Saturday 27 February 2021

ISPRS Journal of Photogrammetry and Remote Sensing

 


Imprint logoJournal logoISPRS Journal of Photogrammetry and Remote Sensing 169 (2020) 337–350

 

 

 

 

 

 

 


 

MLRSNet: A multi-label high spatial resolution remote sensing  dataset for                                                                                                                           semantic scene understanding

Xiaoman Qi a, 1, Panpan Zhu b, c, 1, Yuebin Wang a,*, Liqiang Zhang c, Junhuan Peng a, Mengfan Wu c, Jialong Chen a, Xudong Zhao a, Ning Zang a, P. Takis Mathiopoulos d

a School of Land Science and Technology, China University of Geosciences, Beijing 100083, China

b College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

c Beijing Key Laboratory of Environmental Remote Sensing and Digital Cities, Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

d Department of Informatics and Telecommunications, National and Kapodestrian University of Athens, Athens 15784, Greece


 


A R T I C L E I N F O 

 

Keywords:

Multi-label image dataset Semantic scene understanding

Convolutional Neural Network (CNN) Image classification

Image retrieval


A B S T R A C T 


To better understand scene images in the field of remote sensing, multi-label annotation of scene images is necessary. Moreover, to enhance the performance of deep learning models for dealing with semantic scene understanding tasks, it is vital to train them on large-scale annotated data. However, most existing datasets are annotated by a single label, which cannot describe the complex remote sensing images well because scene images might have multiple land cover classes. Few multi-label high spatial resolution remote sensing datasets have been developed to train deep learning models for multi-label based tasks, such as scene classification and image retrieval. To address this issue, in this paper, we construct a multi-label high spatial resolution remote sensing dataset named MLRSNet for semantic scene understanding with deep learning from the overhead perspective. It is composed of high-resolution optical satellite or aerial images. MLRSNet contains a total of 109,161 samples within 46 scene categories, and each image has at least one of 60 predefined labels. We have designed visual recognition tasks, including multi-label based image classification and image retrieval, in which a wide variety of deep learning approaches are evaluated with MLRSNet. The experimental results demonstrate that MLRSNet is a significant benchmark for future research, and it complements the current widely used datasets such as ImageNet, which fills gaps in multi-label image research. Furthermore, we will continue to expand the MLRSNet. MLRSNet and all related materials have been made publicly available at https://data.mendeley.com/datasets/ 7j9bv9vwsx/1 and https://github.com/cugbrs/MLRSNet.git.


 

 


1.   Introduction

With the availability of enormous numbers of remote sensing images produced by satellites and airborne sensors, high-resolution remote sensing image analyses have stimulated a flood of interest in the domain of remote sensing and computer vision (Toth and Jo´´zko´w, 2016), such as image classification or land cover mapping (Cheng et al., 2017; Go´mez et al., 2016; You and Dong, 2020; Zhao et al., 2016), image retrieval (Wang et al., 2016), and object detection (Cheng et al., 2014; Han et al., 2014), etc. The great potential offered by these platforms in terms of observation capability poses great challenges for semantic scene un- derstanding (Bazi, 2019). For instance, as these data are obtained from different locations, at different times and even with different satellites or


airborne sensors, there are large variations among the scene images, which creates difficulties for the tasks of semantic scene understanding, such as multi-label based image retrieval and image classification.

Furthermore, remote sensing images usually contain abundant in- formation about ground objects, which creates challenges for semantic scene understanding tasks (Chaudhuri et al., 2017). But it is extremely expensive for labeling each piece of data accurately when the amount of data is huge. Therefore, some research on weakly-supervised segmen- tation based on image-level using the information of multi-label classi- fication network has attracted the attention of some scholars (Ge et al., 2018; Xia et al., 2015). Moreover, there have been many explorations in the use of multi-label data, such as land cover classification (Stivaktakis et al., 2019), high-precision image retrieval (Chaudhuri et al., 2017),


 

 

* Corresponding author.

E-mail address: wangyuebin@cugb.edu.cn (Y. Wang).

1 X. Qi and P. Zhu contributed equally to this work.

https://doi.org/10.1016/j.isprsjprs.2020.09.020

Received 4 February 2020; Received in revised form 26 September 2020; Accepted 28 September 2020

Available online 9 October 2020

0924-2716/© 2020 International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS). Published by Elsevier B.V. All rights reserved.


image semantic segmentation (Xia et al., 2015), or migrate the model of multi-label data training to other visual tasks (e.g., image object recog- nition) (Gong et al., 2019). Therefore, multi-label datasets now attract increasing attention in the remote sensing community owing to that they are not expensive but have a lot of research potential. For these reasons, multi-label annotation of an image is necessary to present more details of the image and improve the performance of scene understanding. In addition, the multi-label annotation of an image can produce potential correlations among the labels, such as roadand cartend to occur synchronously in a remote sensing image, and grass and water often accompany golf course. This will provide a better understanding of scene images, which is impossible for single-label image scene under- standing. Therefore, annotating images with multiple labels is a vital step for semantic scene understanding in remote sensing.

What is more, previous studies have proven that traditional machine learning methods cannot adequately mine ground object scene infor- mation (Cordts et al., 2016; Jeong et al., 2019; Kendall et al., 2015; Zhu et al., 2019). Recently, deep learning approaches, as a popular tech- nology, have shown the great potential of providing solutions to prob- lems related to semantic scene understanding, and many scholars have conducted relevant studies (Fang et al., 2020; Han et al., 2018; Hu et al., 2015; Ma et al., 2019; Paoletti et al., 2018; Wang et al., 2018; Zhang  et al., 2016; Zhou et al., 2019). Such as, a highly reliable end-to-end real- time object detection-based situation recognition system was proposed for autonomous vehicles (Jeong et al., 2019). In another work (Cordts et al., 2016), the authors determined that fully convolutional networks achieve decent results in urban scene understanding. And scene classi- fication CNNs were proved that they significantly outperform previous approaches (Zhou et al., 2017). In the reference (Workman et al., 2017), a novel CNN architecture for estimating geospatial functions, such as population density, land cover, or land use, was proposed. Moreover, CNNs were also used to identify weeds (Hung et al., 2014) and vehicles (Chen et al., 2014), etc.

Additionally, there exists a logarithmic relationship between the performance of deep learning methods on vision tasks and the quantity of training data used for representation learning was proven recently (Sun et al., 2017). This work demonstrated that the power of CNNs on large-scale image recognition tasks can be substantially improved if the CNNs are trained on large multi-perspective samples. At present, there exist some widely used various-scale annotated datasets, including image classification datasets like ImageNet (Deng et al., 2009), Places (Zhou et al., 2017), PASCAL VOC (Everingham et al., 2015), YouTube- 8M (Abu-El-Haija et al., 2016), semantic segmentation datasets like PASCAL Context (Mottaghi et al., 2014), Microsoft COCO (Lin et al., 2014), Cityscapes (Cordts et al., 2016) and Mapillary Vistas Dataset (Neuhold et al., 2017). However, in these benchmarks, the data of outdoor objects on the ground are usually collected from ground-level views. In addition, the object-centric remote sensing image datasets constructed for scene classification, for instance, AID (Xia et al., 2017), NWPU-RESISC45 (Cheng et al., 2017), the Brazilian coffee scene dataset (Penatti et al., 2015), the UC-Merced dataset (Yang and Newsam, 2010), and WHU-RS19 dataset (Xia et al., 2010). But these datasets are insuf- ficient to understand the scene due to the high intra-class diversity and low inter-class variation, with the limited number of remote sensing images (Xia et al., 2017). The SEN12MS dataset (Schmitt et al., 2019) attracts more attention in the domain of land use mapping recently. It consists of 180,662 triplets sampled over all meteorological seasons. Each triplet concludes a dual-pol synthetic aperture radar (SAR) image patches, a multi-spectral Sentinel-2 image patches, and four different MODIS land cover maps following different internationally established classification schemes. However, the SEN12MS contains no more than 17 classes under a selected classification schemes, which may be also insufficient for understanding the complex real world.

Moreover, it is worth noting that each image in most of the afore-

mentioned datasets is annotated by a single label representing the most significant semantic content of the image. However, single-label


annotation is sufficient for simple problems, such as distinguishing be- tween coffee class and noncoffee class but is difficult to address more complex scene understanding tasks. Multiple label-related methods have recently been found to be useful for scene understanding, such as multi- label image search and retrieval problems, where multiple class labels are simultaneously assigned to each image (Boutell et al., 2004; Li et al., 2010; Ranjan et al., 2015; Zhang and Zhou, 2007). Thus, several pub- lished multi-label archives are publicly available, for example, multi- label UAV image datasets such as the Trento dataset (Bazi, 2019) and the Civezzano dataset (Bazi, 2019) and multi-label remote sensing image retrieval (RSIR) archives such as MLRSIR (Shao et al., 2018). The Trento dataset and the Civezzano dataset both contain 14 classes and contain a total of 4000 images and 4105 images, respectively. A multi- label RSIR archive was released in 2017, which is considered to be the first open-source dataset for multi-label RSIR (Chaudhuri et al., 2017). Afterward, MLRSIR (Shao et al., 2018), which is a pixel-wise dataset for multi-label RSIR, was presented by Wuhan University and has a total number of 21 broad categories with 100 images per category. However, training the CNNs using the above datasets easily results in overfitting since the CNN models used for multi-label archives often contain mil- lions of parameters. Thus, a considerable quantity of labeled data will be required to fully train the models. Although BigEarthNet (Sumbul et al., 2019) can deal with the problem of overfitting, the limitation in the distribution and the unique data source could reduce the intraclass di- versity, which raises difficulties for developing robust scene under- standing algorithms.

To overcome the above issues and better understand ground objects,

in this paper, we propose a novel large-scale high-resolution multi-label remote sensing dataset termed MLRSNetfor semantic scene under- standing. It contains 109,161 high-resolution remote sensing images that are annotated into 46 categories, and the number of sample images in a category varies from 1500 to 3000. The images have a fixed size of

×

 

256   256 pixels with various pixel resolutions. Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the number of labels associated with each image varies from 1 to 13. Moreover, we illustrate the construction procedure of the MLRSNet dataset and give evaluations and comparisons of several deep learning methods for multi-label based image classification and image retrieval. The experiments indicate that multi-label based deep learning methods can achieve better performance on image classification and image retrieval.

In summary, three major contributions of this paper are as follows:

 

(1)  A review of related popular datasets is provided by giving a summary of their properties. Covering different scale single-label datasets and multi-label datasets, of which most are usually insufficient for remote sensing scene understanding tasks.

(2)  A multi-label high spatial resolution remote sensing dataset, i.e., MLRSNet is developed for semantic scene understanding. To our knowledge, the dataset is a large high-resolution multi-label remote sensing dataset with the most abundant multi-label in- formation. And the dataset has high intraclass diversity, which can provide a better data resource for evaluating and advancing the numerous methods in semantic scene understanding areas.

(3)  The state-of-the-art neural network methods for multi-label image classification and multi-label image retrieval using MLRSNet are evaluated. These results show that deep-learning- based methods achieve significant performance for multi-label based image classification and image retrieval tasks.

2.   MLRSNet: A multi-label high spatial resolution remote sensing dataset

How to improve the performance of existing multi-label image classification and retrieval approaches using machine learning and other artificially intelligent technologies has attracted much attention in the


 

3000

 

2500

 

Text Box: Number2000

 

1500

 

1000

 

500

 

wind turbinewetlandvegetable greenhousetransmission towerterracetennis courtswimmimg poolstorage tankstadiumsparse residential areasnowbergshipping yardroundaboutriverrailway stationrailwayparkwayparking lotparkoverpassmountainmobile home parkmeadowlakeislandintersectionindustrial areaharbor&portground track fieldgolf coursefreewayforestfarmlanderoded farmlanddesertdense residential areacommercial areacloudchaparralbridgebeachbasketball courtbaseball diamondbarelandairportairplane0

 

 

 

 

 


 

 

 

Table 1


Fig. 1. Illustration of the number of samples per category in MLRSNet. There are 109,161 samples within 46 scene categories.

 

mountain, overpass, park, parking lot, parkway, railway, railway station,


Number of images present in the dataset for each class label. There are 60 pre- defined class labels in total.      

Class label             Number         Class label            Number         Class label                 Number

Airplane                  2306               Freeway                2500               Roundabout               2039

Airport                    2480               Golf course          2515               Runway                       2259

Bare soil                39,345            Grass                    49,391            Sand                            11,014


river, roundabout, shipping yard, snowberg, sparse residential area, stadium, storage tank, swimming pool, tennis court, terrace, transmission tower, vegetable greenhouse, wetland, and wind turbine. The number of sample images varies greatly with different broad categories, from 1500 to 3000, as shown in Fig. 1. Additionally, each image in the dataset is assigned several of 60 predefined class labels, and the number of labels

associated with each image varies between 1 and 13. The number of


Baseball diamond

Basketball court


1996               Greenhouse           2601               Sea                              4980

3726               Gully                      2413               Ships                           4092


images present in the dataset associated with each predefined label is listed in Table 1, and some samples with corresponding multi-label re-

sults are shown in Fig. 2.


Beach                     2485               Habor                    2492               Snow                           3565

Bridge                    2772               Intersection           2497               Snowberg                   2555


Besides, MLRSNet has multi-resolutions: the pixel resolution changes

×

 

from about 10 m to 0.1 m, and the size of each multi-label image is fixed


Buildings                51,305            Island                     2493               Sparse

residential area


1829


to 256 256 pixels to cover a scene with various resolutions. Compared to the afore-mentioned scene understanding datasets in Section 1,


Cars                       34,013            Lake                       2499               Stadium                      2462


MLRSNet has a more significantly large variability in terms of


Chaparral               5903               Mobile


2499               Swimming


5078


geographic origins and number of object categories. Different from


Cloud                      1798

Containers             2500


home Mountain Overpass


5468

2652


pool Tanks

Tennis court


2500

2499


ImageNet (Deng et al., 2009), which collects the data of outdoor objects from ground-level views, MLRSNet describes the objects on Earth from


Crosswalk             2673               Park                       1682               Terrace                       2345


an overhead perspective through satellite or aerial sensors. Therefore,


Dense

residential area


2774               Parking lot           7061               Track                           3693


deep neural networks can be trained based on MLRSNet combined with

ImageNet. We can achieve much higher recognition precision of the scene and effectively address the challenges of object rotation, within-


Desert                    2537               Parkway                2537               Trail                            12,376


Dock                       2492               Pavement              56,383            Transmission

tower


2500


class variability, and between-class similarity. Table 2 lists the differ-

ences between MLRSNet and other widely used scene understanding


Factory                   2667               Railway                 4399               Trees                          70,728


datasets.


Field                       15,142            Railway


2187               Water                          27,834


In contrast, with the existing remote sensing image datasets,


Football field


1057


station River


2493               Wetland                       3417


MLRSNet has the following notable characteristics:

Hierarchy: MLRSNet contains 3 first-class categories, such as land


Forest                     3562               Road                      37,783            Wind turbine              2049

 

 

remote sensing community (Chua et al., 2009). However, for learning- based methods, a large number of labeled samples are required. To advance the state-of-art methods in scene understanding of remote sensing, we construct the MLRSNet, a new large-scale high-resolution multi-label remote sensing image dataset.

 

2.1.   Description of MLRSNet

MLRSNet is composed of 109,161 labeled RGB images from all around the world annotated into 46 broad categories: airplane, airport, bareland, baseball diamond, basketball court, beach, bridge, chaparral, cloud, commercial area, dense residential area, desert, eroded farmland, farmland, forest, freeway, golf course, ground track field. harbor&port, in- dustrial area, intersection, island, lake, meadow, mobile home park,


use and land cover (e.g., commercial area, farmland, forest, industrial area, mountain), natural objects and landforms (e.g., beach, cloud, is- land, lake, river, chaparral), as well as man-made objects and landforms (e.g., airplane, airport, bridge, freeway, overpass), 46 second-class cat- egories (as shown in Fig. 1) and 60 third-class labels (as shown in Table 1).

Multi-label: As shown in Fig. 2, each image in the MLRSNet dataset has one or more corresponding labels because the remote sensing image usually contains many classes of objects that are not mutually exclusive. Several experiments (Shao et al., 2018; Zhang et al., 2018) have indi- cated that multi-label datasets tend to achieve more satisfactory per- formance than single-label datasets in the tasks of image classification or image retrieval.

Large-scale: As shown in Table 2, MLRSNet has a large number of high-resolution multi-label remote sensing scene images. It contains 109,161 high-resolution remote sensing images annotated into 46


 

Fig. 2. Example images of 44 categories (except barelandand cloud) from the MLRSNet dataset are shown, and the corresponding multi-labels of each image are reported at the right of the related image.

 


categories, and the number of sample images in a category varies from 1500 to 3000, all of which are larger than most other listed datasets. MLRSNet is a large-scale high-resolution remote sensing dataset collected for scene image recognition that can cover a much wider range of satellite or aerial images. It is meant to serve as an alternative to advance the development of methods in scene image recognition,


particularly deep-learning-based approaches that require large quanti- ties of labeled training data.

Diversity: To increase the generalization ability of the dataset, we attempt to characterize MLRSNet according to the object distributions for geographical and seasonal distribution, weather conditions, viewing perspectives, capturing time, and image resolution, i.e., large variations


 

Table 2

Statistics of our database and comparisons of current state-of-the-art remote sensing benchmarks.


Dataset                          Number of Total Samples


Number of Categories


Sample Number in Each Category


Image sizes               Image Spatial Resolution (m)


Reference


UC-Merced                  2100                                           21                                          100                                                            256 × 256                  0.3                                                     Everingham et al.

(2015)


NWPU- RESISC45


31,500                                        45                                          700                                                            256 × 256                 ~300.2                                             Cheng et al. (2017)


AID                                10,000                                        30                                          220420                                                    600 × 600                  80.5                                                 Xia et al. (2017)

MLRSIR                        2,100                                          21                                          100                                                            256 × 256                  0.3                                                     Shao et al. (2018)

SEN12MS                     564,768                                      17                                          –                                                                256 × 256                  10500                                              Schmitt et al. (2019)


BigEarthNet                  590,326                                      43                                          328217,119                                            up to 120 ×

120


10, 20, 60                                       Sumbul et al. (2019)



MLRSNet                     109,161                                   46                                         15003000                                           256 £ 256                ~100.1                                         Our work

The number of categories for SEN12MS is counted following the International Geosphere Biosphere Programme (IGBP) classification scheme (Loveland and Belward, 1997).

 


in spatial resolution, viewpoint, object pose, illumination and back- ground as well as occlusion.

 

2.2.   Construction of MLRSNet

 

MLRSNet is a remote sensing community-led dataset for people who want to visualize the world with overhead perspectives. To construct the MLRSNet, we gather a team of more than 50 annotators in the remote sensing domain and spend more than six months for the whole process. The construction of MLRSNet is mainly composed of three procedures, i. e., scene sample collection, database quality control, and database sample diversity improvement.

2.2.1.    Scene category and sample collection

To satisfy hierarchy criteria, the first asset of a high-quality dataset is covering an exhaustive list of representative scene categories. To ach- ieve this goal, we investigate all scene classes of the existing datasets to form a list of scene categories. In the process, we merge some similar semantic scene categories in different datasets into a new category. For


example, playgroundand ground track fieldare taken as ground track field, and harbor and port are taken as a new category called “harbor&port. We also search the keywords object-based image analysis (OBIA), geographic object-based image analysis (GEOBIA), “land cover classification, land use classification, geospatial image retrievaland geospatial object detectionon Web of Science and Google Scholar to carefully select some new meaningful scene classes. Consequently, we obtain 46 scene categories in total, as shown in Fig. 1. Moreover, most of existing dataset are labeled with the name of categories, which describe the most significant semantic content of the image, but the primitive classes (multiple labels) presented in images are ignored. Thus, we associate each image with one or more land-cover class labels (i.e., multi-labels) based on visual inspection. For every scene category, we randomly select 100 images and annotate the primitive classes in the image. Next, we count the primitive classes in the image and filter out the primitive classes whose number is no more than

5. Finally, we get 60 multiple labels occurred frequently in remote sensing samples. Generally, scene categories are scene-level labels and primitive classes are object-level labels.


 

Fig. 3. The experimental process of quality control. (a) The experimental process of a single annotator. (b) The experimental process of the administrator.


 

Table 3

Confidence score table for different categories of data samples, showing how annotatorsjudgment influence the probability of an image being a good image.

Accept                  Reject                 Airport                   Bridge                 Island                  Parkway

 

0                           1                          0.13                       0.05                      0.03                     0.14

1                           0                          0.80                       0.87                      0.89                     0.67

1                           1                          0.51                       0.49                      0.50                     0.52

2                           0                          0.90                       0.97                      0.98                     0.84

0                           2                          0.05                       0.03                      0.02                     0.13

3                           0                          0.97                       0.99                      1.00                     0.90

    2                           1                          0.82                       0.86                      0.88                     0.73             

 

Compared with other satellite or aerial image datasets, the samples in MLRSNet have more additional meaningful information, such as hi- erarchy and multi-label information. Particularly, when having multi- ple labels of scene samples, by comparing the sample features, we can search the ground object more precisely. With this information, many multi-label tasks can be solved, such as multi-label image classification,

multi-label image retrieval and object detection.

Data diversity is ensured by data sources and manual control. We collect data samples by more than 20 people from multi-resolution, multi-continent, multi-time, multi-light and multi-viewpoint data sour- ces to characterize MLRSNet. Like most of the existing datasets, such as AID (Xia et al., 2017), NWPU-RESISC45 (Cheng et al., 2017), and WHU-

RS19 (Xia et al., 2010), MLRSNet is also extracted from Google Earth where images are from different remote imaging sensors. The satellite sensors include but are not limited to the GeoEye-1, WorldView-1, WorldView-2, SPOT-7, Pleiades-1A, and Pleiades-1B. And images also can be collected by the cameras for aerial photography. We collect data from all over the world to satisfy the diversity criteria, and the samples in MLRSNet cover more than 100 countries and regions. In addition, we control the data diversity. More details can be found in Section 2.2.3.


2.2.2.    Data quality control

Aiming to develop a highly accurate dataset, we implement a quality control process. In the process, we rely on another 20 annotators to verify the ground-truth label of each candidate image collected in the previous process, including scene sample annotation, counting confi- dence score, and disposing of confusing data.

 

Fig. 3 illustrates the experimental process of quality control. A sample is randomly presented to the annotator for selecting a category from 46 predefined category names. If the annotation result remains consistent with the ground-truth label of each candidate sample, the system gives an acceptresponse; otherwise, it gives a rejectresponse. Because of the subjectivity of human and the complexity of the image, different images need a different number of annotations. The solution, according to ImageNet (Deng et al., 2009), is to require mul- tiple annotators to tag the images individually. While annotators are instructed to label an image, we make a confidence table (see Table 3) to dynamically determine the number of annotations needed for different categories of images. Table 3 shows examples for airport, bridge, “islandand parkway. The confidence score indicates the probability of an image is a good image given the annotator votes. After data la- beling for approximately two months, every sample is labeled several times until a predetermined confidence score threshold is reached. Therefore, the data samples with a confidence score 0.97 are retained while others are removed.

We observe that the boundaries of some data pairs are blurry, e.g.,

airport and runway, intersection and crosswalk, desert and bareland (see Fig. 4). For this reason, we gather our annotators for a discussion about the boundary of these data pairs. After that, we begin a second labeling round for data in these ambiguous pairs with a confidence score of

<0.97. Similarly, after the second labeling round, we preserve samples

 

scored 0.97 and deprecate others. Finally, we collect more than 100,000 data samples within 46 scene categories.


 

 

 

Fig. 4. Boundaries among scene categories can be blurry. The images show a soft transition between airport vs. runway, crosswalk vs. intersection and bareland vs. desert.


 

Fig. 5. A screenshot of the visual tool that computes the relative diversity of scene datasets. Different pairs of samples are randomly presented to a person who is instructed to select the most similar pair. Each trial is composed of 4 pairs from each database, giving a total of 12 pairs to choose from.

 


Table 4

Details of each CNNs model be used in the experiment.

CNNs                      Layers        Parameters         Top-1 Accuracy


 

Top-5 Accuracy


 

year


 

 

 

 

 

 

 

Fig. 6. Relative diversity of each category (20 categories) in a different dataset. MLRSNet (in red line) contains the most diverse set of images. (For interpre- tation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

 

2.2.3.    Data diversity improvement

An ideal dataset, expected to generalize well, should have high di- versity, which means it should include a high variability of appearances, locations, resolutions, scales and background clutter and occlusions.

We use a measure to quantify the relative diversity of image datasets referred to as the practice in reference (Zhou et al., 2017). During the procedure of comparing the diversity of data samples, while our dataset diversity is lower than other datasets, we can improve the quality and number of data samples for a certain category.

We develop a tool with a graphical user interface, as shown in Fig. 5. We ask 10 annotators to measure the relative diversities among AID (Xia et al., 2017), NWPU-RESISC45 (Cheng et al., 2017), and MLRSNet. Each time, different pairs of samples are randomly presented to an annotator who is instructed to select the most similar pair. Each trial is composed of 4 pairs from each database, giving a total of 12 pairs to choose from. We run 50 trials per category and 10 observers per trial, for the 20 categories in common.

Fig. 6 shows the results of the relative diversity for all 20 scene categories common to the three databases. The results show that there is a large variation in terms of diversity among the three datasets, and


InceptionV3            47                23 M                   0.779                          0.937                          2014

VGGNet16              16                138 M                 0.713                          0.901                          2014

VGGNet19              19                143 M                 0.713                          0.900                          2014

ResNet50                50                25 M                   0.749                          0.921                          2015

ResNet101              101              44 M                   0.764                          0.928                          2015

DenseNet121          121              8 M                      0.750                          0.923                          2017

DenseNet169          169              14 M                   0.762                          0.932                          2017

DenseNet201          201              20 M                   0.773                          0.936                          2017

 

The top-1 and top-5 accuracy refers to the models performance on the ImageNet validation dataset.

 

MLRSNet is the most diverse of the three datasets. The average relative diversity on each dataset is for MLRSNet is 0.78, for AID (Xia et al., 2017) is 0.56, and for NWPU-RESISC45 (Cheng et al., 2017) is 0.69. The categories with the smallest variation in diversity in MLRSNet are baseball diamond, beach, sparse residential area and storage tank. Then, we conduct a random rotation, resize and crop for all images in these four categories.

3.   Scene classification

Scene classification is a fundamental task in remote sensing image understanding. Recently, classification using convolutional neural net- works (CNNs) has achieved significant performance. MLRSNet can be taken as a benchmark to evaluate the classification performances of different CNNs.

 

3.1.   Experimental settings

 

Eight popular CNN architectures, i.e., InceptionV3 (Szegedy et al., 2016), VGGNet16 (Simonyan and Zisserman, 2014), VGGNet19 (Simonyan and Zisserman, 2014), ResNet50 (He et al., 2016), ResNet101 (He et al., 2016), DenseNet121 (Huang et al., 2017), Den- seNet169 (Huang et al., 2017) and DenseNet201 (Huang et al., 2017) are chosen to address the remote sensing image classification problem, and


 

Table 5

Parameters utilized for model fine-tuning.

Package                 Epochs                  Batch Size                  Optimizer                   Learning rate Keras               10                    32                    Adam                                      0.01

 

the details of each model are shown in Table 4. It should be noticed that the final layer of CNNs is replaced by a dense connection with 60 nodes, and the result of dense connection is activated by a sigmoid function. Sigmoid maps the values of output vector of the network to an interval of (0, 1) indicating the score for each class. Then we binarize the output vector with a threshold of 0.5 to generate a multi-label prediction, similar to [1, 0, 1, , 0], where 1 indicates the image is annotated with the corresponding label and otherwise it is 0. Finally, the models are trained on MLRSNet. We call the fine-tuned models MLRSNet-CNN, i.e., MLRSNet-VGGNet16.

InceptionV3  (Szegedy et al., 2016): The inception module was first

proposed in reference (Szegedy et al., 2016) by Google and was adopted for image classification and object detection in the ImageNet Large-Scale


Table 6

Mean Average Precision (%) of the eight fine-tuned models under different training ratios.            

CNNs                                                            Training ratios

20%                               30%                               40%

MLRSNet-InceptionV3                                81.50                              82.33                              84.84

MLRSNet-VGGNet16                                  67.88                              72.66                              75.39

MLRSNet-VGGNet19                                  66.12                              69.53                              73.60

MLRSNet-ResNet50                                   82.65                              84.28                              86.01

MLRSNet-ResNet101                                 83.26                              84.19                              85.72

MLRSNet-DenseNet121                            75.96                              77.99                              80.25

MLRSNet-DenseNet169                            82.16                              86.42                              87.35

MLRSNet-DenseNet201                            87.25                              87.84                               88.77

 

The best results are shown in bold.

 

Table 7

F1 score of the eight fine-tuned models under different training ratios.            

CNNs                                                          Training ratios

 

20%                               30%                               40%


Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of                                                                                                                     


this architecture is the improved utilization of the computing resources inside the network. However, the author explored methods to scale up networks in ways that aim at utilizing the added computation as effi- ciently as possible by suitably factorized convolutions and aggressive regularization, which is named InceptionV3.

VGGNet16, VGGNet19 (Simonyan and Zisserman, 2014): VGG was

originally developed for the ImageNet dataset by the Oxford Visual Geometry Group in the ILSVRC14. To investigate the effect of the con- volutional network depth on its accuracy in the large-scale image recognition setting, Simonyan and Zisserman thoroughly evaluated the


MLRSNet-InceptionV3                             0.7746                            0.8016                            0.8146

MLRSNet-VGGNet16                               0.5743                            0.6534                            0.6855

MLRSNet-VGGNet19                               0.5677                            0.6120                            0.6329

MLRSNet-ResNet50                                 0.7530                            0.8176                            0.8353

MLRSNet-ResNet101                               0.7618                            0.7703                            0.8226

MLRSNet-DenseNet121                          0.7154                            0.7389                            0.7571

MLRSNet-DenseNet169                          0.8138                            0.8408                            0.8521

MLRSNet-DenseNet201                          0.8381                            0.8414                             0.8538

The best results are shown in bold.

where tj (0, 1) denotes the jth ground-truth label for training image Xi.


̂

 

network of increasing depth using an architecture with very small (3 ×      j              i


2

 

3) convolution filters, which showed a significant improvement in the accuracies. In this work, we use two models that show the corresponding performance in scene classification, named VGGNet16 and VGGNet19. ResNet50, ResNet101 (He et al., 2016): Residual Nets (ResNet) is a framework presented by Microsoft Research to ease the training of networks that are substantially deeper than those used previously. This model won the 1st place on the ILSVRC 2015 classification  task. ResNet50 is the 50-layer ResNet, and ResNet101 is the 101-layer ResNet. DenseNet121, DenseNet169, DenseNet201 (Huang  et al.,  2017): The dense convolution network (DenseNet) connects each layer to other layers in a feed-forward manner and has L(L+1) direct connections for convolutional networks with L layers. DenseNets are widely used because they have several compelling advantages, such as alleviating the vanishing gradient problem, strengthening feature propagation, encouraging feature reuse and substantially reducing the number of parameters. DenseNet121, DenseNet169 and DensesNet201 are the 121-


ti is the output of the sigmoid layer. m is the number of training images and q is the number of classes in total.

To improve the generalization capability, we fine-tune the models by using the parameters listed in Table 5. All the CNN models are imple- mented on a 2.10 GHz 48-core CPU. In addition, a TITAN RTX GPU is used for acceleration. We try to avoid introducing random errors by duplicating experiments. In this study, we repeat the experiment five times and plot an error-bar graph by counting the results.

 

3.2.   Evaluation protocols

 

We compute two commonly used evaluation metrics, i.e., mean average precision and average F1 score to quantitatively evaluate the classification results.

=                     × 2

 

The F1 score is a comprehensive metric for evaluating classification performance for each model and can be defined as:


layer, 169-layer and 201-layer DenseNet, respectively.

To comprehensively evaluate the classification performances of different CNNs, three trainingtesting ratios are considered: (i) 20%


F1       precision × recall precision + recall


10%70%, i.e., we randomly select 20% of the dataset for training, 10%


where precision = |Lc Lr | and recall = |Lc Lr |. Lc is the final label vector of

 

                                                                


for validation and others for testing; (ii) 30%10%60%; (iii) 40%


|Lc |


|Lr |


10%50%.

We choose TensorFlow and the Python package Keras for our ex- periments. The aforementioned eight methods, which are pretrained on the ImageNet dataset, are obtained from the URL: https://github.com/ fchollet/deep-learning-models/releases. In the experiment, binary cross-entropy measures how far away from the true value (which is either 0 or 1) the prediction is for each of the classes and then averages these class-wise errors to obtain the final loss. The formula of binary cross-entropy adopted for multi-label classification can be shown as follow:


the networks for a sample. Lr is the ground-truth label of the sample. denotes intersection. || denotes the number of nonzero entries. A higher value  of the F1  score represents a  better classification   performance.

Here, it should be pointed out that all precision and recall values are computed separately for each sample and then averaged across samples.

 

3.3.   Experimental results

 

We evaluate eight methods using MLRSNet and present the results of

multi-label image classification as follows. As shown in Tables 6 and 7, the fine-tuned models can achieve good classification performances on


mq i=1

 

j=1

 

i

 

i

 

i

 

i

 

m        q


L =   1   (tjloĝtj + (1 tj)log(1 ̂tj))


MLRSNet, which indicates that deep-learning-based models have the

Text Box: ability to obtain discriminative features. The statistical results of repeated trials are shown in Figs. 7 and 8.


 

Fig. 7. The statistical results of Mean Average Precision. The bar chart shows the average, and the error line presents the standard deviation.

 


Fig. 8. The statistical results of F1 score. The bar chart shows the average, and the error line presents the standard deviation.

 


It is worth noting that MLRSNet-DenseNet201 obtains significantly better metric values in the comparative experiment. In particular, the MLRSNet-DenseNet201 model can achieve an overall improvement as the number of data samples increased. MLRSNet-DenseNet201 and MLRSNet-DenseNet169 achieve over 0.80 F1 score in the 10 epochs when the training ratio is 20%.

Moreover, with increasing training data, the performances of the models increase. This suggests that increasing the data size can further improve the performances of deep learning models.

Fig. 9 shows several annotation examples on MLRSNet when fine- tuned DenseNet201 is employed. Here, annotations with black font  are included in ground-truth labels, whereas annotations with red font are incorrect labels tagged by the model. The green fonts are correct labels, but the model does not tag them. It is obvious that the first seven images are all annotated correctly. The 8th to 10th images do not include incorrect annotations but neglecting no more than two correct labels, and the last five images contain one incorrect label. Hence, we can further find that MLRSNet-DenseNet201 has outstanding perfor- mance in multi-label classification. This model also can be adopted to help us to label the scene images in the future.

4.   Image retrieval

With the sharp increase in the volume of remote sensing images, image retrieval has become an important topic of research in RS. We then show the application of MLRSNet to image retrieval. Similarly, it can be used as a benchmark dataset to evaluate the retrieval perfor- mances of different models.


4.1.   Experimental settings

 

We use the afore-mentioned eight fine-tuned CNN models (in Section

3) to evaluate the retrieval performances of the dataset in this work. Images are fed into CNNs, and the features are extracted from the last layer of each network. Euclidean distance is selected to calculate the similarity between the query images and images in the retrieval ar- chives. Only if the retrieval results and the query image belong to the same category can we considered that the query has been satisfied.

Since the fine-tuned CNNs which are trained with the 40% images of MLRSNet are applied to perform the retrieval experiment, the remaining 60% images of MLRSNet are adopted as the testing queries and retrieval database. In order to make full use of the dataset for retrieval experi- ments, and minimize the random error, in this section, we randomly split MLRSNet into testing queries and retrieval database at three different ratios: 5% vs. 55%, 10% vs. 50%, and 15% vs. 45%. Taking 5% vs. 55% for example, 5% of images from each category are selected as query images to query the rest of the dataset. Databases of different sizes are further used to validate the effectiveness of MLRSNet for image retrieval experiments. Table 8 shows the number of images in testing queries and the number of images in retrieval database when 5%, 10% and 15% of images from each category are selected as the testing queries.

 

4.2.   Retrieval metrics

To evaluate the retrieval performance, we use average normalized modified retrieval rank (ANMRR), mean average precision (mAP), and precision at k (P@k, where k is the number of retrieved images) as metrics. In the following experiments, the ANMRR, mAP, and P@k are the averaged values over all the query images.


 

Fig. 9. Some samples of image classification by the MLRSNet-DenseNet201 model. Multi-labels of each image are reported below the related image, and the red font indicates the incorrect classification result, while the green font indicates the correct labels, but the model is not tagged. (For interpretation of the references to colour   in this figure legend, the reader is referred to the web version of this article.)

 


 

Table 8

The number of images in testing queries and in retrieval database when different percentages are chosen.           

    Percentage                                            5%                                    10%                                  15%         

Testing queries                                   3275                                  6550                                  9825

Retrieval database                               62,222                               58,947                               55,672


For a detailed description of ANMRR, we refer the reader to the reference (Manjunath et al., 2001). The formulas of the last two metrics are as follows.

Nq

 

@ =   

 

 k     1        Ns Nq i=1 K


 

 


 

                                                                                                                         

Fig. 10. The precision-recall curves of MLRSNet-CNNs with different percentages of the testing queries. (a), (b), (c) represent the percentages of testing queries are 5%, 10%, 15%, respectively.


 

Table 9

The retrieval results of different methods. For ANMRR, lower value indicates better performance, while for mAP and P@k, larger is better.

Percentage                            Models                                                           ANMRR                            mAP                                 P@10                               P@50                               P@100                                                P@500 5%                                                   MLRSNet-Inception V3                                         0.3506                              0.7307                              0.7524                              0.7270                                                0.7130                                                            0.6575

MLRSNet-VGGNet16                                  0.5503                                0.4868                              0.5073                              0.4833                              0.4699                              0.4107

MLRSNet-VGGNet19                                  0.6248                                0.3690                              0.3878                              0.3656                              0.3539                              0.3084

MLRSNet-ResNet50                                   0.2407                                0.8333                              0.8462                              0.8314                              0.8220                              0.7610

MLRSNet-ResNet101                                 0.3001                                0.7580                              0.7730                              0.7556                              0.7456                              0.6928

MLRSNet-DenseNet121                            0.3440                                0.7513                              0.7758                              0.7478                              0.7316                              0.6577

MLRSNet-DenseNet169                            0.1665                                0.8874                              0.8969                              0.8863                              0.8791                              0.8414

MLRSNet-DenseNet201                            0.1557                                0.8959                              0.9031                              0.8949                              0.8898                              0.8596

10%                                        MLRSNet-Inception V3                             0.3518                                0.7321                              0.7547                              0.7283                              0.7135                              0.6535

MLRSNet-VGGNet16                                  0.5579                                0.4721                              0.4922                              0.4683                              0.4552                              0.3956

MLRSNet-VGGNet19                                  0.6238                                0.3709                              0.3883                              0.3676                              0.3561                              0.3070

MLRSNet-ResNet50                                   0.2496                                0.8208                              0.8339                              0.8191                              0.8082                              0.7415

MLRSNet-ResNet101                                 0.3073                                0.7435                              0.7695                              0.7394                              0.7219                              0.6832

MLRSNet-DenseNet121                            0.3493                                0.7604                              0.7810                              0.7571                              0.7443                              0.6423

MLRSNet-DenseNet169                            0.1693                                0.8834                              0.8928                              0.8823                              0.8747                              0.8327

MLRSNet-DenseNet201                            0.1583                                0.8936                              0.9018                              0.8925                              0.8869                              0.8520

15%                                        MLRSNet-Inception V3                             0.3569                                0.7225                              0.7450                              0.7189                              0.7042                              0.6372

MLRSNet-VGGNet16                                  0.5626                                0.4533                              0.4760                              0.4489                              0.4358                              0.3847

MLRSNet-VGGNet19                                  0.6213                                0.3694                              0.3883                              0.3661                              0.3521                              0.3053

MLRSNet-ResNet50                                   0.2503                                0.8141                              0.8299                              0.8123                              0.7988                              0.7317

MLRSNet-ResNet101                                 0.3096                                0.7516                              0.7675                              0.7498                              0.7368                              0.6745

MLRSNet-DenseNet121                            0.3505                                0.7355                              0.7631                              0.7306                              0.7141                              0.6301

MLRSNet-DenseNet169                            0.1724                                0.8786                              0.8893                              0.8775                              0.8687                              0.8223

MLRSNet-DenseNet201                            0.1577                                0.8935                              0.9003                              0.8929                              0.8869                              0.8465

The best results are shown in bold.

 


Fig. 11. The ANMRR results of features for a set of categories with 10% of images extracted by MLRSNet-CNNs.

 

      =            @

 

Nq                         mi                                                                                                                                                                                                                                                                          Especially, MLSNet-DenseNet201 is higher than other models with


mAP    1         1       p k Nq i=1 mi k=1

=

 

where Nq is the number of queries, Ns represents for a given query the number of the images in the result that are considered as the correct image. K is the number of retrieved images. mi is the number of result images for a given query i. We define mi 100, that is, we calculate the mean of top-100 average retrieved precision.

 

4.3.   Experimental results

=

 

The precision-recall curves of eight methods with different percent- ages of testing queries are shown in Fig. 10. Table 9 shows the values of ANMRR, mAP, and P@k (k 10, 50, 100, 500) obtained when MLRSNet- CNNs are used, and different percentages of the testing queries are chosen.

By analyzing Fig. 10 and Table 9, it can be observed that most of the MLRSNet-CNNs obtain impressive retrieval results regardless of different percentages of the testing queries. It demonstrates that CNNs indeed perform well in the image retrieval task using MLRSNet.


different percentages of testing queries in mAP and P@k (10, 50, 100, 500) for the MLRSNet dataset. The mAP results of MLRSNet- DenseNet201 indicate a relative improvement of about 1% against MLRSNet-DenseNet169. But the mAP results of MLRSNet-DenseNet201 indicate a relative increase of 6.267.94% against MLRSNet-ResNet50. Meanwhile, MLSNet-DenseNet201 is lower than other models with different percentages of testing queries in ANMRR. In Fig. 10, the precision-recall curve of MLSNet-DenseNet201 is superior to other methods no matter the size of the retrieval database. As a result, the MLRSNet-DenseNet201 can achieve better performance compared with other approaches. The reason lies in that each layer in the DenseNet accepts all previous layersfeatures as input, which can maximize the information flow between all layers in the network. Besides, for the three types of DenseNet models, as the number of layer increases, the models can obtain more representative and discriminative image fea- tures, showing better experimental performances.

=

 

Moreover, Fig. 11 shows the ANMRR results of features extracted from several categories by the eight methods when we choose 10% of the MLRSNet as the testing queries. It is evident that the ANMRR result of MLRSNet-VGGNets is large for most categories, meaning that its


 

Fig. 12. The precision-recall curves of MLRSNet-DenseNet201 on a set of cat- egories with the MLRSNet.

 

retrieval performance is poor. This may lie in that the simple network structures only can obtain limited representations of images, leading to poor experimental performances. Meanwhile, it can be seen that the same network performs differently for various categories. In particular, most networks can perform well in the categories that can be recognized easily (e.g., lake, harbor&port and swimming pool).

Taking MLRSNet-DenseNet201 as an example, we show the precision-recall curves of a set of categories when percentage of testing queries is 10% in Fig. 12. This illustrates that although the whole retrieval performance of MLRSNet-DenseNet201 is impressive, the per- formances on several categories are not satisfactory, e.g., basketball court and commercial area. The reason may be that the high intra-class diversities of these two categories increase the difficulties of image retrieval.

Fig. 13 shows two examples of result images retrieved by the


MLRSNet-DenseNet201 when the query image is randomly selected from the airplane category and the stadium category with the percent- ages of the testing queries is 10%. The multi-labels associated with the image are given below the related image. From the retrieved results, it is obvious that the MLRSNet-DenseNet201 model accurately detects the multi-label image objects associated with a given query image and retrieve the most visually similar images from the database.

Compared with single-label remote sensing image retrieval, image retrieval using a multi-label dataset can add more restricted conditions (multi-label) to the retrieval process, thereby achieving more accurate image retrieval results. We use the name of the category as the label for each image in the category, and then apply the single-label remote sensing dataset (SLRSNet) to train the DenseNet201 model and perform the same retrieval experiments. The retrieval results are shown in    Fig. 14. We can find that the multi-label image retrieval results contain more common labels with the retrieval image by comparing Figs. 13 and

14. It indicates that the multi-label image retrieval results are more similar to the retrieval image. Taking the stadium category as an example, when we query stadium images that contain a swimming pool, the multi-label image retrieval better matches the labels swim- ming pooland water, yet the single-label image retrieval has diffi- culty meeting this requirement.

 

5. Conclusion

The MLRSNet is a multi-label high spatial resolution remote sensing dataset for semantic scene understanding with deep learning from the overhead perspective. MLRSNet has distinctive characteristics: hierar- chy, large-scale, diversity, and multi-label. Experiments, including multi-label scene classification and image retrieval, are conducted with different deep neural networks. From these experimental results, we can conclude that MLRSNet can be adopted as a benchmark dataset for performance evaluations of multi-label image retrieval and scene clas- sification. MLRSNet complements the current large object-centric


 

Fig.  13.  The retrieval results of airplane (top) and stadium (bottom) categories by the MLRSNet-DenseNet201 model when the percentage of the testing queries           is 10%.


 

Fig. 14. The retrieval results of airplane (top) and stadium (bottom) categories by the SLRSNet-DenseNet201 model when the percentage of the testing queries is 10%.


 

datasets such as ImageNet. In future work, we will continue to expand the MLRSNet and apply the dataset to other recognition tasks, such as semantic/instance segmentation in large-scale scene images and ground object recognition.

 

Declaration of Competing Interest

 

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

 

Appendix A. Supplementary material

Supplementary data to this article can be found online at https://doi. org/10.1016/j.isprsjprs.2020.09.020.

 

References

Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S., 2016. Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675.

Bazi, Y., 2019. Two-branch neural network for learning multi-label classification in UAV imagery. In: IGARSS 20192019 IEEE International Geoscience and Remote Sensing Symposium. IEEE, pp. 24432446.

Boutell, M.R., Luo, J., Shen, X., Brown, C.M., 2004. Learning multi-label scene classification. Pattern Recogn. 37 (9), 17571771.

Chaudhuri, B., Demir, B., Chaudhuri, S., Bruzzone, L., 2017. Multilabel remote sensing

image retrieval using a semisupervised graph-theoretic method. IEEE Trans. Geosci. Remote Sens. 56 (2), 11441158.

Chen, X., Xiang, S., Liu, C.L., Pan, C.H., 2014. Vehicle detection in satellite images by hybrid deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 11 (10),

17971801.

Cheng, G., Han, J., Lu, X., 2017. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 105 (10), 18651883.

Cheng, G., Han, J., Zhou, P., Guo, L., 2014. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 98, 119132.


Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y., 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In: Proceedings of the

ACM International Conference on Image and Video Retrieval, pp. 19.

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B., 2016. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, pp. 32133223.

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and

Pattern Recognition, pp. 248255.

Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2015. The PASCAL visual object classes challenge: a retrospective. Int. J. Comput. Vision

111 (1), 98136.

Fang, B., Li, Y., Zhang, H., Chan, J.C.W., 2020. Collaborative learning of lightweight convolutional neural network and deep clustering for hyperspectral image semi- supervised classification with limited training samples. ISPRS J. Photogramm.

Remote Sens. 161, 164178.

Ge, W., Yang, S., Yu, Y., 2018. Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly

supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12771286.

Go´mez, C., White, J.C., Wulder, M.A., 2016. Optical remotely sensed time series data for

land cover classification: A review. ISPRS J. Photogramm. Remote Sens. 116, 5572.

Gong, T., Liu, B., Chu, Q., Yu, N., 2019. Using multi-label classification to improve object detection. Neurocomputing 370, 174185.

Han, J., Zhou, P., Zhang, D., Cheng, G., Guo, L., Liu, Z., Bu, S., Wu, J., 2014. Efficient, simultaneous detection of multi-class geospatial targets based on visual saliency modeling and discriminative learning of sparse coding. ISPRS J. Photogramm.

Remote Sens. 89, 3748.

Han, W., Feng, R., Wang, L., Cheng, Y., 2018. A semi-supervised generative framework with deep learning features for high-resolution remote sensing image scene

classification. ISPRS J. Photogramm. Remote Sens. 145, 2343.

He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In:

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770778.

Hu, F., Xia, G.S., Hu, J., Zhang, L., 2015. Transferring deep convolutional neural

networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 7 (11), 1468014707.

Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision

and  Pattern Recognition,  pp. 47004708.

Hung, C., Xu, Z., Sukkarieh, S., 2014. Feature learning based approach for weed classification using high resolution aerial images from a digital camera mounted on a UAV. Remote Sens. 6 (12), 1203712054.


 

Jeong, H.J., Choi, S.Y., Jang, S.S., Ha, Y.G., 2019. Driving scene understanding using

hybrid deep neural network. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 14.

Kendall, A., Badrinarayanan, V., Cipolla, R., 2015. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680.

R. Li Y. Zhang Z. Lu J. Lu Y. Tian 2010. Technique of image retrieval based on multi-label image annotation. 2010 Second International Conference on Multimedia and Information Technology (vol. 2, 10-13).

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dolla´r, P., Zitnick, C.

L., 2014. Microsoft coco: Common objects in context. In: European Conference on Computer Vision. Springer, Cham, pp. 740755.

Loveland, T.R., Belward, A.S., 1997. The international geosphere biosphere programme

data and information system global land cover data set (DISCover). Acta Astronaut. 41 (410), 681689.

Ma, L., Liu, Y., Zhang, X., Ye, Y., Yin, G., Johnson, B.A., 2019. Deep learning in remote sensing applications: A meta-analysis and review. ISPRS J. Photogramm. Remote

Sens. 152, 166177.

Manjunath, B.S., Ohm, J.R., Vasudevan, V.V., Yamada, A., 2001. Color and texture descriptors. IEEE Trans. Circuits Syst. Video Technol. 11 (6), 703715.

Mottaghi, R., Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A., 2014. The role of context for object detection and semantic segmentation in the wild.

In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 891898.

Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P., 2017. The mapillary vistas dataset for semantic understanding of street scenes. In: Proceedings of the IEEE

International Conference on Computer Vision, pp. 49904999.

Paoletti, M.E., Haut, J.M., Plaza, J., Plaza, A., 2018. A new deep convolutional neural

network for fast hyperspectral image classification. ISPRS J. Photogramm. Remote Sens. 145, 120147.

Penatti, O.A., Nogueira, K., Dos Santos, J.A., 2015. Do deep features generalize from everyday objects to remote sensing and aerial scenes domains?. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,

pp. 4451.

Ranjan, V., Rasiwasia, N., Jawahar, C.V., 2015. Multi-label cross-modal retrieval. In: Proceedings of the IEEE International Conference on Computer Vision,

pp. 40944102.

Schmitt, M., Hughes, L.H., Qiu, C., Zhu, X.X., 2019. SEN12MS A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion. arXiv preprint arXiv:.07789.

Shao, Z., Yang, K., Zhou, W., 2018. A benchmark dataset for performance evaluation of multi-label remote sensing image retrieval. Remote Sens. 10 (6).

Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale

image recognition. In: Proceedings of the International Conference on Learning Representations 2015, pp. 1936.

Stivaktakis, R., Tsagkatakis, G., Tsakalides, P., 2019. Deep learning for multilabel land cover scene categorization using data augmentation. IEEE Geosci. Remote Sens. Lett.

16 (7), 10311035.

Sumbul, G., Charfuelan, M., Demir, B., Markl, V., 2019. Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. In: IGARSS 20192019 IEEE International Geoscience and Remote Sensing Symposium. IEEE,

pp. 59015904.


Sun, C., Shrivastava, A., Singh, S., Gupta, A., 2017. Revisiting unreasonable effectiveness

of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 843852.

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pp. 28182826.

Toth, C., Jo´´zko´w, G., 2016. Remote sensing platforms and sensors: A survey. ISPRS J. Photogramm. Remote Sens. 115, 2236.

Wang, S., Quan, D., Liang, X., Ning, M., Guo, Y., Jiao, L., 2018. A deep learning framework for remote sensing image registration. ISPRS J. Photogramm. Remote

Sens. 145, 148164.

Wang, Y., Zhang, L., Tong, X., Zhang, L., Zhang, Z., Liu, H., Xing, X., Mathiopoulos, P.T., 2016. A three-layered graph-based learning approach for remote sensing image

retrieval. IEEE Trans. Geosci. Remote Sens. 54 (10), 60206034.

Workman, S., Zhai, M., Crandall, D.J., Jacobs, N., 2017. A unified model for near and

remote sensing. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 26882697.

Xia, G.S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., Lu, X., 2017. AID: A

benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 55 (7), 39653981.

Xia, G.S., Yang, W., Delon, J., Gousseau, Y., Sun, H., Maître, H., 2010. Structural high- resolution satellite image indexing. In: ISPRS TC VII Symposium 100 Years ISPRS, Vienna, Austria, pp. 298303.

Xia, Y., Zhu, Q., Wei, W., 2015. Weakly supervised random forest for multi-label image clustering and segmentation. In: Proceedings of the 5th ACM on International

Conference on Multimedia Retrieval, pp. 227233.

Yang, Y., Newsam, S., 2010. Bag-of-visual-words and spatial extensions for land-use

classification. In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp. 270279.

You, N., Dong, J., 2020. Examining earliest identifiable timing of crops using all

available Sentinel 1/2 imagery and Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 161, 109123.

Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J., 2018. Multilabel image classification with regional latent semantic dependencies. IEEE Trans. Multimedia 20 (10), 28012813.

Zhang, L., Zhang, L., Du, B., 2016. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geosci. Remote Sens. Mag. 4 (2), 2240.

Zhang, M.L., Zhou, Z.H., 2007. ML-KNN: A lazy learning approach to multi-label leaming. Pattern Recogn. 40 (7), 20382048.

Zhao, J., Zhong, Y., Shu, H., Zhang, L., 2016. High-resolution image classification

integrating spectral-spatial-location cues by conditional random fields. IEEE Trans. Image Process. 25 (9), 40334045.

Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A., 2017. Places: A 10 million

image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40 (6), 14521464.

Zhou, P., Han, J., Cheng, G., Zhang, B., 2019. Learning compact and discriminative stacked autoencoder for hyperspectral image classification. IEEE Trans. Geosci.

Remote Sens. 57 (7), 48234833.

Zhu, Q., Sun, X., Zhong, Y., Zhang, L., 2019. High-resolution remote sensing image scene understanding: A review. In: IGARSS 20192019 IEEE International Geoscience and Remote Sensing Symposium, pp. 30613064.






UK assignment helper

Author & Editor

We are the best assignment writing service provider in the UK. We can say it with pride that we tend to perceive our client’s requirements better than any other company. We provide assignment writing service in 100+ subjects.

0 comments:

Post a Comment