ISPRS Journal of Photogrammetry and Remote Sensing
169 (2020) 337–350
MLRSNet: A
multi-label high spatial resolution remote sensing dataset for semantic scene understanding
Xiaoman Qi a, 1, Panpan Zhu b, c, 1, Yuebin Wang a,*, Liqiang Zhang c, Junhuan Peng a, Mengfan Wu c, Jialong Chen a, Xudong Zhao a, Ning Zang a, P. Takis Mathiopoulos d
a
School of Land Science
and Technology, China University of Geosciences, Beijing 100083, China
b College of Computer
Science and Technology, Chongqing University of Posts and Telecommunications,
Chongqing 400065, China
c Beijing Key Laboratory of
Environmental Remote Sensing and Digital Cities, Faculty of Geographical
Science, Beijing Normal University, Beijing 100875, China
d Department
of Informatics and Telecommunications, National and Kapodestrian University of
Athens, Athens 15784, Greece
A R T I C L E I N F O
Keywords:
Multi-label image dataset Semantic scene understanding
Convolutional Neural Network (CNN) Image classification
Image
retrieval
A B S T R A C T
To better understand scene images in the
field of remote sensing, multi-label annotation of scene images is necessary.
Moreover, to enhance the performance of deep learning models for dealing with
semantic scene understanding tasks, it is vital to train them on large-scale annotated data. However, most existing datasets
are annotated by a single label,
which cannot describe the complex remote
sensing images well
because scene images might have multiple land cover classes. Few multi-label high
spatial resolution remote
sensing datasets have
been developed to train deep learning models for multi-label based
tasks, such as scene classification and image retrieval. To address this issue,
in this paper, we construct a multi-label high spatial resolution remote sensing dataset named MLRSNet for semantic scene
understanding with deep learning from the overhead
perspective. It is composed
of high-resolution optical
satellite or aerial
images. MLRSNet contains
a total of 109,161 samples within 46 scene categories, and each image has at least one of 60 predefined labels.
We have designed visual
recognition tasks, including multi-label based image classification and image retrieval, in which a wide variety
of deep learning approaches are evaluated with
MLRSNet. The experimental results demonstrate that
MLRSNet is a significant benchmark for future
research, and it complements the current widely used datasets such as ImageNet, which
fills gaps in multi-label image research. Furthermore, we will continue
to expand the MLRSNet.
MLRSNet and all related materials have been made publicly available at https://data.mendeley.com/datasets/ 7j9bv9vwsx/1 and https://github.com/cugbrs/MLRSNet.git.
1. Introduction
With the availability of enormous numbers
of remote sensing
images produced by satellites and airborne sensors, high-resolution
remote sensing image analyses
have stimulated a flood of interest in the domain of remote sensing and computer
vision (Toth and Jo´´zko´w, 2016), such as image classification or land cover mapping (Cheng et al., 2017; Go´mez et al., 2016; You and Dong, 2020; Zhao et al., 2016), image
retrieval (Wang et al., 2016), and object detection
(Cheng et al., 2014; Han et al.,
2014), etc. The great potential offered by
these platforms in terms of observation capability poses great challenges for
semantic scene un- derstanding (Bazi, 2019). For instance, as these data are obtained
from different locations, at different times and even with different
satellites or
airborne sensors, there are large variations among the
scene images, which creates difficulties for the tasks of semantic
scene understanding, such as
multi-label based image retrieval and image classification.
Furthermore, remote sensing images
usually contain abundant in- formation about ground objects, which creates
challenges for semantic scene understanding tasks (Chaudhuri et
al., 2017). But it is extremely expensive for labeling each piece of data accurately when the amount
of data is huge. Therefore, some research on weakly-supervised segmen- tation based on image-level using
the information of multi-label classi- fication network has attracted the
attention of some scholars (Ge et al., 2018; Xia et al., 2015). Moreover, there have been many explorations in the use of multi-label data, such as land cover classification (Stivaktakis et al., 2019),
high-precision image retrieval (Chaudhuri et al., 2017),
E-mail address: wangyuebin@cugb.edu.cn
(Y. Wang).
1 X. Qi and P. Zhu contributed
equally to this work.
https://doi.org/10.1016/j.isprsjprs.2020.09.020
Received 4 February 2020; Received in revised
form 26 September 2020; Accepted 28 September 2020
Available online 9 October
2020
0924-2716/© 2020
International Society for Photogrammetry and Remote Sensing, Inc. (ISPRS).
Published by Elsevier B.V. All rights reserved.
image semantic segmentation (Xia et al., 2015), or migrate the model of multi-label data training to other
visual tasks (e.g., image object recog- nition) (Gong et al.,
2019). Therefore, multi-label datasets now attract increasing attention
in the remote sensing community owing to that they
are not expensive but have a lot of research
potential. For these reasons,
multi-label annotation of an image is necessary to present more details of the
image and improve the performance of scene understanding. In addition, the
multi-label annotation of an image can produce potential correlations among the
labels, such as “road” and “car” tend to occur
synchronously in a remote sensing
image, and “grass” and “water” often accompany “golf course”. This will provide a better understanding of scene images, which is
impossible for single-label image scene under- standing. Therefore, annotating
images with multiple labels is a vital step for semantic scene understanding in
remote sensing.
What is more, previous
studies have proven
that traditional machine learning methods cannot
adequately mine ground object scene infor- mation (Cordts et al., 2016; Jeong et al., 2019; Kendall
et al., 2015; Zhu et al., 2019). Recently,
deep learning approaches, as a popular tech- nology, have shown the great
potential of providing solutions to prob- lems
related to semantic
scene understanding, and many scholars
have conducted relevant studies
(Fang et al., 2020; Han et al., 2018; Hu et al.,
2015; Ma et al., 2019; Paoletti et al., 2018; Wang et al.,
2018; Zhang et al., 2016; Zhou et al., 2019). Such as, a highly reliable
end-to-end real- time object
detection-based situation recognition system was proposed for autonomous
vehicles (Jeong et al., 2019). In another work (Cordts et al., 2016), the authors determined that fully convolutional
networks achieve decent results
in urban scene understanding. And scene classi- fication CNNs were proved that
they significantly outperform previous approaches (Zhou et al., 2017). In the reference (Workman et al., 2017),
a novel CNN architecture for estimating geospatial functions, such as
population density, land cover, or land use, was proposed. Moreover, CNNs were also used to identify
weeds (Hung et al., 2014) and vehicles (Chen et al., 2014), etc.
Additionally, there exists a
logarithmic relationship between the performance of deep learning methods on
vision tasks and the quantity of
training data used for representation learning was proven recently (Sun et al., 2017). This work demonstrated that the power of
CNNs on large-scale image recognition tasks can be substantially improved
if the CNNs are trained on
large multi-perspective samples. At present,
there exist some widely used various-scale annotated datasets, including
image classification datasets like ImageNet (Deng et al.,
2009), Places (Zhou
et al., 2017), PASCAL VOC (Everingham et al., 2015), YouTube- 8M (Abu-El-Haija et al., 2016),
semantic segmentation datasets like PASCAL Context (Mottaghi et al.,
2014), Microsoft COCO (Lin et al., 2014), Cityscapes (Cordts et al., 2016) and Mapillary Vistas Dataset (Neuhold et al., 2017). However, in these benchmarks, the data
of outdoor objects on the ground are usually collected from ground-level views.
In addition, the object-centric remote sensing image datasets constructed for
scene classification, for instance, AID (Xia et al., 2017),
NWPU-RESISC45 (Cheng et al., 2017), the Brazilian coffee
scene dataset (Penatti et al., 2015), the UC-Merced dataset
(Yang and Newsam,
2010), and WHU-RS19 dataset
(Xia et al., 2010). But these datasets
are insuf- ficient to
understand the scene due to the high intra-class diversity and low inter-class
variation, with the limited number of remote sensing images (Xia et al., 2017). The SEN12MS dataset (Schmitt et al.,
2019) attracts more attention in the domain of land use mapping
recently. It consists of 180,662 triplets sampled over all meteorological
seasons. Each triplet concludes a dual-pol synthetic aperture radar (SAR) image
patches, a multi-spectral Sentinel-2 image patches, and four different MODIS
land cover maps following different internationally established classification
schemes. However, the SEN12MS contains no more
than 17 classes under a selected classification schemes, which may be also insufficient for understanding the
complex real world.
Moreover, it is worth noting that each image in most of the afore-
mentioned datasets is annotated by a single label
representing the most significant semantic
content of the image. However, single-label
annotation is sufficient for simple problems, such as
distinguishing be- tween coffee class and noncoffee class but is difficult to
address more complex scene understanding tasks. Multiple label-related methods have recently been found to be useful
for scene understanding, such as multi- label image search and retrieval
problems, where multiple class labels are simultaneously assigned
to each image (Boutell et al., 2004; Li et al.,
2010; Ranjan et al., 2015; Zhang and Zhou, 2007).
Thus, several pub- lished multi-label archives are publicly available, for
example, multi- label UAV image datasets such as the Trento dataset (Bazi, 2019) and the Civezzano dataset (Bazi, 2019)
and multi-label remote sensing image retrieval (RSIR) archives such as MLRSIR (Shao et al., 2018). The Trento
dataset and the Civezzano dataset both contain 14 classes and contain a total
of 4000 images and 4105 images, respectively. A multi- label RSIR archive was
released in 2017, which is considered to be
the first open-source dataset for multi-label RSIR (Chaudhuri et al., 2017). Afterward, MLRSIR (Shao et al., 2018), which is a pixel-wise dataset
for multi-label RSIR, was presented by Wuhan University and has a total
number of 21 broad categories with 100 images
per category. However, training the CNNs using the
above datasets easily results in overfitting since the CNN models used for
multi-label archives often contain mil- lions
of parameters. Thus, a considerable quantity of labeled
data will be required to fully train the models.
Although BigEarthNet (Sumbul et al., 2019) can deal with the
problem of overfitting, the limitation in the distribution and the unique data
source could reduce the intraclass di- versity, which raises difficulties for
developing robust scene under- standing algorithms.
To overcome the above
issues and better
understand ground objects,
in this paper, we propose a novel large-scale
high-resolution multi-label remote
sensing dataset termed “MLRSNet” for semantic scene under- standing. It
contains 109,161 high-resolution remote sensing images that are annotated into 46 categories, and the number
of sample images in
a category varies from 1500 to 3000. The images
have a fixed size of
|
256
256 pixels with various pixel resolutions. Moreover, each image in the dataset is tagged with several of 60 predefined class labels, and the
number of labels associated with each image varies from 1 to 13. Moreover, we
illustrate the construction procedure of the MLRSNet dataset and give
evaluations and comparisons of several deep
learning methods for multi-label based image classification and image
retrieval. The experiments indicate that multi-label based deep learning
methods can achieve better performance on image classification and image
retrieval.
In summary, three major contributions of this paper are as follows:
(1) A
review of related popular datasets is provided by giving a summary of their
properties. Covering different scale single-label
datasets and multi-label datasets, of which most are usually
insufficient for remote sensing scene understanding tasks.
(2) A
multi-label high spatial resolution remote sensing dataset, i.e.,
MLRSNet is developed for semantic
scene understanding. To our knowledge, the dataset is a large
high-resolution multi-label remote sensing dataset with the most abundant
multi-label in- formation. And the dataset has high intraclass diversity, which
can provide a better data resource for evaluating and advancing the numerous
methods in semantic
scene understanding areas.
(3) The
state-of-the-art neural network methods for multi-label image classification
and multi-label image retrieval using MLRSNet
are evaluated. These results show that deep-learning-
based methods achieve significant performance for multi-label based image
classification and image retrieval tasks.
2. MLRSNet: A multi-label high spatial
resolution remote sensing dataset
How to improve the performance of
existing multi-label image classification and retrieval approaches using machine
learning and other artificially intelligent technologies has attracted much attention in the
Fig.
1. Illustration of the number of samples per category in
MLRSNet. There are 109,161 samples within 46 scene categories.
mountain,
overpass, park, parking lot, parkway, railway, railway station,
Number of images present in the dataset
for each class label. There are 60 pre-
defined class labels
in total.
Class label Number Class
label Number Class
label Number
Airplane 2306 Freeway 2500 Roundabout 2039
Airport 2480 Golf course 2515 Runway 2259
Bare soil 39,345 Grass 49,391 Sand 11,014
river, roundabout,
shipping yard, snowberg, sparse residential area, stadium, storage tank,
swimming pool, tennis court, terrace, transmission tower, vegetable greenhouse,
wetland, and wind turbine. The number of sample images
varies greatly with different broad categories, from 1500 to 3000, as shown in Fig.
1. Additionally, each image in the dataset is assigned several
of 60 predefined class labels,
and the number of labels
associated with each image varies between 1 and 13. The number of
Baseball diamond
Basketball court
1996 Greenhouse 2601 Sea 4980
3726 Gully 2413 Ships 4092
images present in the dataset associated with each predefined label
is listed in Table 1, and some samples with
corresponding multi-label re-
sults are shown in Fig. 2.
Beach 2485 Habor 2492 Snow 3565
Bridge 2772 Intersection 2497 Snowberg 2555
Besides, MLRSNet
has multi-resolutions: the pixel resolution changes
|
from about
10 m to 0.1 m, and the
size of each
multi-label image is fixed
Buildings 51,305 Island 2493 Sparse
residential area
1829
to 256 256 pixels to cover a scene with various resolutions.
Compared to the afore-mentioned scene understanding datasets in Section 1,
Cars 34,013 Lake 2499 Stadium 2462
MLRSNet has a more significantly large variability in
terms of
Chaparral 5903 Mobile
2499 Swimming
5078
geographic origins and number of object categories.
Different from
Cloud 1798
Containers 2500
home
Mountain Overpass
5468
2652
pool Tanks
Tennis court
2500
2499
ImageNet (Deng et al., 2009), which
collects the data of outdoor objects from ground-level views, MLRSNet describes
the objects on Earth from
Crosswalk 2673 Park 1682 Terrace 2345
an overhead perspective through satellite or aerial
sensors. Therefore,
Dense
residential area
2774 Parking lot 7061 Track 3693
deep neural
networks can be trained based on MLRSNet
combined with
ImageNet. We can achieve much higher recognition precision of the
scene and effectively address the challenges of object rotation, within-
Desert 2537 Parkway 2537 Trail 12,376
Dock 2492 Pavement 56,383 Transmission
tower
2500
class variability, and between-class similarity. Table 2 lists the differ-
ences between MLRSNet and other widely used scene understanding
Factory 2667 Railway 4399 Trees 70,728
datasets.
Field 15,142 Railway
2187 Water 27,834
In contrast, with the existing remote sensing image
datasets,
Football field
1057
station River
2493 Wetland 3417
MLRSNet has the
following notable characteristics:
Hierarchy: MLRSNet
contains 3 first-class categories, such as land
Forest 3562 Road 37,783 Wind turbine 2049
remote sensing community (Chua et al.,
2009). However, for learning- based methods, a large number of
labeled samples are required. To advance the state-of-art methods in scene
understanding of remote sensing, we construct the MLRSNet, a new large-scale
high-resolution multi-label remote sensing image dataset.
MLRSNet is composed of 109,161 labeled RGB images from
all around the world annotated into 46 broad categories: airplane, airport,
bareland, baseball diamond, basketball court, beach, bridge, chaparral, cloud,
commercial area, dense residential area, desert, eroded farmland, farmland,
forest, freeway, golf course, ground track field. harbor&port, in- dustrial
area, intersection, island, lake, meadow, mobile home park,
use and land cover (e.g., commercial
area, farmland, forest, industrial area, mountain), natural objects and
landforms (e.g., beach, cloud, is-
land, lake, river, chaparral), as well as man-made objects
and landforms (e.g., airplane, airport, bridge, freeway, overpass), 46 second-class
cat- egories (as shown in Fig. 1) and 60 third-class
labels (as shown in Table 1).
Multi-label:
As shown in Fig. 2, each image in the MLRSNet
dataset has one or more corresponding labels because the remote sensing
image usually contains many classes of objects that are not mutually exclusive. Several experiments (Shao et al., 2018; Zhang et al., 2018) have indi- cated that
multi-label datasets tend to achieve more satisfactory per- formance than single-label datasets
in the tasks of image classification or image
retrieval.
Large-scale: As shown in Table 2, MLRSNet
has a large number of high-resolution multi-label remote
sensing scene images. It contains
109,161 high-resolution remote sensing images annotated into 46
Fig. 2. Example
images of 44 categories (except “bareland” and
“cloud”) from the MLRSNet dataset are
shown, and the corresponding multi-labels of each image are reported at the
right of the related image.
categories, and the number of sample images in a
category varies from 1500 to 3000,
all of which are larger than most other listed datasets. MLRSNet is a
large-scale high-resolution remote sensing dataset collected for scene image recognition that can cover a much wider range of satellite or aerial images. It is
meant to serve as an alternative to advance the development of methods in scene
image recognition,
particularly deep-learning-based approaches that require
large quanti- ties of labeled training data.
Diversity: To
increase the generalization ability of the dataset, we attempt to characterize
MLRSNet according to the object distributions for geographical and seasonal
distribution, weather conditions, viewing perspectives, capturing time, and
image resolution, i.e., large
variations
Statistics of our database and comparisons of
current state-of-the-art remote sensing benchmarks.
Dataset Number of Total
Samples
Number of Categories
Sample Number in Each Category
Image sizes Image
Spatial Resolution (m)
Reference
AID 10,000 30 220–420 600 × 600 8–0.5 Xia et al. (2017)
MLRSIR 2,100 21 100 256 × 256 0.3 Shao et al. (2018)
SEN12MS 564,768 17 – 256 × 256 10–500 Schmitt et al. (2019)
MLRSNet 109,161 46 1500–3000 256 £ 256 ~10–0.1 Our work
The number of categories for SEN12MS is counted following the International Geosphere Biosphere Programme (IGBP)
classification scheme (Loveland and Belward, 1997).
in spatial resolution, viewpoint, object pose, illumination and
back- ground as well as occlusion.
MLRSNet is a remote
sensing community-led dataset
for people who want
to visualize the world with overhead perspectives. To construct the MLRSNet, we gather a team of more than
50 annotators in the remote sensing domain and spend more than six months for the whole process.
The construction of MLRSNet is mainly composed
of three procedures, i. e., scene sample
collection, database quality control, and database sample diversity improvement.
2.2.1. Scene category and sample collection
To
satisfy hierarchy criteria, the first asset of a high-quality dataset is covering an
exhaustive list of representative scene categories. To ach- ieve this goal, we investigate all scene classes
of the existing datasets to form a list of scene categories. In the
process, we merge some similar semantic scene categories in different datasets
into a new category. For
example, “playground” and “ground track field” are taken as “ground track field”, and “harbor” and “port” are taken as a new category
called “harbor&port”. We also search the keywords “object-based image analysis (OBIA)”, “geographic object-based image analysis (GEOBIA)”, “land cover classification”, “land use classification”, “geospatial image retrieval” and “geospatial object
detection” on Web of Science
and Google Scholar
to carefully select
some new meaningful scene classes. Consequently, we obtain 46 scene categories in total, as shown in Fig. 1. Moreover, most of existing
dataset are labeled
with the name of categories, which describe the most
significant semantic content of the image, but the primitive
classes (multiple labels)
presented in images are ignored. Thus, we associate each image with one or more land-cover
class labels (i.e., multi-labels) based
on visual inspection. For every scene category, we randomly select 100
images and annotate the primitive classes in the image. Next, we count the primitive classes in the image and filter out the primitive
classes whose number
is no more than
5. Finally, we get 60 multiple labels occurred
frequently in remote sensing samples. Generally, scene categories are scene-level labels and primitive classes are object-level labels.
Fig. 3. The experimental process of quality control. (a) The experimental process
of a single annotator. (b) The experimental process of the administrator.
Confidence score table for
different categories of data samples, showing how annotators’ judgment
influence the probability of an image being a good image.
Accept Reject Airport Bridge Island Parkway
0 1 0.13 0.05 0.03 0.14
1 0 0.80 0.87 0.89 0.67
1 1 0.51 0.49 0.50 0.52
2 0 0.90 0.97 0.98 0.84
0 2 0.05 0.03 0.02 0.13
3 0 0.97 0.99 1.00 0.90
2 1 0.82 0.86 0.88 0.73
Compared with other satellite or
aerial image datasets, the samples in MLRSNet have more additional meaningful
information, such as hi- erarchy and multi-label information. Particularly,
when having multi- ple labels of scene samples,
by comparing the sample features,
we can search the ground
object more precisely. With this information, many multi-label tasks
can be solved, such as multi-label image classification,
multi-label image retrieval and object detection.
Data diversity
is ensured by data sources and manual control.
We collect data samples by more than 20 people from multi-resolution,
multi-continent, multi-time, multi-light and multi-viewpoint data sour- ces to characterize MLRSNet.
Like most of the existing
datasets, such as AID
(Xia et al., 2017), NWPU-RESISC45 (Cheng et al., 2017), and WHU-
RS19 (Xia et al., 2010), MLRSNet
is also extracted from Google Earth where images are from different remote
imaging sensors. The satellite sensors include but are not limited to the
GeoEye-1, WorldView-1, WorldView-2, SPOT-7, Pleiades-1A, and Pleiades-1B. And images also can
be collected by the cameras
for aerial photography. We collect data from all over the world to satisfy
the diversity criteria, and the samples in MLRSNet cover more than 100 countries and regions. In addition, we control the data diversity. More details can be found in Section 2.2.3.
Aiming to develop a highly accurate dataset, we implement a quality control
process. In the process, we rely on another 20 annotators to verify the ground-truth label of each candidate image
collected in the previous process, including scene
sample annotation, counting
confi- dence score, and disposing of confusing data.
|
Fig. 3 illustrates the experimental process of quality
control. A sample is randomly presented to the annotator for selecting a
category from 46 predefined category names. If the annotation result remains
consistent with the ground-truth label of each candidate sample, the system
gives an “accept” response; otherwise, it gives a “reject” response. Because
of the subjectivity of human and the complexity of the
image, different images need a different number of annotations. The solution,
according to ImageNet (Deng et al., 2009), is to
require mul- tiple annotators to tag the images individually. While annotators
are instructed to label an image,
we make a confidence table (see Table 3) to dynamically determine the number of annotations needed for
different categories of images. Table 3 shows examples for
“airport”, “bridge”, “island” and “parkway”. The confidence score
indicates the probability of an image is a good image given the annotator
votes. After data la- beling for approximately two months, every sample is
labeled several times until a predetermined confidence score threshold is
reached. Therefore, the data samples with a confidence score 0.97 are retained while others are removed.
We observe that the boundaries of some data pairs are blurry, e.g.,
airport and runway,
intersection and crosswalk, desert and bareland
(see Fig. 4). For this reason,
we gather our annotators for a discussion about the boundary of these data pairs. After that, we begin a second labeling round for data in these ambiguous
pairs with a confidence score of
<0.97. Similarly, after the second labeling round, we preserve samples
|
scored 0.97 and deprecate others. Finally, we
collect more than 100,000 data samples within 46 scene categories.
Fig. 4. Boundaries
among scene categories can be blurry. The images show a soft transition between
airport vs. runway, crosswalk vs. intersection and bareland vs. desert.
Fig. 5. A screenshot of
the visual tool that computes the relative diversity of scene datasets.
Different pairs of samples are randomly presented to a person who is instructed
to select the most similar pair. Each trial is composed of 4 pairs from each
database, giving a total of 12 pairs to choose from.
Details of each CNNs model be used in the experiment.
CNNs Layers Parameters Top-1 Accuracy
Top-5 Accuracy
year
Fig. 6. Relative diversity of each category
(20 categories) in a different
dataset. MLRSNet (in red line) contains the most diverse set of images.
(For interpre- tation of the references to colour in this figure
legend, the reader
is referred to the web version of this article.)
2.2.3. Data diversity improvement
An ideal dataset, expected to
generalize well, should have high di- versity,
which means it should include
a high variability of appearances, locations, resolutions,
scales and background clutter and occlusions.
We
use a measure to quantify
the relative diversity
of image datasets referred to as the practice in
reference (Zhou et al., 2017). During the procedure of comparing the diversity of data samples,
while our dataset diversity is lower than other
datasets, we can improve the quality and number of data samples for a certain category.
We
develop a tool with a graphical user interface, as shown in Fig. 5. We ask 10 annotators to measure the relative diversities among AID (Xia et
al., 2017), NWPU-RESISC45 (Cheng et al.,
2017), and MLRSNet. Each time, different pairs of samples are randomly presented to
an annotator who is instructed to
select the most similar pair. Each trial is composed of 4 pairs from each database, giving a total of 12 pairs to
choose from. We run 50 trials per category and 10 observers per trial, for the
20 categories in common.
Fig. 6 shows the results of the relative diversity for
all 20 scene categories common to the three databases. The results show that there is
a large variation in terms of diversity among the three datasets, and
InceptionV3 47 23 M 0.779 0.937 2014
VGGNet16 16 138 M 0.713 0.901 2014
VGGNet19 19 143 M 0.713 0.900 2014
ResNet50 50 25 M 0.749 0.921 2015
ResNet101 101 44 M 0.764 0.928 2015
DenseNet121 121 8
M 0.750 0.923 2017
DenseNet169 169 14 M 0.762 0.932 2017
DenseNet201 201 20 M 0.773 0.936 2017
The top-1 and top-5 accuracy
refers to the model’s performance on the ImageNet validation dataset.
MLRSNet is the most diverse
of the three datasets. The average relative diversity on each dataset is for
MLRSNet is 0.78, for AID (Xia et al., 2017) is 0.56, and for NWPU-RESISC45 (Cheng et al., 2017) is 0.69. The categories with the smallest variation
in diversity in MLRSNet are baseball diamond, beach, sparse residential area and storage
tank. Then, we conduct a
random rotation, resize and crop for all images in these four categories.
3. Scene
classification
Scene classification is a
fundamental task in remote sensing image understanding. Recently,
classification using convolutional neural net- works (CNNs) has achieved
significant performance. MLRSNet can be taken as a benchmark to evaluate the
classification performances of different CNNs.
Eight popular CNN architectures, i.e., InceptionV3 (Szegedy et al., 2016), VGGNet16 (Simonyan and Zisserman, 2014), VGGNet19 (Simonyan and
Zisserman, 2014), ResNet50 (He et al., 2016),
ResNet101 (He et al., 2016), DenseNet121 (Huang et al., 2017), Den- seNet169 (Huang et al., 2017) and DenseNet201 (Huang et al., 2017) are chosen to address
the remote sensing
image classification problem,
and
Parameters utilized for
model fine-tuning.
Package Epochs Batch Size Optimizer Learning rate Keras 10 32 Adam 0.01
the details of each model are shown in Table 4. It should be noticed that the
final layer of CNNs is replaced by a dense
connection with 60 nodes,
and the result of dense connection is activated by a sigmoid function. Sigmoid maps the values
of output vector
of the network to an interval of (0, 1) indicating the score for each
class. Then we binarize the output vector with a threshold of 0.5 to generate a
multi-label prediction, similar to [1, 0, 1, …, 0], where 1 indicates the image is annotated with the
corresponding label and otherwise it is 0. Finally, the models are trained on MLRSNet. We call the fine-tuned models
MLRSNet-CNN, i.e., MLRSNet-VGGNet16.
InceptionV3 (Szegedy et al., 2016): The inception module was first
proposed in reference (Szegedy et al., 2016) by Google and was adopted for image classification and object detection
in the ImageNet Large-Scale
Mean Average Precision (%) of the eight fine-tuned models
under different training ratios.
CNNs Training
ratios
20% 30% 40%
MLRSNet-InceptionV3 81.50 82.33 84.84
MLRSNet-VGGNet16 67.88 72.66 75.39
MLRSNet-VGGNet19 66.12 69.53 73.60
MLRSNet-ResNet50 82.65 84.28 86.01
MLRSNet-ResNet101 83.26 84.19 85.72
MLRSNet-DenseNet121 75.96 77.99 80.25
MLRSNet-DenseNet169 82.16 86.42 87.35
MLRSNet-DenseNet201 87.25 87.84 88.77
The best results are shown in bold.
F1 score of the eight fine-tuned models under
different training ratios.
CNNs Training
ratios
20% 30% 40%
Visual Recognition
Challenge 2014 (ILSVRC14). The main hallmark
of
this architecture is the improved utilization of the
computing resources inside the network. However, the author explored methods to
scale up networks in ways that aim at utilizing the added computation as effi-
ciently as possible by suitably factorized convolutions and aggressive
regularization, which is named InceptionV3.
VGGNet16, VGGNet19 (Simonyan and Zisserman, 2014): VGG was
originally developed for the ImageNet dataset by the
Oxford Visual Geometry Group in the ILSVRC14. To investigate the effect of the
con- volutional network depth on its accuracy in the large-scale image
recognition setting, Simonyan and Zisserman thoroughly evaluated the
MLRSNet-InceptionV3 0.7746 0.8016 0.8146
MLRSNet-VGGNet16 0.5743 0.6534 0.6855
MLRSNet-VGGNet19 0.5677 0.6120 0.6329
MLRSNet-ResNet50 0.7530 0.8176 0.8353
MLRSNet-ResNet101 0.7618 0.7703 0.8226
MLRSNet-DenseNet121 0.7154 0.7389 0.7571
MLRSNet-DenseNet169 0.8138 0.8408 0.8521
MLRSNet-DenseNet201 0.8381 0.8414 0.8538
The best results are shown in bold.
where tj ∈ (0, 1) denotes the jth ground-truth label
for training image
Xi.
|
network of increasing depth using an architecture with very small (3 × j i
|
3) convolution filters,
which showed a significant improvement in the accuracies. In this work, we use two models
that show the corresponding performance
in scene classification, named VGGNet16 and VGGNet19. ResNet50, ResNet101 (He et al., 2016): Residual Nets (ResNet) is a framework presented by Microsoft Research
to ease the training of networks that are substantially deeper than
those used previously. This model
won the 1st place on the ILSVRC 2015 classification task. ResNet50 is the 50-layer
ResNet, and ResNet101 is the 101-layer ResNet. DenseNet121, DenseNet169, DenseNet201 (Huang et al., 2017): The dense convolution network (DenseNet) connects
each layer to other layers
in a feed-forward manner and has L(L+1) direct connections for convolutional
networks with L layers. DenseNets are widely used because
they have several compelling advantages, such
as alleviating the vanishing gradient problem,
strengthening feature propagation, encouraging feature reuse and
substantially reducing the number of parameters. DenseNet121, DenseNet169 and
DensesNet201 are the 121-
ti is the
output of the sigmoid layer. m is the number
of training images and q is the number
of classes in total.
To
improve the generalization capability, we fine-tune the models by using the parameters listed in Table 5. All the CNN models are imple- mented on a 2.10 GHz
48-core CPU. In addition, a TITAN RTX GPU is used for acceleration. We try to
avoid introducing random errors by duplicating experiments. In this study, we
repeat the experiment five times and plot an error-bar graph by counting the results.
We compute two commonly used
evaluation metrics, i.e., mean
average precision and average F1 score to quantitatively evaluate the classification results.
|
The F1 score is a comprehensive metric for evaluating classification
performance for each model and can be defined as:
layer, 169-layer
and 201-layer DenseNet, respectively.
To comprehensively evaluate the classification
performances of different CNNs, three training–testing ratios are considered: (i) 20%–
F1 precision
× recall precision + recall
10%–70%, i.e., we randomly
select 20% of the dataset for training, 10%
where
precision =
|Lc ∩Lr |
and recall =
|Lc ∩Lr |. Lc is the final label vector of
for validation and others
for testing; (ii) 30%–10%–60%; (iii) 40%–
|Lc |
|Lr |
10%–50%.
We choose TensorFlow and the Python
package Keras for our ex- periments. The aforementioned eight methods, which
are pretrained on the ImageNet
dataset, are obtained from the URL: https://github.com/ fchollet/deep-learning-models/releases.
In the experiment, binary cross-entropy measures how far away from the true
value (which is either 0 or 1) the prediction is for each of the classes and then averages these class-wise errors to
obtain the final loss. The formula of binary cross-entropy adopted for
multi-label classification can be shown as follow:
the networks for a sample. Lr is the ground-truth label of
the sample. ∩ denotes intersection. |⋅| denotes the number of nonzero entries. A higher value of the F1 score represents a better classification performance.
Here, it should be pointed out that all precision and
recall values are computed separately for each sample
and then averaged
across samples.
We evaluate eight
methods using MLRSNet
and present the results of
multi-label image classification as follows. As shown
in Tables 6 and 7,
the fine-tuned models can achieve good classification performances on
|
|
|
|
|
|
m q
L = 1 ∑ ∑(tjloĝtj + (1 — tj)log(1 — ̂tj))
MLRSNet, which
indicates that deep-learning-based models have the
Fig.
7. The statistical results of Mean Average Precision. The bar
chart shows the average, and the error line presents the standard deviation.
Fig.
8. The statistical results of F1 score. The bar chart shows the average, and the error line presents the
standard deviation.
It is worth noting that
MLRSNet-DenseNet201 obtains significantly better metric values in the
comparative experiment. In particular, the MLRSNet-DenseNet201 model can
achieve an overall improvement as the number of data samples increased.
MLRSNet-DenseNet201 and MLRSNet-DenseNet169 achieve over 0.80 F1 score in the 10 epochs when the training ratio is 20%.
Moreover, with increasing training
data, the performances of the models increase. This suggests that increasing the data size can further improve the performances of deep
learning models.
Fig. 9 shows
several annotation examples on MLRSNet when fine- tuned DenseNet201 is
employed. Here, annotations with black font
are included in ground-truth labels, whereas annotations with red font are incorrect labels tagged by the
model. The green fonts are correct labels, but the model does not tag them. It is obvious
that the first
seven images are all annotated correctly. The 8th to 10th images do not
include incorrect annotations but neglecting no more than two correct labels,
and the last five images contain one incorrect label. Hence, we can further
find that MLRSNet-DenseNet201 has outstanding perfor-
mance in multi-label classification. This model also can be adopted to
help us to label the scene images in the future.
4. Image
retrieval
With the sharp increase in the
volume of remote sensing images, image retrieval has become an important topic
of research in RS. We then show the application of MLRSNet to image retrieval.
Similarly, it can be used as a benchmark dataset to evaluate the retrieval
perfor- mances of different models.
We use the afore-mentioned eight fine-tuned CNN models (in Section
3) to evaluate the
retrieval performances of the dataset in this work. Images are fed into CNNs,
and the features are extracted from the last layer of each network. Euclidean
distance is selected to calculate the similarity between the query images and
images in the retrieval ar- chives. Only if the retrieval results and the query
image belong to the same category can we considered that the query has been satisfied.
Since the fine-tuned CNNs which are trained with the 40% images of MLRSNet are applied to perform the retrieval experiment, the remaining 60% images
of MLRSNet are adopted as the testing
queries and retrieval database. In order to make full
use of the dataset for retrieval experi- ments, and minimize the random error,
in this section, we randomly split MLRSNet into testing queries and retrieval
database at three different ratios: 5% vs. 55%, 10% vs. 50%, and 15% vs. 45%. Taking
5% vs. 55% for example, 5% of images
from each category
are selected as query
images to query the rest of the dataset. Databases of different sizes are further used to validate the
effectiveness of MLRSNet for image retrieval experiments. Table 8 shows
the number of images in testing queries and the number of images in retrieval
database when 5%, 10% and 15% of images from each category are selected as the
testing queries.
To evaluate the retrieval
performance, we use average normalized modified retrieval rank (ANMRR), mean
average precision (mAP), and
precision at k (P@k, where k is the number of retrieved images) as metrics. In the following
experiments, the ANMRR, mAP, and P@k are the
averaged values over all the query images.
Fig. 9. Some samples of image classification by the
MLRSNet-DenseNet201 model. Multi-labels of each image are reported below the
related image, and the red font indicates the incorrect classification result,
while the green font indicates the correct labels, but the model is not tagged.
(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
The number of images in testing queries
and in retrieval database when different
percentages are chosen.
Percentage 5% 10% 15%
Testing queries 3275 6550 9825
Retrieval database 62,222 58,947 55,672
For a detailed description of ANMRR,
we refer the reader to the reference (Manjunath et
al., 2001). The formulas of the last two metrics are as follows.
|
|
P k 1 Ns Nq
i=1 K
Fig. 10. The precision-recall curves of MLRSNet-CNNs with different percentages of the testing
queries. (a), (b), (c) represent the percentages of testing queries
are 5%, 10%, 15%, respectively.
The retrieval results of different methods.
For ANMRR, lower value indicates better performance, while for mAP and P@k,
larger is better.
Percentage Models ANMRR mAP P@10 P@50 P@100 P@500 5% MLRSNet-Inception V3 0.3506 0.7307 0.7524 0.7270 0.7130 0.6575
MLRSNet-VGGNet16 0.5503 0.4868 0.5073 0.4833 0.4699 0.4107
MLRSNet-VGGNet19 0.6248 0.3690 0.3878 0.3656 0.3539 0.3084
MLRSNet-ResNet50 0.2407 0.8333 0.8462 0.8314 0.8220 0.7610
MLRSNet-ResNet101 0.3001 0.7580 0.7730 0.7556 0.7456 0.6928
MLRSNet-DenseNet121 0.3440 0.7513 0.7758 0.7478 0.7316 0.6577
MLRSNet-DenseNet169 0.1665 0.8874 0.8969 0.8863 0.8791 0.8414
MLRSNet-DenseNet201 0.1557 0.8959 0.9031 0.8949 0.8898 0.8596
10% MLRSNet-Inception V3 0.3518 0.7321 0.7547 0.7283 0.7135 0.6535
MLRSNet-VGGNet16 0.5579 0.4721 0.4922 0.4683 0.4552 0.3956
MLRSNet-VGGNet19 0.6238 0.3709 0.3883 0.3676 0.3561 0.3070
MLRSNet-ResNet50 0.2496 0.8208 0.8339 0.8191 0.8082 0.7415
MLRSNet-ResNet101 0.3073 0.7435 0.7695 0.7394 0.7219 0.6832
MLRSNet-DenseNet121 0.3493 0.7604 0.7810 0.7571 0.7443 0.6423
MLRSNet-DenseNet169 0.1693 0.8834 0.8928 0.8823 0.8747 0.8327
MLRSNet-DenseNet201 0.1583 0.8936 0.9018 0.8925 0.8869 0.8520
15% MLRSNet-Inception V3 0.3569 0.7225 0.7450 0.7189 0.7042 0.6372
MLRSNet-VGGNet16 0.5626 0.4533 0.4760 0.4489 0.4358 0.3847
MLRSNet-VGGNet19 0.6213 0.3694 0.3883 0.3661 0.3521 0.3053
MLRSNet-ResNet50 0.2503 0.8141 0.8299 0.8123 0.7988 0.7317
MLRSNet-ResNet101 0.3096 0.7516 0.7675 0.7498 0.7368 0.6745
MLRSNet-DenseNet121 0.3505 0.7355 0.7631 0.7306 0.7141 0.6301
MLRSNet-DenseNet169 0.1724 0.8786 0.8893 0.8775 0.8687 0.8223
MLRSNet-DenseNet201 0.1577 0.8935 0.9003 0.8929 0.8869 0.8465
The best results are shown in
bold.
Fig.
11. The ANMRR results of features for a set of categories with
10% of images extracted by MLRSNet-CNNs.
|
Nq mi Especially,
MLSNet-DenseNet201 is higher than other models
with
mAP 1 1 p k Nq
i=1 mi k=1
|
where Nq is
the number of queries, Ns represents
for a given query the number of the images in the result that are considered as
the correct image. K is the number
of retrieved images. mi is
the number of result images for a given query i. We define mi 100, that is, we calculate the mean of top-100 average retrieved
precision.
|
The precision-recall curves of eight methods
with different percent- ages of testing
queries are shown in Fig. 10. Table 9 shows the values of ANMRR, mAP, and P@k (k 10, 50, 100, 500) obtained when MLRSNet-
CNNs are used, and different percentages of the testing queries are chosen.
By
analyzing Fig. 10 and Table 9, it can be observed that most of the
MLRSNet-CNNs obtain impressive retrieval results regardless of different
percentages of the testing queries. It demonstrates that CNNs indeed perform well in the image retrieval task using MLRSNet.
different percentages of testing queries in mAP and P@k (k 10, 50, 100, 500) for the MLRSNet dataset. The mAP results of MLRSNet-
DenseNet201 indicate a relative improvement of about 1% against
MLRSNet-DenseNet169. But the mAP results of MLRSNet-DenseNet201 indicate a relative increase
of 6.26–7.94% against MLRSNet-ResNet50. Meanwhile,
MLSNet-DenseNet201 is lower than other models with
different percentages of testing queries in ANMRR. In Fig. 10, the precision-recall curve of MLSNet-DenseNet201 is
superior to other methods no matter
the size of the retrieval database. As a result, the MLRSNet-DenseNet201 can achieve better
performance compared with other
approaches. The reason lies in that each layer in the DenseNet accepts all
previous layers’ features as input,
which can maximize the information flow between all layers in the network.
Besides, for the three types of DenseNet models, as the number of layer
increases, the models can obtain more representative and discriminative image
fea- tures, showing better experimental performances.
|
Moreover, Fig. 11 shows
the ANMRR results of features extracted from several categories by the eight
methods when we choose 10% of the MLRSNet as the testing
queries. It is evident that the ANMRR
result of MLRSNet-VGGNets is large for most categories, meaning that its
Fig. 12. The precision-recall curves
of MLRSNet-DenseNet201 on a set of cat- egories with the MLRSNet.
retrieval performance is poor. This may lie in that the
simple network structures only can obtain limited representations of images,
leading to poor experimental performances. Meanwhile, it can be seen that the
same network performs differently for various categories. In particular, most networks
can perform well in the categories that can be recognized
easily (e.g., lake, harbor&port and
swimming pool).
Taking MLRSNet-DenseNet201 as an
example, we show the precision-recall curves of a set of categories when percentage of testing
queries is 10% in Fig. 12. This illustrates that although
the whole retrieval performance of MLRSNet-DenseNet201 is impressive, the per-
formances on several categories are not satisfactory, e.g., basketball court and commercial area. The reason
may be that the high intra-class
diversities of these two categories increase the difficulties of image retrieval.
Fig. 13 shows two examples of result images retrieved by
the
MLRSNet-DenseNet201 when the query image is randomly selected from the airplane category
and the stadium
category with the percent-
ages of the testing queries
is 10%. The
multi-labels associated with
the image are given
below the related
image. From the
retrieved results, it is
obvious that the MLRSNet-DenseNet201 model accurately
detects the multi-label image objects associated with a given query image and
retrieve the most visually similar
images from the database.
Compared with single-label remote
sensing image retrieval, image retrieval using a multi-label dataset can add
more restricted conditions (multi-label) to the retrieval process, thereby
achieving more accurate image retrieval results.
We use the name of the category
as the label for each image
in the category, and then apply the single-label remote sensing dataset
(SLRSNet) to train the DenseNet201 model and perform the same retrieval experiments.
The retrieval results are shown in Fig. 14. We can find that the multi-label image retrieval
results contain more common labels
with the retrieval image by comparing Figs. 13 and
14. It indicates that the
multi-label image retrieval results are more similar to the retrieval image.
Taking the stadium category as an example, when we query stadium images that
contain a “swimming pool”, the multi-label image retrieval better
matches the labels “swim- ming pool” and “water”, yet the single-label
image retrieval has diffi- culty meeting this requirement.
5. Conclusion
The
MLRSNet is a multi-label high spatial resolution remote sensing
dataset for semantic scene understanding with deep learning from the overhead
perspective. MLRSNet has distinctive characteristics: hierar- chy, large-scale,
diversity, and multi-label. Experiments, including multi-label scene
classification and image retrieval, are conducted with different deep neural networks.
From these experimental results, we can conclude that MLRSNet can be adopted
as a benchmark dataset for performance evaluations of multi-label image
retrieval and scene clas- sification. MLRSNet complements the current large object-centric
Fig. 13. The
retrieval results of airplane (top) and stadium (bottom) categories by the
MLRSNet-DenseNet201 model when the percentage of the testing queries is
10%.
Fig.
14. The retrieval results of airplane (top)
and stadium (bottom) categories by the SLRSNet-DenseNet201 model when the
percentage of the testing queries is 10%.
datasets such as ImageNet. In future work, we will
continue to expand the MLRSNet and apply the dataset to other recognition
tasks, such as semantic/instance segmentation in large-scale scene
images and ground
object recognition.
Declaration of Competing
Interest
The authors declare that they have
no known competing financial interests or personal
relationships that could have appeared
to influence the work reported
in this paper.
Appendix A. Supplementary material
Supplementary data to this
article can be found online
at https://doi. org/10.1016/j.isprsjprs.2020.09.020.
References
Abu-El-Haija,
S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B.,
Vijayanarasimhan, S., 2016. Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675.
Bazi, Y., 2019.
Two-branch neural network for learning multi-label classification in UAV imagery. In: IGARSS 2019–2019 IEEE International Geoscience and Remote
Sensing Symposium. IEEE,
pp. 2443–2446.
Boutell, M.R.,
Luo, J., Shen, X., Brown, C.M., 2004. Learning multi-label scene classification.
Pattern Recogn. 37 (9), 1757–1771.
Chaudhuri, B., Demir, B.,
Chaudhuri, S., Bruzzone, L., 2017. Multilabel remote sensing
image retrieval
using a semisupervised graph-theoretic method. IEEE Trans. Geosci. Remote Sens. 56
(2), 1144–1158.
Chen, X., Xiang,
S., Liu, C.L., Pan, C.H., 2014. Vehicle detection in satellite images by hybrid deep
convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 11 (10),
Cheng, G., Han,
J., Lu, X., 2017. Remote sensing image scene classification: Benchmark and state of the
art. Proc. IEEE 105 (10), 1865–1883.
Cheng, G., Han,
J., Zhou, P., Guo, L., 2014. Multi-class geospatial object detection and geographic image
classification based on collection of part detectors. ISPRS J. Photogramm.
Remote Sens. 98, 119–132.
Chua, T.S., Tang,
J., Hong, R., Li, H., Luo, Z., Zheng, Y., 2009. NUS-WIDE: a real-world web image
database from National University of Singapore. In: Proceedings of the
ACM International Conference on
Image and Video Retrieval, pp. 1–9.
Cordts, M.,
Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S.,
Schiele, B., 2016. The cityscapes dataset for semantic urban scene understanding.
In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 3213–3223.
Deng, J., Dong,
W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale hierarchical
image database. In: 2009 IEEE Conference on Computer Vision and
Pattern Recognition, pp. 248–255.
Everingham, M.,
Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A., 2015. The PASCAL visual
object classes challenge: a retrospective. Int. J. Comput. Vision
Fang, B., Li, Y., Zhang, H., Chan, J.C.W.,
2020. Collaborative learning
of lightweight convolutional neural network and deep clustering for hyperspectral image
semi- supervised classification with limited training
samples. ISPRS J. Photogramm.
Ge, W., Yang, S.,
Yu, Y., 2018. Multi-evidence filtering and fusion for multi-label classification,
object detection and semantic segmentation based on weakly
supervised learning.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 1277–1286.
Go´mez, C., White, J.C., Wulder, M.A., 2016. Optical remotely sensed time series data for
land cover classification: A review. ISPRS
J. Photogramm. Remote
Sens. 116, 55–72.
Gong, T., Liu,
B., Chu, Q., Yu, N., 2019. Using multi-label classification to improve object detection.
Neurocomputing 370, 174–185.
Han, J., Zhou,
P., Zhang, D., Cheng, G., Guo, L., Liu, Z., Bu, S., Wu, J., 2014. Efficient, simultaneous
detection of multi-class geospatial targets based on visual saliency modeling and
discriminative learning of sparse coding. ISPRS J. Photogramm.
Han, W., Feng,
R., Wang, L., Cheng, Y., 2018. A semi-supervised generative framework with deep
learning features for high-resolution remote sensing image scene
classification. ISPRS J.
Photogramm. Remote Sens. 145, 23–43.
He, K., Zhang, X., Ren, S., Sun,
J., 2016. Deep residual learning for image recognition. In:
Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.
Hu, F., Xia, G.S., Hu, J.,
Zhang, L., 2015. Transferring deep convolutional neural
networks for the
scene classification of high-resolution remote sensing imagery. Remote Sens. 7
(11), 14680–14707.
Huang, G., Liu,
Z., Van Der Maaten, L., Weinberger, K.Q., 2017. Densely connected convolutional
networks. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 4700–4708.
Hung, C., Xu, Z.,
Sukkarieh, S., 2014. Feature learning based approach for weed classification
using high resolution aerial images from a digital camera mounted on a UAV. Remote Sens.
6 (12), 12037–12054.
Jeong, H.J., Choi, S.Y., Jang,
S.S., Ha, Y.G., 2019. Driving scene understanding using
hybrid deep
neural network. In: 2019 IEEE International Conference on Big Data and Smart Computing
(BigComp), pp. 1–4.
Kendall,
A., Badrinarayanan, V., Cipolla, R., 2015. Bayesian
segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene
understanding. arXiv preprint arXiv:1511.02680.
R. Li Y. Zhang Z. Lu J. Lu Y. Tian 2010.
Technique of image retrieval based on multi-label image annotation. 2010 Second
International Conference on Multimedia and Information
Technology (vol. 2, 10-13).
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dolla´r, P., Zitnick, C.
L., 2014.
Microsoft coco: Common objects in context. In: European Conference on Computer Vision.
Springer, Cham, pp. 740–755.
Loveland, T.R., Belward, A.S.,
1997. The international geosphere biosphere programme
data and
information system global land cover data set (DISCover). Acta Astronaut. 41 (4–10), 681–689.
Ma, L., Liu, Y.,
Zhang, X., Ye, Y., Yin, G., Johnson, B.A., 2019. Deep learning in remote sensing
applications: A meta-analysis and review. ISPRS J. Photogramm. Remote
Manjunath, B.S.,
Ohm, J.R., Vasudevan, V.V., Yamada, A., 2001. Color and texture descriptors. IEEE
Trans. Circuits Syst. Video Technol. 11 (6), 703–715.
Mottaghi, R.,
Chen, X., Liu, X., Cho, N.G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A., 2014. The role of
context for object detection and semantic segmentation in the wild.
In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898.
Neuhold, G.,
Ollmann, T., Rota Bulo, S., Kontschieder, P., 2017. The mapillary vistas dataset for
semantic understanding of street scenes. In: Proceedings of the IEEE
International Conference on
Computer Vision, pp. 4990–4999.
Paoletti, M.E., Haut, J.M.,
Plaza, J., Plaza, A., 2018. A new deep convolutional neural
network for fast
hyperspectral image classification. ISPRS J. Photogramm. Remote Sens. 145, 120–147.
Penatti, O.A.,
Nogueira, K., Dos Santos, J.A., 2015. Do deep features generalize from everyday objects
to remote sensing and aerial scenes domains?. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition Workshops,
Ranjan, V., Rasiwasia, N., Jawahar, C.V., 2015. Multi-label cross-modal retrieval. In: Proceedings of
the IEEE International Conference on Computer
Vision,
Schmitt,
M., Hughes, L.H., Qiu, C., Zhu, X.X.,
2019. SEN12MS – A
Curated Dataset of Georeferenced Multi-Spectral
Sentinel-1/2 Imagery for Deep Learning and Data Fusion.
arXiv preprint arXiv:.07789.
Shao, Z., Yang,
K., Zhou, W., 2018. A benchmark dataset for performance evaluation of multi-label
remote sensing image retrieval. Remote Sens. 10 (6).
Simonyan, K., Zisserman, A.,
2014. Very deep convolutional networks for large-scale
image
recognition. In: Proceedings of the International Conference on Learning Representations
2015, pp. 19–36.
Stivaktakis, R.,
Tsagkatakis, G., Tsakalides, P., 2019. Deep learning for multilabel land cover scene categorization using data augmentation. IEEE Geosci. Remote Sens. Lett.
Sumbul, G.,
Charfuelan, M., Demir, B., Markl, V., 2019. Bigearthnet: A large-scale benchmark archive
for remote sensing image understanding. In: IGARSS 2019–2019 IEEE
International Geoscience and Remote Sensing Symposium. IEEE,
Sun, C., Shrivastava, A., Singh,
S., Gupta, A., 2017. Revisiting unreasonable effectiveness
of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision,
pp. 843–852.
Szegedy, C.,
Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Rethinking the inception
architecture for computer vision. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern
Recognition, pp. 2818–2826.
Toth, C.,
Jo´´zko´w,
G., 2016.
Remote sensing platforms and sensors: A
survey. ISPRS J. Photogramm.
Remote Sens. 115, 22–36.
Wang, S., Quan,
D., Liang, X., Ning, M., Guo, Y., Jiao, L., 2018. A deep learning framework for
remote sensing image registration. ISPRS J. Photogramm. Remote
Wang, Y., Zhang,
L., Tong, X., Zhang, L., Zhang, Z., Liu, H., Xing, X., Mathiopoulos, P.T., 2016. A
three-layered graph-based learning approach for remote sensing image
retrieval. IEEE Trans. Geosci.
Remote Sens. 54 (10), 6020–6034.
Workman, S., Zhai, M., Crandall,
D.J., Jacobs, N., 2017. A unified model for near and
remote sensing.
In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2688–2697.
Xia, G.S., Hu, J., Hu, F., Shi,
B., Bai, X., Zhong, Y., Zhang, L., Lu, X., 2017. AID: A
benchmark data
set for performance evaluation of aerial scene classification. IEEE Trans. Geosci.
Remote Sens. 55 (7), 3965–3981.
Xia, G.S., Yang,
W., Delon, J., Gousseau, Y., Sun, H., Maître, H., 2010. Structural high- resolution
satellite image indexing. In: ISPRS TC VII Symposium – 100 Years ISPRS, Vienna, Austria,
pp. 298–303.
Xia, Y., Zhu, Q.,
Wei, W., 2015. Weakly supervised random forest for multi-label image clustering and
segmentation. In: Proceedings of the 5th ACM on International
Conference on Multimedia
Retrieval, pp. 227–233.
Yang, Y., Newsam, S., 2010.
Bag-of-visual-words and spatial extensions for land-use
classification.
In: Proceedings of the 18th SIGSPATIAL International Conference on Advances in
Geographic Information Systems, pp. 270–279.
You, N., Dong, J., 2020.
Examining earliest identifiable timing of crops using all
available
Sentinel 1/2 imagery and Google Earth Engine. ISPRS J. Photogramm. Remote Sens. 161,
109–123.
Zhang, J., Wu,
Q., Shen, C., Zhang, J., Lu, J., 2018. Multilabel image classification with regional latent
semantic dependencies. IEEE Trans. Multimedia 20 (10), 2801–2813.
Zhang, L., Zhang, L., Du, B., 2016. Deep learning for remote sensing
data: A technical tutorial on the state of the art. IEEE Geosci. Remote Sens. Mag. 4 (2), 22–40.
Zhang, M.L.,
Zhou, Z.H., 2007. ML-KNN: A lazy learning approach to multi-label leaming. Pattern
Recogn. 40 (7), 2038–2048.
Zhao, J., Zhong, Y., Shu, H.,
Zhang, L., 2016. High-resolution image classification
integrating
spectral-spatial-location cues by conditional random fields. IEEE Trans. Image Process. 25
(9), 4033–4045.
Zhou, B., Lapedriza, A., Khosla,
A., Oliva, A., Torralba, A., 2017. Places: A 10 million
image database
for scene recognition. IEEE Trans Pattern Anal Mach Intell 40 (6), 1452–1464.
Zhou, P., Han,
J., Cheng, G., Zhang, B., 2019. Learning compact and discriminative stacked
autoencoder for hyperspectral image classification. IEEE Trans. Geosci.
Remote Sens. 57 (7), 4823–4833.
Zhu, Q., Sun, X., Zhong, Y., Zhang, L., 2019. High-resolution remote sensing image scene understanding: A
review. In: IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing
Symposium, pp. 3061–3064.
0 comments:
Post a Comment