<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Environ. Sci.</journal-id>
<journal-title>Frontiers in Environmental Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Environ. Sci.</abbrev-journal-title>
<issn pub-type="epub">2296-665X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1010630</article-id>
<article-id pub-id-type="doi">10.3389/fenvs.2022.1010630</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Environmental Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Urban functional zone classification based on self-supervised learning: A case study in Beijing, China</article-title>
<alt-title alt-title-type="left-running-head">Lu et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/fenvs.2022.1010630">10.3389/fenvs.2022.1010630</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Lu</surname>
<given-names>Weipeng</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1948325/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Qi</surname>
<given-names>Ji</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1843662/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Feng</surname>
<given-names>Huihui</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1268571/overview"/>
</contrib>
</contrib-group>
<aff>
<institution>School of Geosciences and Info-Physics</institution>, <institution>Central South University</institution>, <addr-line>Changsha</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1455833/overview">Penghai Wu</ext-link>, Anhui University, China</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1984257/overview">Junli Li</ext-link>, Anhui Agricultural University, China</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1566028/overview">Guojie Wang</ext-link>, Nanjing University of Information Science and Technology, China</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Huihui Feng, <email>hhfeng@csu.edu.cn</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Environmental Informatics and Remote Sensing, a section of the journal Frontiers in Environmental Science</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>17</day>
<month>11</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>10</volume>
<elocation-id>1010630</elocation-id>
<history>
<date date-type="received">
<day>03</day>
<month>08</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>07</day>
<month>11</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Lu, Qi and Feng.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Lu, Qi and Feng</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Urban functional zones (UFZs) are the fundamental units for urban management and operation. The advance in earth observation and deep learning technology provides chances for automatically and intelligently classifying UFZs <italic>via</italic> remote sensing images. However, current methods based on deep learning require numerous high-quality annotations to train a well-performed model, which is time-consuming. Thus, how to train a reliable model using a few annotated data is a problem in UFZ classification. Self-supervised learning (SSL) can optimize models using numerous unannotated data. In this paper, we introduce SSL into UFZ classification to use the instance discrimination pretext task for guiding a model to learn useful features from over 50,000 unannotated remote sensing images and fine tune the model using 700 to 7,000 annotated data. The validation experiment in Beijing, China reveals that 1) using a few annotated data, SSL can achieve a kappa coefficient and an overall accuracy 2.1&#x2013;11.8% and 2.0&#x2013;10.0% higher than that of supervised learning (SL), and 2) can also gain results comparable to that got by the SL paradigm using two times annotated data for training. The less the data used for finetuning the more obvious the advantage of SSL to SL. Besides, the comparison experiment between the model pretrained on the research region and that pretrained on the benchmark reveals that the objects with displacement and incompleteness are more difficult for models to classify accurately.</p>
</abstract>
<kwd-group>
<kwd>self-supervised learning</kwd>
<kwd>urban functional zone</kwd>
<kwd>remote sensing</kwd>
<kwd>deep learning</kwd>
<kwd>image classification</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Urban functional zones (UFZs), including commercial zones, industrial zones, and residential zones, have specific social activities. The spatial distribution of UFZs describes the city structure and reveals the land demand, playing an important role in urban management (<xref ref-type="bibr" rid="B42">Zhang et al., 2017</xref>; <xref ref-type="bibr" rid="B5">Chen et al., 2018</xref>). Nowadays, geographic big data like points of interest and geo-tagged photos become available, which were used to analyze UFZ spatial patterns. For example, <xref ref-type="bibr" rid="B40">Yin et al. (2021b)</xref> used the density of points of interest to determine the type of parcels and map out the UFZ. <xref ref-type="bibr" rid="B18">Kang et al. (2021)</xref> used photos from Flickr to investigate the landscapes to guide the tourisms industry. However, these data were uploaded by users, so their quality are uncontrollable (<xref ref-type="bibr" rid="B39">Yin et al., 2021a</xref>). The advance in earth observation provides high spatiotemporal-resolution remote sensing imagery (RSI), which is widely used for UFZ classification research (<xref ref-type="bibr" rid="B1">Bao et al., 2020</xref>; <xref ref-type="bibr" rid="B2">Cao et al., 2020</xref>; <xref ref-type="bibr" rid="B22">Liu et al., 2021</xref>).</p>
<p>Traditional RSI interpretation relies on handcrafted features (<xref ref-type="bibr" rid="B9">Dai and Yang, 2010</xref>; <xref ref-type="bibr" rid="B47">Zhu et al., 2014</xref>; <xref ref-type="bibr" rid="B3">Castelluccio et al., 2015</xref>), in which radiometric features, texture features, and shape features were used for image classification and retrieval (<xref ref-type="bibr" rid="B26">Luo et al., 2013</xref>). <xref ref-type="bibr" rid="B43">Zhang et al. (2018)</xref> proposed a hierarchical bottom-up and up-bottom feedback model to improve the classification accuracy of UFZs by handcrafted features like gray-level co-occurrence matrix (GLCM). <xref ref-type="bibr" rid="B11">Du et al. (2019)</xref> used window independent context (WIC) feature to extract spatial units of UFZs from very-high-resolution RSI. However, generating a well-designed handcrafted feature requires expert experience and has low robustness, which cannot provide satisfying results in complex RSI interpretation like UFZ classification (<xref ref-type="bibr" rid="B6">Cheng et al., 2017</xref>).</p>
<p>Recently, with the development of deep learning technology, the methods based on high-level visual features, like convolutional neural networks (CNNs), are employed in intelligent and automatic feature extraction (<xref ref-type="bibr" rid="B17">Ioffe and Szegedy, 2015</xref>; <xref ref-type="bibr" rid="B14">He et al., 2016</xref>; <xref ref-type="bibr" rid="B30">Szegedy et al., 2016</xref>). More and more UFZ researchers have adapted CNNs for representation and classification (<xref ref-type="bibr" rid="B24">Liu et al., 2017</xref>; <xref ref-type="bibr" rid="B8">Cheng et al., 2018</xref>; <xref ref-type="bibr" rid="B34">Wang et al., 2018</xref>). For UFZ classification, CNNs have become an essential part in recent 5 years (<xref ref-type="bibr" rid="B1">Bao et al., 2020</xref>; <xref ref-type="bibr" rid="B23">Liu et al., 2020</xref>; <xref ref-type="bibr" rid="B37">Xu et al., 2020</xref>; <xref ref-type="bibr" rid="B46">Zhou et al., 2020</xref>; <xref ref-type="bibr" rid="B12">Du et al., 2021</xref>; <xref ref-type="bibr" rid="B25">Lu et al., 2022</xref>). <xref ref-type="bibr" rid="B46">Zhou et al. (2020)</xref> proposed super-object based CNNs to classify UFZ in RSI. They used the AlexNet (<xref ref-type="bibr" rid="B20">Krizhevsky et al., 2012</xref>), a typical CNN model, to determine the class of a clipped RSI. <xref ref-type="bibr" rid="B12">Du et al. (2021)</xref> designed a multi-scale semantic segmentation network combining an object-level conditional random field to map UFZ at the object level.</p>
<p>Generally, the training of a CNN follows the supervised learning (SL) paradigm, which fits the parameters using numerous annotated training data. Under the SL paradigm, training a stable model requires a large number of high-quality samples (<xref ref-type="bibr" rid="B27">Ma et al., 2017</xref>). Large-scale image classification datasets, such as ImageNet (<xref ref-type="bibr" rid="B20">Krizhevsky et al., 2012</xref>) and <bold>P</bold>attern <bold>A</bold>nalysis, <bold>S</bold>tatical modeling and <bold>C</bold>omput<bold>A</bold>tional <bold>L</bold>earning <bold>V</bold>isual <bold>O</bold>bject <bold>C</bold>lasses (PASCAL VOC) challenge dataset (<xref ref-type="bibr" rid="B13">Everingham et al., 2010</xref>) have promoted the development of SL in computer vision. However, when SL is applied to the field like remote sensing and medical image, this training paradigm often has insufficient training samples. Annotating an RSI dataset needs professional knowledge and tedious work, so annotating an RSI dataset as large as ImageNet is costly. Therefore, it is difficult to train a good-performance UFZ classification model with existing datasets under the SL paradigm.</p>
<p>Transfer learning (TL) pretrains a model on large-scale datasets <italic>via</italic> SL and then finetunes parts of model parameters by target tasks, like UFZ classification. It can reduce the annotation requirement (<xref ref-type="bibr" rid="B35">Wang et al., 2020</xref>; <xref ref-type="bibr" rid="B38">Yang et al., 2020</xref>). TL assumes that the model can learn a general representation from large amounts of datasets. And the representation can be transferred into the remote sensing domain by a few annotated data. But TL requires that the data used for pretraining and finetuning should have the same number of channels. Natural images have the three channels of red band, green band, and blue band (RGB bands), but different RSIs have different numbers of channels. For example, multispectral images and hyperspectral images have more than three channels and panchromatic imagery has only one channel. The difference in channel numbers causes difficulty in finetuning the RGB-pretrained model on RIS. In addition, the RGB-band RSIs have quite different visual characteristics from natural images, due to the different imaging mechanisms, such as angle and distance. Therefore, it is a problem to train a model <italic>via</italic> massive unannotated RSIs.</p>
<p>In the past few years, self-supervised learning (SSL) has become popular in model pretraining and gains results comparable to those got by previous learning paradigms in computer vision tasks such as image classification, semantic segmentation, and object detection (<xref ref-type="bibr" rid="B10">Doersch and Zisserman, 2017</xref>; <xref ref-type="bibr" rid="B28">Similarities, 2021</xref>; <xref ref-type="bibr" rid="B33">Tao et al., 2021</xref>; <xref ref-type="bibr" rid="B21">Li et al., 2022</xref>). SSL trains models to learn useful knowledge <italic>via</italic> pretext task, whose annotation is obtained directly from the training data. Thus, SSL has the advantage that its pretrain period is label-free. Recently, SSL researches on RSI have made great progress, but most of them only used public benchmark for experiments (<xref ref-type="bibr" rid="B41">Yu et al., 2020</xref>; <xref ref-type="bibr" rid="B45">Zhao et al., 2020</xref>; <xref ref-type="bibr" rid="B29">Stojnic and Risojevic, 2021</xref>). For example, <xref ref-type="bibr" rid="B32">Tao et al. (2022)</xref> investigated the potential of SSL on RSI interpretation by three open RSI datasets: <bold>EuroSAT</bold>, which is the Land Use and Land Cover Classification with Sentinel-2 (<xref ref-type="bibr" rid="B15">Helber et al., 2019</xref>), <bold>A</bold>erial <bold>I</bold>mage <bold>D</bold>ataset <bold>(AID)</bold> (<xref ref-type="bibr" rid="B36">Xia et al., 2017</xref>), and <bold>NWPU-RESISC45</bold>, which is the <bold>RE</bold>mote <bold>S</bold>ensing <bold>I</bold>mage <bold>S</bold>cene <bold>C</bold>lassification (RESISC) dataset created by <bold>N</bold>orthwestern <bold>P</bold>olytechnical <bold>U</bold>niversity (NWPU) with <bold>45</bold> classes (<xref ref-type="bibr" rid="B6">Cheng et al., 2017</xref>). No relevant studies have been carried out in the practical application. The RSIs selected in open RSI datasets and those used for practical applications are different (<xref ref-type="bibr" rid="B6">Cheng et al., 2017</xref>, <xref ref-type="bibr" rid="B7">2020</xref>; <xref ref-type="bibr" rid="B16">Hong, 2021</xref>).<list list-type="simple">
<list-item>
<p>&#x2022; First, in the images selected for open RSI datasets, the objects of interest are always at the center. In the images used for practical application, the location of key objects is random, so the image is difficult to be cropped with the target at the center. The displacement of objects causes sample misclassification by the model pre-trained on benchmark.</p>
</list-item>
<list-item>
<p>&#x2022; Second, the image size of a benchmark is always fixed, but the scales of objects in benchmarks are different (e.g., factory and airport). Thus, the benchmark spatial resolution changes to make sure the key object is contained in the image completely. However, in practice, the spatial resolution and the image size are always fixed, so some large-scale object might be cropped into several patches, which is difficult for the model to classify accurately.</p>
</list-item>
</list>
</p>
<p>Therefore, this paper intends to introduce the SSL into the UFZ classification of the region inside the Sixth Ring Road of Beijing, China, and to investigate the different performance of SSL in open RSI dataset and the practical application. Specifically, we pretrain the model on unannotated RSI of downtown Beijing <italic>via</italic> SSL, and then collect a small-scale UFZ classification dataset to fine-tune the model. Notably, in order to be like the practical application, all samples used in the experiments are randomly cropped with fixed resolution and size. The experiment result shows that SSL has advantages over SL in terms of sample demand and final classification accuracy.</p>
</sec>
<sec sec-type="materials|methods" id="s2">
<title>2 Materials and methods</title>
<sec id="s2-1">
<title>2.1 Study area and data</title>
<p>This study takes Beijing, China as the research region (shown in <xref ref-type="fig" rid="F1">Figure 1</xref>). It has a spatial coverage of 3300&#xa0;km<sup>2</sup> (longitude 116&#xb0;04&#x2032;-116&#xb0;44&#x2032;E and latitude 39&#xb0;40&#x2032;-40&#xb0;11&#x2032;N) and a population of 21,000,000. This region contains a variety of urban landscapes, which can effectively denote old/new city areas and urban/suburbs.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>The study area and its location in Beijing, China.</p>
</caption>
<graphic xlink:href="fenvs-10-1010630-g001.tif"/>
</fig>
<p>The RSI used in this paper is downloaded from Bing Virtual with a size of <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:mn>53248</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>69632</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> in the WGS-84 framework. For SSL, the entire image is meshed into 56,576 patches with a size of 256. Considering current researches (<xref ref-type="bibr" rid="B44">Zhang et al., 2020</xref>; <xref ref-type="bibr" rid="B22">Liu et al., 2021</xref>; <xref ref-type="bibr" rid="B25">Lu et al., 2022</xref>) and the &#x201c;Code for classification of urban and rural land use and planning standards of development land (GB50137)&#x201d; issued by the Ministry of Housing and Urban-Rural Development of the People&#x2019;s Republic of China, we divide the UFZs into 10 kinds. For model finetuning, we annotate a few patches manually. The classification system and the number of patches for each UFZ type are shown in <xref ref-type="table" rid="T1">Table 1</xref>, and parts of annotated patches are shown in <xref ref-type="fig" rid="F2">Figure 2</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>The classification system and number of patches for each UFZ type of the collected dataset.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Category</th>
<th align="left">Definition</th>
<th align="left">Number of patches</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Commercial</td>
<td align="left">Financial center, retail center, shopping mall, office building</td>
<td align="left">314</td>
</tr>
<tr>
<td align="left">Residential</td>
<td align="left">Residence, urban shantytown, and rural settlement</td>
<td align="left">1226</td>
</tr>
<tr>
<td align="left">Institutional</td>
<td align="left">educational, medical, cultural, administrative office, and public services</td>
<td align="left">763</td>
</tr>
<tr>
<td align="left">Industrial</td>
<td align="left">factories, warehouse</td>
<td align="left">275</td>
</tr>
<tr>
<td align="left">Transportation</td>
<td align="left">Railway, highway, port and its surrounding water, bus station, railway station, airport, gasoline station</td>
<td align="left">875</td>
</tr>
<tr>
<td align="left">Open Space</td>
<td align="left">urban park, botanic garden, and other urban grasslands</td>
<td align="left">531</td>
</tr>
<tr>
<td align="left">Construction</td>
<td align="left">vacant land, bare land, and land under construction</td>
<td align="left">1310</td>
</tr>
<tr>
<td align="left">Forest</td>
<td align="left">non-urban development land with dense trees</td>
<td align="left">333</td>
</tr>
<tr>
<td align="left">Agricultural</td>
<td align="left">vegetable field, cropland, orchard, and other agricultural lands</td>
<td align="left">1105</td>
</tr>
<tr>
<td align="left">Water</td>
<td align="left">natural and artificial waterbody</td>
<td align="left">572</td>
</tr>
<tr>
<td align="left">&#x3a3;</td>
<td align="left">7304</td>
<td align="left"/>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Samples of the 10 urban function zone (UFZ) types.</p>
</caption>
<graphic xlink:href="fenvs-10-1010630-g002.tif"/>
</fig>
</sec>
<sec id="s2-2">
<title>2.2 Paradigm of SL and SSL</title>
<p>The supervised learning (SL) is a model training paradigm that has been widely used in big data analysis. Given dataset <inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>k</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and model <inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:mi mathvariant="script">F</mml:mi>
<mml:mo>:</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2192;</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> with random initialization parameters, SL is to optimize <inline-formula id="inf4">
<mml:math id="m4">
<mml:mrow>
<mml:mi mathvariant="script">F</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> to minimize the error between <inline-formula id="inf5">
<mml:math id="m5">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf6">
<mml:math id="m6">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>The SSL is to initialize a model&#x2019;s parameters using pretext tasks, such as image reconstruction, rotation prediction, and instance discrimination (<xref ref-type="bibr" rid="B31">Tao et al., 2020</xref>). By solving the pretext tasks, the model can learn the useful features from unannotated samples. Here, we introduce the instance discrimination task that will be used in our research.</p>
<p>Given an image (instance) <inline-formula id="inf7">
<mml:math id="m7">
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> and its two argumentation views <inline-formula id="inf8">
<mml:math id="m8">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf9">
<mml:math id="m9">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, instance discrimination is to distinguish the positive sample of <inline-formula id="inf10">
<mml:math id="m10">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> from a set of samples <inline-formula id="inf11">
<mml:math id="m11">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>. <inline-formula id="inf12">
<mml:math id="m12">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf13">
<mml:math id="m13">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are positive samples of each other. From the perspective of feature space (<xref ref-type="fig" rid="F3">Figure 3</xref>), the goal of instance discrimination is to aggregate the positive samples and push apart them from other samples (negative samples). A similarity loss function is designed to complete the task <xref ref-type="disp-formula" rid="e1">Eq. 1</xref>.<disp-formula id="e1">
<mml:math id="m14">
<mml:mrow>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="italic">log</mml:mi>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="italic">cos</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mo>&#x3c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>&#x3e;</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>&#x3c4;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mi mathvariant="double-struck">I</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2260;</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mi mathvariant="italic">exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="italic">cos</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mo>&#x3c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>&#x3e;</mml:mo>
<mml:mo>/</mml:mo>
<mml:mi>&#x3c4;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>where <inline-formula id="inf14">
<mml:math id="m15">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf15">
<mml:math id="m16">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are the argumentation views. <inline-formula id="inf16">
<mml:math id="m17">
<mml:mrow>
<mml:mi mathvariant="italic">cos</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mo>&#x3c;</mml:mo>
<mml:mi mathvariant="bold-italic">u</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold-italic">v</mml:mi>
<mml:mo>&#x3e;</mml:mo>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="bold-italic">u</mml:mi>
<mml:mi mathvariant="bold-italic">v</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold-italic">u</mml:mi>
</mml:mrow>
<mml:mo>&#x7c;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold-italic">v</mml:mi>
</mml:mrow>
<mml:mo>&#x7c;</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</inline-formula>. <inline-formula id="inf17">
<mml:math id="m18">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="double-struck">I</mml:mi>
<mml:mrow>
<mml:mi>b</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>l</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the indicator function, and its value is 1 only if bool is true, 0 otherwise. <inline-formula id="inf18">
<mml:math id="m19">
<mml:mrow>
<mml:mi>&#x3c4;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the temperature parameter.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Understanding the instance discrimination task from the perspective of feature space.</p>
</caption>
<graphic xlink:href="fenvs-10-1010630-g003.tif"/>
</fig>
</sec>
<sec id="s2-3">
<title>2.3 Implementation of SSL on UFZ classification</title>
<p>As shown in <xref ref-type="fig" rid="F4">Figure 4</xref>, the implementation of SSL on UFZ classification includes two steps: 1) learning useful knowledge <italic>via</italic> SSL, and 2) finetuning the pre-trained model to the UFZ classification domain <italic>via</italic> SL. In the first step, the model will be trained on large-scale unannotated RSIs to learn useful knowledge. In the second step, the model will be finetuned on a small-scale UFZ classification dataset with annotation to obtain a UFZ classification model.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Flowchart of UFZ classification by SSL.</p>
</caption>
<graphic xlink:href="fenvs-10-1010630-g004.tif"/>
</fig>
<sec id="s2-3-1">
<title>2.3.1 Learning potential useful knowledge <italic>via</italic> SSL</title>
<p>In this study, we use instance discrimination as the pretext task, as it can guide the model to learn the invariance of an image and the difference between two images (<xref ref-type="bibr" rid="B4">Chen et al., 2020</xref>).</p>
<p>We design a CNN that contains a visual feature encoder <inline-formula id="inf19">
<mml:math id="m20">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and a feature projector <inline-formula id="inf20">
<mml:math id="m21">
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>g</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> to represent argumentation views and complete the instance discrimination task. The SSL training has three steps:<list list-type="simple">
<list-item>
<p>1) Generation of positive samples: Randomly select a few unannotated data <inline-formula id="inf21">
<mml:math id="m22">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> from a large-scale dataset <inline-formula id="inf22">
<mml:math id="m23">
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>, and argument them by two random argumentation rules <inline-formula id="inf23">
<mml:math id="m24">
<mml:mrow>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf24">
<mml:math id="m25">
<mml:mrow>
<mml:msub>
<mml:mi>t</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (e.g., rotation, flip, random mask, dithering). By doing so, a set of argumentation views <inline-formula id="inf25">
<mml:math id="m26">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> are generated, in which <inline-formula id="inf26">
<mml:math id="m27">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>k</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf27">
<mml:math id="m28">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> are a pair of positive samples.</p>
</list-item>
<list-item>
<p>2) Representation of argumentation views: represent the argumentation views in <inline-formula id="inf28">
<mml:math id="m29">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> by <inline-formula id="inf29">
<mml:math id="m30">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> to get the visual representation <inline-formula id="inf30">
<mml:math id="m31">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and project the representation by <inline-formula id="inf31">
<mml:math id="m32">
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>g</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. In this way, all argumentation views are projected as <inline-formula id="inf32">
<mml:math id="m33">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> in the instance discriminative space.</p>
</list-item>
<list-item>
<p>3) Discrimination of instance: optimize <inline-formula id="inf33">
<mml:math id="m34">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf34">
<mml:math id="m35">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>g</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> by minimizing the similarity loss.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s2-3-2">
<title>2.3.2 Finetune the pre-trained model to the UFZ classification domain</title>
<p>Finetuning the pre-trained model <italic>via</italic> SL is to use a small-scale annotated dataset to adjust some parameters of the pre-trained model. In this study, we use the collected UFZ classification dataset to finetune the pre-trained model to the UFZ classification domain through the following two steps:<list list-type="simple">
<list-item>
<p>1) Extracting useful features: randomly sample a mini batch of data <inline-formula id="inf35">
<mml:math id="m36">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">y</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> from the annotated dataset <inline-formula id="inf36">
<mml:math id="m37">
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
<mml:mo>,</mml:mo>
<mml:mover accent="true">
<mml:mi mathvariant="bold-italic">Y</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, and extract features by pretrained feature encoder <inline-formula id="inf37">
<mml:math id="m38">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> to get the feature representation <inline-formula id="inf38">
<mml:math id="m39">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</list-item>
<list-item>
<p>2) Finetuning the model by SL: randomly initialize a classifier <inline-formula id="inf39">
<mml:math id="m40">
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>&#x3c6;</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and classify <inline-formula id="inf40">
<mml:math id="m41">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> to predict the class distribution probability <inline-formula id="inf41">
<mml:math id="m42">
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">p</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:msubsup>
<mml:mo>}</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> and optimize <inline-formula id="inf42">
<mml:math id="m43">
<mml:mrow>
<mml:mi>&#x3c6;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> by minimizing the cross-entropy (CE) loss between <inline-formula id="inf43">
<mml:math id="m44">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">p</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf44">
<mml:math id="m45">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">y</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. <inline-formula id="inf45">
<mml:math id="m46">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi mathvariant="double-struck">I</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf46">
<mml:math id="m47">
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>C</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf47">
<mml:math id="m48">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the total class number. In this study, <inline-formula id="inf48">
<mml:math id="m49">
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is 10.</p>
</list-item>
</list>
<disp-formula id="e2">
<mml:math id="m50">
<mml:mrow>
<mml:msub>
<mml:mi>l</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">p</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">y</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:munderover>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:munderover>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mi mathvariant="italic">log</mml:mi>
<mml:mo>&#x2061;</mml:mo>
<mml:mo>&#x2008;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>
</p>
</sec>
<sec id="s2-3-3">
<title>2.3.3 Implementation details</title>
<p>In the experiment, we use ResNet50 (<xref ref-type="bibr" rid="B14">He et al., 2016</xref>) as the backbone of the visual encoder <inline-formula id="inf49">
<mml:math id="m51">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>f</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, and take two stacked fully connected (FC) layer with <bold>RE</bold>ctified <bold>L</bold>inear <bold>U</bold>nit (<inline-formula id="inf50">
<mml:math id="m52">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>) activating function as the projector <inline-formula id="inf51">
<mml:math id="m53">
<mml:mrow>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>g</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. For an image <inline-formula id="inf52">
<mml:math id="m54">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (or its argumentation view <inline-formula id="inf53">
<mml:math id="m55">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>), the model firstly extracts its visual feature by feature extractor <inline-formula id="inf54">
<mml:math id="m56">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>:</mml:mo>
<mml:msub>
<mml:mover accent="true">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x2192;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="double-struck">R</mml:mi>
<mml:mn>2048</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, and then projects <inline-formula id="inf55">
<mml:math id="m57">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> to the instance discriminative space by projector <inline-formula id="inf56">
<mml:math id="m58">
<mml:mrow>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> (<xref ref-type="disp-formula" rid="e3">Eq. 3</xref>), in which <inline-formula id="inf57">
<mml:math id="m59">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="double-struck">R</mml:mi>
<mml:mrow>
<mml:mn>1024</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>2048</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf58">
<mml:math id="m60">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="double-struck">R</mml:mi>
<mml:mrow>
<mml:mn>128</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>1024</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> are the learnable weight in FC layers, and <inline-formula id="inf59">
<mml:math id="m61">
<mml:mrow>
<mml:mi>&#x3b4;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="italic">max</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> denotes the <inline-formula id="inf60">
<mml:math id="m62">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>U</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> function.<disp-formula id="e3">
<mml:math id="m63">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">z</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>g</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold-italic">&#x3b8;</mml:mi>
<mml:mi>g</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi>&#x3b4;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>
</p>
<p>For model finetuning, we take a classifier with an FC layer. Mathematically, the classification process can be expressed by (<xref ref-type="disp-formula" rid="e4">Eq. 4)</xref>.<disp-formula id="e4">
<mml:math id="m64">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">p</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>S</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>M</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>3</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold-italic">h</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>
<disp-formula id="e5">
<mml:math id="m65">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>M</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">p</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi mathvariant="italic">exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">p</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:munder>
<mml:mrow>
<mml:mi mathvariant="italic">exp</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">p</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
<p>
<inline-formula id="inf61">
<mml:math id="m66">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>3</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the weight of FC layer and <inline-formula id="inf62">
<mml:math id="m67">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">p</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the class probability distribution of image <inline-formula id="inf63">
<mml:math id="m68">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. <inline-formula id="inf64">
<mml:math id="m69">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>M</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the normalized exponential function whose expression is <xref ref-type="disp-formula" rid="e5">Eq. 5</xref>. <inline-formula id="inf65">
<mml:math id="m70">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold-italic">p</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> means the probability of image <inline-formula id="inf66">
<mml:math id="m71">
<mml:mrow>
<mml:msub>
<mml:mover accent="true">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x223c;</mml:mo>
</mml:mover>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> belonging to class <inline-formula id="inf67">
<mml:math id="m72">
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</sec>
</sec>
</sec>
<sec sec-type="results" id="s3">
<title>3 Results</title>
<p>In quantitative evaluation, we use the Kappa coefficient (Kappa) and overall accuracy (OA) as the overall evaluation indexes and the producer accuracy (PA), user accuracy (UA), and F1 score (F1) as the evaluation indexes for each category. <xref ref-type="table" rid="T2">Table 2</xref> shows the evaluation result of two initialization strategy with different numbers of finetuning samples. When 100% finetuning samples are used, the SSL method gains a better result than SL. The Kappa and OA increase by 2.4% and 2.1%, respectively. According to F1, the SSL method achieves the best results in 8 out of 10 categories. For both SSL and SL models, forests and water have an F1 value of above 0.9, due to the simple texture. Residential zones, transportations, open spaces, constructions, and agricultural lands are also visually distinguishable, so their F1 values are over 0.75. However, commercial, institutional, and industrial zones with strong social attributes are visually ambiguous, which are difficult to be accurately classified them using the visual characteristics provided by remote sensing images, so their F1 values are relatively low.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Evaluation of the models using SSL initialization and random initialization using different percentage of finetune samples. Com: commercial, Res: residential, Ins: institutional, Ind: industrial, Tra: transportation, OS: open space, Con: construction, For: forest, Agr: agricultural.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Initialization</th>
<th rowspan="2" align="left">Sample (%)</th>
<th colspan="10" align="left">UA (%)/PA (%)/F1</th>
<th rowspan="2" align="left">Kappa</th>
<th rowspan="2" align="left">OA (%)</th>
</tr>
<tr>
<th align="left">Com</th>
<th align="left">Res</th>
<th align="left">Ins</th>
<th align="left">Ind</th>
<th align="left">Tra</th>
<th align="left">OS</th>
<th align="left">Con</th>
<th align="left">For</th>
<th align="left">Agr</th>
<th align="left">Water</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="15" align="left">SSL on the Research Region</td>
<td align="char" char=".">100</td>
<td align="left">57.9</td>
<td align="left">85.3</td>
<td align="left">67.5</td>
<td align="left">74.1</td>
<td align="left">83.6</td>
<td align="left">81.6</td>
<td align="left">77.9</td>
<td align="left">98.5</td>
<td align="left">90.3</td>
<td align="left">95.3</td>
<td align="char" char=".">0.796</td>
<td align="char" char=".">82.2</td>
</tr>
<tr>
<td align="left"/>
<td align="left">34.9</td>
<td align="left">87.8</td>
<td align="left">71.9</td>
<td align="left">72.7</td>
<td align="left">82.2</td>
<td align="left">84.0</td>
<td align="left">83.2</td>
<td align="left">95.6</td>
<td align="left">87.9</td>
<td align="left">94.4</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">0.436</td>
<td align="left">0.865</td>
<td align="left">0.696</td>
<td align="left">0.734</td>
<td align="left">0.829</td>
<td align="left">0.828</td>
<td align="left">0.804</td>
<td align="left">0.970</td>
<td align="left">0.891</td>
<td align="left">0.949</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="char" char=".">80</td>
<td align="left">51.4</td>
<td align="left">83.9</td>
<td align="left">65.9</td>
<td align="left">75.5</td>
<td align="left">85.2</td>
<td align="left">80.8</td>
<td align="left">77.7</td>
<td align="left">98.5</td>
<td align="left">89.8</td>
<td align="left">92.3</td>
<td align="char" char=".">0.790</td>
<td align="char" char=".">81.7</td>
</tr>
<tr>
<td align="left"/>
<td align="left">30.2</td>
<td align="left">86.9</td>
<td align="left">71.9</td>
<td align="left">72.7</td>
<td align="left">82.3</td>
<td align="left">79.2</td>
<td align="left">82.4</td>
<td align="left">97.0</td>
<td align="left">87.8</td>
<td align="left">94.7</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">0.380</td>
<td align="left">0.854</td>
<td align="left">0.688</td>
<td align="left">0.741</td>
<td align="left">0.837</td>
<td align="left">0.800</td>
<td align="left">0.800</td>
<td align="left">0.977</td>
<td align="left">0.888</td>
<td align="left">0.935</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="char" char=".">40</td>
<td align="left">40.9</td>
<td align="left">79.1</td>
<td align="left">60.6</td>
<td align="left">70.8</td>
<td align="left">82.5</td>
<td align="left">79.8</td>
<td align="left">73.3</td>
<td align="left">98.5</td>
<td align="left">89.4</td>
<td align="left">91.2</td>
<td align="char" char=".">0.753</td>
<td align="char" char=".">78.5</td>
</tr>
<tr>
<td align="left"/>
<td align="left">14.3</td>
<td align="left">88.2</td>
<td align="left">71.2</td>
<td align="left">61.8</td>
<td align="left">75.4</td>
<td align="left">78.3</td>
<td align="left">80.5</td>
<td align="left">95.5</td>
<td align="left">84.2</td>
<td align="left">90.4</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">0.212</td>
<td align="left">0.834</td>
<td align="left">0.655</td>
<td align="left">0.660</td>
<td align="left">0.788</td>
<td align="left">0.790</td>
<td align="left">0.767</td>
<td align="left">0.970</td>
<td align="left">0.867</td>
<td align="left">0.907</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="char" char=".">20</td>
<td align="left">40.0</td>
<td align="left">77.1</td>
<td align="left">58.3</td>
<td align="left">74.1</td>
<td align="left">77.8</td>
<td align="left">82.5</td>
<td align="left">67.0</td>
<td align="left">98.5</td>
<td align="left">86.8</td>
<td align="left">91.9</td>
<td align="char" char=".">0.727</td>
<td align="char" char=".">76.4</td>
</tr>
<tr>
<td align="left"/>
<td align="left">3.2</td>
<td align="left">86.5</td>
<td align="left">64.1</td>
<td align="left">36.4</td>
<td align="left">70.3</td>
<td align="left">75.5</td>
<td align="left">85.9</td>
<td align="left">95.5</td>
<td align="left">86.0</td>
<td align="left">89.5</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">0.059</td>
<td align="left">0.815</td>
<td align="left">0.611</td>
<td align="left">0.488</td>
<td align="left">0.739</td>
<td align="left">0.788</td>
<td align="left">0.753</td>
<td align="left">0.970</td>
<td align="left">0.864</td>
<td align="left">0.907</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="char" char=".">10</td>
<td align="left">0</td>
<td align="left">73.3</td>
<td align="left">53.6</td>
<td align="left">0</td>
<td align="left">79.3</td>
<td align="left">92.2</td>
<td align="left">52.6</td>
<td align="left">98.4</td>
<td align="left">75.4</td>
<td align="left">90.7</td>
<td align="char" char=".">0.642</td>
<td align="char" char=".">69.3</td>
</tr>
<tr>
<td align="left"/>
<td align="left">0</td>
<td align="left">85.3</td>
<td align="left">48.4</td>
<td align="left">0</td>
<td align="left">54.9</td>
<td align="left">55.7</td>
<td align="left">87.4</td>
<td align="left">91.0</td>
<td align="left">84.6</td>
<td align="left">86.0</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">-</td>
<td align="left">0.789</td>
<td align="left">0.509</td>
<td align="left">-</td>
<td align="left">0.649</td>
<td align="left">0.694</td>
<td align="left">0.657</td>
<td align="left">0.946</td>
<td align="left">0.797</td>
<td align="left">0.883</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td rowspan="15" align="left">Random initialization</td>
<td align="char" char=".">100</td>
<td align="left">35.0</td>
<td align="left">83.4</td>
<td align="left">67.3</td>
<td align="left">70.6</td>
<td align="left">81.0</td>
<td align="left">71.8</td>
<td align="left">85.3</td>
<td align="left">97.0</td>
<td align="left">89.6</td>
<td align="left">95.3</td>
<td align="char" char=".">0.779</td>
<td align="char" char=".">80.6</td>
</tr>
<tr>
<td align="left"/>
<td align="left">33.3</td>
<td align="left">82.0</td>
<td align="left">73.9</td>
<td align="left">65.5</td>
<td align="left">82.9</td>
<td align="left">79.2</td>
<td align="left">84.4</td>
<td align="left">97.0</td>
<td align="left">86.0</td>
<td align="left">89.5</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">0.341</td>
<td align="left">0.827</td>
<td align="left">0.704</td>
<td align="left">0.679</td>
<td align="left">0.819</td>
<td align="left">0.753</td>
<td align="left">0.848</td>
<td align="left">0.970</td>
<td align="left">0.878</td>
<td align="left">0.923</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="char" char=".">80</td>
<td align="left">29.8</td>
<td align="left">81.6</td>
<td align="left">64.8</td>
<td align="left">81.8</td>
<td align="left">82.4</td>
<td align="left">67.5</td>
<td align="left">82.3</td>
<td align="left">95.6</td>
<td align="left">91.4</td>
<td align="left">95.2</td>
<td align="char" char=".">0.773</td>
<td align="char" char=".">80.2</td>
</tr>
<tr>
<td align="left"/>
<td align="left">22.2</td>
<td align="left">83.3</td>
<td align="left">67.3</td>
<td align="left">65.5</td>
<td align="left">82.9</td>
<td align="left">76.4</td>
<td align="left">88.9</td>
<td align="left">97.0</td>
<td align="left">86.9</td>
<td align="left">86.8</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">0.255</td>
<td align="left">0.824</td>
<td align="left">0.660</td>
<td align="left">0.727</td>
<td align="left">0.826</td>
<td align="left">0.717</td>
<td align="left">0.855</td>
<td align="left">0.963</td>
<td align="left">0.891</td>
<td align="left">0.908</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="char" char=".">40</td>
<td align="left">34.8</td>
<td align="left">74.7</td>
<td align="left">57.2</td>
<td align="left">57.1</td>
<td align="left">73.3</td>
<td align="left">69.6</td>
<td align="left">76.7</td>
<td align="left">96.8</td>
<td align="left">85.8</td>
<td align="left">91.6</td>
<td align="char" char=".">0.709</td>
<td align="char" char=".">74.7</td>
</tr>
<tr>
<td align="left"/>
<td align="left">25.4</td>
<td align="left">84.5</td>
<td align="left">59.5</td>
<td align="left">36.4</td>
<td align="left">75.4</td>
<td align="left">75.5</td>
<td align="left">80.5</td>
<td align="left">91.0</td>
<td align="left">79.2</td>
<td align="left">86.0</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">0.294</td>
<td align="left">0.793</td>
<td align="left">0.583</td>
<td align="left">0.444</td>
<td align="left">0.744</td>
<td align="left">0.724</td>
<td align="left">0.786</td>
<td align="left">0.938</td>
<td align="left">0.824</td>
<td align="left">0.887</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="char" char=".">20</td>
<td align="left">33.7</td>
<td align="left">69.6</td>
<td align="left">52.7</td>
<td align="left">71.4</td>
<td align="left">71.3</td>
<td align="left">60.9</td>
<td align="left">69.7</td>
<td align="left">93.8</td>
<td align="left">75.6</td>
<td align="left">82.0</td>
<td align="char" char=".">0.636</td>
<td align="char" char=".">68.2</td>
</tr>
<tr>
<td align="left"/>
<td align="left">47.6</td>
<td align="left">81.2</td>
<td align="left">50.3</td>
<td align="left">36.4</td>
<td align="left">68.0</td>
<td align="left">66.0</td>
<td align="left">63.4</td>
<td align="left">89.6</td>
<td align="left">74.2</td>
<td align="left">79.8</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">0.395</td>
<td align="left">0.750</td>
<td align="left">0.515</td>
<td align="left">0.482</td>
<td align="left">0.696</td>
<td align="left">0.633</td>
<td align="left">0.664</td>
<td align="left">0.916</td>
<td align="left">0.749</td>
<td align="left">0.809</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="char" char=".">10</td>
<td align="left">31.0</td>
<td align="left">65.6</td>
<td align="left">44.7</td>
<td align="left">50.0</td>
<td align="left">72.3</td>
<td align="left">55.6</td>
<td align="left">58.3</td>
<td align="left">76.6</td>
<td align="left">72.3</td>
<td align="left">85.9</td>
<td align="char" char=".">0.574</td>
<td align="char" char=".">63.0</td>
</tr>
<tr>
<td align="left"/>
<td align="left">34.9</td>
<td align="left">80.8</td>
<td align="left">44.4</td>
<td align="left">18.2</td>
<td align="left">53.7</td>
<td align="left">47.2</td>
<td align="left">66.8</td>
<td align="left">88.1</td>
<td align="left">71.9</td>
<td align="left">74.6</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left"/>
<td align="left">0.328</td>
<td align="left">0.724</td>
<td align="left">0.446</td>
<td align="left">0.267</td>
<td align="left">0.616</td>
<td align="left">0.510</td>
<td align="left">0.623</td>
<td align="left">0.819</td>
<td align="left">0.721</td>
<td align="left">0.798</td>
<td align="left"/>
<td align="left"/>
</tr>
</tbody>
</table>
</table-wrap>
<p>
<xref ref-type="fig" rid="F5">Figure 5</xref> shows the UFZ map predicted by two models. One model is initialized by SSL on the research region (SSL model) with 100% finetuning samples and another is randomly initialized (SL model) with 100% finetuning samples. We show four results in <xref ref-type="fig" rid="F5">Figure 5</xref>, which intuitively demonstrate the superiority of the SSL model in UFZ classification. The comparison chart shows that SL is prone to misclassifying UFZs with visual homogeneity, such as open space, forest, residential zone and commercial zone. For example, in region A, the SSL model accurately identifies the area with forest trails as forest, while the SL model misclassifies it as an open space. A possible reason is that the SL model cannot distinguish between forest trails and park trails when the samples are limited, while the SSL model can distinguish between the two using many unlabeled samples.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>UFZ classification using 100% training samples. <bold>(A)</bold> UFZ map predicted by the SSL model; <bold>(B)</bold> UFZ map predicted by the SL model; <bold>(C)</bold> comparison result between the SSL model and the SL model.</p>
</caption>
<graphic xlink:href="fenvs-10-1010630-g005.tif"/>
</fig>
<sec id="s3-1">
<title>3.1 The advantages of SSL in UFZ classification</title>
<p>To compare the performance of SSL and SL in UFZ classification, we carry out a set of experiments using 10%, 20%, 40%, 80%, and 100% training samples for finetuning, separately. The overall results are shown in <xref ref-type="fig" rid="F6">Figure 6</xref>. For detailed qualitative evaluation, please refer to <xref ref-type="table" rid="T2">Table 2</xref>. Compared with the randomly initialized model (SL-based model), the model pretrained <italic>via</italic> SSL gains better results. Following are the advantages of SSL:<list list-type="simple">
<list-item>
<p>1) Using the same number of training samples, the SSL-based model achieves higher Kappa and OA than the SL-based model, and the fewer the training samples the more obvious the advantage. When 100% training samples are used for finetuning, the values of Kappa and OA of the SSL-based model are 2.1% and 2.0% higher than those of the SL-based model, respectively. When the samples reduce to 10%, the correspondence is 11.8% and 10.0%.</p>
</list-item>
<list-item>
<p>2) The SSL-based model achieves results comparable to or better than that got by the SL-based model but uses fewer samples. When the SSL-based model uses 10% (Kappa: 0.642; OA: 69.3%)and 20% (Kappa: 0.727; OA: 76.4%)samples for finetuning, the results are better than that got by SL-based models using 20% (Kappa: 0.636; OA: 68.2%)and 40% (Kappa: 0.709; OA: 74.7%)samples, respectively.</p>
</list-item>
</list>
</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Results obtained using different ratio of training samples. R.I., random initialization.</p>
</caption>
<graphic xlink:href="fenvs-10-1010630-g006.tif"/>
</fig>
</sec>
<sec id="s3-2">
<title>3.2 Spatial patterns of the urban functional zones</title>
<p>As shown in the map in <xref ref-type="fig" rid="F5">Figure 5</xref>, there are many institutional zones in downtown, like government buildings, universities, and research institutes, because this city is the cultural and political center of China. The residential zones rank the top ratio in the center city. In the suburb, there are large areas of forest, open space, and agricultural land. The construction regions between urban and suburban areas reflect the expansion of Beijing.</p>
<p>In this study, we analyze the spatial patterns of the UFZs in the research region. The location quotient (LQ) is used to evaluate the ratio of specialization of a region (<xref ref-type="bibr" rid="B19">Kolars and Haggett, 1967</xref>). LQ is calculated by <xref ref-type="disp-formula" rid="e6">Eq. 6</xref>, in which the area ratio of UFZ <inline-formula id="inf68">
<mml:math id="m73">
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> in region <inline-formula id="inf69">
<mml:math id="m74">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is the divided by area ratio of total UFZ <inline-formula id="inf70">
<mml:math id="m75">
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> in the research region <inline-formula id="inf71">
<mml:math id="m76">
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>.<disp-formula id="e6">
<mml:math id="m77">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:msubsup>
<mml:mi>Q</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>r</mml:mi>
</mml:msubsup>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>r</mml:mi>
</mml:msubsup>
<mml:mo>/</mml:mo>
<mml:msup>
<mml:mi>s</mml:mi>
<mml:mi>r</mml:mi>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>c</mml:mi>
</mml:msub>
<mml:mo>/</mml:mo>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>When <inline-formula id="inf72">
<mml:math id="m78">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:msubsup>
<mml:mi>Q</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>r</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> &#x3e; 1.5, UFZ <inline-formula id="inf73">
<mml:math id="m79">
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> has a high superiority in region <inline-formula id="inf74">
<mml:math id="m80">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>; when <inline-formula id="inf75">
<mml:math id="m81">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:msubsup>
<mml:mi>Q</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>r</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> is between 1 and 1.5, UFZ <inline-formula id="inf76">
<mml:math id="m82">
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> exceeds the average level in region <inline-formula id="inf77">
<mml:math id="m83">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>; if <inline-formula id="inf78">
<mml:math id="m84">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:msubsup>
<mml:mi>Q</mml:mi>
<mml:mi>c</mml:mi>
<mml:mi>r</mml:mi>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> &#x3c; 1, UFZ <inline-formula id="inf79">
<mml:math id="m85">
<mml:mrow>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> is below the average level in region <inline-formula id="inf80">
<mml:math id="m86">
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. By calculating LQ, the development status of UFZs and the degree of function composite can be analyzed.</p>
<p>The LQ based on ring roads (from the 2nd ring road to the 6th ring road) and administrative divisions (Xicheng, Dongcheng, Haidian, Chaoyang, Shijing, and Fengtai) is calculated and shown in <xref ref-type="table" rid="T3">Table 3</xref> and <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Location quotient based on the ring road.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Category</th>
<th align="left">Inside 2nd</th>
<th align="left">2nd-3rd</th>
<th align="left">3rd-4th</th>
<th align="left">4th-5th</th>
<th align="left">5th-6th</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Commercial</td>
<td align="left">2.01</td>
<td align="left">2.42</td>
<td align="left">2.51</td>
<td align="left">1.67</td>
<td align="left">0.59</td>
</tr>
<tr>
<td align="left">Residential</td>
<td align="left">1.81</td>
<td align="left">1.41</td>
<td align="left">1.69</td>
<td align="left">1.15</td>
<td align="left">0.85</td>
</tr>
<tr>
<td align="left">Institutional</td>
<td align="left">2.82</td>
<td align="left">3.04</td>
<td align="left">2.00</td>
<td align="left">1.27</td>
<td align="left">0.66</td>
</tr>
<tr>
<td align="left">Industrial</td>
<td align="left">0.17</td>
<td align="left">0.18</td>
<td align="left">0.59</td>
<td align="left">0.93</td>
<td align="left">1.13</td>
</tr>
<tr>
<td align="left">Transportation</td>
<td align="left">0.70</td>
<td align="left">1.01</td>
<td align="left">1.22</td>
<td align="left">1.17</td>
<td align="left">0.95</td>
</tr>
<tr>
<td align="left">Open Space</td>
<td align="left">0.27</td>
<td align="left">0.18</td>
<td align="left">0.30</td>
<td align="left">1.13</td>
<td align="left">1.11</td>
</tr>
<tr>
<td align="left">Construction</td>
<td align="left">0.20</td>
<td align="left">0.44</td>
<td align="left">0.65</td>
<td align="left">1.11</td>
<td align="left">1.07</td>
</tr>
<tr>
<td align="left">Forest</td>
<td align="left">0.00</td>
<td align="left">0.00</td>
<td align="left">0.00</td>
<td align="left">0.01</td>
<td align="left">1.41</td>
</tr>
<tr>
<td align="left">Agricultural</td>
<td align="left">0.01</td>
<td align="left">0.00</td>
<td align="left">0.03</td>
<td align="left">0.23</td>
<td align="left">1.36</td>
</tr>
<tr>
<td align="left">Water</td>
<td align="left">0.85</td>
<td align="left">0.26</td>
<td align="left">0.19</td>
<td align="left">0.67</td>
<td align="left">1.20</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf81">
<mml:math id="m87">
<mml:mrow>
<mml:mi>&#x3bc;</mml:mi>
<mml:mo>&#xb1;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">0.88 &#xb1; 0.94</td>
<td align="left">0.89 &#xb1; 1.02</td>
<td align="left">0.92 &#xb1; 0.84</td>
<td align="left">0.93 &#xb1; 0.47</td>
<td align="left">1.03 &#xb1; 0.26</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Location quotient based on administrative district.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Category</th>
<th align="left">Xicheng</th>
<th align="left">Dongcheng</th>
<th align="left">Haidian</th>
<th align="left">Chaoyang</th>
<th align="left">Shijing</th>
<th align="left">Fengtai</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">Commercial</td>
<td align="left">1.50</td>
<td align="left">1.52</td>
<td align="left">0.74</td>
<td align="left">1.29</td>
<td align="left">0.35</td>
<td align="left">1.10</td>
</tr>
<tr>
<td align="left">Residential</td>
<td align="left">1.73</td>
<td align="left">1.67</td>
<td align="left">0.81</td>
<td align="left">1.12</td>
<td align="left">1.25</td>
<td align="left">1.01</td>
</tr>
<tr>
<td align="left">Institutional</td>
<td align="left">2.69</td>
<td align="left">2.58</td>
<td align="left">1.39</td>
<td align="left">0.77</td>
<td align="left">0.69</td>
<td align="left">0.90</td>
</tr>
<tr>
<td align="left">Industrial</td>
<td align="left">0.11</td>
<td align="left">0.32</td>
<td align="left">0.41</td>
<td align="left">1.24</td>
<td align="left">1.09</td>
<td align="left">1.44</td>
</tr>
<tr>
<td align="left">Transportation</td>
<td align="left">0.82</td>
<td align="left">0.79</td>
<td align="left">0.67</td>
<td align="left">1.17</td>
<td align="left">0.84</td>
<td align="left">1.25</td>
</tr>
<tr>
<td align="left">Open Space</td>
<td align="left">0.11</td>
<td align="left">0.31</td>
<td align="left">1.16</td>
<td align="left">0.70</td>
<td align="left">0.52</td>
<td align="left">1.38</td>
</tr>
<tr>
<td align="left">Construction</td>
<td align="left">0.16</td>
<td align="left">0.32</td>
<td align="left">0.56</td>
<td align="left">1.42</td>
<td align="left">0.89</td>
<td align="left">1.01</td>
</tr>
<tr>
<td align="left">Forest</td>
<td align="left">0.00</td>
<td align="left">0.00</td>
<td align="left">1.83</td>
<td align="left">0.00</td>
<td align="left">4.03</td>
<td align="left">0.53</td>
</tr>
<tr>
<td align="left">Agricultural</td>
<td align="left">0.00</td>
<td align="left">0.02</td>
<td align="left">1.35</td>
<td align="left">1.06</td>
<td align="left">0.07</td>
<td align="left">0.68</td>
</tr>
<tr>
<td align="left">Water</td>
<td align="left">1.23</td>
<td align="left">0.20</td>
<td align="left">0.89</td>
<td align="left">1.37</td>
<td align="left">0.58</td>
<td align="left">0.70</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf82">
<mml:math id="m88">
<mml:mrow>
<mml:mi mathvariant="italic">&#x3bc;</mml:mi>
<mml:mo>&#xb1;</mml:mo>
<mml:mi mathvariant="italic">&#x3c3;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">0.83 &#xb1; 0.88</td>
<td align="left">0.77 &#xb1; 0.82</td>
<td align="left">0.98 &#xb1; 0.42</td>
<td align="left">1.01 &#xb1; 0.41</td>
<td align="left">1.03 &#xb1; 1.05</td>
<td align="left">1 &#xb1; 0.29</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Commercial, residential, and institutional zones show different superiority in the regions divided by ring roads. The commercial zone shows high superiority inside the 5th ring road, the institutional zones are concentrated inside the 4th ring road, and the residential zones are prominent inside the 2nd and between the 3rd and 4th ring roads. Apart from the downtown, the superiority of the above functional zones reduces, and other UFZs increase.</p>
<p>From the perspective of administrative divisions, the commercial, residential, and institutional zones show superiority in the inner city (Xicheng and Dongcheng district). The forest shows superiority in the Haidian and Shijing districts, as they share the Western Hills National Forest Park. In Chaoyang and Fengtai districts, most kinds of UFZs are at the average level.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Discussion</title>
<sec id="s4-1">
<title>4.1 The gap between benchmarks and practical application</title>
<p>As we mentioned in the Introduction, SSL has been investigated deeply using different data, but it is rarely used in practical applications like UFZ classification, and the gap between benchmarks and practical applications is also ignored. Here, we conducted an experiment, in which two ResNet50 models are pretrained by the sample generated from the research region and the sample in the AID (<xref ref-type="bibr" rid="B36">Xia et al., 2017</xref>) dataset <italic>via</italic> SSL separately, and finetuned by 100% annotated samples collected in the research region. <xref ref-type="table" rid="T5">Table 5</xref> compares the performance of the two models.</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Quantitative result of the models pretrained on the research region and AID.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">Initialization</th>
<th colspan="10" align="left">UA (%)/PA (%)/F1</th>
<th rowspan="2" align="left">Kappa</th>
<th rowspan="2" align="left">OA</th>
</tr>
<tr>
<th align="left">Com</th>
<th align="left">Res</th>
<th align="left">Ins</th>
<th align="left">Ind</th>
<th align="left">Tra</th>
<th align="left">OS</th>
<th align="left">Con</th>
<th align="left">For</th>
<th align="left">Agr</th>
<th align="left">Water</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td rowspan="3" align="left">SSL on the Research Region</td>
<td align="left">57.9</td>
<td align="left">85.3</td>
<td align="left">67.5</td>
<td align="left">74.1</td>
<td align="left">83.6</td>
<td align="left">81.6</td>
<td align="left">77.9</td>
<td align="left">98.5</td>
<td align="left">90.3</td>
<td align="left">95.3</td>
<td rowspan="3" align="char" char=".">0.796</td>
<td rowspan="3" align="char" char=".">82.2 (%)</td>
</tr>
<tr>
<td align="left">34.9</td>
<td align="left">87.8</td>
<td align="left">71.9</td>
<td align="left">72.7</td>
<td align="left">82.2</td>
<td align="left">84.0</td>
<td align="left">83.2</td>
<td align="left">95.6</td>
<td align="left">87.9</td>
<td align="left">94.4</td>
</tr>
<tr>
<td align="left">0.436</td>
<td align="left">0.865</td>
<td align="left">0.696</td>
<td align="left">0.734</td>
<td align="left">0.829</td>
<td align="left">0.828</td>
<td align="left">0.804</td>
<td align="left">0.970</td>
<td align="left">0.891</td>
<td align="left">0.949</td>
</tr>
<tr>
<td rowspan="3" align="left">SSL on AID</td>
<td align="left">46.2</td>
<td align="left">69.0</td>
<td align="left">64.9</td>
<td align="left">68.4</td>
<td align="left">79.7</td>
<td align="left">77.3</td>
<td align="left">63.0</td>
<td align="left">90.0</td>
<td align="left">79.9</td>
<td align="left">90.3</td>
<td rowspan="3" align="char" char=".">0.779</td>
<td rowspan="3" align="char" char=".">80.6 (%)</td>
</tr>
<tr>
<td align="left">19.1</td>
<td align="left">87.4</td>
<td align="left">55.6</td>
<td align="left">47.3</td>
<td align="left">67.4</td>
<td align="left">64.2</td>
<td align="left">74.1</td>
<td align="left">94.0</td>
<td align="left">82.8</td>
<td align="left">89.5</td>
</tr>
<tr>
<td align="left">0.270</td>
<td align="left">0.771</td>
<td align="left">0.599</td>
<td align="left">0.559</td>
<td align="left">0.731</td>
<td align="left">0.701</td>
<td align="left">0.681</td>
<td align="left">0.920</td>
<td align="left">0.813</td>
<td align="left">0.899</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Compared with the model pretrained on the AID, the model pretrained on the research region gains 15.9% and 12.0% higher Kappa and OA. The average F1 got by the research region based model was 19.2% higher than the other, with the maximum increase of 62% in commercial areas.</p>
<p>As shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, there are many patches in the dataset with incomplete objects, while in public datasets such as AID, samples are carefully selected that have higher visual discrimination and are easier for the model to capture its features. <xref ref-type="fig" rid="F7">Figure 7</xref> shows some samples misclassified by the AID pre-trained model. The objects are not in the center of the patches and some objects are incomplete. But they have been accurately classified by the research region pre-trained model. For example, airplanes are important objects for identifying the airport, but they are not at the center of patches in the practical samples, which leads misclassification. This conflicts the prior knowledge learned from the benchmark that objects used to determine the category of images should be in the image center.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Samples misclassified by the model pretrained on AID. The texts at the bottom of each subfigure are the ground truth and the prediction. For example, the image in the left upper corner is for transportation, but it was misclassified as agricultural land. Com: commercial, Res: residential, Ins: institutional, Ind: industrial, Tra: transportation, OS: open space, Con: construction, For: forest, Agr: agricultural.</p>
</caption>
<graphic xlink:href="fenvs-10-1010630-g007.tif"/>
</fig>
</sec>
</sec>
<sec sec-type="conclusion" id="s5">
<title>5 Conclusion</title>
<p>Current SL-based UFZ classification methods require a lot of training samples, which are not easy to acquire. Thus, this study conducts research on UFZ classification based on SSL. We collect 7304 typical UFZ samples as the finetuning and testing data and map the UFZ distribution inside the 6<sup>th</sup> ring road in Beijing. The experiment result proves that SSL gains better classification results than SL when the same number of training data is used and achieves comparable results to SL using half of the training samples. However, the classification accuracy of commercial, institutional, and industrial zones is still unsatisfying due to visual ambiguity. In addition, the comparison experiment between the model pretrained on the research region and that pretrained on the benchmark demonstrates the difficulties in the practical application of SSL. The displacement and incompleteness of objects in real data impact the performance of SSL models.</p>
<p>In the future, we will use social sensing data like geo-tagged photos, taxi trajectories, and points of interest as supplementary information for UFZ classification.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.</p>
</sec>
<sec id="s7">
<title>Author contributions</title>
<p>WL: Data curation, Software, Visualization, Writing-Original draft preparation. JQ: Investigation and Reviewing. HF: Conceptualization, Methodology, Supervision, Editing.</p>
</sec>
<sec id="s10">
<title>Funding</title>
<p>This work was supported by the Inner Mongolia Science &#x26; Technology Plan (2022YFSJ0014).</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ming</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>DFCNN-based semantic recognition of urban functional zones by integrating remote sensing data and POI data</article-title>. <source>Remote Sens. (Basel).</source> <volume>12</volume>, <fpage>1088</fpage>. <pub-id pub-id-type="doi">10.3390/rs12071088</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cao</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Tu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Deep learning-based remote and social sensing data fusion for urban region function recognition</article-title>. <source>ISPRS J. Photogramm. Remote Sens.</source> <volume>163</volume>, <fpage>82</fpage>&#x2013;<lpage>97</lpage>. <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2020.02.014</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Castelluccio</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Poggi</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Sansone</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Verdoliva</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Land use classification in remote sensing images by convolutional neural networks</article-title>. <comment>
<italic>arXiv preprint arXiv:1508.00092</italic> <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1508.00092">https://arxiv.org/abs/1508.00092</ext-link> (Accessed August 01, 2015)</comment>.</citation>
</ref>
<ref id="B4">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kornblith</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Norouzi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>A simple framework for contrastive learning of visual representations</article-title>,&#x201d; in <conf-name>Proceedings of the International Conference on Machine Learning (ICML) Proceedings of Machine Learning Research. (PMLR), 1597&#x2013;1607</conf-name>, <conf-date>July 2020</conf-date>. <comment>Available at: <ext-link ext-link-type="uri" xlink:href="http://proceedings.mlr.press/v119/chen20j.html">http://proceedings.mlr.press/v119/chen20j.html</ext-link> (Accessed June 5, 2021)</comment>.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Tian</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Social functional mapping of urban green space using remote sensing and social sensing data</article-title>. <source>ISPRS J. Photogramm. Remote Sens.</source> <volume>146</volume>, <fpage>436</fpage>&#x2013;<lpage>452</lpage>. <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2018.10.010</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Remote sensing image scene classification: Benchmark and state of the art</article-title>. <source>Proc. IEEE</source> <volume>105</volume>, <fpage>1865</fpage>&#x2013;<lpage>1883</lpage>. <pub-id pub-id-type="doi">10.1109/jproc.2017.2675998</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Xia</surname>
<given-names>G.-S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities</article-title>. <source>IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.</source> <volume>13</volume>, <fpage>3735</fpage>&#x2013;<lpage>3756</lpage>. <pub-id pub-id-type="doi">10.1109/JSTARS.2020.3005403</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>When deep learning meets metric learning: Remote sensing image scene classification via learning discriminative CNNs</article-title>. <source>IEEE Trans. Geosci. Remote Sens.</source> <volume>56</volume>, <fpage>2811</fpage>&#x2013;<lpage>2821</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2017.2783902</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dai</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Satellite image classification via two-layer sparse coding with biased image representation</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>8</volume>, <fpage>173</fpage>&#x2013;<lpage>176</lpage>. <pub-id pub-id-type="doi">10.1109/lgrs.2010.2055033</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Doersch</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Multi-task self-supervised visual learning</article-title>,&#x201d; in <conf-name>Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV)</conf-name>, <conf-loc>Venice, Italy</conf-loc>, <conf-date>October 2017</conf-date> (<publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/iccv.2017.226</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Context-enabled extraction of large-scale urban functional zones from very-high-resolution images: A multiscale segmentation approach</article-title>. <source>Remote Sens. (Basel).</source> <volume>11</volume>, <fpage>1902</fpage>. <pub-id pub-id-type="doi">10.3390/rs11161902</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Mapping large-scale and fine-grained urban functional zones from VHR images using a multi-scale semantic segmentation network and object based approach</article-title>. <source>Remote Sens. Environ.</source> <volume>261</volume>, <fpage>112480</fpage>. <pub-id pub-id-type="doi">10.1016/j.rse.2021.112480</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Everingham</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Van Gool</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>C. K. I.</given-names>
</name>
<name>
<surname>Winn</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>The pascal visual object classes (VOC) challenge</article-title>. <source>Int. J. Comput. Vis.</source> <volume>88</volume>, <fpage>303</fpage>&#x2013;<lpage>338</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-009-0275-4</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Deep residual learning for image recognition</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Las Vegas, NV, USA</conf-loc>, <conf-date>June 2016</conf-date>, <fpage>770</fpage>&#x2013;<lpage>778</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Helber</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Bischke</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Dengel</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Borth</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification</article-title>. <source>IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.</source> <volume>12</volume>, <fpage>2217</fpage>&#x2013;<lpage>2226</lpage>. <pub-id pub-id-type="doi">10.1109/JSTARS.2019.2918242</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hong</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yao</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chanussot</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>X. X.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Multimodal remote sensing benchmark datasets for land cover classification with a shared and specific feature learning model</article-title>. <source>ISPRS J. Photogrammetry Remote Sens.</source> <volume>13</volume>, <fpage>68</fpage>&#x2013;<lpage>80</lpage>. <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2021.05.011</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ioffe</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Szegedy</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Batch normalization: Accelerating deep network training by reducing internal covariate shift</article-title>,&#x201d; in <conf-name>Proceedings of the International Conference on Machine Learning</conf-name>, <conf-loc>Guangzhou, China</conf-loc>, <conf-date>July 2015</conf-date>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Cho</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Yoon</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Transfer learning of a deep learning model for exploring tourists&#x2019; urban image using geotagged photos</article-title>. <source>ISPRS Int. J. Geoinf.</source> <volume>10</volume>, <fpage>137</fpage>. <pub-id pub-id-type="doi">10.3390/ijgi10030137</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kolars</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Haggett</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>1967</year>). <article-title>Locational Analysis in human geography</article-title>. <source>Econ. Geogr.</source> <volume>43</volume>, <fpage>276</fpage>. <pub-id pub-id-type="doi">10.2307/143300</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Krizhevsky</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sutskever</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G. E.</given-names>
</name>
</person-group> (<year>2012</year>). &#x201c;<article-title>Imagenet classification with deep convolutional neural networks</article-title>,&#x201d; in <conf-name>Proceedings of the Advances in neural information processing systems</conf-name>, <conf-loc>Lake Tahoe, NV, USA.</conf-loc>, <conf-date>December 2012</conf-date>, <fpage>1097</fpage>&#x2013;<lpage>1105</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Q.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>Global and local contrastive self-supervised learning for semantic segmentation of HR remote sensing images</article-title>. <source>IEEE Trans. Geosci. Remote Sens.</source> <volume>60</volume>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1109/TGRS.2022.3147513</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>B. H.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>Y. B.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Classification schemes and identification methods for urban functional zone: A review of recent papers</article-title>. <source>Appl. Sci. (Basel).</source> <volume>11</volume>, <fpage>9968</fpage>. <pub-id pub-id-type="doi">10.3390/app11219968</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>H. M.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Y. Y.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>J. B.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>J. C.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>W. T.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Recognizing urban functional zones by a hierarchical fusion method considering landscape features and human activities</article-title>. <source>Trans. Gis</source> <volume>24</volume>, <fpage>1359</fpage>&#x2013;<lpage>1381</lpage>. <pub-id pub-id-type="doi">10.1111/tgis.12642</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Hang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Learning multiscale deep features for high-resolution satellite image scene classification</article-title>. <source>IEEE Trans. Geosci. Remote Sens.</source> <volume>56</volume>, <fpage>117</fpage>&#x2013;<lpage>126</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2017.2743243</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Tao</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>A unified deep learning framework for urban functional zone extraction based on multi-source heterogeneous data</article-title>. <source>Remote Sens. Environ.</source> <volume>270</volume>, <fpage>112830</fpage>. <pub-id pub-id-type="doi">10.1016/j.rse.2021.112830</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Luo</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Indexing of remote sensing images with different resolutions by multiple features</article-title>. <source>IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.</source> <volume>6</volume>, <fpage>1899</fpage>&#x2013;<lpage>1912</lpage>. <pub-id pub-id-type="doi">10.1109/JSTARS.2012.2228254</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>A review of supervised object-based land-cover image classification</article-title>. <source>ISPRS J. Photogramm. Remote Sens.</source> <volume>130</volume>, <fpage>277</fpage>&#x2013;<lpage>293</lpage>. <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2017.06.001</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schmarje</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Santarossa</surname>
<given-names> M.</given-names>
</name>
<name>
<surname>Schr&#xf6;der</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Koch</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>A survey on semi-self- and unsupervised learning for image classification</article-title>. <source>IEEE Access</source> <volume>9</volume>, <fpage>82146</fpage>&#x2013;<lpage>82168</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2021.3084358</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Stojnic</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Risojevic</surname>
<given-names>V.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Self-supervised learning of remote sensing scene representations using contrastive multiview coding</article-title>,&#x201d; in <conf-name>Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</conf-name>, <conf-loc>Nashville, TN, USA</conf-loc>, <conf-date>June 2021</conf-date> (<publisher-name>IEEE</publisher-name>), <fpage>1182</fpage>&#x2013;<lpage>1191</lpage>. <pub-id pub-id-type="doi">10.1109/cvprw53098.2021.00129</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Szegedy</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Vanhoucke</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Ioffe</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Shlens</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wojna</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Rethinking the inception architecture for computer vision</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, <conf-loc>Las Vegas, NV, USA</conf-loc>, <conf-date>June 2016</conf-date>, <fpage>2818</fpage>&#x2013;<lpage>2826</lpage>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tao</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Remote sensing image scene classification with self-supervised paradigm under limited labeled samples</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>1</volume>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/LGRS.2020.3038420</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tao</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Remote sensing image scene classification with self-supervised paradigm under limited labeled samples</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>19</volume>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/lgrs.2020.3038420</pub-id>
</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tao</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Remote sensing image intelligent interpretation: From supervised learning to self-supervised learning</article-title>. <source>Acta Geod. Cartogr. Sinica</source> <volume>50</volume>, <fpage>1122</fpage>&#x2013;<lpage>1134</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chanussot</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Scene classification with recurrent attention of VHR remote sensing images</article-title>. <source>IEEE Trans. Geosci. Remote Sens.</source> <volume>57</volume>, <fpage>1155</fpage>&#x2013;<lpage>1167</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2018.2864987</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Scale-equalizing pyramid convolution for object detection</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>Seattle, WA, USA</conf-loc>, <conf-date>June 2020</conf-date>, <fpage>13359</fpage>&#x2013;<lpage>13368</lpage>.</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xia</surname>
<given-names>G. S.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>J. W.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Shi</surname>
<given-names>B. G.</given-names>
</name>
<name>
<surname>Bai</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhong</surname>
<given-names>Y. F.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Aid: A benchmark data set for performance evaluation of aerial scene classification</article-title>. <source>IEEE Trans. Geosci. Remote Sens.</source> <volume>55</volume>, <fpage>3965</fpage>&#x2013;<lpage>3981</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2017.2685945</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Qing</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A new remote sensing images and point-of-interest fused (RPF) model for sensing urban functional regions</article-title>. <source>Remote Sens. (Basel).</source> <volume>12</volume>, <fpage>1032</fpage>. <pub-id pub-id-type="doi">10.3390/rs12061032</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Xie</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Transfer learning or self-supervised learning? A tale of two pretraining paradigms</article-title>. <comment>
<italic>arXiv:2007.04234 [cs, stat]</italic>. Available at: <ext-link ext-link-type="uri" xlink:href="http://arxiv.org/abs/2007.04234">http://arxiv.org/abs/2007.04234</ext-link> (Accessed September 19, 2021)</comment>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hamm</surname>
<given-names>N. A. S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Xing</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2021a</year>). <article-title>Integrating remote sensing and geospatial big data for urban land use mapping: A review</article-title>. <source>Int. J. Appl. Earth Observation Geoinformation</source> <volume>103</volume>, <fpage>102514</fpage>. <pub-id pub-id-type="doi">10.1016/j.jag.2021.102514</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Hamm</surname>
<given-names>N. A. S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>You</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2021b</year>). <article-title>Decision-level and feature-level integration of remote sensing and geospatial big data for urban land use mapping</article-title>. <source>Remote Sens.</source> <volume>13</volume>, <fpage>1579</fpage>. <pub-id pub-id-type="doi">10.3390/rs13081579</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Attention GANs: Unsupervised deep feature learning for aerial scene classification</article-title>. <source>IEEE Trans. Geosci. Remote Sens.</source> <volume>58</volume>, <fpage>519</fpage>&#x2013;<lpage>531</lpage>. <pub-id pub-id-type="doi">10.1109/tgrs.2019.2937830</pub-id>
</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Hierarchical semantic cognition for urban functional zones with VHR satellite images and POI data</article-title>. <source>ISPRS J. Photogramm. Remote Sens.</source> <volume>132</volume>, <fpage>170</fpage>&#x2013;<lpage>184</lpage>. <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2017.09.007</pub-id>
</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Integrating bottom-up classification and top-down feedback for improving urban land-cover and functional-zone mapping</article-title>. <source>Remote Sens. Environ.</source> <volume>212</volume>, <fpage>231</fpage>&#x2013;<lpage>248</lpage>. <pub-id pub-id-type="doi">10.1016/j.rse.2018.05.006</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Heuristic sample learning for complex urban scenes: Application to urban functional-zone mapping with VHR images and POI data</article-title>. <source>ISPRS J. Photogramm. Remote Sens.</source> <volume>161</volume>, <fpage>1</fpage>&#x2013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1016/j.isprsjprs.2020.01.005</pub-id>
</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Piao</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>When self-supervised learning meets scene classification: Remote sensing scene classification based on a multitask learning framework</article-title>. <source>Remote Sens.</source> <volume>12</volume>, <fpage>3276</fpage>. <pub-id pub-id-type="doi">10.3390/rs12203276</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ming</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Lv</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Bao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>SO&#x2013;CNN based urban functional zone fine division with VHR remote sensing image</article-title>. <source>Remote Sens. Environ.</source> <volume>236</volume>, <fpage>111458</fpage>. <pub-id pub-id-type="doi">10.1016/j.rse.2019.111458</pub-id>
</citation>
</ref>
<ref id="B47">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Zhong</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Multi-feature probability topic scene classifier for high spatial resolution remote sensing imagery</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS)</conf-name>, <conf-loc>Quebec City, QC, Canada</conf-loc>, <conf-date>July 2014</conf-date>, <fpage>1</fpage>&#x2013;<lpage>4</lpage>. <pub-id pub-id-type="doi">10.1109/IGARSS.2014.6947071</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>