<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2023.1114186</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Robust deep semi-supervised learning with label propagation and differential privacy</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Yan</surname> <given-names>Zhicong</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2113088/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Li</surname> <given-names>Shenghong</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Duan</surname> <given-names>Zhongli</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhao</surname> <given-names>Yuanyuan</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Cyber Science and Engineering, Shanghai Jiao Tong University</institution>, <addr-line>Shanghai</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>School of Art and Design, Zhengzhou Institute of Industrial Application Technology, Zhengzhou University, Zhengzhou</institution>, <addr-line>Hebei</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>School of Information Science and Engineering, Hangzhou Normal University, Hangzhou</institution>, <addr-line>Zhejiang</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Chaofeng Zhang, Advanced Institute of Industrial Technology, Japan</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Chong Di, Qilu University of Technology, China; Jianwen Xu, Muroran Institute of Technology, Japan</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Shenghong Li <email>shli&#x00040;sjtu.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>24</day>
<month>05</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>5</volume>
<elocation-id>1114186</elocation-id>
<history>
<date date-type="received">
<day>02</day>
<month>12</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>20</day>
<month>04</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Yan, Li, Duan and Zhao.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Yan, Li, Duan and Zhao</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Semi-supervised learning (SSL) methods provide a powerful tool for utilizing abundant unlabeled data to strengthen standard supervised learning. Traditional graph-based SSL methods prevail in classical SSL problems for their intuitional implementation and effective performance. However, they encounter troubles when applying to image classification followed by modern deep learning, since the diffusion algorithms face the curse of dimensionality. In this study, we propose a simple and efficient SSL method, combining a graph-based SSL paradigm with differential privacy. We aim at developing coherent latent feature space of deep neural networks so that the diffusion algorithm in the latent space can give more precise predictions for unlabeled data. Our approach achieves state-of-the-art performance on the Cifar10, Cifar100, and Mini-imagenet benchmark datasets and obtains an error rate of 18.56% on Cifar10 using only 1% of all labels. Furthermore, our approach inherits the benefits of graph-based SSL methods with a simple training process and can be easily combined with any network architecture.</p></abstract>
<kwd-group>
<kwd>deep semi-supervised learning</kwd>
<kwd>label propagation</kwd>
<kwd>differential privacy</kwd>
<kwd>robust learning</kwd>
<kwd>mixup data augmentation</kwd>
</kwd-group>
<counts>
<fig-count count="2"/>
<table-count count="5"/>
<equation-count count="17"/>
<ref-count count="36"/>
<page-count count="10"/>
<word-count count="7309"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Computer Security</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Deep neural networks (DNN) have become the first choice for computer vision applications due to their prominent performance and flexibility. However, the harsh requirement for precisely annotated data has largely constrained the wide application of deep learning. It is generally known that the collection and annotation of large-scale data are extremely costly and time-consuming in some professional industries (e.g., healthcare, finance, and manufacturing). Therefore, semi-supervised learning (SSL), which utilizes abundant unlabeled data in deep learning applications, has become an important research trend in the field of artificial intelligence (Chapelle et al., <xref ref-type="bibr" rid="B4">2006</xref>; Tarvainen and Valpola, <xref ref-type="bibr" rid="B30">2017</xref>; Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>).</p>
<p>Various approaches to SSL in terms of image classification have been proposed in recent years based on some prototypical assumptions. For instance, the manifold assumption, which states that high dimensional data usually lie on low dimensional manifolds, leads to some consistency-based SSL methods. Other two well-known assumptions, cluster assumption, and low-density separation assumption have also inspired some types of research. To summarize, two main research directions show great promise.</p>
<p>One direction explores generative model-based approaches. For instance, VAE (Kingma and Welling, <xref ref-type="bibr" rid="B15">2013</xref>), generative adversarial network (GAN) (Goodfellow et al., <xref ref-type="bibr" rid="B9">2014</xref>), and normalizing flows (Kobyzev et al., <xref ref-type="bibr" rid="B16">2020</xref>) can establish a low-dimensional hidden variable capturing the manifold of the input data, and then, Bayesian inference can be applied to optimize the posterior probability of both labeled and unlabeled examples (Kingma et al., <xref ref-type="bibr" rid="B14">2014</xref>; Makhzani et al., <xref ref-type="bibr" rid="B23">2015</xref>; Rasmus et al., <xref ref-type="bibr" rid="B26">2015</xref>; Maal&#x000F8;e et al., <xref ref-type="bibr" rid="B22">2016</xref>). However, GAN is known to be extremely difficult in generating high-resolution images, despite a large amount of research in recent years (Radford et al., <xref ref-type="bibr" rid="B25">2015</xref>; Gulrajani et al., <xref ref-type="bibr" rid="B10">2017</xref>; Brock et al., <xref ref-type="bibr" rid="B3">2018</xref>), making these approaches difficult to scale to large and complex dataset (Yu et al., <xref ref-type="bibr" rid="B33">2019</xref>). Furthermore, the extensive computation cost in training generative models makes these approaches less practical in real-world applications (Brock et al., <xref ref-type="bibr" rid="B3">2018</xref>).</p>
<p>Another direction tries to exert proper regularizations on the classifier using unlabeled data. Those regularizations could be summarized into two categories as follows: one is consistency regularization, where two similar images or two networks with related parameters are encouraged to have similar network outputs (Sajjadi et al., <xref ref-type="bibr" rid="B28">2016</xref>; Tarvainen and Valpola, <xref ref-type="bibr" rid="B30">2017</xref>; Miyato et al., <xref ref-type="bibr" rid="B24">2019</xref>). Another type is based on the data graph. In traditional machine learning, manifold assumption-based algorithms usually establish a graph to describe the manifold structure, then employ the graph Laplacian to induce smoothness on the data manifold, such as <italic>Harmonic function</italic> (HF) (Zhu et al., <xref ref-type="bibr" rid="B36">2003</xref>), <italic>Label Propagation</italic> (LP) (Zhou et al., <xref ref-type="bibr" rid="B35">2003</xref>; Gong et al., <xref ref-type="bibr" rid="B8">2015</xref>), and <italic>Manifold Regularization</italic> (MR) (Belkin et al., <xref ref-type="bibr" rid="B2">2006</xref>).</p>
<p>Those two types of semi-supervised methods have their strengths and weakness, respectively. In terms of consistency-based regularization, those methods only consider the perturbations around each data point, ignoring the connections between data points. Therefore, they do not fully utilize the data structure, such as manifolds or clusters. This artifact could be avoided if the data structure is taken into consideration using graph-based methods, which define convex optimization problems and have closed form of solutions (Zhou et al., <xref ref-type="bibr" rid="B35">2003</xref>; Gong et al., <xref ref-type="bibr" rid="B8">2015</xref>; Tu et al., <xref ref-type="bibr" rid="B31">2015</xref>). On the contrary, the performance of the graph-based methods will degrade if the input data graph cannot satisfy the following conditions (Belkin et al., <xref ref-type="bibr" rid="B2">2006</xref>): capturing the manifold structure of the input space and representing the similarity between two data points. Some traditional research aims at improving the graph quality (Jebara et al., <xref ref-type="bibr" rid="B12">2009</xref>). However, it is extremely difficult to capture the manifold of high-dimensional image data, causing poor performance on image recognition tasks (Kamnitsas et al., <xref ref-type="bibr" rid="B13">2018</xref>; Luo et al., <xref ref-type="bibr" rid="B21">2018</xref>; Li et al., <xref ref-type="bibr" rid="B18">2020</xref>, <xref ref-type="bibr" rid="B19">2022</xref>; Ren et al., <xref ref-type="bibr" rid="B27">2022</xref>).</p>
<p>Motivated by the above observation, we introduce differential privacy (Dwork and Roth, <xref ref-type="bibr" rid="B6">2014</xref>) and mixup data augmentation (Zhang et al., <xref ref-type="bibr" rid="B34">2018</xref>) in the graph-based SSL method. For both labeled and unlabeled data points, we force the predictions of network changes linearly in the vector from one data point to another, which forces the middle features to change linearly as well. We also employ differential privacy, to further boost the consistency of latent feature space by adding random noise to its latent representation layers. We observe that such regularization results in a more compact and coherent latent feature space given by the network and leads to a high-quality graph that captures the data manifold more accurately. Compared with the consistency-based methods (Sajjadi et al., <xref ref-type="bibr" rid="B28">2016</xref>; Tarvainen and Valpola, <xref ref-type="bibr" rid="B30">2017</xref>; Miyato et al., <xref ref-type="bibr" rid="B24">2019</xref>), the proposed method demonstrate dominant performance with fewer label and much higher convergence speed, which means it can achieve the same performance with fewer computational cost. Compared with previous graph-based SSL methods, the proposed method tends to form more coherent latent feature space and achieves higher performance.</p>
<p>To summarize, there are two key contributions of our study as follows:
<list list-type="bullet">
<list-item><p>We propose a simple but effective regularization method that can be applied to a graph-based SSL framework to regularize the latent space of deep neural network during the training process so that the graph-based SSL framework can work better with high-dimensional image datasets.</p></list-item>
<list-item><p>We experimentally show that the proposed method achieves significant performance in improvement over the previous graph-based SSL method (Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>) on SSL standard benchmarks and demonstrates competitive results to other state-of-the-art SSL methods.</p></list-item>
</list></p></sec>
<sec id="s2">
<title>2. Related work</title>
<p>In this section, we roughly categorized the recent advances in SSL into generative methods and graph-based methods and briefly introduce these two types of methods.</p>
<sec>
<title>2.1. Generative SSL methods</title>
<p>Instead of directly estimating posterior probability <italic>p</italic>(<italic>y</italic>|<italic>x</italic>), the generative methods pay attention to learning the class distributions <italic>P</italic>(<italic>x</italic>|<italic>y</italic>) or the joint distribution <italic>p</italic>(<italic>x, y</italic>) &#x0003D; <italic>p</italic>(<italic>x</italic>|<italic>y</italic>)<italic>p</italic>(<italic>y</italic>), to compute the posterior probability using Bayes&#x00027; formula. In this framework, SSL can be modeled as a missing data problem while the unlabeled data can be used to optimize the marginal distribution <inline-formula><mml:math id="M1"><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>Y</mml:mi></mml:mrow></mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>|</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. The joint log-likelihood on both labeled data set <italic>D</italic><sub><italic>L</italic></sub> and unlabeled data set <italic>D</italic><sub><italic>U</italic></sub> is naturally considered an objective function (Chapelle et al., <xref ref-type="bibr" rid="B4">2006</xref>) as follows:
<disp-formula id="E1"><label>(1)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:mo class="qopname">log</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>U</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>y</mml:mi></mml:mrow></mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where &#x003C0;<sub><italic>y</italic></sub> &#x0003D; <italic>p</italic>(<italic>y</italic>|&#x003B8;) is class prior. Previous studies under this framework use auto-encoder (Rasmus et al., <xref ref-type="bibr" rid="B26">2015</xref>) or variational auto-encoder (Kingma and Welling, <xref ref-type="bibr" rid="B15">2013</xref>) to model <italic>p</italic>(<italic>x, y</italic>). Unfortunately, since the neural network has excessive representational power, optimizing marginal distribution cannot guarantee to achieve the right joint distribution, making those approaches perform less well in large datasets.</p>
<p>To better estimate <italic>p</italic>(<italic>x</italic>|<italic>y</italic>), generative adversarial network (GAN) (Goodfellow et al., <xref ref-type="bibr" rid="B9">2014</xref>) has been introduced to the SSL framework. GAN is well-known for high-quality realistic image generation, Radford et al. (<xref ref-type="bibr" rid="B25">2015</xref>) employed fake examples from a conditional GAN as additional training data. Salimans et al. (<xref ref-type="bibr" rid="B29">2016</xref>) strengthened the discriminator to classify input data as well as distinguish fake examples from real data. Besides, BadGAN (Dai et al., <xref ref-type="bibr" rid="B5">2017</xref>) argued that generating low-quality examples which lie in low-density areas between different classes can better guide the classifier position its the decision boundary. However, those approaches are based on practice and lack theoretical analysis, showing weak performance compared with newly emerged approaches.</p></sec>
<sec>
<title>2.2. Graph-based SSL methods</title>
<p>Graph-based methods operate on a weighted graph <italic>G</italic> &#x0003D; (<italic>V, E</italic>) with adjacency matrix <italic>A</italic>, where the vertex set <italic>V</italic> is composed of all data samples denoted as <italic>D</italic> &#x0003D; <italic>D</italic><sub><italic>L</italic></sub> &#x0222A; <italic>D</italic><sub><italic>U</italic></sub>, and the elements of adjacency matrix <italic>A</italic><sub><italic>ij</italic></sub> are based on similarities between vertices <italic>x</italic><sub><italic>i</italic></sub>, <italic>x</italic><sub><italic>j</italic></sub> &#x02208; <italic>V</italic>. The smoothness assumption states that close data points should have similar predictions. Label Propagation (LP) iteratively propagated the class posterior of each node to its neighbors, faster through a short connection between data nodes, until a global equilibrium is reached. Zhou et al. (<xref ref-type="bibr" rid="B35">2003</xref>) showed that the same solution is arrived at by enforcing the smoothness term or equivalently minimizing the energy on the graph as follows:
<disp-formula id="E2"><label>(2)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>R</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mtext>f</mml:mtext></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mtext>f</mml:mtext></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mtext>f</mml:mtext></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mtext>f</mml:mtext></mml:mrow><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:msup><mml:mo>&#x00394;</mml:mo><mml:mtext>f</mml:mtext></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Here, f<sub><italic>i</italic></sub> is the pseudo label of the <italic>i</italic>-th data sample, &#x00394; &#x0003D; <italic>D</italic> &#x02212; <italic>A</italic> is the traditional graph Laplacian where <italic>D</italic> is a diagonal matrix with <inline-formula><mml:math id="M4"><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. If f is replaced with f(X) which is the output of a parameterized function on all data samples, the R(f) reaches a graph Laplacian regularizer which forces the function f to be harmonic. Some modifications to Equation (2) are made to mitigate the effect of outliers in the graph (Gong et al., <xref ref-type="bibr" rid="B8">2015</xref>; Tu et al., <xref ref-type="bibr" rid="B31">2015</xref>). An inevitable drawback of these approaches is that their performance largely relies on the quality of the input graph.</p>
<p>Our study is inspired by a series of recent methods which utilize the LP with a dynamically constructed graph in the optimization process. Acting like EM algorithms, those methods alternate between the two steps, the first step is to use the embeddings obtained by the deep neural network to construct the nearest neighbor graph, and then LP is performed on this graph to infer pseudo-label for the unlabeled images. After that, the network is trained using both labeled and pseudo-labeled data. In addition to the time cost by LP, this method just uses standard back-propagation methods to train the deep neural network, making it fast and efficient. Luo et al. (<xref ref-type="bibr" rid="B21">2018</xref>) used the graph constructed from embeddings obtained by the teacher model which is acting better than the student model. Kamnitsas et al. (<xref ref-type="bibr" rid="B13">2018</xref>) utilize LP to infer clusters in network middle representation, then encourage points in the same cluster to be closer. Iscen et al. (<xref ref-type="bibr" rid="B11">2019</xref>) introduce entropy as an uncertainty measure of pseudo-labels, then reduce the cost of uncertain examples. All of those approaches construct the graph actively in the optimization process to improve the quality of pseudo-labels.</p></sec></sec>
<sec id="s3">
<title>3. The proposed method</title>
<p>The common challenge of graph-based methods is the need for a well-behaved graph that captures the geometry manifold of input data. In the following, we formalize our approach and emphasize our efforts in improving the quality of the graph. The key motivation of our method is to form a better network representation for better graph construction.</p>
<sec>
<title>3.1. Problem formulation</title>
<p>We assume that the input space is <inline-formula><mml:math id="M5"><mml:mrow><mml:mi mathvariant="script">X</mml:mi></mml:mrow><mml:mo>&#x02286;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. We have a collection of <italic>l</italic> labeled samples <italic>X</italic><sub><italic>L</italic></sub>: &#x0003D; {<italic>x</italic><sub>1</sub>, ..., <italic>x</italic><sub><italic>l</italic></sub>} with <inline-formula><mml:math id="M6"><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="script">X</mml:mi></mml:mrow></mml:math></inline-formula>, their labels are given by <italic>Y</italic><sub><italic>L</italic></sub> &#x0003D; {<italic>y</italic><sub>1</sub>, ...., <italic>y</italic><sub><italic>l</italic></sub>} with <italic>y</italic><sub><italic>i</italic></sub> &#x02208; <italic>C</italic>, where <italic>C</italic> &#x0003D; {1, ..., <italic>c</italic>} is the set of discrete labels for <italic>c</italic> classes. In addition, <italic>u</italic> extra samples <italic>X</italic><sub><italic>U</italic></sub> &#x0003D; {<italic>x</italic><sub><italic>l</italic>&#x0002B;1</sub>, ..., <italic>x</italic><sub><italic>l</italic>&#x0002B;<italic>u</italic></sub>} are given without any label information. The whole set of samples is denoted as <italic>D</italic> &#x0003D; <italic>X</italic><sub><italic>L</italic></sub> &#x0222A; <italic>X</italic><sub><italic>U</italic></sub>. The transductive goal in SSL is to find the possible label set &#x00176; for all unlabeled samples, while the inductive goal is to find a classifier <inline-formula><mml:math id="M7"><mml:mi>f</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="script">X</mml:mi></mml:mrow><mml:mo>&#x021A6;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> which can generalize well on unseen samples by utilizing all samples <italic>D</italic> and label <italic>Y</italic><sub><italic>L</italic></sub>. In this study, we focus on the inductive settings and use the convolutional neural network (CNN) as the classifier in our experiments.</p></sec>
<sec>
<title>3.2. Overview</title>
<p>Given a randomly initialized neural network <italic>f</italic> parameterized by &#x003B8;, we introduce a new optimization process for SSL that can be summarized as follows. First, we perform pre-training of the network using only labeled data to warm it up, where we introduce mixup data argumentation. Then, we start the iterative SSL training process, perform label propagation to infer pseudo-labels, and optimize the network with both labeled and pseudo-labeled data. We point out two critical improvements compared with previous approaches: mixup regularization with pseudo-labeled data and deformed graph Laplacian-based label propagation. In addition, we incorporate the pseudo-label certainty and class-balancing strategy from the study by Iscen et al. (<xref ref-type="bibr" rid="B11">2019</xref>) into our approach. A graphical view of the proposed approach is shown in <xref ref-type="fig" rid="F1">Figure 1</xref>.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Overview of the proposed method. The colored points are the t-SNE visualization of the feature vectors extracted from Cifar10 train data by the same deep neural network in different training stages (we squeeze the 128-D feature vectors into 2-D plane coordinates using t-SNE). Starting from a randomly initialized network, we first train it with labeled data (<italic>l</italic> &#x0003D; 1,000) only and differential privacy to form a primitive network representation. Then, we perform label propagation and train the network on the entire dataset, with both handcraft labels and pseudo-labels. We repeat this process for <italic>T</italic>&#x02032; times until convergence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-05-1114186-g0001.tif"/>
</fig></sec>
<sec>
<title>3.3. Supervised mixup pre-training</title>
<p>In the early stage of the training process, the neural network <italic>f</italic> is composed of randomly initialized weight parameters. The output of the neural network is chaotic and has little semantic information about the input images. In previous studies, the network is pre-trained using the labeled samples by minimizing supervised cost only. Standard optimization techniques are employed in this procedure. The optimization target is the expectation of loss function &#x02113; over the labeled data distribution <italic>P</italic><sub><italic>D</italic></sub> as follows:
<disp-formula id="E3"><label>(3)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0007E;</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>&#x02113;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where <inline-formula><mml:math id="M9"><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:mfrac><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:munderover><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is a point distribution that is employed to estimate the true data distribution <italic>P</italic><sub><italic>D</italic></sub> when labeled data set <italic>X</italic><sub><italic>L</italic></sub> and correspondent label <italic>Y</italic><sub><italic>L</italic></sub> are given. With fewer labeled data in the SSL setting, the point distribution is hardly able to estimate the true data distribution <italic>P</italic><sub><italic>D</italic></sub>, causing the model overfitting and degeneration. Previous studies take measures to mitigate this problem, using a larger learning rate and fewer pre-training epochs (Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>). Other regularization techniques such as dropout and weight decay are adopted by Luo et al. (<xref ref-type="bibr" rid="B21">2018</xref>) and Miyato et al. (<xref ref-type="bibr" rid="B24">2019</xref>).</p>
<p>In this study, we introduce mixup (Zhang et al., <xref ref-type="bibr" rid="B34">2018</xref>) in the pre-training procedure. Instead of simple point distribution, mixup proposes a vicinal distribution to estimate <italic>P</italic><sub><italic>D</italic></sub>, whose generative process is summarized as follows:
<disp-formula id="E4"><label>(4)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0007E;</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>&#x003BB;</mml:mi><mml:mo>&#x0007E;</mml:mo><mml:mi>B</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B1;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>x</mml:mi><mml:mo>: =</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>y</mml:mi><mml:mo>: =</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where &#x003B1; is a hyperparameter &#x003B1; &#x02208; (0, &#x0221E;), which controls the strength of interpolation between data pairs. Notably, <italic>D</italic><sub><italic>M</italic></sub> will degrade to <italic>D</italic><sub><italic>L</italic></sub> as &#x003B1; &#x02192; 0. We denote the mixup distribution as <italic>D</italic><sub><italic>M</italic></sub>(<italic>D</italic><sub><italic>L</italic></sub>, &#x003B1;). By replacing the <italic>D</italic><sub><italic>L</italic></sub> with mixup distribution in Equation (3), we have as follows:
<disp-formula id="E5"><label>(5)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0007E;</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:munder></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>&#x02113;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
In a nutshell, we just randomly sample two data points each time and perform standard supervised training using the mixed data point. Notably, the interpolation between <italic>y</italic><sub><italic>i</italic></sub> and <italic>y</italic><sub><italic>j</italic></sub> is an interpolation between two one-hot encoded probability vectors. This modification does not influence the optimization process, which means any optimizer and network architecture can be applied with this regularization.</p>
<p>We summarize the benefits of introducing mixup mainly in two-folds as follows:
<list list-type="order">
<list-item><p>Overfitting problem is greatly alleviated. While the output posterior probability of the classifier is forced to transit linearly from class to class, the decision boundaries between classes are pushed into the intermediate area, reducing the number of undesirable results when predicting outside the train examples. The experiments will demonstrate the test error of the pre-trained network is reduced by introducing mixup regularization.</p></list-item>
<list-item><p>The internal representation of the classifier is encouraged to transit linearly as well as the output, leading to abstract representations in smooth and coherent feature space, and then more accurate label propagation is preformed based on sample distance in feature space.</p></list-item>
</list></p></sec>
<sec>
<title>3.4. Label propagation via nearest neighbor graph</title>
<p>Given a classification network <italic>f</italic> with pre-trained parameters &#x003B8;, we clarify how to infer the possible label for unlabeled examples, which is employed to guide the SSL optimization process in the following subsection. First, we construct a graph based on network internal representations of all examples, and then, we apply label propagation on the graph to infer pseudo labels.</p>
<p>In most cases, the deep neural network can be seen as a sequence of non-linear layers or transformations, with each transformation giving an internal representation of the input. While the low-level representation captures more details of the input image, the high-level representation contains more semantic information about the input image, and the last layer maps the input feature into class probabilities. In a nutshell, network <italic>f</italic> can be decomposed as <italic>f</italic> &#x0003D; <italic>g</italic> &#x025E6; <italic>h</italic> where <inline-formula><mml:math id="M12"><mml:mi>h</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="script">X</mml:mi></mml:mrow><mml:mo>&#x021A6;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is a feature extractor and <italic>g</italic> : &#x0211D;<sup><italic>d</italic></sup> &#x021A6; &#x0211D;<sup><italic>c</italic></sup> is the output layer usually consisting of a fully-connected layer with softmax. We take <italic>h</italic> as a low-dimensional feature extractor and denote the feature vector for the <italic>i</italic>-th example as <italic>v</italic><sub><italic>i</italic></sub>: &#x0003D; <italic>h</italic>(<italic>x</italic><sub><italic>i</italic></sub>). We extract all features set of <italic>D</italic> as <italic>V</italic> &#x0003D; {<italic>v</italic><sub>1</sub>, .., <italic>v</italic><sub><italic>l</italic></sub>, <italic>v</italic><sub><italic>l</italic>&#x0002B;1</sub>, ..., <italic>v</italic><sub><italic>n</italic></sub>} for similarity computation.</p>
<p>The next is to construct a graph on <italic>N</italic> data nodes. While computing the full <italic>N</italic> &#x000D7; <italic>N</italic> affinity matrix <italic>A</italic> may be intractable, we approximate it by constructing a nearest neighbor graph and only count the similarity between nodes and their <italic>k</italic> nearest neighbors. Thus, we create a graph with a sparse affinity matrix <italic>A</italic> &#x02208; &#x0211D;<sup><italic>n</italic> &#x000D7; <italic>n</italic></sup> and its elements as follows:
<disp-formula id="E6"><label>(6)</label><mml:math id="M13"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>: =</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mtext class="textrm" mathvariant="normal">if</mml:mtext><mml:mi>i</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02227;</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msub><mml:mrow><mml:mtext class="textrm" mathvariant="normal">NN</mml:mtext></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mtd><mml:mtd><mml:mtext class="textrm" mathvariant="normal">otherwise</mml:mtext></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where NN<sub><italic>k</italic></sub>(<italic>v</italic><sub><italic>i</italic></sub>) denotes the set of k nearest neighbors of <italic>v</italic><sub><italic>i</italic></sub> in <italic>D</italic>, and <italic>s</italic> is the similarity function. The choice of <italic>s</italic> is quite flexible. Since we need to approximate the semantic similarity of two instances, we adopt the Gaussian similarity function <inline-formula><mml:math id="M14"><mml:mi>s</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mo>&#x02225;</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>/</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:math></inline-formula> with hyperparameter &#x003C3;. Notably, approximate nearest neighbor (ANN) algorithms can be applied to accelerate the graph construction for large <italic>N</italic>.</p>
<p>Hereafter, let <italic>W</italic>: &#x0003D; <italic>A</italic> &#x0002B; <italic>A</italic><sup>T</sup> be the symmetric affinity matrix and <inline-formula><mml:math id="M15"><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mi>W</mml:mi><mml:msup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula> be its normalized counterpart, <italic>D</italic> is the diagonal degree matrix, in which the element is defined by <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>: =</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>. Further more, the volume of the graph is formulated as <inline-formula><mml:math id="M17"><mml:mi>v</mml:mi><mml:mo>=</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>.</p>
<p>After defining those parameters, we are going to describe our LP algorithm by defining input and output as two <italic>n</italic> &#x000D7; <italic>c</italic> matrix <italic>Y</italic>, <italic>Z</italic>. <italic>Y</italic> is the matrix of the given label with rows {<italic>y</italic><sub>1</sub>, ..., <italic>y</italic><sub><italic>l</italic></sub>, <italic>y</italic><sub><italic>l</italic>&#x0002B;1</sub>, ...<italic>y</italic><sub><italic>n</italic></sub>}, where the first <italic>l</italic> rows are corresponding one-hot encoded labels of each labeled example, and the rest are all zero vectors. <italic>Z</italic> is the desired class posterior probabilities which are solved by minimizing the following cost function:
<disp-formula id="E7"><label>(7)</label><mml:math id="M18"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder><mml:mrow><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>Z</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mrow><mml:mi mathvariant="script">Q</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02225;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mrow><mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02225;</mml:mo><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>&#x02225;</mml:mo><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:msup><mml:mo>&#x00394;</mml:mo><mml:mi>Z</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mtext>T</mml:mtext></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>I</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi><mml:mo>/</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mrow><mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mo>&#x02225;</mml:mo><mml:mi>Z</mml:mi><mml:mo>-</mml:mo><mml:mi>Y</mml:mi><mml:msubsup><mml:mrow><mml:mo>&#x02225;</mml:mo></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Here, <italic>z</italic><sub><italic>i</italic></sub> is the <italic>i</italic>-th row of matrix <italic>Z</italic>, <inline-formula><mml:math id="M19"><mml:mo>&#x00394;</mml:mo><mml:mo>=</mml:mo><mml:mi>I</mml:mi><mml:mo>-</mml:mo><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow></mml:math></inline-formula> is the normalized graph Laplacian and &#x02225;&#x000B7;&#x02225;<sub><italic>F</italic></sub> is the Frobenius norm. The first term encourages smoothness where similar examples tend to induce the same predictions, and the last term attempts to maintain predictions for labeled examples (Zhou et al., <xref ref-type="bibr" rid="B35">2003</xref>). In addition, the outlier which is indicated by a lower degree <italic>d</italic><sub><italic>ii</italic></sub> is forced to have a weak label in the second term. Thus, the degree of smoothness and weakness of outliers is controlled by weight parameters &#x003B2; and &#x003B3; individually. To find the optimal solution of <italic>Z</italic>, we set the derivative of Equation (7) with respect to <italic>Z</italic> to 0 and obtain as follows:
<disp-formula id="E8"><label>(8)</label><mml:math id="M20"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B2;</mml:mi><mml:mo>&#x00394;</mml:mo><mml:msup><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>I</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi><mml:mo>/</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:mi>Y</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Thus, the optimal <italic>Z</italic> is defined as follows:
<disp-formula id="E9"><label>(9)</label><mml:math id="M21"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>I</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mo>&#x00394;</mml:mo><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>I</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi><mml:mo>/</mml:mo><mml:mi>v</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mi>Y</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Let <inline-formula><mml:math id="M22"><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula> and ignore the constant part in Equation (9). We have as follows:
<disp-formula id="E10"><label>(10)</label><mml:math id="M23"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>Z</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>I</mml:mi><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mi>D</mml:mi><mml:msup><mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mi>Y</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Directly computing <italic>Z</italic><sup>&#x0002A;</sup> by Equation (10) is often intractable for large <italic>n</italic> because the inverse matrix <inline-formula><mml:math id="M24"><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>I</mml:mi><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mi>D</mml:mi><mml:msup><mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math></inline-formula> is not sparse, Instead, we use the conjugate gradient (CG) method to solve the linear system as follows:
<disp-formula id="E11"><label>(11)</label><mml:math id="M25"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>I</mml:mi><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mi>D</mml:mi><mml:msup><mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow><mml:mi>Z</mml:mi><mml:mo>=</mml:mo><mml:mi>Y</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
This method is applicable because <inline-formula><mml:math id="M26"><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>I</mml:mi><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mi mathvariant="script">W</mml:mi></mml:mrow><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mi>D</mml:mi><mml:msup><mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math></inline-formula> is a positive-definite matrix. The CG method has been adopted in many LP applications (Zhou et al., <xref ref-type="bibr" rid="B35">2003</xref>; Gong et al., <xref ref-type="bibr" rid="B8">2015</xref>; Tu et al., <xref ref-type="bibr" rid="B31">2015</xref>; Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>). Finally, the pseudo-label for an unlabeled example is given as follows:
<disp-formula id="E12"><label>(12)</label><mml:math id="M27"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo class="qopname">arg</mml:mo><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
Equation (11) is a hard assignment by evaluating the most confident class of each example; however, the contrast between classes can reflect the certainty of each example. Following Iscen et al. (<xref ref-type="bibr" rid="B11">2019</xref>), we associate a measure of confidence to each unlabeled example of calculating the entropy of Z:
<disp-formula id="E13"><label>(13)</label><mml:math id="M28"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>: =</mml:mo><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x01E91;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where <inline-formula><mml:math id="M29"><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula> is the normalized counterpart of <italic>z</italic><sub><italic>i</italic></sub>, in other words, <inline-formula><mml:math id="M30"><mml:msub><mml:mrow><mml:mi>&#x01E91;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>/</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, function <italic>H</italic>:&#x0211D;<sup><italic>c</italic></sup> &#x021A6; &#x0211D; is the entropy function.</p></sec>
<sec>
<title>3.5. Mixup regularization with differential privacy</title>
<p>Given the pseudo-label and confidence measure of all unlabeled data, we associate them with each example and denote the point distribution of unlabeled data as <inline-formula><mml:math id="M31"><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>U</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mfrac><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>u</mml:mi></mml:mrow></mml:munderover><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>y</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x00177;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>w</mml:mi><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. To take the confidence coefficient into account, we propose the mixup distribution of unlabeled data as <italic>D</italic><sub><italic>MU</italic></sub> whose generative process is summarized as follows:
<disp-formula id="E14"><label>(14)</label><mml:math id="M32"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0007E;</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>U</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>&#x003BB;</mml:mi><mml:mo>&#x0007E;</mml:mo><mml:mi>B</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B1;</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>x</mml:mi><mml:mo>: =</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>&#x00177;</mml:mi><mml:mo>: =</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>w</mml:mi><mml:mo>: =</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
In this generative process, the input data are interpolated randomly between two examples, while the pseudo-label and confidence score are interpolated with the same proportion as well.</p>
<p>The abundant unlabeled data are used in the training process by minimizing the following cost along with labeled data.
<disp-formula id="E15"><label>(15)</label><mml:math id="M33"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder><mml:mrow><mml:mo>&#x1D53C;</mml:mo></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi><mml:mo>,</mml:mo><mml:mi>&#x00177;</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0007E;</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>M</mml:mi><mml:mi>U</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>U</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:munder></mml:mstyle><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>w</mml:mi><mml:mi>&#x02113;</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>&#x00177;</mml:mi></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
While the interpolation of labeled data can help form better network representation, in the unsupervised part, introducing interpolation of unlabeled data can lead to two benefits as follows: (1) The decision boundaries are pushed far away from unlabeled data, which are a desired property of low-density separation assumption. The model is forced to make neutral predictions in the middle zone of different samples or namely different clusters. (2) The clusters in hidden space are encouraged to have only one class of pseudo label, respectively. Considering the clusters in hidden space, if data points in one cluster are pseudo-labeled by two different classes, the mixup loss term will tear this cluster apart. The middle of this cluster is both encouraged to have neutral predictions as the interpolation of edge points or encouraged to have clear predictions as the middle points of the cluster.</p>
<p>Moreover, we employ differential privacy by directly adding noise to the latent representation of the deep neural network in the training progress:
<disp-formula id="E16"><label>(16)</label><mml:math id="M34"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where &#x003F5; is randomly sampled from <inline-formula><mml:math id="M35"><mml:mrow><mml:mi mathvariant="script">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. This procedure is inspired by the Ladder network (Rasmus et al., <xref ref-type="bibr" rid="B26">2015</xref>). By adding noise to its latent representation, the neural network will have more resistance to the dataset bias and form a more coherent latent space.</p>
<p>Finally, we finetune the network by minimizing the following objective function using both labeled and unlabeled data:
<disp-formula id="E17"><label>(17)</label><mml:math id="M36"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>U</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>L</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003BB;</mml:mi><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>u</mml:mi><mml:mi>n</mml:mi><mml:mi>s</mml:mi><mml:mi>u</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>U</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
where &#x003BB; is a coefficient that controls the effects of the unsupervised term.</p></sec>
<sec>
<title>3.6. Iterative training</title>
<p>We summarize our approach with the above definitions. Given a convolution neural network <italic>f</italic> with randomly initialized weights &#x003B8;, we begin by training the network with mixup regularization for <italic>T</italic> epochs using the supervised loss term (Equation 5), and then, we start the following iterative process. First, we extract feature vector set <italic>V</italic> on the entire training set <italic>X</italic> and construct a nearest neighbor graph by computing the adjacency matrix via Equation (6). Second, we perform label propagation by solving the linear system (Equation 11) and assign pseudo-label and confidence score by Equations (12) and (13). Finally, we train the network for one epoch by minimizing the cost (Equation 17) on both labeled and unlabeled data set. This iterative process is repeated for <italic>T</italic>&#x02032; epochs.</p>
<p>The whole training process is summarized in <xref ref-type="table" rid="T6">Algorithm 1</xref>, where procedure <italic>Optimize</italic>() refers to the mini-batch optimization of the given loss term for one epoch. In our experiment, we randomly sample a mini-batch of data and perform mixup interpolation within this mini-batch, and this strategy is used to reduce I/O consumption and report no harm to the result in the study by Zhang et al. (<xref ref-type="bibr" rid="B34">2018</xref>). The procedure <italic>NearestNeighborGraph</italic>() refers to the nearest neighbor graph construction based on the feature vector set <italic>V</italic> and the computing of edge value in the graph.</p>
<table-wrap position="float" id="T6">
<label>Algorithm 1</label>
<caption><p>Mini-batch training with LP for SSL.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr><td align="left" valign="top"><monospace>&#x003B8; &#x02190; initialize randomly;</monospace></td></tr>
<tr><td align="left" valign="top"><monospace><bold>for</bold> epoch &#x02208;[1, ..., <italic>T</italic>] <bold>do</bold></monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;&#x003B8; &#x02190; <italic>Optimize</italic>(<italic>R</italic><sub><italic>S</italic></sub>(<italic>X</italic><sub><italic>L</italic></sub>, <italic>Y</italic><sub><italic>L</italic></sub>, &#x003B8;));</monospace> </td></tr>
<tr><td align="left" valign="top"><monospace><bold>end for</bold></monospace></td></tr>
<tr><td align="left" valign="top"><monospace><bold>for</bold> epoch &#x02208;[1, ..., <italic>T</italic>&#x02032;] <bold>do</bold></monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;<bold>for</bold> <italic>i</italic> &#x02208; 1, ..., <italic>n</italic> <bold>do</bold></monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;&#x000A0;&#x000A0;<italic>v</italic><sub><italic>i</italic></sub> &#x02190; <italic>h</italic><sub>&#x003B8;</sub>(<italic>x</italic><sub><italic>i</italic></sub>);</monospace> </td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;<bold>end for</bold></monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;<italic>A</italic> &#x02190; <italic>NearestNeighborGraph</italic>(<italic>V</italic>) ;</monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;<italic>W</italic> &#x02190; <italic>A</italic> &#x0002B; <italic>A</italic><sup>T</sup>;</monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;&#x00394; &#x02190; <italic>I</italic> &#x02212; <italic>D</italic><sup>&#x02212;1/2</sup><italic>WD</italic><sup>&#x02212;1/2</sup>;</monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;<italic>Z</italic> &#x02190; solve with CG;</monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;<bold>for</bold> <italic>i</italic> &#x02208; 1, ..., <italic>n</italic> <bold>do</bold></monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;&#x000A0;&#x000A0;<inline-formula><mml:math id="M37"><mml:msub><mml:mrow><mml:mi>&#x00177;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:mo class="qopname">arg</mml:mo><mml:msub><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>;</monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;&#x000A0;&#x000A0;<inline-formula><mml:math id="M38"><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02190;</mml:mo><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>H</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>z</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>l</mml:mi><mml:mi>o</mml:mi><mml:mi>g</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>;</monospace> </td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;<bold>end for</bold></monospace></td></tr>
<tr><td align="left" valign="top"><monospace>&#x000A0;&#x000A0;&#x003B8; &#x02190; <italic>Optimize</italic>(<italic>R</italic><sub><italic>all</italic></sub>(<italic>X</italic><sub><italic>L</italic></sub>, <italic>Y</italic><sub><italic>L</italic></sub>, <italic>X</italic><sub><italic>U</italic></sub>, &#x00176;<sub><italic>U</italic></sub>, &#x003B8;));</monospace> </td></tr>
<tr><td align="left" valign="top"><monospace><bold>end for</bold></monospace></td></tr>
</tbody>
</table>
</table-wrap>
</sec></sec>
<sec id="s4">
<title>4. Experiments</title>
<p>In this section, we conduct our experiments with several standard image datasets commonly used in image classification. We first describe the datasets and our implementation details, and then, we compare the proposed method with the state-of-the-art methods. Finally, we conduct an ablation study to give a deep investigation into our method.</p>

<sec>
<title>4.1. Datasets</title>
<p>We conduct experiments on three datasets, such as Cifar10, Cifar100, and Mini-imagenet. Cifar10 is widely used in related study, and Mini-imagenet is adopted by Iscen et al. (<xref ref-type="bibr" rid="B11">2019</xref>), to evaluate the proposed method on a large-scale dataset. Those datasets are commonly used in SSL setting by randomly taking a certain amount of labels and all image data to train the network and evaluate on the test set for fair a comparison with fully supervised methods, while the use of other labels in the training process is forbidden.</p>
<sec>
<title>4.1.1. Cifar10, Cifar100</title>
<p>Cifar10 and Cifar100 datasets (Krizhevsky, <xref ref-type="bibr" rid="B17">2009</xref>) are adopted in the evaluation process of previous SSL methods. The two datasets consist of small images of size 32 &#x000D7; 32. The training set of Cifar10 contains 50 k images, and its test set contains 10 k images, collected from 10 classes. Similar to Cifar10, Cifar100 has 50 and 10 k images for training and test, respectively, instead, Cifar100 collects images from 100 classes. For Cifar10, we randomly choose 50, 100, 200, and 400 images from each class, as the labeled images in our evaluation corresponding to 500, 1,000, 2,000, and 4,000 labels in total. For each class, we also choose 500 images as the validation images and employ the best model in validation to get the final performance on the test dataset. Following the common practice, we repeat the selection process 10 times, and for each time, we run the algorithm once on the dataset split and report mean error and standard deviation of test accuracy.</p></sec>
<sec>
<title>4.1.2. Mini-imagenet</title>
<p>Mini-imagenet was proposed by Gidaris and Komodakis (<xref ref-type="bibr" rid="B7">2018</xref>) for a few-shot learning evaluation, which is a simplified version of the Imagenet dataset. We adopt the same setting as the study by Iscen et al. (<xref ref-type="bibr" rid="B11">2019</xref>). Mini-imagenet consists of 100 classes with 600 images in each class, we randomly choose 500 images per class for the training set and use the remaining 100 images for testing.</p></sec></sec>
<sec>
<title>4.2. Implementation details</title>
<sec>
<title>4.2.1. Networks</title>
<p>We adopt a &#x0201C;13-layer&#x0201D; network for experiments on Cifar10 and Cifar100, which is a baseline used in all experiments in <xref ref-type="table" rid="T1">Table 1</xref>, and Resnet-18 is employed for experiments on Mini-imagenet. All of those networks consist of a feature extractor <italic>h</italic><sub>&#x003B8;</sub>, followed by a linear classification layer. The <italic>l</italic><sub>2</sub>-normalization after the feature extractor in the study by Iscen et al. (<xref ref-type="bibr" rid="B11">2019</xref>) is canceled, which reported slightly harmful performance since we employ the Mahalanobis distance between features instead of the dot product as the similarity function.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Comparison with state-of-the-art methods on Cifar10 using 13-layer ConvNet network architecture.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Nb. labels</bold></th>
<th valign="top" align="center"><bold>500 labels</bold></th>
<th valign="top" align="center"><bold>1,000 labels</bold></th>
<th valign="top" align="center"><bold>2,000 labels</bold></th>
<th valign="top" align="center"><bold>4,000 labels</bold></th>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Nb. images</bold></th>
<th valign="top" align="center"><bold>50,000 images</bold></th>
<th valign="top" align="center"><bold>50,000 images</bold></th>
<th valign="top" align="center"><bold>50,000 images</bold></th>
<th valign="top" align="center"><bold>50,000 images</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Supervised w/o. mixup</td>
<td valign="top" align="center">46.22 &#x000B1; 2.93</td>
<td valign="top" align="center">33.09 &#x000B1; 1.13</td>
<td valign="top" align="center">24.32 &#x000B1; 0.34</td>
<td valign="top" align="center">17.75 &#x000B1; 0.15</td>
</tr> <tr>
<td valign="top" align="left">Supervised w. mixup</td>
<td valign="top" align="center">44.65 &#x000B1; 1.01</td>
<td valign="top" align="center">34.84 &#x000B1; 1.37</td>
<td valign="top" align="center">24.86 &#x000B1; 0.42</td>
<td valign="top" align="center">16.89 &#x000B1; 0.16</td>
</tr> <tr>
<td valign="top" align="left">BadGAN<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref> (Dai et al., <xref ref-type="bibr" rid="B5">2017</xref>)</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">14.41 &#x000B1; 0.30</td>
</tr> <tr>
<td valign="top" align="left">VAT<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref> (Miyato et al., <xref ref-type="bibr" rid="B24">2019</xref>)</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">11.36 &#x000B1; 0.34</td>
</tr> <tr>
<td valign="top" align="left">MT<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref> (Tarvainen and Valpola, <xref ref-type="bibr" rid="B30">2017</xref>)</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">27.36 &#x000B1; 1.30</td>
<td valign="top" align="center">15.73 &#x000B1; 0.31</td>
<td valign="top" align="center">12.31 &#x000B1; 0.28</td>
</tr> <tr>
<td valign="top" align="left">SWA<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref> (Athiwaratkun et al., <xref ref-type="bibr" rid="B1">2019</xref>)</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">15.58 &#x000B1; 0.12</td>
<td valign="top" align="center">11.02 &#x000B1; 0.23</td>
<td valign="top" align="center">9.05 &#x000B1; 0.21</td>
</tr> <tr>
<td valign="top" align="left">LP<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref> (Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>)</td>
<td valign="top" align="center">32.40 &#x000B1; 1.80</td>
<td valign="top" align="center">22.02 &#x000B1; 0.88</td>
<td valign="top" align="center">15.66 &#x000B1; 0.35</td>
<td valign="top" align="center">12.69 &#x000B1; 0.29</td>
</tr> <tr>
<td valign="top" align="left">LP&#x0002B;MT<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref> (Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>)</td>
<td valign="top" align="center">24.02 &#x000B1; 2.44</td>
<td valign="top" align="center">16.93 &#x000B1; 0.70</td>
<td valign="top" align="center">13.22 &#x000B1; 0.29</td>
<td valign="top" align="center">10.61 &#x000B1; 0.28</td>
</tr> <tr>
<td valign="top" align="left">ICT<xref ref-type="table-fn" rid="TN1"><sup>&#x02020;</sup></xref> (Verma et al., <xref ref-type="bibr" rid="B32">2019</xref>)</td>
<td valign="top" align="center">&#x02013;</td>
<td valign="top" align="center">15.48 &#x000B1; 0.78</td>
<td valign="top" align="center"><bold>9.26 &#x000B1; 0.09</bold></td>
<td valign="top" align="center"><bold>7.29 &#x000B1; 0.09</bold></td>
</tr>
<tr>
<td valign="top" align="left"><bold>Ours</bold></td>
<td valign="top" align="center"><bold>18.56 &#x000B1; 1.58</bold></td>
<td valign="top" align="center"><bold>14.74 &#x000B1; 0.55</bold></td>
<td valign="top" align="center">10.14 &#x000B1; 0.30</td>
<td valign="top" align="center">8.58 &#x000B1; 0.27</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The error rate is reported over 10 runs.</p>
<fn id="TN1">
<label>&#x02020;</label><p>Denotes scores reported in previous studies. Bold values means the best result under the described experimental settings.</p></fn>
</table-wrap-foot>
</table-wrap></sec>
<sec>
<title>4.2.2. Hyper-parameters</title>
<p>The following hyper-parameters are adopted in all experiments. First, we train the model with labeled data for 30 epochs, then we finetune the model with all data for 270 epochs for the experiments on Cifar10 and Cifar100 and 370 epochs for the experiments on Mini-imagenet. The training is performed using the SGD optimizer in all experiments. The learning rate is decayed from 0.1 to 0 with cosine annealing (Loshchilov and Hutter, <xref ref-type="bibr" rid="B20">2016</xref>), and the momentum and weight decay parameters are set to 0.9 and 0.0001, respectively.</p>
<p>For the three hyperparameters, <italic>k, k</italic><sub>1</sub>, <italic>k</italic><sub>2</sub> introduced in Section 3.5, and we set <italic>k</italic> &#x0003D; 10 in Equation (6) for fast graph construction and set <italic>k</italic><sub>1</sub> &#x0003D; 0.99, <italic>k</italic><sub>2</sub> &#x0003D; 0.0005 in Equation (11), where we implement the CG algorithm using the python sci-kit package. Other two hyperparameter mixup coefficients &#x003B1;<sub><italic>su</italic></sub> and &#x003B1;<sub><italic>unsu</italic></sub> are set to 1.0 in all our experiments. We set the value of &#x003BB; in Equation (17) to 10 for all experiments.</p></sec></sec>
<sec>
<title>4.3. Comparison with state-of-the-art methods</title>
<p>In this section, we present a comparison with the state-of-the-art methods. We choose representative methods from three categories, such as generative SSL methods (BadGAN; Dai et al., <xref ref-type="bibr" rid="B5">2017</xref>), consistency-based SSL methods [VAT; (Miyato et al., <xref ref-type="bibr" rid="B24">2019</xref>), MT (Tarvainen and Valpola, <xref ref-type="bibr" rid="B30">2017</xref>), and ICT (Verma et al., <xref ref-type="bibr" rid="B32">2019</xref>)], and graph-based SSL methods (LP; Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>). The performance of various methods on three datasets is represented in <xref ref-type="table" rid="T1">Tables 1</xref>&#x02013;<xref ref-type="table" rid="T3">3</xref>, respectively.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Comparison with the state-of-the-art methods on Cifar100 using 13-layer ConvNet network architecture.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Nb. labels</bold></th>
<th valign="top" align="center"><bold>4,000 labels</bold></th>
<th valign="top" align="center"><bold>10,000 labels</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td valign="top" align="left"><bold>Nb. images</bold></td>
<td valign="top" align="center"><bold>50,000 images</bold></td>
<td valign="top" align="center"><bold>50,000 images</bold></td>
</tr> <tr>
<td valign="top" align="left">Supervised w/o. mixup</td>
<td valign="top" align="center">51.82 &#x000B1; 0.51</td>
<td valign="top" align="center">39.81 &#x000B1; 0.53</td>
</tr> <tr>
<td valign="top" align="left">Supervised w. mixup</td>
<td valign="top" align="center">52.43 &#x000B1; 0.43</td>
<td valign="top" align="center">38.53 &#x000B1; 0.29</td>
</tr> <tr>
<td valign="top" align="left">LP<xref ref-type="table-fn" rid="TN2"><sup>&#x02020;</sup></xref> (Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>)</td>
<td valign="top" align="center">46.20 &#x000B1; 0.76</td>
<td valign="top" align="center">38.43 &#x000B1; 1.88</td>
</tr> <tr>
<td valign="top" align="left">LP&#x0002B;MT<xref ref-type="table-fn" rid="TN2"><sup>&#x02020;</sup></xref> (Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>)</td>
<td valign="top" align="center">43.73 &#x000B1; 0.20</td>
<td valign="top" align="center">35.92 &#x000B1; 0.47</td>
</tr>
<tr>
<td valign="top" align="left"><bold>Ours</bold></td>
<td valign="top" align="center"><bold>38.87 &#x000B1; 0.43</bold></td>
<td valign="top" align="center"><bold>32.15 &#x000B1; 0.25</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The error rate is reported over 10 runs.</p>
<fn id="TN2">
<label>&#x02020;</label><p>Denotes scores reported in previous studies. Bold values means the best result under the described experimental settings.</p></fn>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Comparison with state-of-the-art methods on Mini-imagenet using the resnet-18 network.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Dataset</bold></th>
<th valign="top" align="center"><bold>4,000 labels</bold></th>
<th valign="top" align="center"><bold>10,000 labels</bold></th>
</tr>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Nb. labeled images</bold></th>
<th valign="top" align="center"><bold>50,000 images</bold></th>
<th valign="top" align="center"><bold>50,000 images</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Supervised</td>
<td valign="top" align="center">63.57 &#x000B1; 0.59</td>
<td valign="top" align="center">48.25 &#x000B1; 0.33</td>
</tr> <tr>
<td valign="top" align="left">LP&#x0002B;MT<xref ref-type="table-fn" rid="TN3"><sup>&#x02020;</sup></xref></td>
<td valign="top" align="center">70.29 &#x000B1; 0.81</td>
<td valign="top" align="center">57.58 &#x000B1; 1.47</td>
</tr> <tr>
<td valign="top" align="left"><bold>Ours</bold></td>
<td valign="top" align="center"><bold>48.86 &#x000B1; 0.11</bold></td>
<td valign="top" align="center"><bold>40.08 &#x000B1; 0.93</bold></td>
</tr>
<tr>
<td valign="top" align="left">Fully supervised with all labels</td>
<td valign="top" align="center">31.97 &#x000B1; 1.46</td>
<td valign="top" align="center">31.97 &#x000B1; 1.46</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The error rate is reported over three runs.</p>
<fn id="TN3">
<label>&#x02020;</label><p>Denotes scores reported in previous studies. Bold values means the best result under the described experimental settings.</p></fn>
</table-wrap-foot>
</table-wrap>
<p>The proposed method outperforms other methods with the same network architecture. On the Cifar10 dataset, our method achieves a significant error rate reduction (&#x0007E;20%) compared with our precedent method (Iscen et al., <xref ref-type="bibr" rid="B11">2019</xref>), showing that our method exactly amends its weakness and successfully mitigates the performance gap between the graph-based SSL framework and other SSL methods. Compared with the best consistency-based method (Verma et al., <xref ref-type="bibr" rid="B32">2019</xref>) to our knowledge, our method performs slightly weaker with 4,000 labels in total but outperforms it with fewer labeled images. This shows the advantage of traditional graph-based SSL learning that it can make more effective utilization of available labels, which are still applicable when it comes to modern deep learning architecture. We also try to use even fewer labels to evaluate the robustness of our method. Holding 500 labels (&#x0007E;1% of all), our method still achieves 18.56% error rate on the Cifar10 dataset.</p></sec>
<sec>
<title>4.4. Ablation studies</title>
<p>We conduct ablation studies to investigate the impact of mixup regularization on the pseudo-labels. To access the quality of pseudo-labels, accuracy is an important indicator. Despite this, we utilize the confidence score to calculate the weighted accuracy of pseudo-labels: <inline-formula><mml:math id="M39"><mml:mi>A</mml:mi><mml:mi>c</mml:mi><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mi>e</mml:mi><mml:mi>i</mml:mi><mml:mi>g</mml:mi><mml:mi>h</mml:mi><mml:mi>t</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:mfrac><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>u</mml:mi></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. During the iterative optimization process, if the models are not capable of correcting wrong pseudo-labels, the confidence of those mistakes will increase and eventually get close to 1, leading to <italic>Acc</italic><sub><italic>weighted</italic></sub> &#x02248; <italic>Acc</italic>. The weighted accuracy indicator can reflect if the model really learns something useful from unlabeled images or if it just remembers the pseudo-labels. <xref ref-type="fig" rid="F2">Figure 2</xref> shows the progress of accuracy and the weighted accuracy of pseudo-labels throughout the training process. The experiments are conducted on Mini-imagenet with 100 labels per class. The results show that if no mixup regularization is applied or only applying mixup regularization on labeled data, the accuracy of pseudo-labels only increases in the beginning and tends to be stable in the following epochs, while the weighted accuracy curve keeps declining until getting close to the accuracy curve. These results imply that without regularization, the deep neural network just remembers the pseudo-labels due to its excessive representational ability. In contrast, our regularization method successfully alleviates such undesired phenomenon, and the accuracy of pseudo-labels is keep increasing during the training process.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Pseudo-label accuracy and weighted pseudo-label accuracy with different mixup conditions on Mini-imagenet (10,000 labels are given in the training process, and accuracy is calculated according to ground truth). &#x003B1; &#x0003D; 0.0 means no mixup operation and &#x003B1; &#x0003D; 1.0 means mixup coefficient, &#x003BB; is drawn from a uniform distribution. The results show that the applied regularization in both two losses greatly improves the pseudo-label accuracy during the iterative training process.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-05-1114186-g0002.tif"/>
</fig>
<p>In <xref ref-type="table" rid="T4">Table 4</xref>, we compare the performance of Mini-imagenet. The results show that our method greatly reduces the error rate compared with the baseline method even on a high-resolution image dataset. To investigate the effectiveness of differential privacy in the proposed method, we vary the noise scale from 0 to 1.0 and report the performance on different datasets in <xref ref-type="table" rid="T5">Table 5</xref>. The results clearly show that the added noise reduces the error rate of the final model.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Impact of mixup regularization on pair of labeled data points or unlabeled data points.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Nb. labels</bold></th>
<th valign="top" align="center"><bold>10,000 labels</bold></th>
</tr>
<tr>
<th valign="top" align="left"><bold>Nb. images</bold></th>
<th valign="top" align="center"><bold>50,000 images</bold></th>
</tr>
<tr>
<th valign="top" align="left"><bold>LP&#x0002B;MT<sup>&#x02020;</sup></bold></th>
<th valign="top" align="center"><bold>57.35</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">LP&#x0002B;MT</td>
<td valign="top" align="center">48.07</td>
</tr> <tr>
<td valign="top" align="left">LP&#x0002B;MT&#x0002B;Su.mixup</td>
<td valign="top" align="center">44.39</td>
</tr>
<tr>
<td valign="top" align="left">LP&#x0002B;MT&#x0002B;Su.mixup&#x0002B;Unsu.mixup</td>
<td valign="top" align="center">39.43</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The error rate is reported on Mini-imagenet with 10,000 labels.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Impact of varying the noise scale &#x003C3;.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Noise scale</bold></th>
<th valign="top" align="center"><bold>Cifar10</bold></th>
<th valign="top" align="center"><bold>Cifar100</bold></th>
<th valign="top" align="center"><bold>Mini-imagenet</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">0</td>
<td valign="top" align="center">12.31</td>
<td valign="top" align="center">37.98</td>
<td valign="top" align="center">43.72</td>
</tr> <tr>
<td valign="top" align="left">0.01</td>
<td valign="top" align="center">10.24</td>
<td valign="top" align="center">35.32</td>
<td valign="top" align="center">41.30</td>
</tr> <tr>
<td valign="top" align="left">0.1</td>
<td valign="top" align="center">8.58</td>
<td valign="top" align="center">32.15</td>
<td valign="top" align="center">39.65</td>
</tr>
<tr>
<td valign="top" align="left">1</td>
<td valign="top" align="center">8.69</td>
<td valign="top" align="center">31.90</td>
<td valign="top" align="center">39.61</td>
</tr>
</tbody>
</table>
</table-wrap></sec></sec>
<sec id="s5">
<title>5. Conclusion and future work</title>
<p>In this study, we present a simple but effective regularization method in the graph-based SSL framework. Based on the previously proposed method that extends the traditional graph-based SSL framework for modern deep learning of image recognition, our study further strengthens this research line by introducing two critical measures as follows: imposing regularization on the latent space of the deep neural network and preventing the outlier data points from hurting the label propagation process. We show that our approach is effective and practical in utilizing unlabeled images via evaluation on both simple datasets of Cifar10 and Cifar100 and complex datasets with high resolution (Mini-imagenet). Furthermore, our method is computationally efficient and easy to implement the experiment on Mini-imagenet costs approximately 5h using a single NVIDIA 1080TI GPU. Our study also demonstrates differential privacy, which is an effective technique to constrain the excessive representation power of deep neural networks. Future study includes designing more delicate and effective regularization techniques in the SSL framework to further mitigate the performance gap between semi-supervised learning and supervised learning with all labels.</p></sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p></sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>This research study was funded in part by the Defence Industrial Technology Development Program under grant No. JCKY2020604B004.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Athiwaratkun</surname> <given-names>B.</given-names></name> <name><surname>Finzi</surname> <given-names>M.</given-names></name> <name><surname>Izmailov</surname> <given-names>P.</given-names></name> <name><surname>Wilson</surname> <given-names>A. G.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;There are many consistent explanations of unlabeled data: why you should average,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Learning Representations (ICLR)</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Belkin</surname> <given-names>M.</given-names></name> <name><surname>Niyogi</surname> <given-names>P.</given-names></name> <name><surname>Sindhwani</surname> <given-names>V.</given-names></name></person-group> (<year>2006</year>). <article-title>Manifold regularization: a geometric framework for learning from labeled and unlabeled examples</article-title>. <source>J. Mach. Learn. Res</source>. <volume>7</volume>, <fpage>2399</fpage>&#x02013;<lpage>2434</lpage>.<pub-id pub-id-type="pmid">26091754</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Brock</surname> <given-names>A.</given-names></name> <name><surname>Donahue</surname> <given-names>J.</given-names></name> <name><surname>Simonyan</surname> <given-names>K.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Large scale GAN training for high fidelity natural image synthesis,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Learning Representations (ICLR)</source> (<publisher-loc>Vancouver, BC</publisher-loc>).</citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chapelle</surname> <given-names>O.</given-names></name> <name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name> <name><surname>Zien</surname> <given-names>A.</given-names></name></person-group> (<year>2006</year>). <source>Semi-Supervised Learning</source>. <publisher-name>The MIT Press</publisher-name>. <pub-id pub-id-type="doi">10.7551/mitpress/9780262033589.001.0001</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dai</surname> <given-names>Z.</given-names></name> <name><surname>Yang</surname> <given-names>Z.</given-names></name> <name><surname>Yang</surname> <given-names>F.</given-names></name> <name><surname>Cohen</surname> <given-names>W. W.</given-names></name> <name><surname>Salakhutdinov</surname> <given-names>R. R.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Good semi-supervised learning that requires a bad GAN,&#x0201D;</article-title> in <source>Proceedings of the Advances in Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <italic>Vol. 30</italic>.</citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dwork</surname> <given-names>C.</given-names></name> <name><surname>Roth</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>The algorithmic foundations of differential privacy</article-title>. <source>Found. Trends Theor. Comput. Sci</source>. <volume>9</volume>, <fpage>211</fpage>&#x02013;<lpage>407</lpage>. <pub-id pub-id-type="doi">10.1561/0400000042</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gidaris</surname> <given-names>S.</given-names></name> <name><surname>Komodakis</surname> <given-names>N.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Dynamic few-shot visual learning without forgetting,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>), <fpage>4367</fpage>&#x02013;<lpage>4375</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00459</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gong</surname> <given-names>C.</given-names></name> <name><surname>Liu</surname> <given-names>T.</given-names></name> <name><surname>Tao</surname> <given-names>D.</given-names></name> <name><surname>Fu</surname> <given-names>K.</given-names></name> <name><surname>Tu</surname> <given-names>E.</given-names></name> <name><surname>Yang</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Deformed graph Laplacian for semisupervised learning</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>26</volume>, <fpage>2261</fpage>&#x02013;<lpage>2274</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2014.2376936</pub-id><pub-id pub-id-type="pmid">25608310</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Pouget-Abadie</surname> <given-names>J.</given-names></name> <name><surname>Mirza</surname> <given-names>M.</given-names></name> <name><surname>Xu</surname> <given-names>B.</given-names></name> <name><surname>Warde-Farley</surname> <given-names>D.</given-names></name> <name><surname>Ozair</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>&#x0201C;Generative adversarial nets,&#x0201D;</article-title> in <source>Proceedings of the Advances in Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Montreal, QC</publisher-loc>).</citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gulrajani</surname> <given-names>I.</given-names></name> <name><surname>Ahmed</surname> <given-names>F.</given-names></name> <name><surname>Arjovsky</surname> <given-names>M.</given-names></name> <name><surname>Dumoulin</surname> <given-names>V.</given-names></name> <name><surname>Courville</surname> <given-names>A. C.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Improved training of wasserstein GANs,&#x0201D;</article-title> in <source>Proceedings of the Advances in Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <volume>Vol. 30</volume>.</citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Iscen</surname> <given-names>A.</given-names></name> <name><surname>Tolias</surname> <given-names>G.</given-names></name> <name><surname>Avrithis</surname> <given-names>Y.</given-names></name> <name><surname>Chum</surname> <given-names>O.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Label propagation for deep semi-supervised learning,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>. (<publisher-loc>Long Beach, CA</publisher-loc>). <pub-id pub-id-type="doi">10.1109/CVPR.2019.00521</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jebara</surname> <given-names>T.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Chang</surname> <given-names>S.-F.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Graph construction and b-matching for semi-supervised learning,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Machine Learning (ICML)</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>441</fpage>&#x02013;<lpage>448</lpage>. <pub-id pub-id-type="doi">10.1145/1553374.1553432</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kamnitsas</surname> <given-names>K.</given-names></name> <name><surname>Castro</surname> <given-names>D.</given-names></name> <name><surname>Le Folgoc</surname> <given-names>L.</given-names></name> <name><surname>Walker</surname> <given-names>I.</given-names></name> <name><surname>Tanno</surname> <given-names>R.</given-names></name> <name><surname>Rueckert</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>&#x0201C;Semi-supervised learning via compact latent space clustering,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Stockholm</publisher-loc>), <fpage>2459</fpage>&#x02013;<lpage>2468</lpage>.</citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Mohamed</surname> <given-names>S.</given-names></name> <name><surname>Jimenez Rezende</surname> <given-names>D.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Semi-supervised learning with deep generative models,&#x0201D;</article-title> in <source>Proceedings of the Advances in Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Montreal, QC</publisher-loc>), <italic>Vol. 27</italic>.<pub-id pub-id-type="pmid">29989965</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>Auto-encoding variational bayes</article-title>. <source>arXiv preprint arXiv:1312.6114</source>.<pub-id pub-id-type="pmid">32176273</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kobyzev</surname> <given-names>I.</given-names></name> <name><surname>Prince</surname> <given-names>S. J.</given-names></name> <name><surname>Brubaker</surname> <given-names>M. A.</given-names></name></person-group> (<year>2020</year>). <article-title>Normalizing flows: an introduction and review of current methods</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>43</volume>, <fpage>3964</fpage>&#x02013;<lpage>3979</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2020.2992934</pub-id><pub-id pub-id-type="pmid">32396070</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <source>Learning multiple layers of features from tiny images</source> (Master&#x00027;s thesis). University of Toronto, <publisher-loc>Toronto, ON, Canada</publisher-loc>.</citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Ota</surname> <given-names>K.</given-names></name> <name><surname>Dong</surname> <given-names>M.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Desvig: decentralized swift vigilance against adversarial attacks in industrial artificial intelligence systems</article-title>. <source>IEEE TII</source> <volume>16</volume>, <fpage>3267</fpage>&#x02013;<lpage>3277</lpage>. <pub-id pub-id-type="doi">10.1109/TII.2019.2951766</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Yang</surname> <given-names>W.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>Multi-tentacle federated learning over software-defined industrial internet of things against adaptive poisoning attacks</article-title>. <source>IEEE Trans. Indus. Inform</source>. <volume>19</volume>, <fpage>1260</fpage>&#x02013;<lpage>1269</lpage>. <pub-id pub-id-type="doi">10.1109/TII.2022.3173996</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Loshchilov</surname> <given-names>I.</given-names></name> <name><surname>Hutter</surname> <given-names>F.</given-names></name></person-group> (<year>2016</year>). <article-title>SGDR: Stochastic gradient descent with warm restarts</article-title>. <source>arXiv preprint arXiv:1608.03983</source>.</citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Luo</surname> <given-names>Y.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>M.</given-names></name> <name><surname>Ren</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>B.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Smooth neighbors on teacher graphs for semi-supervised learning,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source> <publisher-loc>Salt Lake City, UT</publisher-loc>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00927</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Maal&#x000F8;e</surname> <given-names>L.</given-names></name> <name><surname>S&#x000F8;nderby</surname> <given-names>C. K.</given-names></name> <name><surname>S&#x000F8;nderby</surname> <given-names>S. K.</given-names></name> <name><surname>Winther</surname> <given-names>O.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Auxiliary deep generative models,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Machine Learning (ICML)</source> (<publisher-loc>New York, NY</publisher-loc>), <volume>Vol. 48</volume>, <fpage>1445</fpage>&#x02013;<lpage>1453</lpage>.</citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Makhzani</surname> <given-names>A.</given-names></name> <name><surname>Shlens</surname> <given-names>J.</given-names></name> <name><surname>Jaitly</surname> <given-names>N.</given-names></name> <name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Frey</surname> <given-names>B.</given-names></name></person-group> (<year>2015</year>). <article-title>Adversarial autoencoders</article-title>. <source>arXiv preprint arXiv:1511.05644</source>.</citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Miyato</surname> <given-names>T.</given-names></name> <name><surname>Maeda</surname> <given-names>S.-I.</given-names></name> <name><surname>Koyama</surname> <given-names>M.</given-names></name> <name><surname>Ishii</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>Virtual adversarial training: a regularization method for supervised and semi-supervised learning</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>41</volume>, <fpage>1979</fpage>&#x02013;<lpage>1993</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2018.2858821</pub-id><pub-id pub-id-type="pmid">30040630</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Radford</surname> <given-names>A.</given-names></name> <name><surname>Metz</surname> <given-names>L.</given-names></name> <name><surname>Chintala</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Unsupervised representation learning with deep convolutional generative adversarial networks</article-title>. <source>arXiv preprint arXiv:1511.06434</source>.</citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rasmus</surname> <given-names>A.</given-names></name> <name><surname>Berglund</surname> <given-names>M.</given-names></name> <name><surname>Honkala</surname> <given-names>M.</given-names></name> <name><surname>Valpola</surname> <given-names>H.</given-names></name> <name><surname>Raiko</surname> <given-names>T.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Semi-supervised learning with ladder networks,&#x0201D;</article-title> in <source>Proceedings of the Advances in Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Montreal, QC</publisher-loc>), <italic>Vol. 28</italic>.<pub-id pub-id-type="pmid">36876903</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ren</surname> <given-names>G.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>G.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>Guizani</surname> <given-names>M.</given-names></name></person-group> (<year>2022</year>). <article-title>Protecting intellectual property with reliable availability of learning models in ai-based cybersecurity services</article-title>. <source>IEEE TDSC</source> <fpage>1</fpage>&#x02013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1109/TDSC.2022.3222972</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sajjadi</surname> <given-names>M.</given-names></name> <name><surname>Javanmardi</surname> <given-names>M.</given-names></name> <name><surname>Tasdizen</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Regularization with stochastic transformations and perturbations for deep semi-supervised learning,&#x0201D;</article-title> in <source>Proceedings of the Advances in Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Barcelona</publisher-loc>), <fpage>1163</fpage>&#x02013;<lpage>1171</lpage>.</citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Salimans</surname> <given-names>T.</given-names></name> <name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Zaremba</surname> <given-names>W.</given-names></name> <name><surname>Cheung</surname> <given-names>V.</given-names></name> <name><surname>Radford</surname> <given-names>A.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Improved techniques for training GANs,&#x0201D;</article-title> in <source>Proceedings of the Advances in Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Barcelona</publisher-loc>), <italic>Vol. 29</italic>.</citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tarvainen</surname> <given-names>A.</given-names></name> <name><surname>Valpola</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results,&#x0201D;</article-title> in <source>Proceedings of the Advances in Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>1195</fpage>&#x02013;<lpage>1204</lpage>.</citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tu</surname> <given-names>E.</given-names></name> <name><surname>Yang</surname> <given-names>J.</given-names></name> <name><surname>Kasabov</surname> <given-names>N.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name></person-group> (<year>2015</year>). <article-title>Posterior distribution learning (PDL): a novel supervised learning framework using unlabeled samples to improve classification performance</article-title>. <source>Neurocomputing</source> <volume>157</volume>, <fpage>173</fpage>&#x02013;<lpage>186</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2015.01.020</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Verma</surname> <given-names>V.</given-names></name> <name><surname>Lamb</surname> <given-names>A.</given-names></name> <name><surname>Kannala</surname> <given-names>J.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Lopez-Paz</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Interpolation consistency training for semi-supervised learning,&#x0201D;</article-title> in <source>Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)</source> (<publisher-loc>Macao</publisher-loc>), <fpage>3635</fpage>&#x02013;<lpage>3641</lpage>. <pub-id pub-id-type="doi">10.24963/ijcai.2019/504</pub-id><pub-id pub-id-type="pmid">34735894</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>B.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Ma</surname> <given-names>J.</given-names></name> <name><surname>Zhu</surname> <given-names>Z.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Tangent-normal adversarial regularization for semi-supervised learning,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>10676</fpage>&#x02013;<lpage>10684</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.01093</pub-id><pub-id pub-id-type="pmid">37015525</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Cisse</surname> <given-names>M.</given-names></name> <name><surname>Dauphin</surname> <given-names>Y. N.</given-names></name> <name><surname>Lopez-Paz</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Mixup: beyond empirical risk minimization,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Learning Representations (ICLR)</source> (<publisher-loc>Vancouver, BC</publisher-loc>).</citation></ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>D.</given-names></name> <name><surname>Bousquet</surname> <given-names>O.</given-names></name> <name><surname>Lal</surname> <given-names>T. N.</given-names></name> <name><surname>Weston</surname> <given-names>J.</given-names></name> <name><surname>Olkopf</surname> <given-names>B. S.</given-names></name></person-group> (<year>2003</year>). <article-title>&#x0201C;Learning with local and global consistency,&#x0201D;</article-title> in <source>Proceedings of the Advances in Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Vancouver, BC</publisher-loc>).<pub-id pub-id-type="pmid">25268268</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>X.</given-names></name> <name><surname>Ghahramani</surname> <given-names>Z.</given-names></name> <name><surname>Lafferty</surname> <given-names>J. D.</given-names></name></person-group> (<year>2003</year>). <article-title>&#x0201C;Semi-supervised learning using gaussian fields and harmonic functions,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Machine Learning (ICML)</source> (<publisher-loc>Washington, DC</publisher-loc>), <fpage>912</fpage>&#x02013;<lpage>919</lpage>.</citation></ref>
</ref-list> 
</back>
</article>