<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2020.00002</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>On Robustness of Neural Architecture Search Under Label Noise</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Chen</surname> <given-names>Yi-Wei</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/855112/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Song</surname> <given-names>Qingquan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/880499/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Liu</surname> <given-names>Xi</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/880488/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Sastry</surname> <given-names>P. S.</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/881432/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Hu</surname> <given-names>Xia</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/277895/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>DATALab, Department of Computer Science and Engineering, Texas A&#x00026;M University</institution>, <addr-line>College Station, TX</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Electrical and Computer Engineering, Texas A&#x00026;M University</institution>, <addr-line>College Station, TX</addr-line>, <country>United States</country></aff>
<aff id="aff3"><sup>3</sup><institution>Department of Electrical Engineering, Indian Institute of Science</institution>, <addr-line>Bangalore</addr-line>, <country>India</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Kuansan Wang, Microsoft Research, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Xiangnan He, National University of Singapore, Singapore; Chao Lan, University of Wyoming, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Yi-Wei Chen <email>yiwei_chen&#x00040;tamu.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>11</day>
<month>02</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>3</volume>
<elocation-id>2</elocation-id>
<history>
<date date-type="received">
<day>02</day>
<month>12</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>10</day>
<month>01</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Chen, Song, Liu, Sastry and Hu.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Chen, Song, Liu, Sastry and Hu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Neural architecture search (NAS), which aims at automatically seeking proper neural architectures given a specific task, has attracted extensive attention recently in supervised learning applications. In most real-world situations, the class labels provided in the training data would be noisy due to many reasons, such as subjective judgments, inadequate information, and random human errors. Existing work has demonstrated the adverse effects of label noise on the learning of weights of neural networks. These effects could become more critical in NAS since the architectures are not only trained with noisy labels but are also compared based on their performances on noisy validation sets. In this paper, we systematically explore the robustness of NAS under label noise. We show that label noise in the training and/or validation data can lead to various degrees of performance variations. Through empirical experiments, using robust loss functions can mitigate the performance degradation under symmetric label noise as well as under a simple model of class conditional label noise. We also provide a theoretical justification for this. Both empirical and theoretical results provide a strong argument in favor of employing the robust loss function in NAS under high-level noise.</p></abstract>
<kwd-group>
<kwd>deep learning</kwd>
<kwd>automated machine learning</kwd>
<kwd>neural architecture search</kwd>
<kwd>label noise</kwd>
<kwd>robust loss function</kwd>
</kwd-group>
<counts>
<fig-count count="1"/>
<table-count count="3"/>
<equation-count count="7"/>
<ref-count count="40"/>
<page-count count="9"/>
<word-count count="6910"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Label noise, which corrupts the labels of training instances, has been widely investigated due to its unavoidability in real-world situations and harmfulness to classifier learning algorithms (Fr&#x000E9;nay and Verleysen, <xref ref-type="bibr" rid="B8">2013</xref>). Many recent studies have presented both empirical and analytical insights on learning of neural networks under label noise. Specifically, in the context of risk minimization, there are many recent studies on robust loss functions for learning classifiers under label noise (Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>; Patrini et al., <xref ref-type="bibr" rid="B27">2017</xref>; Zhang and Sabuncu, <xref ref-type="bibr" rid="B38">2018</xref>).</p>
<p>The neural architecture search (NAS) seeks to learn an appropriate architecture also for a neural network in addition to learning the appropriate weights for the chosen architecture. It has the potential to revolutionize the deployment of neural network classifiers in a variety of applications. One requirement for such learning is a large number of training instances with correct labels. However, generating large sets of labeled instances is often difficult, and the process for labeling (e.g., crowdsourcing) has to contend with many random labeling errors. As mentioned above, label noise can adversely affect the learning of weights of a neural network. For NAS, the problem is compounded because we need to search for architecture as well. Since different architectures are learned using training data and compared based on their validation performance, label noise in training and validation (hold-out) data may cause a wrong assessment of architecture during the search process. Thus label noise can result in undesirable architectures being preferred by the search algorithm, leading to the loss of performance. In this paper, we systematically investigate the effect of label noise on NAS. We show that label noise in the training or validation data can lead to different degrees of performance variation. Recently, some robust loss functions are suggested for learning the weights of a network under label noise (Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>; Zhang and Sabuncu, <xref ref-type="bibr" rid="B38">2018</xref>). The standard NAS algorithms use the categorical cross entropy (CCE) loss function. We demonstrate through simulations that the use of a robust loss function (in place of CCE) in NAS can mitigate the effect of harsh label noise. We provide a theoretical justification for this observed performance: for a class of loss functions that satisfies a robustness condition, we show that, under symmetric label noise, the relative risks of different classifiers are the same regardless of whether or not the data are corrupted with label noise.</p>
<p>The main contributions of the paper can be summarized as follows. We provide, for the first time, a systematic investigation of the effects of label noise on NAS. We provide the theoretical and empirical justification for using loss functions that satisfy a robustness criterion. We show that the use of robust loss functions is attractive because of the better performance under high-degree noise than that under the standard CCE loss.</p>
</sec>
<sec id="s2">
<title>2. Background</title>
<sec>
<title>2.1. Robust Risk Minimization</title>
<p>In the context of multi-class classification, the feature vector is represented as <inline-formula><mml:math id="M1"><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow><mml:mo>&#x02286;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, and the corresponding class label denotes <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02026;</mml:mo><mml:mi>c</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Y</mml:mi></mml:mrow></mml:math></inline-formula>. A classifier <inline-formula><mml:math id="M3"><mml:mi>f</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is learned to map each feature vector to a vector of scores, which are later used to decide a class. We assume <italic>f</italic> would be a DNN with the softmax output in this paper. Ideally, we could have a clean labeled dataset <inline-formula><mml:math id="M4"><mml:mi>D</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> drawn <italic>i.i.d</italic>. from an unknown joint distribution <inline-formula><mml:math id="M5"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> over <inline-formula><mml:math id="M6"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">Y</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
<p>In the presence of label noise, the noisy dataset is represented as <inline-formula><mml:math id="M7"><mml:msup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo stretchy="false">{</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01EF9;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> sampled <italic>i.i.d</italic>. from the noisy distribution <inline-formula><mml:math id="M8"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, where &#x01EF9;<sub><bold>x</bold></sub> is the noisy label. A noise model could capture the relationship between <inline-formula><mml:math id="M9"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M10"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> by:</p>
<disp-formula id="E1"><mml:math id="M11"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x01EF9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>;</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>&#x02200;</mml:mo><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The problem of robust learning of classifiers under label noise can be informally summed up as follows. We get noisy data drawn from <inline-formula><mml:math id="M12"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and use it to learn a classifier; however, the learned classifier has to perform well on clean data drawn according to <inline-formula><mml:math id="M13"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula>.</p>
<p>One can consider different label noise models based on what we assume regarding &#x003B7;<sub><bold>x</bold>,<italic>jk</italic></sub> (Fr&#x000E9;nay and Verleysen, <xref ref-type="bibr" rid="B8">2013</xref>; Manwani and Sastry, <xref ref-type="bibr" rid="B25">2013</xref>; Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>; Patrini et al., <xref ref-type="bibr" rid="B27">2017</xref>). In this paper, we consider only symmetric noise and hierarchical (class conditional) noise. If &#x003B7;<sub><bold>x</bold>,<italic>jk</italic></sub> &#x0003D; 1 &#x02212; &#x003B7; for <italic>j</italic> &#x0003D; <italic>k</italic>, <inline-formula><mml:math id="M14"><mml:msub><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> for <italic>j</italic> &#x02260; <italic>k</italic>, then the noise is said to be symmetric or uniform. If &#x003B7;<sub><bold>x</bold>,<italic>jk</italic></sub> is a function of (<italic>j, k</italic>) and independent of <bold>x</bold>, then it is called class conditional noise. We consider a particular case where the set of class labels can be partitioned into some subsets, and label noise is symmetric within each subset. We call this hierarchical noise. This is more realistic because, for example, when the labels are obtained through crowdsourcing, it is likely that different breeds of dogs may be confused with each other, although a dog may never be mislabeled as a car.</p>
<p>Here we define the robustness of risk minimization algorithms (Manwani and Sastry, <xref ref-type="bibr" rid="B25">2013</xref>). Given a classifier <italic>f</italic>, its risk under loss function <inline-formula><mml:math id="M15"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:math></inline-formula> is defined as <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> and <italic>f</italic><sup>&#x0002A;</sup> denotes the minimizer of <inline-formula><mml:math id="M17"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x000B7;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. This is often referred to as L-risk to distinguish it from the usual Bayes risk, but we will call it risk here. Similarly, under noisy distribution the risk of <italic>f</italic> is given by <inline-formula><mml:math id="M18"><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01EF9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> and the corresponding minimizer of <inline-formula><mml:math id="M19"><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x000B7;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is <inline-formula><mml:math id="M20"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. We say the loss function <inline-formula><mml:math id="M21"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:math></inline-formula> is noise-tolerant or robust if:</p>
<disp-formula id="E2"><mml:math id="M22"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mo>&#x025E6;</mml:mo><mml:msup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>d</mml:mi><mml:mo>&#x025E6;</mml:mo><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>Pred</italic> &#x025E6; <italic>f</italic>(<italic>x</italic>) denotes the decision on classification scores <italic>f</italic>(<italic>x</italic>) and <inline-formula><mml:math id="M23"><mml:mi>P</mml:mi><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow></mml:msub></mml:math></inline-formula> denotes probability under the clean data distribution. Essentially, the above equation indicates that the classifiers learned with clean and noisy data both have the same generalization error under the noise-free distribution.</p>
<p>Robustness of risk minimization, as defined above, depends on the specific loss function employed. It has been proved that symmetric loss functions are robust to the symmetric noise (Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>; Zhang and Sabuncu, <xref ref-type="bibr" rid="B38">2018</xref>). A loss function <inline-formula><mml:math id="M24"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:math></inline-formula> is symmetric if it satisfies Equation 1 (Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>).</p>
<disp-formula id="E3"><label>(1)</label><mml:math id="M25"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mo>,</mml:mo><mml:mo>&#x02200;</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>f</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>That is, for any example <bold>x</bold> and classifier <italic>f</italic>, the loss summation over all classes will be equal to a constant <italic>C</italic>. However, the above robustness is defined for finding the minimizer of true risk. One can show that the consistency of empirical risk minimization holds under symmetric noise (Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>). Hence, given a sufficient number of examples, empirical risk minimization also would be robust if we use a symmetric loss function.</p>
</sec>
<sec>
<title>2.2. Robustness of NAS</title>
<p>In this paper, our focus is on NAS. Normally in learning a neural network classifier, one learns only the weights with the architecture chosen beforehand. However, in the context of NAS, one needs to learn both architecture and the weights. Let us denote now by <italic>f</italic> the architecture and by &#x003B8; the weights of the architecture. Then, the risk minimization can involve two different loss functions as below.</p>
<disp-formula id="E4"><label>(2)</label><mml:math id="M26"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mtext>arg&#x000A0;min&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">F</mml:mi></mml:mrow></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>;</mml:mo><mml:msup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mtext>arg&#x000A0;min&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>;</mml:mo><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>We employ the loss <inline-formula><mml:math id="M27"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> for learning architecture while we use <inline-formula><mml:math id="M28"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> for learning weights of any specific architecture. Notice from the above that we use the training data to learn the appropriate weights for any given architecture while we use the validation data for learning the best architecture.</p>
<p>The corresponding quantities under the noisy distribution would be:</p>
<disp-formula id="E5"><label>(3)</label><mml:math id="M29"><mml:mtable columnalign="center"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mtext>arg&#x000A0;min&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">F</mml:mi></mml:mrow></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>;</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01EF9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mtext>arg&#x000A0;min&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mi>r</mml:mi><mml:mi>a</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01EF9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>For the robustness of NAS, as earlier, we want the final performance to be unaffected by whether or not there is label noise. Thus, we still need that the test error, under noise-free distribution, of <italic>f</italic><sup>&#x0002A;</sup> and <inline-formula><mml:math id="M30"><mml:msubsup><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> be the same. However, there are some crucial issues to be noted here.</p>
<p>The parameters &#x003B8; of each <italic>f</italic> in the search space can be optimized by the empirical risk of <inline-formula><mml:math id="M31"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> with <italic>D</italic><sub><italic>train</italic></sub>, and then the best-optimized <italic>f</italic> is selected by the empirical risk of <inline-formula><mml:math id="M32"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> with <italic>D</italic><sub><italic>val</italic></sub>. Thus, in NAS, label noise in training data and validation data may have different effects on the final learned classifier. Also, during the architecture search phase, each architecture is trained only for a few epochs, and then we compare the risks of different architectures. Hence, in addition to having the same minimizers of risk under noisy and noise-free distributions, relative risks of any two different classifiers should remain the same irrespective of the label noise.</p>
<p>In NAS, the most common choice for <inline-formula><mml:math id="M33"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is 0&#x02013;1 loss (i.e., accuracy), while for <inline-formula><mml:math id="M34"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> it is categorical cross entropy (CCE). Suppose <bold>p</bold> represents the output of the softmax layer and let the class label of an example be <italic>t</italic>. The CCE is defined by <inline-formula><mml:math id="M35"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">log</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></inline-formula> 0&#x02013;1 loss is known as symmetric and hence is robust. However, CCE is not symmetric because it does not satisfy Equation 1 (CCE is not bounded). Intuitively, we can mitigate the adverse effects of symmetric noise on NAS by replacing <inline-formula><mml:math id="M36"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> with any symmetric loss function. Robust log loss (RLL) (Kumar and Sastry, <xref ref-type="bibr" rid="B20">2018</xref>) is a modification of CCE.</p>
<disp-formula id="E6"><mml:math id="M37"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:mi>&#x003B1;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B1;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B1;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B1; &#x0003E; 0 is a hyper-parameter and <italic>c</italic> denotes the number of all classes. It satisfies the symmetry condition (Equation 1) and compares (in log scale) probability score of desired output with the average probability score of all other labels. In contrast, the CCE loss only looks at the probability score of the desired output. Another symmetric loss is mean absolute error (MAE) defined by <inline-formula><mml:math id="M38"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>p</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:munderover><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo></mml:math></inline-formula>. Since MAE takes longer training time to coverage (Zhang and Sabuncu, <xref ref-type="bibr" rid="B38">2018</xref>), we make use of RLL in place of CCE in NAS. For other symmetric loss functions (Charoenphakdee et al., <xref ref-type="bibr" rid="B6">2019</xref>), we leave them for future work.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Theoretical Result</title>
<p>As discussed earlier, we want a loss function that ensures that the relative risks of two different classifiers remain the same with and without label noise. Here we prove this for symmetric loss functions.</p>
<p><bold>Theorem 1</bold>. Let <inline-formula><mml:math id="M39"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:math></inline-formula> be a symmetric loss function, <inline-formula><mml:math id="M40"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> be a noise-free distribution, and <inline-formula><mml:math id="M41"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> be a noisy distribution with symmetric noise <inline-formula><mml:math id="M42"><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mfrac><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>, where <italic>c</italic> is the number of total classes. The risk of <italic>f</italic> over <inline-formula><mml:math id="M43"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:math></inline-formula> is <inline-formula><mml:math id="M44"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and over <inline-formula><mml:math id="M45"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is <inline-formula><mml:math id="M46"><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. Then, given any two classifiers <italic>f</italic><sub>1</sub> and <italic>f</italic><sub>2</sub>, if <inline-formula><mml:math id="M47"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0003C;</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M48"><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0003C;</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and vice versa.</p>
<p>Proof 1. Though this result is not explicitly available in the literature, it follows easily from the proof of Theorem 1 in Ghosh et al. (<xref ref-type="bibr" rid="B10">2017</xref>). For completeness, we present the proof here. For symmetric label noise, we have:<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref></p>
<disp-formula id="E7"><mml:math id="M49"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01EF9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01EF9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x01EF9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x01EF9;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x1D53C;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x02260;</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>C</mml:mi><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003B7;</mml:mi><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003B7;</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Note that <italic>C</italic> is the constant in the symmetry condition (Equation 1), and <italic>c</italic> signifies the number of all classes.</p>
<p>For the third equality, we are calculating expectation of a function of &#x01EF9;<sub><bold>x</bold></sub> conditioned on <italic>y</italic><sub><bold>x</bold></sub> and <bold>x</bold>, where random variable &#x01EF9;<sub><bold>x</bold></sub> takes <italic>y</italic><sub><bold>x</bold></sub> with probability 1 &#x02212; &#x003B7; and takes all other labels with equal probability.</p>
<p>Thus, <inline-formula><mml:math id="M50"><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is a linear function of <inline-formula><mml:math id="M51"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. Also, since <inline-formula><mml:math id="M52"><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mfrac><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>, we have <inline-formula><mml:math id="M53"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mfrac><mml:mrow><mml:mi>&#x003B7;</mml:mi><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0003E;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>. Hence, the above shows that <inline-formula><mml:math id="M54"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0003C;</mml:mo><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> implies <inline-formula><mml:math id="M55"><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0003C;</mml:mo><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and vice versa. This completes the proof.</p>
<p><bold>Remark 1</bold>. Theorem 1 shows that under symmetric loss function, the risk ranking of different neural networks remains the same regardless of noisy or clean data. Since 0&#x02013;1 loss is symmetric, 0&#x02013;1 loss as <inline-formula><mml:math id="M56"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> in NAS could keep the risk ranking of different neural networks consistent. It indicates that we could discover the same optimal network architecture from noisy validation data as the one from clean validation data theoretically. Besides, <italic>f</italic><sup>&#x0002A;</sup> is proved as the global minimizer for both <inline-formula><mml:math id="M57"><mml:msub><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M58"><mml:msubsup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> if <inline-formula><mml:math id="M59"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:math></inline-formula> is symmetric (Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>). When we adopt a symmetric loss in <inline-formula><mml:math id="M60"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>, we can obtain <inline-formula><mml:math id="M61"><mml:msup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold-italic"><mml:mi>&#x003B8;</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>. With the above two conditions, as long as <inline-formula><mml:math id="M62"><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mfrac><mml:mrow><mml:mi>c</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>, <inline-formula><mml:math id="M63"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is 0-1 loss, and <inline-formula><mml:math id="M64"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> is symmetric loss, a NAS would be robust to symmetric label noise.</p>
<p><bold>Remark 2</bold>. Theorem 1 demonstrates that the rank consistency for true risk under noisy and noise-free data. The theorem (Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>, Thm.4) points out that the minimization of empirical risk converges uniformly to that of the true risk. With the aid of the theorem, the linear relationship in Theorem 1 would be right as well for empirical risk. This implies that under symmetric loss function, the relative ranking of classifiers for empirical risk (with sufficient samples) would be the same as the true risk under noisy and noise-free data. However, the sample complexity would be higher under noisy labels.</p>
</sec>
<sec id="s4">
<title>4. Experiments</title>
<p>To explore how label noise affects NAS and examine the ranking consistency of symmetric loss functions we designed noisy label settings on CIFAR (Krizhevsky and Hinton, <xref ref-type="bibr" rid="B19">2009</xref>) benchmarks using DARTS (Liu et al., <xref ref-type="bibr" rid="B23">2019</xref>) and ENAS (Pham et al., <xref ref-type="bibr" rid="B28">2018</xref>).</p>
<sec>
<title>4.1. Dataset and Settings</title>
<sec>
<title>4.1.1. Dataset</title>
<p>The CIFAR-10 and CIFAR-100 (Krizhevsky and Hinton, <xref ref-type="bibr" rid="B19">2009</xref>) consist of 32 &#x000D7; 32 color images with 10 and 100 classes, respectively. Each dataset is split into 45, 000, 5, 000, and 10, 000 as training, validation, and testing sets, following AutoKeras (Jin et al., <xref ref-type="bibr" rid="B16">2019</xref>). All the subsets are preprocessed by per-pixel mean subtraction, random horizontal flip, and 32 &#x000D7; 32 random crops after padding with 4 pixels. We corrupt the training and validation labels by noise and always keep testing labels clean, which is common in literature (Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>; Zhang and Sabuncu, <xref ref-type="bibr" rid="B38">2018</xref>). The validation set is used to pick up the best neural architecture during searching and decide the best training epoch during final retraining. Note that the test set is only considered to report the performance.</p>
</sec>
<sec>
<title>4.1.2. Noise Construction</title>
<p>We provide theoretical guarantee to the performance of RLL under symmetric noise. Meanwhile, to better illustrate/demonstrate/understand the effectiveness of RLL, we evaluate RLL under both symmetric noisy and hierarchical noise.</p>
<list list-type="bullet">
<list-item><p>Symmetric noise (Kumar and Sastry, <xref ref-type="bibr" rid="B20">2018</xref>): There is an equal chance that one class is corrupted to be another class. This chance can be captured by a matrix <italic>P</italic><sub>&#x003B7;</sub> &#x0003D; &#x003B7;<italic>B</italic> &#x0002B; (1 &#x02212; &#x003B7;)<italic>I</italic>, whose element in the <italic>i</italic>-th row and <italic>j</italic>-th column is the probability of the true label <italic>i</italic> being changed into label <italic>j</italic>. To be specific, <italic>I</italic> is the identity matrix; all elements of the matrix <italic>B</italic> are <inline-formula><mml:math id="M65"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>c</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula> except that diagonal values are zero, and &#x003B7; is the adjustable noise level. We inject the symmetric noise in CIFAR-10 with &#x003B7; of [0.2, 0.4, 0.6].</p></list-item>
<list-item><p>Hierarchical noise (Hendrycks et al., <xref ref-type="bibr" rid="B13">2018</xref>): All label classes can uniformly turn to any other label classes that belong to the same &#x0201C;superclass.&#x0201D; For instance, the &#x0201C;baby&#x0201D; class is allowed to flip to the four different categories (e.g., boy and girl) in the &#x0201C;people&#x0201D; superclass rather than &#x0201C;bed&#x0201D; or &#x0201C;bear&#x0201D;. Since CIFAR-100 inherently provides the superclass information, we add the hierarchical noise into CIFRA-100 with noise level &#x003B7; of [0.2, 0.4, 0.6].</p></list-item>
</list>
</sec>
<sec>
<title>4.1.3. NAS Algorithms</title>
<p>In order to investigate the noisy label problem in NAS, we select representative NAS methods, including DARTS (Liu et al., <xref ref-type="bibr" rid="B23">2019</xref>) and ENAS (Pham et al., <xref ref-type="bibr" rid="B28">2018</xref>). The empirical results on AutoKeras (Jin et al., <xref ref-type="bibr" rid="B16">2019</xref>) could be found in the <xref ref-type="supplementary-material" rid="SM2">Supplementary Material</xref> as well.</p>
<list list-type="bullet">
<list-item><p>DARTS searches neural architectures by gradient descent. It assigns different network operations by numeric architectural weights and uses Hessian gradient descent jointly to optimize weights of neural networks and architectural weights. The experiment setting of DARTS can be found in section 1 of the <xref ref-type="supplementary-material" rid="SM2">Supplementary Material</xref>.</p></list-item>
<list-item><p>ENAS discovers neural architectures by reinforcement learning. Although its RNN controller still samples potential network operations by REINFORCE rule (Williams, <xref ref-type="bibr" rid="B33">1992</xref>), ENAS could share the weights of network operations between different search iterations. The experiment setting of ENAS can be found in section 2 of the <xref ref-type="supplementary-material" rid="SM2">Supplementary Material</xref>.</p></list-item>
</list>
</sec>
</sec>
<sec>
<title>4.2. The Impact of Label Noise on the Performance of NAS</title>
<p>To demonstrate how erroneous labels affect the performance of NAS, we intentionally introduce symmetric noise (&#x003B7; &#x0003D; 0.6) in training labels, validation labels, or both (all noisy). Different NAS methods execute under clean labels (all clean) and these three noisy settings. We evaluate each searcher by measuring the testing accuracy of its best-discovered architecture. Searched networks are retrained with clean labels or polluted labels, denoted as &#x0201C;all clean&#x0201D; and &#x0201C;all noisy,&#x0201D; respectively. The former one shows how noise in the search phase affects the performance of the standard NAS. The latter one reflects how noise alters the search quality of NAS in practical situations. Furthermore, since test accuracy evaluates the search quality, we also include RLL to reduce the noise effect in the retraining phase.</p>
<p>The main results are shown in <xref ref-type="table" rid="T1">Table 1</xref>. In the clean retraining setting, the optimal network architectures from DARTS and ENAS with noisy labels could result in comparable performance to the ones searched with clean labels. One possible reason is that both DARTS and ENAS adopt the cell search space, which is limited. As long as the networks can be fully retrained by clean labels, they can achieve similar performance. The architectural variance resulting from label noise does not lead to noticeable performance differences. The observation has also been pointed out in Li and Talwalkar (<xref ref-type="bibr" rid="B22">2019</xref>).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>NAS on CIFRA-10 with symmetric noise (&#x003B7; &#x0003D; 0.6).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>DARTS (Liu et al., <xref ref-type="bibr" rid="B23">2019</xref>)</bold></th>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>ENAS (Pham et al., <xref ref-type="bibr" rid="B28">2018</xref>)</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold>All clean</bold></th>
<th valign="top" align="center"><bold>Noisy valid</bold></th>
<th valign="top" align="center"><bold>Noisy train</bold></th>
<th valign="top" align="center"><bold>All noisy</bold></th>
<th valign="top" align="center"><bold>All clean</bold></th>
<th valign="top" align="center"><bold>Noisy valid</bold></th>
<th valign="top" align="center"><bold>Noisy train</bold></th>
<th valign="top" align="center"><bold>All noisy</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">Clean CCE Retrain</td>
<td valign="top" align="center">96.98</td>
<td valign="top" align="center">96.22</td>
<td valign="top" align="center">95.42</td>
<td valign="top" align="center">96.69</td>
<td valign="top" align="center">95.84</td>
<td valign="top" align="center">96.13</td>
<td valign="top" align="center">95.84</td>
<td valign="top" align="center">95.88</td>
</tr>
<tr>
<td valign="top" align="center">Noisy CCE Retrain</td>
<td valign="top" align="center">81.01</td>
<td valign="top" align="center">78.76</td>
<td valign="top" align="center">81.35</td>
<td valign="top" align="center">81.62</td>
<td valign="top" align="center">79.33</td>
<td valign="top" align="center">80.46</td>
<td valign="top" align="center">78.61</td>
<td valign="top" align="center">80.34</td>
</tr>
<tr>
<td valign="top" align="center">Noisy RLL Retrain</td>
<td valign="top" align="center">85.63</td>
<td valign="top" align="center">84.85</td>
<td valign="top" align="center">87.11</td>
<td valign="top" align="center">87.53</td>
<td valign="top" align="center">79.38</td>
<td valign="top" align="center">80.07</td>
<td valign="top" align="center">79.22</td>
<td valign="top" align="center">79.80</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The test accuracy is shown in percentage. Noisy train or noisy valid corrupts training or validation labels, while all noisy pollutes both training and validation labels. NAS algorithms search architectures by CCE under the above settings and retrain the searched architectures by CCE or RLL (&#x003B1; = 0.01)</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>When it comes to retraining the networks with noisy labels, their accuracy drops significantly. The performance differences come from the classical issue of label noise to deep neural networks (Zhang and Sabuncu, <xref ref-type="bibr" rid="B38">2018</xref>). With the help of RLL, we can perceive that the architectures searched by DARTS could achieve better performance, while ENAS does not. Another important observation for ENAS is that the performance under four search settings is comparable. One reason is that the 0-1 loss in ENAS could provide certain robustness to noisy validation labels, which counteracts the negative effect of symmetric noise. Since the search quality of ENAS seems robust to symmetric noise, we do not explore ENAS further in the following experiments.</p>
<p>When we focus on the noisy retraining of DARTS, the performance of &#x0201C;noisy valid&#x0201D; is the lowest one among others. The decrease of search quality is partially because the <inline-formula><mml:math id="M66"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> of DARTS is CCE, which is not robust to symmetric loss. DARTS may not be able to rank the performance of different architectures correctly in the setting. The inferior performance from noisy validation labels in other machine learning models has also been proposed in Inouye et al. (<xref ref-type="bibr" rid="B14">2017</xref>). Moreover, the &#x0201C;all noisy&#x0201D; searcher is supposed to produce the worst test accuracy since it has both noisy training and validation labels. Surprisingly, the empirical results show that &#x0201C;all noisy&#x0201D; in DARTS even outperforms &#x0201C;all clean.&#x0201D; A possible conjecture is that the &#x0201C;all noisy&#x0201D; searcher is optimized under the same retraining setting, and the resulting network is intentionally designed to adapt to noisy labels. The finding is worthy of conducting further explorations in the future, such as adopting NAS to discover more robust neural architectures. Despite that, we could still find that label noise in the search phase could generally lead to a negative influence on NAS performance. DARTS especially suffers more from noisy validation labels.</p>
</sec>
<sec>
<title>4.3. Noise Influence of the Risk Ranking</title>
<p>Since NAS aims to find the architectures that outperform others, obtaining a correct performance ranking among different neural networks plays a crucial role in NAS. As long as NAS can recognize the correct performance ranking during the search phase, it should have a high chance to recommend the best neural architecture finally. Theorem 1 reveals that symmetric loss functions have such desired property under symmetric noise situation. To evaluate the practical effects of the theorem, we construct two different neural networks (<xref ref-type="table" rid="T2">Table 2</xref>) through randomly choosing the network operations as well as the locations of the skip connection. Each network has 8 layers with 36 initial channels. We also exclude the auxiliary layer to avoid its additional loss.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Two neural network architectures for the ranking of empirical risk.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>Network architecture 1</bold></th>
<th valign="top" align="left"><bold>Network architecture 2</bold></th>
</tr>
</thead>
<tbody>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">Normal cell</td>
<td valign="top" align="left"><inline-graphic xlink:href="fdata-03-00002-i0001.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="fdata-03-00002-i0002.tif"/></td>
</tr>
<tr>
<td valign="top" align="left">Reduce cell</td>
<td valign="top" align="left"><inline-graphic xlink:href="fdata-03-00002-i0003.tif"/></td>
<td valign="top" align="left"><inline-graphic xlink:href="fdata-03-00002-i0004.tif"/></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Each network has eight layers comprising normal cells and reduce cells</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>We train the networks for 350 epochs under clean and noisy training labels, to which symmetric noise of &#x003B7; &#x0003D; 0.6 is injected. Proof 1 of section 3 shows that the noisy true risk is of positive correlation with the clean true risk. Although we do not have the true risk, when the empirical risk of a loss function could conform to the relationship, the loss is supposed to satisfy Theorem 1 likely. Thereby, we inspect the closeness between the empirical noisy risk and its ideal risk, which is computed by the linear function of Proof 1 with the empirical clean risk. To be specific, the Pearson correlation coefficient (PCC) is used to measure the degree of closeness. (0 &#x0003C; <italic>PCC</italic> &#x02A7D; 1 indicates the positive correlation).</p>
<p><xref ref-type="fig" rid="F1">Figure 1</xref> displays the RLL and CCE training loss of the first network under noise-free and noisy labels. After we obtained the curve of the empirical clean risk, we drew the ideal curve for the noisy risk according to Proof 1 of section 3. The expectation is that the curve of noisy risk in RLL should be close to the ideal curve, while CCE does not. As we can notice, the curves of noisy risk in CCE deviate from the ideal curves. In contrast, the two curves of noisy risk in RLL stay closer to the ideal curves than CCE. Moreover, the PCC of RLL displays a positive correlation (<italic>PCC</italic> &#x0003E; 0), which also supports that the empirical risk of RLL is very close to the ideal one. The reasons that empirical noisy risks do not perfectly match the ideal one include: (1) training samples (examples) are not enough, (2) hyper-parameters are not optimal for learning the networks, The second network also presents similar results (see <xref ref-type="supplementary-material" rid="SM1">Figure S1</xref>). Therefore, we could understand that symmetric loss functions have the capability to make the risk ranking under noisy labels uniform to the one under clean labels in practice.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The empirical risk of the first network (depicted in <xref ref-type="table" rid="T2">Table 2</xref>). The symmetric noise of &#x003B7; &#x0003D; 0.6 is introduced in training labels. The curves of empirical risk (A1 clean and A1 noisy) are from training the network by CCE or RLL (&#x003B1; = 0.01). The ideal curve (A1 ideal) for the noisy risk is computed from Proof 1 of section 3 with A1 clean. When A1 noisy is as close as possible to A1 ideal, the loss could be understood to follow Theorem 1 in practice. As we can see, the bottom RLL figures display that A1 noisy curves are closer to the A1 ideal curves compared to the CCE figures.</p></caption>
<graphic xlink:href="fdata-03-00002-g0001.tif"/>
</fig>
</sec>
<sec>
<title>4.4. NAS Improvement With Symmetric Loss Function</title>
<p>In practice, the resulting networks from NAS are trained on the potentially wrong labels. We want to see whether NAS could still discover high-performance networks in this harsh environment with the help of symmetric loss function, especially robust log loss (RLL). The performance of neural networks decreases by label noise, but the symmetric loss can alleviate the adverse influence, as shown in Kumar and Sastry (<xref ref-type="bibr" rid="B20">2018</xref>). Thus, in the experiment, no matter DARTS searches networks by CCE or RLL, we leverage RLL in the final retrain phase. Apart from DARTS, Resnet-18 He et al. (<xref ref-type="bibr" rid="B12">2016</xref>) is also included in the experiment for performance comparison. Moreover, we are interested in how NAS with RLL works in another type of label noise. Here we also report the results beyond the hierarchical noise of CIFAR-100.</p>
<p>The results presented in <xref ref-type="table" rid="T3">Table 3</xref> point out that RLL can still help NAS discover high-performance network architectures under high noise levels. No matter in symmetric or hierarchical noise, DARTS with RLL reaches a similar accuracy to DARTS with CCE under &#x003B7; &#x0003D; 0.2 and 0.4, and RLL one outperforms CCE under &#x003B7; &#x0003D; 0.6. One possible reason is that DARTS is robust to mild noise due to its small search space. Nevertheless, severe noise introduces intense uncertainty for DARTS. RLL can help DARTS to determine relatively robust neural architectures in the harsh condition. From the empirical results, we can claim that the symmetric (robust) loss function, RLL, improves the search quality under high-level label noise. More results for another representative searching algorithm, AutoKeras (Jin et al., <xref ref-type="bibr" rid="B16">2019</xref>), can be found in the <xref ref-type="supplementary-material" rid="SM2">Supplementary Material</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>NAS with RLL.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>Symmetric Noise (CIFAR-10)</bold></th>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>Hierarchical Noise (CIFAR-100)</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold>0.2</bold></th>
<th valign="top" align="center"><bold>0.4</bold></th>
<th valign="top" align="center"><bold>0.6</bold></th>
<th valign="top" align="center"><bold>0.2</bold></th>
<th valign="top" align="center"><bold>0.4</bold></th>
<th valign="top" align="center"><bold>0.6</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">ResNet-18</td>
<td valign="top" align="center">92.05 &#x000B1; 0.40</td>
<td valign="top" align="center">88.95 &#x000B1; 0.14</td>
<td valign="top" align="center">82.77 &#x000B1; 0.61</td>
<td valign="top" align="center">61.27 &#x000B1; 0.60</td>
<td valign="top" align="center">53.50 &#x000B1; 0.94</td>
<td valign="top" align="center">39.99 &#x000B1; 2.17</td>
</tr>
<tr>
<td valign="top" align="left">DARTS-CCE</td>
<td valign="top" align="center"><bold>94.91</bold> &#x000B1; 0.19</td>
<td valign="top" align="center"><bold>91.02</bold> &#x000B1; 0.78</td>
<td valign="top" align="center">83.31 &#x000B1; 2.88</td>
<td valign="top" align="center"><bold>67.82</bold> &#x000B1; 0.70</td>
<td valign="top" align="center">52.57 &#x000B1; 1.03</td>
<td valign="top" align="center">39.22 &#x000B1; 2.50</td>
</tr>
<tr>
<td valign="top" align="left">DARTS-RLL</td>
<td valign="top" align="center">94.66 &#x000B1; 0.67</td>
<td valign="top" align="center">90.77 &#x000B1; 1.56</td>
<td valign="top" align="center"><bold>86.24</bold> &#x000B1; 0.85</td>
<td valign="top" align="center">66.47 &#x000B1; 1.68</td>
<td valign="top" align="center"><bold>53.68</bold> &#x000B1; 1.96</td>
<td valign="top" align="center"><bold>46.41</bold> &#x000B1; 2.65</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Test accuracy and standard deviation (3 runs) are represented in percentage. DARTS searches architectures with CCE or RLL (&#x003B1; = 0.01), and then the resulting optimal neural network is trained again from scratch by RLL (&#x003B1; = 0.01). Noise contaminates both training and validation labels with different noise levels. Bold font exhibits the best result in each column</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="s5">
<title>5. Related Work</title>
<sec>
<title>5.1. Neural Architecture Search</title>
<p>Neural architecture search (NAS) is purposed to facilitate the design of network architectures automatically. Currently, the mainstream approaches to achieve NAS include Bayesian optimization (Kandasamy et al., <xref ref-type="bibr" rid="B17">2018</xref>; Jin et al., <xref ref-type="bibr" rid="B16">2019</xref>), reinforcement learning (Zoph and Le, <xref ref-type="bibr" rid="B39">2017</xref>; Cai et al., <xref ref-type="bibr" rid="B3">2018</xref>; Pham et al., <xref ref-type="bibr" rid="B28">2018</xref>; Zoph et al., <xref ref-type="bibr" rid="B40">2018</xref>), evolutionary algorithms (Real et al., <xref ref-type="bibr" rid="B30">2017</xref>, <xref ref-type="bibr" rid="B29">2019</xref>) and gradient-based optimization (Luo et al., <xref ref-type="bibr" rid="B24">2018</xref>; Cai et al., <xref ref-type="bibr" rid="B4">2019</xref>; Liu et al., <xref ref-type="bibr" rid="B23">2019</xref>). Regardless of the different approaches, NAS consists of two phases: the search phase and the final-retrain phase. During the search phase, NAS generates and evaluates a variety of different intermediate network architectures repeatedly. Those networks are trained on the training set for a short time (e.g., tens of epochs). Their performance, measured on the validation set, is used as a guideline to discover better network architectures. In the final-retrain phase, the optimal network architecture will be trained with additional regularization techniques, e.g., Shake-Shake (Gastaldi, <xref ref-type="bibr" rid="B9">2017</xref>), DropPath (Larsson et al., <xref ref-type="bibr" rid="B21">2017</xref>), and Cutout (DeVries and Taylor, <xref ref-type="bibr" rid="B7">2017</xref>). The phase usually takes hundreds of epochs. And then the trained network is evaluated on the unseen test set. In general, the two phases utilize the same training set.</p>
<p>From the perspective of the search space of network architectures, current existing works could be divided into the complete architecture search space (Real et al., <xref ref-type="bibr" rid="B30">2017</xref>; Zoph and Le, <xref ref-type="bibr" rid="B39">2017</xref>; Kandasamy et al., <xref ref-type="bibr" rid="B17">2018</xref>; Jin et al., <xref ref-type="bibr" rid="B16">2019</xref>) and the cell search space (Cai et al., <xref ref-type="bibr" rid="B3">2018</xref>, <xref ref-type="bibr" rid="B4">2019</xref>; Luo et al., <xref ref-type="bibr" rid="B24">2018</xref>; Pham et al., <xref ref-type="bibr" rid="B28">2018</xref>; Zoph et al., <xref ref-type="bibr" rid="B40">2018</xref>; Liu et al., <xref ref-type="bibr" rid="B23">2019</xref>; Real et al., <xref ref-type="bibr" rid="B29">2019</xref>). The first search space allows NAS to look for complete networks and provides a high diversity of resulting network architectures. The second one limits NAS to seek the small architectures for two kinds of cells (normal cell and reduction cell). And it is also required to pre-define the base network architecture to contain the searched cells for evaluation, which implies that many intermediate networks will share similar network architecture. Most existing works usually develop from the cell search space because the size of this search space is significantly smaller than the complete one, and can reduce the enormous search time.</p>
<p>Due to the limited hardware resources, our experiments focus on cell search space, including DARTS (Liu et al., <xref ref-type="bibr" rid="B23">2019</xref>) and ENAS (Pham et al., <xref ref-type="bibr" rid="B28">2018</xref>). We also explore the label noise impact on AutoKeras (Jin et al., <xref ref-type="bibr" rid="B16">2019</xref>). Notice that no similar works have studied the effect of label noise on NAS until we publish the work.</p>
</sec>
<sec>
<title>5.2. Learning Under Corruption Labels</title>
<p>Great progress has been made in research on the robustness of learning algorithms under corrupted labels (Arpit et al., <xref ref-type="bibr" rid="B1">2017</xref>; Chang et al., <xref ref-type="bibr" rid="B5">2017</xref>; Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>; Patrini et al., <xref ref-type="bibr" rid="B27">2017</xref>; Zhang et al., <xref ref-type="bibr" rid="B36">2017</xref>, <xref ref-type="bibr" rid="B37">2018</xref>; Jiang et al., <xref ref-type="bibr" rid="B15">2018</xref>; Ren et al., <xref ref-type="bibr" rid="B31">2018</xref>; Zhang and Sabuncu, <xref ref-type="bibr" rid="B38">2018</xref>). A comprehensive overview of previous studies in this area can be found in Fr&#x000E9;nay and Verleysen (<xref ref-type="bibr" rid="B8">2013</xref>). The proposed approaches for learning under label noise can generally be categorized into a few groups.</p>
<p>The first group comprises mostly label-cleansing methods that aim to correct mislabeled data (Brodley and Friedl, <xref ref-type="bibr" rid="B2">1999</xref>) or adjust the sampling weights of unreliable training instances (Chang et al., <xref ref-type="bibr" rid="B5">2017</xref>; Han et al., <xref ref-type="bibr" rid="B11">2018</xref>; Jiang et al., <xref ref-type="bibr" rid="B15">2018</xref>; Ren et al., <xref ref-type="bibr" rid="B31">2018</xref>; Yu et al., <xref ref-type="bibr" rid="B35">2019</xref>) (adding Co-teaching from Han et al., <xref ref-type="bibr" rid="B11">2018</xref> and Yu et al., <xref ref-type="bibr" rid="B35">2019</xref>). Another group of approaches treats the true but unknown labels as latent variables and the noisy labels as observed variables so that EM-like algorithms can be used to learn the true label distribution of the dataset (Xiao et al., <xref ref-type="bibr" rid="B34">2015</xref>; Vahdat, <xref ref-type="bibr" rid="B32">2017</xref>; Khetan et al., <xref ref-type="bibr" rid="B18">2018</xref>). The third broad group of approaches aims to learn directly from noisy labels under the generic risk minimization framework and focus on noise-robust algorithms (Manwani and Sastry, <xref ref-type="bibr" rid="B25">2013</xref>; Natarajan et al., <xref ref-type="bibr" rid="B26">2013</xref>; Ghosh et al., <xref ref-type="bibr" rid="B10">2017</xref>; Patrini et al., <xref ref-type="bibr" rid="B27">2017</xref>; Zhang and Sabuncu, <xref ref-type="bibr" rid="B38">2018</xref>). There are two general approaches here. One can construct a new loss function using estimated noise distributions, while the others develop conditions on loss functions so that risk minimization is inherently robust. In either case, they can derive some theoretical guarantees on the robustness of classifier learning algorithms.</p>
<p>All the above approaches are for learning parameters of specific classifiers using data with label noise. In NAS, we need to learn a suitable architecture for the neural network in addition to learning of the weights. Our work differs from the above studies that we discuss the robustness in NAS under corrupted labels, while most of the above works focus on the robustness of training in supervised learning. We investigate the effect of label noise in NAS at multiple levels.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusion</title>
<p>Neural architecture search is gaining more and more attention in recent years due to its flexibility and the remarkable power of reducing the burden of neural network design. The pervasive existence of label noise in real-world datasets motivates us to investigate the problem of neural architecture search under label noise. Through both theoretical and experimental analyses, we studied the robustness of NAS under label noise. We showed that symmetric label noise adversely the search ability of DARTS, while ENAS is robust to the noise. We further demonstrated the benefits of employing a specific robust loss function in search algorithms. These conclusions provide a strong argument in favor of adopting the symmetric (robust) loss function to guard against high-level label noise. In the future, we could explore that the factors cause DARTS to have superior performance under noisy training and validation labels. We could also investigate other symmetric loss functions for NAS.</p>
</sec>
<sec sec-type="data-availability-statement" id="s7">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="https://www.cs.toronto.edu/&#x0007E;kriz/cifar.html">https://www.cs.toronto.edu/&#x0007E;kriz/cifar.html</ext-link>.</p>
</sec>
<sec id="s8">
<title>Author Contributions</title>
<p>Y-WC was responsible for the most writing and conducted experiments of DARTS (Liu et al., <xref ref-type="bibr" rid="B23">2019</xref>) and ENAS (Pham et al., <xref ref-type="bibr" rid="B28">2018</xref>). QS proofread the paper and proposed the idea of rank consistency along with PS and XL, and also did the experiments of AutoKeras (Jin et al., <xref ref-type="bibr" rid="B16">2019</xref>) and had the same contribution of Y-WC. XL organized work related to learning under corruption labels. PS wrote the remarks of theoretical results and refined the whole paper. XH supervised the progress and provided helpful discussion.</p>
<sec>
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ack>
<p>We would like to thank XH for providing enormous computing resources for experiments. We also thank the anonymous reviewers for their useful comments.</p>
</ack>
<sec sec-type="supplementary-material" id="s9">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fdata.2020.00002/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fdata.2020.00002/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Image_1.PNG" id="SM1" mimetype="image/png" xmlns:xlink="http://www.w3.org/1999/xlink"/>
<supplementary-material xlink:href="Data_Sheet_1.PDF" id="SM2" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Arpit</surname> <given-names>D.</given-names></name> <name><surname>Jastrz&#x00119;bski</surname> <given-names>S.</given-names></name> <name><surname>Ballas</surname> <given-names>N.</given-names></name> <name><surname>Krueger</surname> <given-names>D.</given-names></name> <name><surname>Bengio</surname> <given-names>E.</given-names></name> <name><surname>Kanwal</surname> <given-names>M. S.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>A closer look at memorization in deep networks</article-title>, in <source>Proceedings of the 34th International Conference on Machine Learning</source> (<publisher-loc>Sydney, NSW</publisher-loc>).</citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brodley</surname> <given-names>C. E.</given-names></name> <name><surname>Friedl</surname> <given-names>M. A.</given-names></name></person-group> (<year>1999</year>). <article-title>Identifying mislabeled training data</article-title>. <source>J. Art. Intell. Res.</source> <volume>11</volume>, <fpage>131</fpage>&#x02013;<lpage>167</lpage>.</citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>H.</given-names></name> <name><surname>Yang</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>W.</given-names></name> <name><surname>Han</surname> <given-names>S.</given-names></name> <name><surname>Yu</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Path-level network transformation for efficient architecture search</article-title>, in <source>Proceedings of the 35th International Conference on Machine Learning</source> (<publisher-loc>Stockholm</publisher-loc>).</citation></ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>H.</given-names></name> <name><surname>Zhu</surname> <given-names>L.</given-names></name> <name><surname>Han</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>ProxylessNAS: direct neural architecture search on target task and hardware</article-title>, in <source>International Conference on Learning Representations</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation></ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>H.-S.</given-names></name> <name><surname>Learned-Miller</surname> <given-names>E.</given-names></name> <name><surname>McCallum</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Active bias: training more accurate neural networks by emphasizing high variance samples</article-title>, in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>1002</fpage>&#x02013;<lpage>1012</lpage>.</citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Charoenphakdee</surname> <given-names>N.</given-names></name> <name><surname>Lee</surname> <given-names>J.</given-names></name> <name><surname>Sugiyama</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>On symmetric losses for learning from corrupted labels</article-title>. <source>arXiv preprint</source> arXiv:1901.09314.</citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>DeVries</surname> <given-names>T.</given-names></name> <name><surname>Taylor</surname> <given-names>G. W.</given-names></name></person-group> (<year>2017</year>). <article-title>Improved regularization of convolutional neural networks with cutout</article-title>. <source>arXiv preprint</source> arXiv:1708.04552.</citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fr&#x000E9;nay</surname> <given-names>B.</given-names></name> <name><surname>Verleysen</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>Classification in the presence of label noise: a survey</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst.</source> <volume>25</volume>, <fpage>845</fpage>&#x02013;<lpage>869</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2013.2292894</pub-id><pub-id pub-id-type="pmid">24808033</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gastaldi</surname> <given-names>X.</given-names></name></person-group> (<year>2017</year>). <article-title>Shake-shake regularization</article-title>. <source>arXiv preprint</source> arXiv:1705.07485.</citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ghosh</surname> <given-names>A.</given-names></name> <name><surname>Kumar</surname> <given-names>H.</given-names></name> <name><surname>Sastry</surname> <given-names>P.</given-names></name></person-group> (<year>2017</year>). <article-title>Robust loss functions under label noise for deep neural networks</article-title>, in <source>Thirty-First AAAI Conference on Artificial Intelligence</source> (<publisher-loc>San Francisco, CA</publisher-loc>).</citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>B.</given-names></name> <name><surname>Yao</surname> <given-names>Q.</given-names></name> <name><surname>Yu</surname> <given-names>X.</given-names></name> <name><surname>Niu</surname> <given-names>G.</given-names></name> <name><surname>Xu</surname> <given-names>M.</given-names></name> <name><surname>Hu</surname> <given-names>W.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>Co-teaching: robust training of deep neural networks with extremely noisy labels</article-title>, in <source>Advances in neural information processing systems</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>8527</fpage>&#x02013;<lpage>8537</lpage>.</citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Deep residual learning for image recognition</article-title>, in <source>The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>).</citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hendrycks</surname> <given-names>D.</given-names></name> <name><surname>Mazeika</surname> <given-names>M.</given-names></name> <name><surname>Wilson</surname> <given-names>D.</given-names></name> <name><surname>Gimpel</surname> <given-names>K.</given-names></name></person-group> (<year>2018</year>). <article-title>Using trusted data to train deep networks on labels corrupted by severe noise</article-title>, in <source>Advances in Neural Information Processing Systems 31</source>, eds <person-group person-group-type="editor"><name><surname>Bengio</surname> <given-names>S.</given-names></name> <name><surname>Wallach</surname> <given-names>H.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Grauman</surname> <given-names>K.</given-names></name> <name><surname>Cesa-Bianchi</surname> <given-names>N.</given-names></name> <name><surname>Garnett</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>10477</fpage>&#x02013;<lpage>10486</lpage>.</citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Inouye</surname> <given-names>D. I.</given-names></name> <name><surname>Ravikumar</surname> <given-names>P.</given-names></name> <name><surname>Das</surname> <given-names>P.</given-names></name> <name><surname>Datta</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Hyperparameter selection under localized label noise via corrupt validation</article-title>, in <source>Learning With Limited Labeled Data (NeurIPS Workshop)</source> (<publisher-loc>Long Beach, CA</publisher-loc>).</citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>L.</given-names></name> <name><surname>Zhou</surname> <given-names>Z.</given-names></name> <name><surname>Leung</surname> <given-names>T.</given-names></name> <name><surname>Li</surname> <given-names>L.-J.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). <article-title>MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels</article-title>, in <source>Proceedings of the 35th International Conference on Machine Learning</source>.</citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jin</surname> <given-names>H.</given-names></name> <name><surname>Song</surname> <given-names>Q.</given-names></name> <name><surname>Hu</surname> <given-names>X.</given-names></name></person-group> (<year>2019</year>). <article-title>Auto-keras: an efficient neural architecture search system</article-title>, in <source>ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>Anchorage, AK</publisher-loc>).</citation></ref>
<ref id="B17">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kandasamy</surname> <given-names>K.</given-names></name> <name><surname>Neiswanger</surname> <given-names>W.</given-names></name> <name><surname>Schneider</surname> <given-names>J.</given-names></name> <name><surname>Poczos</surname> <given-names>B.</given-names></name> <name><surname>Xing</surname> <given-names>E. P.</given-names></name></person-group> (<year>2018</year>). <article-title>Neural architecture search with bayesian optimisation and optimal transport</article-title>, in <source>Advances in Neural Information Processing Systems 31</source> (<publisher-loc>Montreal, QC</publisher-loc>).</citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Khetan</surname> <given-names>A.</given-names></name> <name><surname>Lipton</surname> <given-names>Z. C.</given-names></name> <name><surname>Anandkumar</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning from noisy singly-labeled data</article-title>, in <source>International Conference on Learning Representations</source> (<publisher-loc>Vancouver, BC</publisher-loc>).</citation></ref>
<ref id="B19">
<citation citation-type="other"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2009</year>). <article-title>Learning multiple layers of features from tiny images</article-title>. Technical report, Citeseer.</citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kumar</surname> <given-names>H.</given-names></name> <name><surname>Sastry</surname> <given-names>P. S.</given-names></name></person-group> (<year>2018</year>). <article-title>Robust loss functions for learning multi-class classifiers</article-title>, in <source>2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC)</source> (<publisher-loc>Miyazaki</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>687</fpage>&#x02013;<lpage>692</lpage>.</citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Larsson</surname> <given-names>G.</given-names></name> <name><surname>Maire</surname> <given-names>M.</given-names></name> <name><surname>Shakhnarovich</surname> <given-names>G.</given-names></name></person-group> (<year>2017</year>). <article-title>Fractalnet: ultra-deep neural networks without residuals</article-title>, in <source>ICLR</source> (<publisher-loc>Toulon</publisher-loc>: <publisher-name>International Conference on Learning Representations</publisher-name>).</citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Talwalkar</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Random search and reproducibility for neural architecture search</article-title>. <source>CoRR</source> abs/1902.07638.</citation></ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>DARTS: differentiable architecture search</article-title>, in <source>International Conference on Learning Representations</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Luo</surname> <given-names>R.</given-names></name> <name><surname>Tian</surname> <given-names>F.</given-names></name> <name><surname>Qin</surname> <given-names>T.</given-names></name> <name><surname>Chen</surname> <given-names>E.</given-names></name> <name><surname>Liu</surname> <given-names>T.-Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Neural architecture optimization</article-title>, in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>).</citation></ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Manwani</surname> <given-names>N.</given-names></name> <name><surname>Sastry</surname> <given-names>P. S.</given-names></name></person-group> (<year>2013</year>). <article-title>Noise tolerance under risk minimization</article-title>. <source>IEEE Trans. Cybernetics</source> <volume>43</volume>, <fpage>1146</fpage>&#x02013;<lpage>1151</lpage>. <pub-id pub-id-type="doi">10.1109/TSMCB.2012.2223460</pub-id><pub-id pub-id-type="pmid">23193242</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Natarajan</surname> <given-names>N.</given-names></name> <name><surname>Dhillon</surname> <given-names>I. S.</given-names></name> <name><surname>Ravikumar</surname> <given-names>P. K.</given-names></name> <name><surname>Tewari</surname> <given-names>A.</given-names></name></person-group> (<year>2013</year>). <article-title>Learning with noisy labels</article-title>, in <source>Advances in neural information processing systems</source> (<publisher-loc>Lake Tahoe</publisher-loc>), <fpage>1196</fpage>&#x02013;<lpage>1204</lpage>.</citation></ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Patrini</surname> <given-names>G.</given-names></name> <name><surname>Rozza</surname> <given-names>A.</given-names></name> <name><surname>Krishna Menon</surname> <given-names>A.</given-names></name> <name><surname>Nock</surname> <given-names>R.</given-names></name> <name><surname>Qu</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>). <article-title>Making deep neural networks robust to label noise: A loss correction approach</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1944</fpage>&#x02013;<lpage>1952</lpage>.</citation></ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pham</surname> <given-names>H.</given-names></name> <name><surname>Guan</surname> <given-names>M.</given-names></name> <name><surname>Zoph</surname> <given-names>B.</given-names></name> <name><surname>Le</surname> <given-names>Q.</given-names></name> <name><surname>Dean</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Efficient neural architecture search via parameters sharing</article-title>, in <source>Proceedings of the 35th International Conference on Machine Learning</source> (<publisher-loc>Stockholm</publisher-loc>).</citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Real</surname> <given-names>E.</given-names></name> <name><surname>Aggarwal</surname> <given-names>A.</given-names></name> <name><surname>Huang</surname> <given-names>Y.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2019</year>). <article-title>Regularized evolution for image classifier architecture search</article-title>, in <source>Thirty-Third AAAI Conference on Artificial Intelligence</source> (<publisher-loc>Honolulu, HI</publisher-loc>).</citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Real</surname> <given-names>E.</given-names></name> <name><surname>Moore</surname> <given-names>S.</given-names></name> <name><surname>Selle</surname> <given-names>A.</given-names></name> <name><surname>Saxena</surname> <given-names>S.</given-names></name> <name><surname>Suematsu</surname> <given-names>Y. L.</given-names></name> <name><surname>Tan</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Large-scale evolution of image classifiers</article-title>, in <source>Proceedings of the 34th International Conference on Machine Learning</source> (<publisher-loc>Sydney, NSW</publisher-loc>).</citation></ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ren</surname> <given-names>M.</given-names></name> <name><surname>Zeng</surname> <given-names>W.</given-names></name> <name><surname>Yang</surname> <given-names>B.</given-names></name> <name><surname>Urtasun</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning to reweight examples for robust deep learning</article-title>, in <source>International Conference on Machine Learning</source>, <fpage>4331</fpage>&#x02013;<lpage>4340</lpage>.</citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vahdat</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Toward robustness against label noise in training deep discriminative neural networks</article-title>, in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>5596</fpage>&#x02013;<lpage>5605</lpage>.</citation></ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Williams</surname> <given-names>R. J.</given-names></name></person-group> (<year>1992</year>). <article-title>Simple statistical gradient-following algorithms for connectionist reinforcement learning</article-title>. <source>Mach. Learn.</source> <volume>8</volume>, <fpage>229</fpage>&#x02013;<lpage>256</lpage>.</citation></ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xiao</surname> <given-names>T.</given-names></name> <name><surname>Xia</surname> <given-names>T.</given-names></name> <name><surname>Yang</surname> <given-names>Y.</given-names></name> <name><surname>Huang</surname> <given-names>C.</given-names></name> <name><surname>Wang</surname> <given-names>X.</given-names></name></person-group> (<year>2015</year>). <article-title>Learning from massive noisy labeled data for image classification</article-title>, in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Boston, MA</publisher-loc>), <fpage>2691</fpage>&#x02013;<lpage>2699</lpage>.</citation></ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>X.</given-names></name> <name><surname>Han</surname> <given-names>B.</given-names></name> <name><surname>Yao</surname> <given-names>J.</given-names></name> <name><surname>Niu</surname> <given-names>G.</given-names></name> <name><surname>Tsang</surname> <given-names>I.</given-names></name> <name><surname>Sugiyama</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>How does disagreement help generalization against label corruption?</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>7164</fpage>&#x02013;<lpage>7173</lpage>.</citation></ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>C.</given-names></name> <name><surname>Bengio</surname> <given-names>S.</given-names></name> <name><surname>Hardt</surname> <given-names>M.</given-names></name> <name><surname>Recht</surname> <given-names>B.</given-names></name> <name><surname>Vinyals</surname> <given-names>O.</given-names></name></person-group> (<year>2017</year>). <article-title>Understanding deep learning requires rethinking generalization</article-title>, in <source>ICLR</source> (<publisher-loc>Toulon</publisher-loc>: <publisher-name>International Conference on Learning Representations</publisher-name>).</citation></ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Cisse</surname> <given-names>M.</given-names></name> <name><surname>Dauphin</surname> <given-names>Y. N.</given-names></name> <name><surname>Lopez-Paz</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>mixup: beyond empirical risk minimization</article-title>, in <source>International Conference on Learning Representations</source> (<publisher-loc>Vancouver, BC</publisher-loc>).</citation></ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Sabuncu</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>Generalized cross entropy loss for training deep neural networks with noisy labels</article-title>, in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>8778</fpage>&#x02013;<lpage>8788</lpage>.</citation></ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zoph</surname> <given-names>B.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2017</year>). <article-title>Neural architecture search with reinforcement learning</article-title>, in <source>International Conference on Learning Representations</source> (<publisher-loc>Toulon</publisher-loc>).</citation></ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zoph</surname> <given-names>B.</given-names></name> <name><surname>Vasudevan</surname> <given-names>V.</given-names></name> <name><surname>Shlens</surname> <given-names>J.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning transferable architectures for scalable image recognition</article-title>, in <source>The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>). <pub-id pub-id-type="doi">10.1109/CVPR.2018.00907</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>Note that expectation of clean data is under the joint distribution of <bold>x</bold>, <italic>y</italic><sub><bold>x</bold></sub> while that of noise data is under the joint distribution of <bold>x</bold>, &#x01EF9;<sub><bold>x</bold></sub>.</p></fn>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> This work is, in part, supported by DARPA under grant &#x00023;FA8750-17-2-0116 and &#x00023;W911NF-16-1-0565 and NSF under grant &#x00023;IIS-1657196. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.</p></fn>
</fn-group>
</back>
</article>