<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2022.1052972</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Dynamic transfer learning with progressive meta-task scheduler</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Wu</surname> <given-names>Jun</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2013565/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>He</surname> <given-names>Jingrui</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/560852/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Department of Computer Science, University of Illinois at Urbana-Champaign</institution>, <addr-line>Champaign, IL</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>School of Information Sciences, University of Illinois at Urbana-Champaign</institution>, <addr-line>Champaign, IL</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Shuhan Yuan, Utah State University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Chao Lan, University of Oklahoma, United States; Depeng Xu, University of North Carolina at Charlotte, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Jingrui He <email>jingrui&#x00040;illinois.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>03</day>
<month>11</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>5</volume>
<elocation-id>1052972</elocation-id>
<history>
<date date-type="received">
<day>24</day>
<month>09</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>17</day>
<month>10</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Wu and He.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Wu and He</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Dynamic transfer learning refers to the knowledge transfer from a static source task with adequate label information to a dynamic target task with little or no label information. However, most existing theoretical studies and practical algorithms of dynamic transfer learning assume that the target task is continuously evolving over time. This strong assumption is often violated in real world applications, e.g., the target distribution is suddenly changing at some time stamp. To solve this problem, in this paper, we propose a novel meta-learning framework <monospace>L2S</monospace> based on a progressive meta-task scheduler for dynamic transfer learning. The crucial idea of <monospace>L2S</monospace> is to incrementally learn to schedule the meta-pairs of tasks and then learn the optimal model initialization from those meta-pairs of tasks for fast adaptation to the newest target task. The effectiveness of our <monospace>L2S</monospace> framework is verified both theoretically and empirically.</p></abstract>
<kwd-group>
<kwd>transfer learning</kwd>
<kwd>distribution shift</kwd>
<kwd>dynamic environment</kwd>
<kwd>meta-learning</kwd>
<kwd>task scheduler</kwd>
<kwd>image classification</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="2"/>
<equation-count count="12"/>
<ref-count count="41"/>
<page-count count="11"/>
<word-count count="6391"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Transfer learning (Pan and Yang, <xref ref-type="bibr" rid="B24">2009</xref>; Tripuraneni et al., <xref ref-type="bibr" rid="B28">2020</xref>) improves the generalization performance of a learning algorithm on the target task, by leveraging the knowledge from a relevant source task. It has been studied (Ben-David et al., <xref ref-type="bibr" rid="B3">2010</xref>; Long et al., <xref ref-type="bibr" rid="B19">2015</xref>; Ganin et al., <xref ref-type="bibr" rid="B10">2016</xref>; Zhang et al., <xref ref-type="bibr" rid="B38">2019</xref>) that the knowledge transferability across tasks can be theoretically guaranteed under mild conditions, e.g., source and target tasks share the same labeling function. One assumption behind those works is that source and target tasks are sampled from a stationary task distribution. More recently, it is observed that in the context of transfer learning, the tasks might be sampled from a non-stationary task distribution, i.e., the learning task might be evolving over time in real scenarios. It can be formulated as a dynamic transfer learning problem from a static source task<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> with adequate label information to a dynamic target task with little or no label information (see <xref ref-type="fig" rid="F1">Figure 1</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Illustration of dynamic transfer learning from a static source task (e.g., sketch image classification with fully labeled examples) to a dynamic target task (e.g., real-world image classification with only unlabeled examples).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1052972-g0001.tif"/>
</fig>
<p>Most existing works (Hoffman et al., <xref ref-type="bibr" rid="B14">2014</xref>; Bobu et al., <xref ref-type="bibr" rid="B5">2018</xref>; Kumar et al., <xref ref-type="bibr" rid="B16">2020</xref>; Wang H. et al., <xref ref-type="bibr" rid="B30">2020</xref>; Wu and He, <xref ref-type="bibr" rid="B32">2020</xref>, <xref ref-type="bibr" rid="B34">2022b</xref>) on dynamic transfer learning assume that the target task is continuously changing over time. This assumption allows deriving the generalization error bound of dynamic transfer learning using the distribution shift at any consecutive time stamps. Nevertheless, we show that these error bounds are not tight when the task distribution changes suddenly at some time stamp. Therefore, previous works can be hardly applied to real scenarios where the task distribution might not always be evolving continuously. This sudden distribution shift can be induced by some unexpected issues, e.g., adversarial attacks (Wu and He, <xref ref-type="bibr" rid="B35">2021</xref>), system failures (Lu et al., <xref ref-type="bibr" rid="B21">2018</xref>), etc.</p>
<p>To solve this problem, we derive the generalization error bound of dynamic transfer learning in terms of adaptively scheduled meta-pairs of tasks. Moreover, it is observed that this result is closely related to the existing error bounds (Wang et al., <xref ref-type="bibr" rid="B29">2022</xref>; Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>). It is found that previous works showed the error bounds in terms of the distribution shift at any consecutive time stamps. In contrast, we consider all the meta-pairs of tasks, e.g., a pair of tasks transferring the knowledge from an old time stamp to a new time stamp. As a result, our error bound can be tight even when the task distribution is suddenly shifted at some time stamp. Then, by minimizing the error bound, we propose a novel meta-learning framework <monospace>L2S</monospace> based on a progressive meta-task scheduler for dynamic transfer learning. In this framework, we automatically learn the sampling probability for meta-pairs of tasks based on task relatedness. The effectiveness of <monospace>L2S</monospace> framework is then verified on a variety of dynamic transfer learning tasks. The major contributions of this paper are summarized as follows.</p>
<list list-type="bullet">
<list-item><p>We consider a relaxed assumption of dynamic transfer learning, i.e., the target task distribution might change suddenly at some time stamp when it is evolving over time. The generalization error bounds of dynamic transfer learning can then be derived with this relaxed assumption.</p></list-item>
<list-item><p>We propose a novel meta-learning framework <monospace>L2S</monospace> based on a progressive meta-task scheduler for dynamic transfer learning. Different from recent work (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>), <monospace>L2S</monospace> learns to schedule the meta-pairs of tasks based on task relatedness.</p></list-item>
<list-item><p>Experiments on various data sets demonstrate the effectiveness of our <monospace>L2S</monospace> framework over state-of-the-art baselines.</p></list-item>
</list>
<p>The rest of the paper is organized as follows. We review the related work in Section 2. The problem of dynamic transfer learning is defined in Section 3. In Section 4, we derive the error bounds of dynamic transfer learning, followed by the proposed <monospace>L2S</monospace> framework in Section 5. The empirical analysis on <monospace>L2S</monospace> is provided in Section 6. Finally, we conclude the paper in Section 7.</p>
</sec>
<sec id="s2">
<title>2. Related work</title>
<p>In this section, we briefly introduce the related work on dynamic transfer learning and meta-learning.</p>
<sec>
<title>2.1. Dynamic transfer learning</title>
<p>Dynamic transfer learning (Hoffman et al., <xref ref-type="bibr" rid="B14">2014</xref>; Bitarafan et al., <xref ref-type="bibr" rid="B4">2016</xref>; Mancini et al., <xref ref-type="bibr" rid="B22">2019</xref>) refers to the knowledge transfer from a static source task to a dynamic target task. Compared to standard transfer learning on the static source and target tasks (Pan and Yang, <xref ref-type="bibr" rid="B24">2009</xref>; Zhou et al., <xref ref-type="bibr" rid="B41">2017</xref>, <xref ref-type="bibr" rid="B40">2019a</xref>,<xref ref-type="bibr" rid="B39">b</xref>; Tripuraneni et al., <xref ref-type="bibr" rid="B28">2020</xref>; Wu and He, <xref ref-type="bibr" rid="B35">2021</xref>), dynamic transfer learning is a more challenging but realistic problem setting due to its time evolving task relatedness. More recently, various dynamic transfer learning frameworks are built from the following aspects: self-training (Kumar et al., <xref ref-type="bibr" rid="B16">2020</xref>; Chen and Chao, <xref ref-type="bibr" rid="B6">2021</xref>; Wang et al., <xref ref-type="bibr" rid="B29">2022</xref>), incremental distribution alignment (Bobu et al., <xref ref-type="bibr" rid="B5">2018</xref>; Wulfmeier et al., <xref ref-type="bibr" rid="B36">2018</xref>; Wang H. et al., <xref ref-type="bibr" rid="B30">2020</xref>; Wu and He, <xref ref-type="bibr" rid="B32">2020</xref>, <xref ref-type="bibr" rid="B33">2022a</xref>), meta-learning (Liu et al., <xref ref-type="bibr" rid="B18">2020</xref>; Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>), contrastive learning (Tang et al., <xref ref-type="bibr" rid="B26">2021</xref>; Taufique et al., <xref ref-type="bibr" rid="B27">2022</xref>), etc. Specifically, most existing works assume that the task distribution is continuously evolving over time. Very little effort has been devoted to studying dynamic transfer learning when this assumption is violated in real scenarios. Compared to previous works (Liu et al., <xref ref-type="bibr" rid="B18">2020</xref>; Wang et al., <xref ref-type="bibr" rid="B29">2022</xref>; Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>), in this paper, we focus on a more realistic dynamic transfer learning with a relaxed assumption that the task distribution could be suddenly changed at some time stamp.</p>
</sec>
<sec>
<title>2.2. Meta-learning</title>
<p>Meta-learning (Hospedales et al., <xref ref-type="bibr" rid="B15">2021</xref>) leverages the knowledge from a set of prior meta-training tasks for fast adaptation to new tasks. In the context of few-shot classification, meta-learning aims to find the optimal model initialization (Finn et al., <xref ref-type="bibr" rid="B7">2017</xref>, <xref ref-type="bibr" rid="B8">2018</xref>; Wang L. et al., <xref ref-type="bibr" rid="B31">2020</xref>; Yao et al., <xref ref-type="bibr" rid="B37">2021</xref>) from previously seen tasks such that this model can be fine-tuned on a new task by performing a few gradient steps. It assumes that all the tasks follow a stationary task distribution. More recently, this meta-learning paradigm has been extended into the online learning setting where a sequence of tasks is sampled from non-stationary task distributions (Finn et al., <xref ref-type="bibr" rid="B9">2019</xref>; Acar et al., <xref ref-type="bibr" rid="B1">2021</xref>). Following previous work (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>), we formulate dynamic transfer learning as a meta-learning problem, which aims to learn the optimal model initialization for knowledge transfer across any meta-pair of tasks. In contrast to Wu and He (<xref ref-type="bibr" rid="B34">2022b</xref>) where the meta-pairs of tasks are simply constructed from tasks at consecutive time stamps, we propose to learn the sampling probability for meta-pairs of tasks based on the task relatedness during model training. This can help our meta-learning framework avoid the negative transfer induced by the meta-pairs of tasks sampled from suddenly shifted task distribution.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Preliminaries</title>
<p>In this section, we present the notation and formal problem definition of dynamic transfer learning.</p>
<sec>
<title>3.1. Notation</title>
<p>Let <inline-formula><mml:math id="M1"><mml:mrow><mml:mi mathvariant="script">X</mml:mi></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M2"><mml:mrow><mml:mi mathvariant="script">Y</mml:mi></mml:mrow></mml:math></inline-formula> be the input feature space and output label space respectively. We consider the dynamic transfer learning problem (Hoffman et al., <xref ref-type="bibr" rid="B14">2014</xref>; Bobu et al., <xref ref-type="bibr" rid="B5">2018</xref>) with a static source task <inline-formula><mml:math id="M3"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and a dynamic target task <inline-formula><mml:math id="M4"><mml:msubsup><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> with time stamp <italic>j</italic>. In this case, we assume that there are <italic>m</italic><sup><italic>s</italic></sup> labeled training examples <inline-formula><mml:math id="M5"><mml:msup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msubsup></mml:math></inline-formula> in the source task. Let <inline-formula><mml:math id="M6"><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> be the number of unlabeled training examples <inline-formula><mml:math id="M7"><mml:msubsup><mml:mrow><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:msubsup></mml:math></inline-formula> in the <italic>j</italic><sup>th</sup> target task. Let <inline-formula><mml:math id="M8"><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula> be the hypothesis class on <inline-formula><mml:math id="M9"><mml:mrow><mml:mi mathvariant="script">X</mml:mi></mml:mrow></mml:math></inline-formula> where a hypothesis is a function <inline-formula><mml:math id="M10"><mml:mi>h</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="script">X</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mrow><mml:mi mathvariant="script">Y</mml:mi></mml:mrow></mml:math></inline-formula>. <inline-formula><mml:math id="M11"><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x000B7;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is the loss function such that <inline-formula><mml:math id="M12"><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="script">Y</mml:mi></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mi mathvariant="script">Y</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0211D;</mml:mi></mml:math></inline-formula>. The expected classification error on the source task <inline-formula><mml:math id="M13"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is defined as <inline-formula><mml:math id="M14"><mml:msup><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0007E;</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> for any <inline-formula><mml:math id="M15"><mml:mi>h</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula>, and its empirical estimate is given by <inline-formula><mml:math id="M16"><mml:msup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munderover><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. The expected error <inline-formula><mml:math id="M17"><mml:msubsup><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and empirical error <inline-formula><mml:math id="M18"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> of the target task at the <italic>j</italic><sup>th</sup> time stamp can also be defined similarly.</p>
</sec>
<sec>
<title>3.2. Problem definition</title>
<p>Following previous works (Hoffman et al., <xref ref-type="bibr" rid="B14">2014</xref>; Bitarafan et al., <xref ref-type="bibr" rid="B4">2016</xref>; Bobu et al., <xref ref-type="bibr" rid="B5">2018</xref>), we formally define the problem of dynamic transfer learning as follows.</p>
<p><bold> Definition 3.1</bold>. <italic>(Dynamic Transfer Learning) Given a labeled static source task <inline-formula><mml:math id="M19"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> and an unlabeled dynamic target task <inline-formula><mml:math id="M20"><mml:msubsup><mml:mrow><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>, dynamic transfer learning aims to learn the prediction function for the newest target task <inline-formula><mml:math id="M21"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> by leveraging the knowledge from historical source and target tasks</italic>.</p>
<p>The key challenge of dynamic transfer learning is the time evolving task relatedness between source and target tasks. Recent works (Liu et al., <xref ref-type="bibr" rid="B18">2020</xref>; Wang et al., <xref ref-type="bibr" rid="B29">2022</xref>; Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>) showed the generalization error bounds by assuming that the data distribution of the target task is continuously changing over time. Intuitively, in this case, the expected error bound on the newest target task is bounded in terms of the largest distribution gap [e.g., <inline-formula><mml:math id="M22"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mn>0</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>N</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow><mml:mi>&#x00394;</mml:mi><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>] across time stamps. As a result, these generalization error bounds are not tight when the task distribution is significantly shifted at some time stamp. As shown in <xref ref-type="fig" rid="F2">Figure 2</xref>, the task distribution is shifted smoothly from time stamp 1 to time stamp 2. However, it changes sharply from time stamp 2 to time stamp 3. In real scenarios, this sharp distribution shift might be induced by some unexpected issues, e.g., adversarial manipulation (Wu and He, <xref ref-type="bibr" rid="B35">2021</xref>). This thus motivates us to study dynamic transfer learning with a much more relaxed assumption that the task distribution could be suddenly shifted at some time stamp.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Challenges of dynamic transfer learning where the task distribution is suddenly changed at time stamp 3. Here orange circle and green square denote data points from two classes, and the dashed line indicates the optimal decision boundary at different time stamps.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1052972-g0002.tif"/>
</fig></sec>
</sec>
<sec id="s4">
<title>4. Theoretical analysis</title>
<p>In this section, we provide the theoretical analysis for dynamic transfer learning.</p>
<sec>
<title>4.1. Generalization error bound</title>
<p>We derive the generalization error bound of dynamic transfer learning as follows. Following Ben-David et al. (<xref ref-type="bibr" rid="B3">2010</xref>) and Liu et al. (<xref ref-type="bibr" rid="B18">2020</xref>), we use <inline-formula><mml:math id="M23"><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula>-divergence to measure the distribution shift across tasks and Vapnik-Chervonenkis (VC) dimension to measure the complexity of a class of functions <inline-formula><mml:math id="M24"><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula>. Without loss of generality, we would like to consider a binary classification problem (i.e., <inline-formula><mml:math id="M25"><mml:mrow><mml:mi mathvariant="script">Y</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>) with the loss function <inline-formula><mml:math id="M26"><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x00177;</mml:mi><mml:mo>,</mml:mo><mml:mi>y</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>|</mml:mo><mml:mi>&#x00177;</mml:mi><mml:mo>-</mml:mo><mml:mi>y</mml:mi><mml:mo>|</mml:mo></mml:math></inline-formula>. The following theorem showed that the expected error of the newest target task <inline-formula><mml:math id="M27"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> can be bounded in terms of the historical source and target knowledge.</p>
<p><bold> Theorem 4.1</bold>. <italic>(Generalization Error Bound) Let <inline-formula><mml:math id="M28"><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula> be a hypothesis space of VC dimension <italic>d</italic>. If there are <italic>m</italic> labeled source examples i.i.d. drawn from <inline-formula><mml:math id="M29"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> (denoted as <inline-formula><mml:math id="M30"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> as well) and <italic>m</italic> unlabeled target examples i.i.d. drawn from <inline-formula><mml:math id="M31"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> for each time stamp <italic>j</italic> &#x0003D; 1, &#x022EF; , <italic>N</italic>&#x0002B;1<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>, then for any &#x003B4;&#x0003E;0 and <inline-formula><mml:math id="M32"><mml:mi>h</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula>, with probability at least 1&#x02212;&#x003B4;, the expected error of the newest target task <inline-formula><mml:math id="M33"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> can be bounded as follows</italic>.</p>
<disp-formula id="E1"><mml:math id="M34"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msubsup><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>h</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02264;</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mo stretchy='true'>(</mml:mo><mml:msubsup><mml:mover accent='true'><mml:mi>&#x003F5;</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>h</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003B7;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>d</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mi>&#x0210B;</mml:mi><mml:mo>&#x00394;</mml:mo><mml:mi>&#x0210B;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>j</mml:mi><mml:mi>t</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;</mml:mtext></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>+</mml:mo><mml:mi mathvariant='script'>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo>+</mml:mo><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mi>m</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo>/</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>m</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:msqrt></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><italic>where <inline-formula><mml:math id="M35"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, and <italic>w</italic><sub><italic>ij</italic></sub>&#x02265;0 if <italic>i</italic>&#x0003C;<italic>j</italic>, <italic>w</italic><sub><italic>ij</italic></sub> &#x0003D; 0 otherwise. <inline-formula><mml:math id="M36"><mml:msub><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> if 1 &#x02264; <italic>j</italic> &#x02264; <italic>N</italic> and <italic>i</italic>&#x0003C;<italic>j</italic>, and <inline-formula><mml:math id="M37"><mml:msub><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> if <italic>j</italic> &#x0003D; <italic>N</italic>&#x0002B;1 and <italic>i</italic>&#x0003C;<italic>j</italic>, &#x003B7;<sub><italic>ij</italic></sub> &#x0003D; 0 otherwise. Here &#x003BB; denotes the combined error of the ideal hypothesis over all the tasks, i.e., <inline-formula><mml:math id="M38"><mml:mi>&#x003BB;</mml:mi><mml:mo>=</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo class="qopname">&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msubsup><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M39"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow><mml:mo>&#x00394;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x000B7;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the empirical estimate of <inline-formula><mml:math id="M40"><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula>-divergence over finite examples</italic>.</p>
<p>Note that this error bound holds with other existing distribution discrepancy measures (see Corollary 4.3), though we consider <inline-formula><mml:math id="M41"><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula>-divergence (Ben-David et al., <xref ref-type="bibr" rid="B3">2010</xref>) in Theorem 4.1. Furthermore, we show the generalization error bound of dynamic transfer learning from the perspective of meta-learning. That is, instead of sharing the hypothesis <inline-formula><mml:math id="M42"><mml:mi>h</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula> for all the tasks, we learn a common initialized model <inline-formula><mml:math id="M44"><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula> across tasks. Then the task-specific model <italic>h</italic><sub><italic>i</italic></sub> via one-step gradient update for the target at the <italic>i</italic><sup>th</sup> time stamp, i.e., <inline-formula><mml:math id="M45"><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover><mml:mo>-</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>, where <inline-formula><mml:math id="M46"><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover></mml:math></inline-formula> denotes the parameters of <inline-formula><mml:math id="M47"><mml:msub><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover></mml:math></inline-formula> respectively and <inline-formula><mml:math id="M48"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> is the meta-learning loss for updating the task-specific model parameters. If we let <inline-formula><mml:math id="M49"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:mfrac><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:munderover><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo>&#x00304;</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula>, the following theorem provides the generalization error bound based on meta-learning.</p>
<p><bold> Theorem 4.2</bold>. <italic>(Meta-Learning Generalization Error Bound) Let <inline-formula><mml:math id="M50"><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula> be a hypothesis space of VC dimension <italic>d</italic>. If there are <italic>m</italic> labeled source examples i.i.d. drawn from <inline-formula><mml:math id="M51"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> (denoted as <inline-formula><mml:math id="M52"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> as well) and <italic>m</italic> unlabeled target examples i.i.d. drawn from <inline-formula><mml:math id="M53"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> for each time stamp <italic>j</italic> &#x0003D; 1, &#x022EF; , <italic>N</italic>&#x0002B;1, then for any &#x003B4;&#x0003E;0 and a proper inner learning rate &#x003B2;, with probability at least 1&#x02212;&#x003B4;, the expected error of the newest target task <inline-formula><mml:math id="M54"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> can be bounded in the following</italic>.</p>
<disp-formula id="E2"><mml:math id="M55"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msubsup><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02264;</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mo stretchy='true'>(</mml:mo><mml:msubsup><mml:mover accent='true'><mml:mi>&#x003F5;</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003B7;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>d</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mi>&#x0210B;</mml:mi><mml:mo>&#x00394;</mml:mo><mml:mi>&#x0210B;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>j</mml:mi><mml:mi>t</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo stretchy='true'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>+</mml:mo><mml:mi mathvariant='script'>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mi>m</mml:mi></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>m</mml:mi></mml:munderover><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mrow><mml:msub><mml:mo>&#x02207;</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mover accent='true'><mml:mi>h</mml:mi><mml:mo stretchy='true'>&#x000AF;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>x</mml:mi></mml:mstyle><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow><mml:mo>|</mml:mo></mml:mrow></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mstyle></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:mi>&#x003BB;</mml:mi><mml:mo>+</mml:mo><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:mi>d</mml:mi><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mi>m</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>2</mml:mn><mml:mo>/</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:msubsup><mml:mrow><mml:mstyle displaystyle='true'><mml:msubsup><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msubsup><mml:mrow><mml:msubsup><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mn>2</mml:mn></mml:msubsup></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mi>m</mml:mi></mml:mfrac></mml:mrow></mml:msqrt></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><italic>where <inline-formula><mml:math id="M56"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:math></inline-formula>, and <italic>w</italic><sub><italic>ij</italic></sub>&#x02265;0 if <italic>i</italic>&#x0003C;<italic>j</italic>, <italic>w</italic><sub><italic>ij</italic></sub> &#x0003D; 0 otherwise. <inline-formula><mml:math id="M57"><mml:msub><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula> if 1 &#x02264; <italic>j</italic> &#x02264; <italic>N</italic> and <italic>i</italic>&#x0003C;<italic>j</italic>, and <inline-formula><mml:math id="M58"><mml:msub><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> if <italic>j</italic> &#x0003D; <italic>N</italic>&#x0002B;1 and <italic>i</italic>&#x0003C;<italic>j</italic>, &#x003B7;<sub><italic>ij</italic></sub> &#x0003D; 0 otherwise. Here &#x003BB; denotes the combined error of the ideal hypothesis over all the tasks, i.e., <inline-formula><mml:math id="M59"><mml:mi>&#x003BB;</mml:mi><mml:mo>=</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:mi>h</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo class="qopname">&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:msubsup><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and <inline-formula><mml:math id="M60"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow><mml:mo>&#x00394;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x000B7;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the empirical estimate of <inline-formula><mml:math id="M61"><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula>-divergence over finite examples</italic>.</p>
<p>We observe from Theorem 4.2 that the parameter <italic>w</italic><sub><italic>ij</italic></sub> plays an important role in the generalization error bound of dynamic transfer learning. Intuitively, it is more likely to assign higher value <italic>w</italic><sub><italic>ij</italic></sub> for the easy meta-pair of tasks <inline-formula><mml:math id="M62"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02192;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> with stronger class discrimination over <inline-formula><mml:math id="M63"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> [i.e., smaller <inline-formula><mml:math id="M64"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>] and smaller distribution shift between <inline-formula><mml:math id="M65"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="M66"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> [i.e., smaller <inline-formula><mml:math id="M67"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow><mml:mo>&#x00394;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>].</p>
</sec>
<sec>
<title>4.2. Connection to existing bounds</title>
<p>The following corollary shows that the error bound in Theorem 4.1 can be generalized by considering various domain discrepancy measures.</p>
<p><bold> Corollary 4.3</bold>. <italic>With the same assumptions in Theorem 4.1, for any &#x003B4;&#x0003E;0 and <inline-formula><mml:math id="M68"><mml:mi>h</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula>, there exist <italic>w</italic><sub><italic>ij</italic></sub>&#x02265;0 and &#x003B7;<sub><italic>ij</italic></sub>&#x02265;0, with probability at least 1&#x02212;&#x003B4;, the expected error of the newest target task <inline-formula><mml:math id="M69"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> can be bounded in the following</italic>.</p>
<disp-formula id="E3"><label>(1)</label><mml:math id="M70"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msubsup><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02264;</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B7;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003A9;</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><italic>where <inline-formula><mml:math id="M71"><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x000B7;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> can be instantiated with existing distribution discrepancy measures, including discrepancy distance (Mansour et al., <xref ref-type="bibr" rid="B23">2009</xref>), maximum mean discrepancy (Long et al., <xref ref-type="bibr" rid="B19">2015</xref>), Wasserstein distance (Shen et al., <xref ref-type="bibr" rid="B25">2018</xref>), <italic>f</italic>-divergence (Acuna et al., <xref ref-type="bibr" rid="B2">2021</xref>), etc. Here &#x003A9; denotes the corresponding sample complexity when the distribution discrepancy measure is selected</italic>.</p>
<p>Corollary 4.3 shows the flexibility in generalizing existing static transfer learning theories (Mansour et al., <xref ref-type="bibr" rid="B23">2009</xref>; Ben-David et al., <xref ref-type="bibr" rid="B3">2010</xref>; Ghifary et al., <xref ref-type="bibr" rid="B11">2016</xref>; Shen et al., <xref ref-type="bibr" rid="B25">2018</xref>; Zhang et al., <xref ref-type="bibr" rid="B38">2019</xref>; Acuna et al., <xref ref-type="bibr" rid="B2">2021</xref>) into the dynamic transfer learning setting. Moreover, it is observed that Corollary 4.3 is closely related to the existing generalization error bounds (Wang et al., <xref ref-type="bibr" rid="B29">2022</xref>; Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>) of dynamic transfer learning, under different parameters <italic>w</italic><sub><italic>ij</italic></sub> and &#x003B7;<sub><italic>ij</italic></sub>.</p>
<list list-type="bullet">
<list-item><p>When <italic>w</italic><sub><italic>ij</italic></sub> and &#x003B7;<sub><italic>ij</italic></sub> are given by</p></list-item>
</list>
<disp-formula id="E4"><mml:math id="M72"><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mfrac><mml:mi>&#x003C4;</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>N</mml:mi><mml:mtext>&#x000A0;and&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<disp-formula id="E5"><mml:math id="M73"><mml:mrow><mml:msub><mml:mi>&#x003B7;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>&#x003C1;</mml:mi><mml:msqrt><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msqrt><mml:mo stretchy='false'>(</mml:mo><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn><mml:mtext>&#x000A0;and&#x000A0;</mml:mtext><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mi>&#x003C1;</mml:mi><mml:msqrt><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msqrt><mml:mo stretchy='false'>(</mml:mo><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo><mml:mo>/</mml:mo><mml:mi>&#x003C4;</mml:mi><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>N</mml:mi><mml:mtext>&#x000A0;and&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>where &#x003C4;&#x02208;&#x0211D;. Then, when &#x003C4; &#x02192; 0, Corollary 4.3 recovers the generalization error bound (Wang et al., <xref ref-type="bibr" rid="B29">2022</xref>).</p>
<disp-formula id="E6"><mml:math id="M74"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msubsup><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02264;</mml:mo><mml:msup><mml:mi>&#x003F5;</mml:mi><mml:mi>s</mml:mi></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>h</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x003C1;</mml:mi><mml:msqrt><mml:mrow><mml:msup><mml:mi>R</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msqrt><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>d</mml:mi><mml:mrow><mml:msub><mml:mi>W</mml:mi><mml:mi>p</mml:mi></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>+</mml:mo><mml:mi mathvariant='script'>O</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mi>m</mml:mi></mml:mfrac></mml:mrow></mml:msqrt><mml:mo>+</mml:mo><mml:mfrac><mml:mi>N</mml:mi><mml:mrow><mml:msqrt><mml:mi>m</mml:mi></mml:msqrt></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:msqrt><mml:mrow><mml:mi>m</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:mfrac><mml:mo>+</mml:mo><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:mi>log</mml:mi><mml:msup><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>m</mml:mi><mml:mi>N</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mn>3</mml:mn><mml:mi>L</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:msqrt></mml:mrow></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>+</mml:mo><mml:mrow><mml:mrow><mml:msqrt><mml:mrow><mml:mfrac><mml:mrow><mml:mi>log</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mrow></mml:msqrt></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M75"><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:math></inline-formula> is the hypothesis class of <italic>R</italic>-Lipschitz <italic>L</italic>-layer fully-connected neural networks with 1-Lipschitz activation function.</p>
<list list-type="bullet">
<list-item><p>When <italic>w</italic><sub><italic>ij</italic></sub> and &#x003B7;<sub><italic>ij</italic></sub> are given by</p></list-item>
</list>
<disp-formula id="E7"><mml:math id="M76"><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow><mml:mtext>&#x000A0;&#x000A0;</mml:mtext><mml:msub><mml:mi>&#x003B7;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtable columnalign='left'><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>if&#x000A0;</mml:mtext><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn><mml:mo>=</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:mtd></mml:mtr><mml:mtr columnalign='left'><mml:mtd columnalign='left'><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo></mml:mrow></mml:mtd><mml:mtd columnalign='left'><mml:mrow><mml:mtext>otherwise</mml:mtext></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:mrow></mml:mrow></mml:math></disp-formula>
<p>Then, Corollary 4.3 recovers the generalization error bound (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>).</p>
<disp-formula id="E8"><label>(2)</label><mml:math id="M77"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msubsup><mml:mi>&#x003F5;</mml:mi><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>h</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02264;</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>&#x003F5;</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:mi>h</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>d</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mi>&#x0210B;</mml:mi><mml:mo>&#x00394;</mml:mo><mml:mi>&#x0210B;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mo>+</mml:mo><mml:msub><mml:mi>&#x003A9;</mml:mi><mml:mi>L</mml:mi></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003A9;<sub><italic>L</italic></sub> is a Rademacher complexity term.</p>
<p>Compared to existing theoretical results (Wang et al., <xref ref-type="bibr" rid="B29">2022</xref>; Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>), with appropriate <italic>w</italic><sub><italic>ij</italic></sub>, our generalization error bound in Corollary 4.3 is much more tighter when there exists some time stamp <italic>i</italic> such that <inline-formula><mml:math id="M79"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow><mml:mo>&#x00394;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is large. It thus motivates us to develop a progressive meta-task scheduler in the meta-learning framework for dynamic transfer learning. The crucial idea is to automatically learn the values <italic>w</italic><sub><italic>ij</italic></sub>, based on the intuition that assigning large value <italic>w</italic><sub><italic>ij</italic></sub> on easy meta-pair of tasks <inline-formula><mml:math id="M80"><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02192;</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> would make our error bound much tighter.</p>
</sec>
</sec>
<sec sec-type="methods" id="s5">
<title>5. Methodology</title>
<p>Following Wu and He (<xref ref-type="bibr" rid="B34">2022b</xref>), we propose a meta-learning framework named <monospace>L2S</monospace> for dynamic transfer learning by empirically minimizing the error bound in Theorem 4.2. Instead of uniformly sampling the meta-pairs of tasks in the consecutive time stamps (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>), in this paper, we learn a progressive meta-task scheduler for automatically formulating the meta-pairs of tasks from the dynamic target task.</p>
<p>The overall objective function of <monospace>L2S</monospace> for learning the prediction function of <inline-formula><mml:math id="M81"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> on the (<italic>N</italic>&#x0002B;1)<sup>th</sup> time stamp is given as follows.</p>
<disp-formula id="E9"><label>(3)</label><mml:math id="M82"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:munder><mml:mi>min</mml:mi><mml:mi>&#x003B8;</mml:mi></mml:munder><mml:munder><mml:mrow><mml:mi>&#x000A0;min</mml:mi></mml:mrow><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle></mml:munder><mml:mi mathvariant='script'>&#x000A0;J</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>,</mml:mo><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>w</mml:mi></mml:mstyle><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mover accent='true'><mml:mi>&#x003F5;</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mover accent='true'><mml:mi>d</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mi>&#x0210B;</mml:mi><mml:mo>&#x00394;</mml:mo><mml:mi>&#x0210B;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>j</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>s.t.&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>N</mml:mi></mml:munderover><mml:mrow><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:munderover><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle></mml:mrow></mml:mstyle><mml:mo>=</mml:mo><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mtext>s.t.&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mo>&#x02207;</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:msup><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>j</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B8; is the trainable parameters and <inline-formula><mml:math id="M83"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is the meta-training loss. &#x003B7;&#x02265;0 is a hyper-parameter to balance the classification error and discrepancy minimization.</p>
<p>The proposed <monospace>L2S</monospace> framework has three crucial components: meta-pairs of tasks, meta-training, and meta-testing. The overall training procedures of <monospace>L2S</monospace> are illustrated in <xref ref-type="fig" rid="F6">Algorithm 1</xref>.</p>
<list list-type="bullet">
<list-item><p><bold>Meta-Pairs of Tasks:</bold> Following the theoretical results in Section 4.1, we formulate the candidate meta-pairs of tasks from any two different time stamps <inline-formula><mml:math id="M84"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> (<italic>i</italic>&#x0003C;<italic>j</italic>). It can be considered as a simple knowledge transfer from <inline-formula><mml:math id="M85"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> to <inline-formula><mml:math id="M86"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. Here we simply denote the source task <inline-formula><mml:math id="M87"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> as <inline-formula><mml:math id="M88"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. Since we focus on learning the prediction function on the target task at a new time stamp, we consider the knowledge transfer from an old time stamp <italic>i</italic> to a new time stamp <italic>j</italic>, i.e., <italic>i</italic>&#x0003C;<italic>j</italic>. Note that as suggested in Theorem 4.2, those candidate meta-pairs of tasks might not have equal sampling probability for meta-training. Therefore, we propose a progressive meta-pair scheduler to incrementally learn the sampling probability of every candidate meta-pair of tasks.</p></list-item>
</list>
<list list-type="simple">
<list-item><p>As shown in Theorem 4.2, the sampling probability <italic>w</italic><sub><italic>ij</italic></sub> is strongly related to the classification error on <inline-formula><mml:math id="M89"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and the empirical distribution discrepancy between <inline-formula><mml:math id="M90"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M91"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. However, we have only unlabeled training examples for the target task. It is intractable to accurately estimate the classification error on <inline-formula><mml:math id="M92"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> (<italic>i</italic> &#x0003D; 1, 2, &#x022EF; ) for the target task. One solution is that we can incrementally estimate the pseudo-labels of unlabeled target examples, and then obtain the classification error using these pseudo-labels. But it will be largely affected by the quality of the pseudo-labels. Instead, in this paper, we simply learn the sampling probability using the empirical distribution discrepancy between <inline-formula><mml:math id="M93"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M94"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> because this distribution discrepancy involves only the unlabeled examples. That is, the sampling probability <italic>w</italic><sub><italic>ij</italic></sub> is learned as follows.</p></list-item>
</list>
<disp-formula id="E10"><label>(4)</label><mml:math id="M95"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mo class="qopname">exp</mml:mo><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow><mml:mo>&#x00394;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<list list-type="simple">
<list-item><p>where &#x00393; is a normalization term. it indicates that the meta-pair of tasks with a smaller distribution discrepancy has a larger probability of being sampled for meta-training. Intuitively, the smaller distribution discrepancy guarantees the knowledge transfer across tasks (Ganin et al., <xref ref-type="bibr" rid="B10">2016</xref>; Zhang et al., <xref ref-type="bibr" rid="B38">2019</xref>). Therefore, we can sample a set of meta-pairs of tasks <inline-formula><mml:math id="M96"><mml:mrow><mml:mi mathvariant="script">S</mml:mi></mml:mrow></mml:math></inline-formula> based on the sampling probability for meta-training.</p></list-item>
</list>
<list list-type="bullet">
<list-item><p><bold>Meta-Training:</bold> Following Wu and He (<xref ref-type="bibr" rid="B34">2022b</xref>), the meta-training over meta-pairs of tasks is given as follows. Let <inline-formula><mml:math id="M97"><mml:msub><mml:mrow><mml:mi>&#x003B6;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow><mml:mo>&#x00394;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> be the loss function over the validation set on a meta-pair of tasks. Then the model initialization &#x003B8; can be learned by</p></list-item>
</list>
<disp-formula id="E11"><label>(5)</label><mml:math id="M98"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:mtext>&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;</mml:mtext><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>arg</mml:mi><mml:munder><mml:mrow><mml:mi>&#x000A0;min</mml:mi></mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:munder><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>S</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:msub><mml:mi>&#x003B6;</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mstyle><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mi>M</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mo>&#x02207;</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:msup><mml:mi>&#x02112;</mml:mi><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>i</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi mathvariant='script'>D</mml:mi><mml:mi>j</mml:mi><mml:mi>t</mml:mi></mml:msubsup><mml:mo stretchy='false'>)</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<list list-type="simple">
<list-item><p>where <italic>M</italic><sub><italic>ij</italic></sub>:&#x003B8; &#x02192; &#x003B8;<sub><italic>ij</italic></sub> is a function which maps the model initialization &#x003B8; into the optimal task-specific parameter &#x003B8;<sub><italic>ij</italic></sub>. Similar to the model-agnostic meta-learning (MAML) (Finn et al., <xref ref-type="bibr" rid="B7">2017</xref>), <italic>M</italic><sub><italic>ij</italic></sub>(&#x003B8;) can be instantiated by one or a few gradient descent updates in practice. In this case, the meta-training loss is given by <inline-formula><mml:math id="M99"><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow><mml:mo>&#x00394;</mml:mo><mml:mrow><mml:mi mathvariant="script">H</mml:mi></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> over the training set.</p></list-item>
<list-item><p>As illustrated in <xref ref-type="fig" rid="F6">Algorithm 1</xref>, the predictive function is incrementally learned for the target task at every historical time stamp, and then the pseudo-labels of unlabeled target examples can be inferred.</p></list-item>
</list>
<list list-type="bullet">
<list-item><p><bold>Meta-Testing:</bold> The optimal parameters &#x003B8;<sub><italic>N</italic>&#x0002B;1</sub> on the newest target task <inline-formula><mml:math id="M100"><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> could be learned by fine-tuning the optimal model initialization &#x003B8; on a selective meta-pair of tasks <inline-formula><mml:math id="M101"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>.</p></list-item>
</list>
<disp-formula id="E12"><label>(6)</label><mml:math id="M102"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>M</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>-</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mi>a</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<list list-type="simple">
<list-item><p>where &#x003B8; is the optimized model initialization learned in the meta-training phase. Here we choose the meta-pair of tasks <inline-formula><mml:math id="M103"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> by estimating the sampling probability <italic>w</italic><sub><italic>k</italic>(<italic>N</italic>&#x0002B;1)</sub> (<italic>k</italic> &#x0003D; 0, 1, &#x022EF; , <italic>N</italic>) and choosing <italic>k</italic> with the largest value <italic>w</italic><sub><italic>k</italic>(<italic>N</italic>&#x0002B;1)</sub>.</p></list-item>
</list>
<fig id="F6" position="float">
<label>Algorithm 1</label>
<caption><p>Learning to Schedule (<monospace>L2S</monospace>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1052972-g0006.tif"/>
</fig>
</sec>
<sec id="s6">
<title>6. Experiments</title>
<p>In this section, we provide the empirical analysis of <monospace>L2S</monospace> framework on various data sets.</p>
<sec>
<title>6.1. Experimental setup</title>
<p>We used the following publicly available image data sets:</p>
<list list-type="bullet">
<list-item><p>Rotating MNIST (Kumar et al., <xref ref-type="bibr" rid="B16">2020</xref>): The original MNIST (LeCun et al., <xref ref-type="bibr" rid="B17">1998</xref>) is a digital image data set with 60,000 images from 10 categories. Rotating MNIST is a semi-synthetic version of MNIST where each image is rotated by a degree. Following Bobu et al. (<xref ref-type="bibr" rid="B5">2018</xref>) and Kumar et al. (<xref ref-type="bibr" rid="B16">2020</xref>), we rotate each image by an angle for generating the time-evolving classification task. More specifically, for the source task, we randomly choose 32 images and then rotate them by an angle between 0 and 10 degrees. All the images in the source task are associated with class labels. For the time-evolving target task, we randomly choose 32 images at every time stamp <italic>j</italic> (<italic>j</italic> &#x0003D; 1, &#x022EF; , 35) and rotate them by an angle between 10&#x000B7;<italic>j</italic> and 10&#x000B7;(<italic>j</italic>&#x0002B;1) degrees. It can be seen that in this case, the data distribution of the target task is continuously evolving over time. Therefore, we denote the aforementioned Rotating MNIST as a data set &#x0201C;with continuous evolvement.&#x0201D; In contrast, we consider the dynamic transfer learning scenarios &#x0201C;with large distribution shift,&#x0201D; where the samples at the last 18 time stamps of the target task are randomly shuffled. That is, the target task might not be evolving smoothly with respect to the rotation degree.</p></list-item>
<list-item><p>ImageCLEF-DA (Long et al., <xref ref-type="bibr" rid="B20">2017</xref>): ImageCLEF-DA has three image classification tasks: Caltech-256 (C), ImageNet ILSVRC 2012 (I) and Pascal VOC 2012 (P). Following Wu and He (<xref ref-type="bibr" rid="B34">2022b</xref>), we generate the time evolving target task by adding random noise and rotation to the original images. For example, if we consider Caltech-256 (C) as the target task, we can generate a time-evolving target task by rotating the original images of Caltech-256 with a degree <italic>O</italic><sub><italic>d</italic></sub>(<italic>j</italic>) (<italic>j</italic> &#x0003D; 1, 2&#x022EF; , 5 is the time stamp) and adding the random salt&#x00026;pepper noise with the magnitude <italic>O</italic><sub><italic>n</italic></sub>(<italic>j</italic>), i.e., <italic>O</italic><sub><italic>d</italic></sub>(<italic>j</italic>) &#x0003D; 15&#x000B7;(<italic>j</italic>&#x02212;1), <italic>O</italic><sub><italic>n</italic></sub>(<italic>j</italic>) &#x0003D; 0.01&#x000B7;(<italic>j</italic>&#x02212;1), <italic>N</italic> &#x0003D; 4.</p></list-item>
</list>
<p>Following Bobu et al. (<xref ref-type="bibr" rid="B5">2018</xref>) and Wu and He (<xref ref-type="bibr" rid="B34">2022b</xref>), we report both the classification accuracy on the newest target task (Acc) and the average classification accuracy on the historical target tasks (H-Acc) in the experiments. The comparison baselines we used in the experiments include: (1) static transfer learning approaches: SourceOnly, DAN (Long et al., <xref ref-type="bibr" rid="B19">2015</xref>), DANN (Ganin et al., <xref ref-type="bibr" rid="B10">2016</xref>), and MDD (Zhang et al., <xref ref-type="bibr" rid="B38">2019</xref>); and (2) dynamic transfer learning: CUA (Bobu et al., <xref ref-type="bibr" rid="B5">2018</xref>), GST (Kumar et al., <xref ref-type="bibr" rid="B16">2020</xref>), L2E (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>), and our proposed <monospace>L2S</monospace> framework. For a fair comparison, all the methods use the same base models for feature extraction, e.g., LeNet for Rotating MNIST and ResNet-18 (He et al., <xref ref-type="bibr" rid="B13">2016</xref>) for ImageCLEF-DA. In addition, we set &#x003B7; &#x0003D; 1, &#x003B2; &#x0003D; 0.01 and the number of inner epochs in <italic>M</italic><sub><italic>ij</italic></sub>(&#x003B8;) as 1. All the experiments are performed on a Windows machine with four 3.80GHz Intel Cores, 64GB RAM and two NVIDIA Quadro RTX 5000 GPUs.</p>
</sec>
<sec>
<title>6.2. Results</title>
<p><xref ref-type="fig" rid="F3">Figures 3</xref>, <xref ref-type="fig" rid="F4">4</xref> show the distribution shift in the dynamic transfer learning tasks, where &#x0201C;S-T&#x00022; denotes the distribution difference <inline-formula><mml:math id="M104"><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> between the source and the target at every time stamp and &#x0201C;T-T&#x00022; denotes the distribution difference <inline-formula><mml:math id="M105"><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> of the target at consecutive time stamp. Here we use maximum mean discrepancy (MMD) (Gretton et al., <xref ref-type="bibr" rid="B12">2012</xref>) to measure the distribution difference across tasks. We see that when the target task is continuously evolving over time, <inline-formula><mml:math id="M106"><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is small. This enables gradual knowledge transferability in the target task. If there exists a large distribution shift at some times, i.e., <inline-formula><mml:math id="M107"><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mrow><mml:mi mathvariant="script">D</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is large, the strategy of gradual knowledge transferability might fail. In <xref ref-type="fig" rid="F3">Figures 3</xref>, <xref ref-type="fig" rid="F4">4</xref>, the large distribution shift happened in the time stamps 17&#x02013;35 on Rotating MNIST and time stamp 1 on I &#x02192; C/P.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Rotating MNIST with <bold>(A)</bold> continuous evolvement and <bold>(B)</bold> large distribution shift.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1052972-g0003.tif"/>
</fig>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>I &#x02192; C on ImageCLEF-DA with <bold>(A)</bold> continuous evolvement, <bold>(B)</bold> large distribution shift. I &#x02192; P on ImageCLEF-DA with <bold>(C)</bold> continuous evolvement, <bold>(D)</bold> large distribution shift.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1052972-g0004.tif"/>
</fig>
<p><xref ref-type="table" rid="T1">Tables 1</xref>, <xref ref-type="table" rid="T2">2</xref> provides the experimental results of <monospace>L2S</monospace> as well as baselines on Rotating MNIST and Image-CLEF data sets. We have the following observations from the results. On the one hand, when the target task is continuously evolving over time, most dynamic transfer learning baselines can achieve satisfactory performance on both the newest and historical target tasks. The baseline GST (Kumar et al., <xref ref-type="bibr" rid="B16">2020</xref>) fails on Rotating MNIST, because the self-training approach might be more likely to accumulate the classification error when the target task is evolving for a long time. On the other hand, the performance of CUA (Bobu et al., <xref ref-type="bibr" rid="B5">2018</xref>) and L2E (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>) drops significantly when there is a large distribution shift within the target task at some time stamp. In contrast, by adaptively selecting the meta-pairs of tasks, the proposed <monospace>L2S</monospace> framework can mitigate the issue of the potential large distribution shift in the targe task. Specifically, compared to L2E (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>), <monospace>L2S</monospace> improves the performance by a large margin. This confirms the efficacy of the proposed progressive meta-pair scheduler.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Results of dynamic transfer learning on Rotating MNIST.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>With continuous evolvement</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>With large distribution shift</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold>Acc</bold></th>
<th valign="top" align="center"><bold>H-Acc</bold></th>
<th valign="top" align="center"><bold>Acc</bold></th>
<th valign="top" align="center"><bold>H-Acc</bold></th>
</tr>
</thead>
<tbody><tr>
<td valign="top" align="left">SourceOnly</td>
<td valign="top" align="center">1.0000</td>
<td valign="top" align="center">0.4393</td>
<td valign="top" align="center">0.3437</td>
<td valign="top" align="center">0.4393</td>
</tr>
<tr>
<td valign="top" align="left">DAN (Long et al., <xref ref-type="bibr" rid="B19">2015</xref>)</td>
<td valign="top" align="center">1.0000</td>
<td valign="top" align="center">0.4518</td>
<td valign="top" align="center">0.5625</td>
<td valign="top" align="center">0.4830</td>
</tr>
<tr>
<td valign="top" align="left">DANN (Ganin et al., <xref ref-type="bibr" rid="B10">2016</xref>)</td>
<td valign="top" align="center">1.0000</td>
<td valign="top" align="center">0.3884</td>
<td valign="top" align="center">0.3750</td>
<td valign="top" align="center">0.4000</td>
</tr>
<tr>
<td valign="top" align="left">MDD (Zhang et al., <xref ref-type="bibr" rid="B38">2019</xref>)</td>
<td valign="top" align="center">1.0000</td>
<td valign="top" align="center">0.4250</td>
<td valign="top" align="center">0.4063</td>
<td valign="top" align="center">0.4482</td>
</tr>
<tr>
<td valign="top" align="left">CUA (Bobu et al., <xref ref-type="bibr" rid="B5">2018</xref>)</td>
<td valign="top" align="center">0.9375</td>
<td valign="top" align="center">0.9277</td>
<td valign="top" align="center">0.4375</td>
<td valign="top" align="center">0.8259</td>
</tr>
<tr>
<td valign="top" align="left">GST (Kumar et al., <xref ref-type="bibr" rid="B16">2020</xref>)</td>
<td valign="top" align="center">0.0625</td>
<td valign="top" align="center">0.1062</td>
<td valign="top" align="center">0.1250</td>
<td valign="top" align="center">0.2259</td>
</tr>
<tr>
<td valign="top" align="left">L2E (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>)</td>
<td valign="top" align="center">0.9688</td>
<td valign="top" align="center">0.9795</td>
<td valign="top" align="center">0.6250</td>
<td valign="top" align="center">0.7179</td>
</tr>
<tr>
<td valign="top" align="left"><monospace>L2S</monospace> </td>
<td valign="top" align="center"><bold>1.0000</bold></td>
<td valign="top" align="center"><bold>0.9991</bold></td>
<td valign="top" align="center"><bold>0.9687</bold></td>
<td valign="top" align="center"><bold>0.9116</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The best results are indicated in bold.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Results of dynamic transfer learning on ImageCLEF-DA.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>With continuous evolvement</bold></th>
<th valign="top" align="center" colspan="4" style="border-bottom: thin solid #000000;"><bold>With large distribution shift</bold></th>
</tr>
<tr>
<th/>
</tr>
<tr>
<td/>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>I</bold>&#x02192;<bold>C</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>I</bold>&#x02192;<bold>P</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>I</bold>&#x02192;<bold>C</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>I</bold>&#x02192;<bold>P</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="center"><bold>Acc</bold></th>
<th valign="top" align="center"><bold>H-Acc</bold></th>
<th valign="top" align="center"><bold>Acc</bold></th>
<th valign="top" align="center"><bold>H-Acc</bold></th>
<th valign="top" align="center"><bold>Acc</bold></th>
<th valign="top" align="center"><bold>H-Acc</bold></th>
<th valign="top" align="center"><bold>Acc</bold></th>
<th valign="top" align="center"><bold>H-Acc</bold></th>
</tr>
</thead>
<tbody><tr>
<td valign="top" align="left">SourceOnly</td>
<td valign="top" align="center">0.3125</td>
<td valign="top" align="center">0.4250</td>
<td valign="top" align="center">0.2812</td>
<td valign="top" align="center">0.3938</td>
<td valign="top" align="center">0.3125</td>
<td valign="top" align="center">0.4125</td>
<td valign="top" align="center">0.2187</td>
<td valign="top" align="center">0.2562</td>
</tr>
<tr>
<td valign="top" align="left">DAN (Long et al., <xref ref-type="bibr" rid="B19">2015</xref>)</td>
<td valign="top" align="center">0.2500</td>
<td valign="top" align="center">0.4000</td>
<td valign="top" align="center">0.2187</td>
<td valign="top" align="center">0.2688</td>
<td valign="top" align="center">0.3750</td>
<td valign="top" align="center">0.3750</td>
<td valign="top" align="center">0.2500</td>
<td valign="top" align="center">0.2625</td>
</tr>
<tr>
<td valign="top" align="left">DANN (Ganin et al., <xref ref-type="bibr" rid="B10">2016</xref>)</td>
<td valign="top" align="center">0.3125</td>
<td valign="top" align="center">0.4438</td>
<td valign="top" align="center">0.3125</td>
<td valign="top" align="center">0.4188</td>
<td valign="top" align="center">0.3125</td>
<td valign="top" align="center">0.4125</td>
<td valign="top" align="center">0.1875</td>
<td valign="top" align="center">0.2750</td>
</tr>
<tr>
<td valign="top" align="left">MDD (Zhang et al., <xref ref-type="bibr" rid="B38">2019</xref>)</td>
<td valign="top" align="center">0.3437</td>
<td valign="top" align="center">0.4750</td>
<td valign="top" align="center">0.3125</td>
<td valign="top" align="center">0.4562</td>
<td valign="top" align="center">0.3125</td>
<td valign="top" align="center">0.4062</td>
<td valign="top" align="center">0.2500</td>
<td valign="top" align="center">0.3188</td>
</tr>
<tr>
<td valign="top" align="left">CUA (Bobu et al., <xref ref-type="bibr" rid="B5">2018</xref>)</td>
<td valign="top" align="center">0.4063</td>
<td valign="top" align="center">0.5125</td>
<td valign="top" align="center">0.5312</td>
<td valign="top" align="center">0.5438</td>
<td valign="top" align="center">0.4375</td>
<td valign="top" align="center">0.4625</td>
<td valign="top" align="center">0.3437</td>
<td valign="top" align="center">0.4000</td>
</tr>
<tr>
<td valign="top" align="left">GST (Kumar et al., <xref ref-type="bibr" rid="B16">2020</xref>)</td>
<td valign="top" align="center">0.5000</td>
<td valign="top" align="center">0.5312</td>
<td valign="top" align="center">0.4375</td>
<td valign="top" align="center">0.4312</td>
<td valign="top" align="center">0.2812</td>
<td valign="top" align="center">0.3062</td>
<td valign="top" align="center">0.2500</td>
<td valign="top" align="center">0.2562</td>
</tr>
<tr>
<td valign="top" align="left">L2E (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>)</td>
<td valign="top" align="center"><bold>0.5625</bold></td>
<td valign="top" align="center"><bold>0.6875</bold></td>
<td valign="top" align="center">0.5625</td>
<td valign="top" align="center">0.5875</td>
<td valign="top" align="center">0.3750</td>
<td valign="top" align="center">0.4812</td>
<td valign="top" align="center">0.3750</td>
<td valign="top" align="center"><bold>0.4812</bold></td>
</tr>
<tr>
<td valign="top" align="left"><monospace>L2S</monospace> </td>
<td valign="top" align="center"><bold>0.5625</bold></td>
<td valign="top" align="center">0.6125</td>
<td valign="top" align="center"><bold>0.6562</bold></td>
<td valign="top" align="center"><bold>0.6188</bold></td>
<td valign="top" align="center"><bold>0.4375</bold></td>
<td valign="top" align="center"><bold>0.5500</bold></td>
<td valign="top" align="center"><bold>0.4375</bold></td>
<td valign="top" align="center"><bold>0.4812</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The best results are indicated in bold.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>6.3. Analysis</title>
<p>We provide the ablation study of our <monospace>L2S</monospace> framework with respect to the number of inner training epochs. The results on the newest target task of Rotating MNIST are shown in <xref ref-type="fig" rid="F5">Figure 5</xref>, where we use 1 or 5 inner epochs for our meta-learning framework. We see that using more inner epochs can improve the convergence of <monospace>L2S</monospace> but it sacrifices the classification accuracy on the historical target task. This is because <monospace>L2S</monospace> with more inner epochs would enforce the fine-tuned model to be more task-specific. Thus, we set the number of inner epochs as 1 in our experiments.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Ablation study with different number of inner epochs. <bold>(A)</bold> Training loss. <bold>(B)</bold> Acc. <bold>(C)</bold> H-Acc.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1052972-g0005.tif"/>
</fig></sec>
</sec>
<sec sec-type="conclusions" id="s7">
<title>7. Conclusion</title>
<p>In this paper, we study the problem of dynamic transfer learning from a labeled source task to an unlabeled dynamic target task. We start by deriving the generalization error bounds of dynamic transfer learning by assigning the meta-pairs of tasks with different weights. This allows us to provide the tighter error bound when there is a large distribution shift of the target task at some time stamp. Then we develop a novel meta-learning framework <monospace>L2S</monospace> with progressive meta-task scheduler for dynamic transfer learning. Extensive experiments on several image data sets demonstrate the effectiveness of the proposed <monospace>L2S</monospace> framework over state-of-the-art baselines.</p>
</sec>
<sec sec-type="data-availability" id="s8">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s9">
<title>Author contributions</title>
<p>JW and JH work together to develop a new theoretical understanding and algorithms for dynamic transfer learning. Both authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>This work is supported by the National Science Foundation under Award Nos. IIS-1947203, IIS-2117902, and IIS-2137468 and Agriculture and Food Research Initiative (AFRI) Grant No. 2020-67021-32799/project accession no. <ext-link ext-link-type="DDBJ/EMBL/GenBank" xlink:href="1024178">1024178</ext-link> from the USDA National Institute of Food and Agriculture.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s12">
<title>Author disclaimer</title>
<p>The views and conclusions are those of the authors and should not be interpreted as representing the official policies of the funding agencies or the government.</p>
</sec>
</body>
<back><sec sec-type="supplementary-material" id="s13">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fdata.2022.1052972/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fdata.2022.1052972/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.PDF" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/></sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Acar</surname> <given-names>D. A. E.</given-names></name> <name><surname>Zhu</surname> <given-names>R.</given-names></name> <name><surname>Saligrama</surname> <given-names>V.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Memory efficient online meta learning,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>, <fpage>32</fpage>&#x02013;<lpage>42</lpage>.</citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Acuna</surname> <given-names>D.</given-names></name> <name><surname>Zhang</surname> <given-names>G.</given-names></name> <name><surname>Law</surname> <given-names>M. T.</given-names></name> <name><surname>Fidler</surname> <given-names>S.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;<italic>f</italic>-domain adversarial learning: theory and algorithms,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>, <fpage>66</fpage>&#x02013;<lpage>75</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ben-David</surname> <given-names>S.</given-names></name> <name><surname>Blitzer</surname> <given-names>J.</given-names></name> <name><surname>Crammer</surname> <given-names>K.</given-names></name> <name><surname>Kulesza</surname> <given-names>A.</given-names></name> <name><surname>Pereira</surname> <given-names>F.</given-names></name> <name><surname>Vaughan</surname> <given-names>J. W.</given-names></name></person-group> (<year>2010</year>). <article-title>A theory of learning from different domains</article-title>. <source>Mach. Learn</source>. <volume>79</volume>, <fpage>151</fpage>&#x02013;<lpage>175</lpage>. <pub-id pub-id-type="doi">10.1007/s10994-009-5152-4</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bitarafan</surname> <given-names>A.</given-names></name> <name><surname>Baghshah</surname> <given-names>M. S.</given-names></name> <name><surname>Gheisari</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Incremental evolving domain adaptation</article-title>. <source>IEEE Trans. Knowl. Data Eng</source>. <volume>28</volume>, <fpage>2128</fpage>&#x02013;<lpage>2141</lpage>. <pub-id pub-id-type="doi">10.1109/TKDE.2016.2551241</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bobu</surname> <given-names>A.</given-names></name> <name><surname>Tzeng</surname> <given-names>E.</given-names></name> <name><surname>Hoffman</surname> <given-names>J.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Adapting to continuously shifting domains,&#x0201D;</article-title> in <source>International Conference on Learning Representations Workshop</source> (<publisher-loc>Vancouver, BC</publisher-loc>).</citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>H.-Y.</given-names></name> <name><surname>Chao</surname> <given-names>W.-L.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Gradual domain adaptation without indexed intermediate domains,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 34</source>, <fpage>8201</fpage>&#x02013;<lpage>8214</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Finn</surname> <given-names>C.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Model-agnostic meta-learning for fast adaptation of deep networks,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Sydney, NSW</publisher-loc>), <fpage>1126</fpage>&#x02013;<lpage>1135</lpage>.<pub-id pub-id-type="pmid">35653901</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Finn</surname> <given-names>C.</given-names></name> <name><surname>Xu</surname> <given-names>K.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Probabilistic model-agnostic meta-learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>), Vol. <volume>31</volume>.<pub-id pub-id-type="pmid">35168359</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Finn</surname> <given-names>C.</given-names></name> <name><surname>Rajeswaran</surname> <given-names>A.</given-names></name> <name><surname>Kakade</surname> <given-names>S.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Online meta-learning,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>1920</fpage>&#x02013;<lpage>1930</lpage>.</citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ganin</surname> <given-names>Y.</given-names></name> <name><surname>Ustinova</surname> <given-names>E.</given-names></name> <name><surname>Ajakan</surname> <given-names>H.</given-names></name> <name><surname>Germain</surname> <given-names>P.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Laviolette</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Domain-adversarial training of neural networks</article-title>. <source>J. Mach. Learn. Res</source>. <volume>17</volume>, <fpage>2096</fpage>&#x02013;<lpage>2030</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-58347-1_10</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ghifary</surname> <given-names>M.</given-names></name> <name><surname>Balduzzi</surname> <given-names>D.</given-names></name> <name><surname>Kleijn</surname> <given-names>W. B.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Scatter component analysis: a unified framework for domain adaptation and domain generalization</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>39</volume>, <fpage>1414</fpage>&#x02013;<lpage>1430</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2016.2599532</pub-id><pub-id pub-id-type="pmid">28113617</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gretton</surname> <given-names>A.</given-names></name> <name><surname>Borgwardt</surname> <given-names>K. M.</given-names></name> <name><surname>Rasch</surname> <given-names>M. J.</given-names></name> <name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name> <name><surname>Smola</surname> <given-names>A.</given-names></name></person-group> (<year>2012</year>). <article-title>A kernel two-sample test</article-title>. <source>J. Mach. Learn. Res</source>. <volume>13</volume>, <fpage>723</fpage>&#x02013;<lpage>773</lpage>.</citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep residual learning for image recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>.<pub-id pub-id-type="pmid">32166560</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hoffman</surname> <given-names>J.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name> <name><surname>Saenko</surname> <given-names>K.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Continuous manifold based adaptation for evolving visual domains,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Columbus, OH</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>867</fpage>&#x02013;<lpage>874</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hospedales</surname> <given-names>T. M.</given-names></name> <name><surname>Antoniou</surname> <given-names>A.</given-names></name> <name><surname>Micaelli</surname> <given-names>P.</given-names></name> <name><surname>Storkey</surname> <given-names>A. J.</given-names></name></person-group> (<year>2021</year>). <article-title>Meta-learning in neural networks: a survey</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>14</volume>, <fpage>5149</fpage>&#x02013;<lpage>5169</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2021.3079209</pub-id><pub-id pub-id-type="pmid">33974543</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kumar</surname> <given-names>A.</given-names></name> <name><surname>Ma</surname> <given-names>T.</given-names></name> <name><surname>Liang</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Understanding self-training for gradual domain adaptation,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>, <fpage>5468</fpage>&#x02013;<lpage>5479</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Haffner</surname> <given-names>P.</given-names></name></person-group> (<year>1998</year>). <article-title>Gradient-based learning applied to document recognition</article-title>. <source>Proc. IEEE</source> <volume>86</volume>, <fpage>2278</fpage>&#x02013;<lpage>2324</lpage>. <pub-id pub-id-type="doi">10.1109/5.726791</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>H.</given-names></name> <name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Learning to adapt to evolving domains,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 33</source>, <fpage>22338</fpage>&#x02013;<lpage>22348</lpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Jordan</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Learning transferable features with deep adaptation networks,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Lille</publisher-loc>), <fpage>97</fpage>&#x02013;<lpage>105</lpage>.<pub-id pub-id-type="pmid">30188813</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Zhu</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Jordan</surname> <given-names>M. I.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Deep transfer learning with joint adaptation networks,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Sydney, NSW</publisher-loc>), <fpage>2208</fpage>&#x02013;<lpage>2217</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>A.</given-names></name> <name><surname>Dong</surname> <given-names>F.</given-names></name> <name><surname>Gu</surname> <given-names>F.</given-names></name> <name><surname>Gama</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>G.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning under concept drift: a review</article-title>. <source>IEEE Trans. Knowl. Data Eng</source>. <volume>31</volume>, <fpage>2346</fpage>&#x02013;<lpage>2363</lpage>. <pub-id pub-id-type="doi">10.1109/TKDE.2018.2876857</pub-id></citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mancini</surname> <given-names>M.</given-names></name> <name><surname>Bulo</surname> <given-names>S. R.</given-names></name> <name><surname>Caputo</surname> <given-names>B.</given-names></name> <name><surname>Ricci</surname> <given-names>E.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Adagraph: unifying predictive and continuous domain adaptation through graphs,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6568</fpage>&#x02013;<lpage>6577</lpage>.</citation>
</ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mansour</surname> <given-names>Y.</given-names></name> <name><surname>Mohri</surname> <given-names>M.</given-names></name> <name><surname>Rostamizadeh</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Domain adaptation: learning bounds and algorithms,&#x0201D;</article-title> in <source>22nd Conference on Learning Theory, COLT 2009</source> (<publisher-loc>Montreal, QC</publisher-loc>).</citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>S. J.</given-names></name> <name><surname>Yang</surname> <given-names>Q.</given-names></name></person-group> (<year>2009</year>). <article-title>A survey on transfer learning</article-title>. <source>IEEE Trans. Knowl. Data Eng</source>. <volume>22</volume>, <fpage>1345</fpage>&#x02013;<lpage>1359</lpage>. <pub-id pub-id-type="doi">10.1109/TKDE.2009.191</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shen</surname> <given-names>J.</given-names></name> <name><surname>Qu</surname> <given-names>Y.</given-names></name> <name><surname>Zhang</surname> <given-names>W.</given-names></name> <name><surname>Yu</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Wasserstein distance guided representation learning for domain adaptation,&#x0201D;</article-title> in <source>Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tang</surname> <given-names>S.</given-names></name> <name><surname>Su</surname> <given-names>P.</given-names></name> <name><surname>Chen</surname> <given-names>D.</given-names></name> <name><surname>Ouyang</surname> <given-names>W.</given-names></name></person-group> (<year>2021</year>). <article-title>Gradient regularized contrastive learning for continual domain adaptation</article-title>. <source>Proc. AAAI Conf. Artif. Intell</source>. <volume>35</volume>, <fpage>2665</fpage>&#x02013;<lpage>2673</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v35i3.16370</pub-id></citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Taufique</surname> <given-names>A. M. N.</given-names></name> <name><surname>Jahan</surname> <given-names>C. S.</given-names></name> <name><surname>Savakis</surname> <given-names>A.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Unsupervised continual learning for gradually varying domains,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3740</fpage>&#x02013;<lpage>3750</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tripuraneni</surname> <given-names>N.</given-names></name> <name><surname>Jordan</surname> <given-names>M.</given-names></name> <name><surname>Jin</surname> <given-names>C.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;On the theory of transfer learning: The importance of task diversity,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 33</source>, <fpage>7852</fpage>&#x02013;<lpage>7862</lpage>.</citation>
</ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Zhao</surname> <given-names>H.</given-names></name></person-group> (<year>2022</year>). <article-title>&#x0201C;Understanding gradual domain adaptation: improved analysis, optimal path and beyond,&#x0201D;</article-title> in <source>Proceedings of the 39th International Conference on Machine Learning</source> (<publisher-loc>Baltimore, MD</publisher-loc>), <fpage>22784</fpage>&#x02013;<lpage>22801</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>He</surname> <given-names>H.</given-names></name> <name><surname>Katabi</surname> <given-names>D.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Continuously indexed domain adaptation,&#x0201D;</article-title> in <source>Proceedings of the 37th International Conference on Machine Learning</source>, <fpage>9898</fpage>&#x02013;<lpage>9907</lpage>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Cai</surname> <given-names>Q.</given-names></name> <name><surname>Yang</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;On the global optimality of model-agnostic meta-learning,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>, <fpage>9837</fpage>&#x02013;<lpage>9846</lpage>.</citation>
</ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>He</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Continuous transfer learning with label-informed distribution alignment</article-title>. <source>arXiv preprint arXiv:2006.03230</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2006.03230</pub-id></citation>
</ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>He</surname> <given-names>J.</given-names></name></person-group> (<year>2022a</year>). <article-title>&#x0201C;Domain adaptation with dynamic open-set targets,&#x0201D;</article-title> in <source>Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>Washington, DC</publisher-loc>), <fpage>2039</fpage>&#x02013;<lpage>2049</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>He</surname> <given-names>J.</given-names></name></person-group> (<year>2022b</year>). <article-title>&#x0201C;A unified meta-learning framework for dynamic transfer learning,&#x0201D;</article-title> in <source>Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22</source> (<publisher-loc>Vienna</publisher-loc>), <fpage>3573</fpage>&#x02013;<lpage>3579</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>He</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;Indirect invisible poisoning attacks on domain adaptation,&#x0201D;</article-title> in <source>Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source>, <fpage>1852</fpage>&#x02013;<lpage>1862</lpage>.</citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wulfmeier</surname> <given-names>M.</given-names></name> <name><surname>Bewley</surname> <given-names>A.</given-names></name> <name><surname>Posner</surname> <given-names>I.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Incremental adversarial domain adaptation for continually changing environments,&#x0201D;</article-title> in <source>2018 IEEE International Conference on Robotics and Automation (ICRA)</source> (<publisher-loc>Brisbane, QLD</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4489</fpage>&#x02013;<lpage>4495</lpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yao</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name> <name><surname>Wei</surname> <given-names>Y.</given-names></name> <name><surname>Zhao</surname> <given-names>P.</given-names></name> <name><surname>Mahdavi</surname> <given-names>M.</given-names></name> <name><surname>Lian</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>&#x0201C;Meta-learning with an adaptive task scheduler,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems, Vol. 34</source>, <fpage>7497</fpage>&#x02013;<lpage>7509</lpage>.</citation>
</ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>T.</given-names></name> <name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Jordan</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Bridging theory and algorithm for domain adaptation,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>7404</fpage>&#x02013;<lpage>7413</lpage>.</citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>Y.</given-names></name> <name><surname>Ma</surname> <given-names>F.</given-names></name> <name><surname>Gao</surname> <given-names>J.</given-names></name> <name><surname>He</surname> <given-names>J.</given-names></name></person-group> (<year>2019b</year>). <article-title>&#x0201C;Optimizing the wisdom of the crowd: Inference, learning, and teaching,&#x0201D;</article-title> in <source>Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &#x00026; Data Mining</source>, eds A. Teredesai, V. Kumar, Y. Li, R. Rosales, E. Terzi, and G. Karypis (<publisher-loc>Anchorage, AK</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>3231</fpage>&#x02013;<lpage>3232</lpage>. <pub-id pub-id-type="doi">10.1145/3292500.3332277</pub-id></citation>
</ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>Y.</given-names></name> <name><surname>Ying</surname> <given-names>L.</given-names></name> <name><surname>He</surname> <given-names>J.</given-names></name></person-group> (<year>2019a</year>). <article-title>Multi-task crowdsourcing via an optimization framework</article-title>. <source>ACM Trans. Knowl. Discov. Data</source> <volume>13</volume>, <fpage>1</fpage>&#x02013;<lpage>26</lpage>. <pub-id pub-id-type="doi">10.1145/3310227</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>Y.</given-names></name> <name><surname>Yong</surname> <given-names>L.</given-names></name> <name><surname>He</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;MultiCmbox 2: An optimization framework for learning from task and worker dual heterogeneity,&#x0201D;</article-title> in <source>Proceedings of the 2017 SIAM International Conference on Data Mining</source> (<publisher-loc>Houston, TX</publisher-loc>: <publisher-name>SIAM</publisher-name>), <fpage>579</fpage>&#x02013;<lpage>587</lpage>. <pub-id pub-id-type="doi">10.1137/1.9781611974973.65</pub-id></citation>
</ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>It can also be generalized to the scenarios (Wu and He, <xref ref-type="bibr" rid="B34">2022b</xref>) where the knowledge is transferred from a dynamic source task to a dynamic target task.</p></fn>
<fn id="fn0002"><p><sup>2</sup>Here we assume that it generates the same number of examples at every time stamp, i.e., <inline-formula><mml:math id="M43"><mml:msup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>m</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>m</mml:mi></mml:math></inline-formula>, but the theoretical results can also be generalized into the scenarios with different number of samples in source and target tasks.</p></fn>
</fn-group>
</back>
</article>