<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2022.1029307</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Auto-GNN: Neural architecture search of graph neural networks</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Zhou</surname> <given-names>Kaixiong</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1976668/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Huang</surname> <given-names>Xiao</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/2081657/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Song</surname> <given-names>Qingquan</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/880499/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Chen</surname> <given-names>Rui</given-names></name>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1068766/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Hu</surname> <given-names>Xia</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1518368/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>DATA Lab, Department of Computer Science, Rice University</institution>, <addr-line>Houston, TX</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Computing, The Hong Kong Polytechnic University, Kowloon</institution>, <addr-line>Hong Kong SAR</addr-line>, <country>China</country></aff>
<aff id="aff3"><sup>3</sup><institution>LinkedIn</institution>, <addr-line>Sunnyvale, CA</addr-line>, <country>United States</country></aff>
<aff id="aff4"><sup>4</sup><institution>Samsung Research America</institution>, <addr-line>Silicon Valley, CA</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Yao Ma, New Jersey Institute of Technology, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Xiao Wang, Beijing University of Posts and Telecommunications (BUPT), China; Senzhang Wang, Central South University, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Xia Hu <email>xia.hu&#x00040;rice.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>17</day>
<month>11</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>5</volume>
<elocation-id>1029307</elocation-id>
<history>
<date date-type="received">
<day>27</day>
<month>08</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>26</day>
<month>10</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Zhou, Huang, Song, Chen and Hu.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Zhou, Huang, Song, Chen and Hu</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Graph neural networks (GNNs) have been widely used in various graph analysis tasks. As the graph characteristics vary significantly in real-world systems, given a specific scenario, the architecture parameters need to be tuned carefully to identify a suitable GNN. Neural architecture search (NAS) has shown its potential in discovering the effective architectures for the learning tasks in image and language modeling. However, the existing NAS algorithms cannot be applied efficiently to GNN search problem because of two facts. First, the large-step exploration in the traditional controller fails to learn the sensitive performance variations with slight architecture modifications in GNNs. Second, the search space is composed of heterogeneous GNNs, which prevents the direct adoption of parameter sharing among them to accelerate the search progress. To tackle the challenges, we propose an automated graph neural networks (AGNN) framework, which aims to find the optimal GNN architecture efficiently. Specifically, a reinforced conservative controller is designed to explore the architecture space with small steps. To accelerate the validation, a novel constrained parameter sharing strategy is presented to regularize the weight transferring among GNNs. It avoids training from scratch and saves the computation time. Experimental results on the benchmark datasets demonstrate that the architecture identified by AGNN achieves the best performance and search efficiency, comparing with existing human-invented models and the traditional search methods.</p></abstract>
<kwd-group>
<kwd>graph neural networks</kwd>
<kwd>automated machine learning</kwd>
<kwd>neural architecture search</kwd>
<kwd>deep and scalable graph analysis</kwd>
<kwd>reinforcement learning</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="6"/>
<equation-count count="3"/>
<ref-count count="40"/>
<page-count count="12"/>
<word-count count="7740"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Graph neural networks (GNNs) (Micheli, <xref ref-type="bibr" rid="B22">2009</xref>) have emerged as predominant tools to model graph data at various domains, such as social media (Grover and Leskovec, <xref ref-type="bibr" rid="B15">2016</xref>) and bioinformatics (Zitnik and Leskovec, <xref ref-type="bibr" rid="B38">2017</xref>). Following the message passing strategy (Hamilton et al., <xref ref-type="bibr" rid="B16">2017</xref>), GNNs learn a node&#x00027;s representation <italic>via</italic> recursively aggregating the representations of its neighbors and itself. The learned node representations could be employed to deal with different tasks efficiently.</p>
<p>The success of GNNs is usually accompanied with careful architecture parameter tuning, aiming to adapt GNNs to the different types of graph data. For example, attention heads in the graph attention networks (Velickovic et al., <xref ref-type="bibr" rid="B28">2017</xref>) are selected for the citation networks and the protein&#x02013;protein intermodule data. These human-invented architectures not only require the manual trials in selecting the architecture parameters, but also tend to obtain the suboptimal performance when they are transferred to other graph data. Based on these observations, we investigate how to automatically identify the optimal architectures for the different scenarios.</p>
<p>Neural architecture search (NAS) has attracted increasing research interests (Elsken et al., <xref ref-type="bibr" rid="B11">2018</xref>). Its goal is to find the optimal neural architecture in the predefined search space to maximize the model performance on a given task. It has been widely reported that the new architectures discovered by NAS algorithms outperformed the human-invented ones at many domains, such as the image classification (Zoph and Le, <xref ref-type="bibr" rid="B39">2016</xref>) and semantic image segmentation (Liu et al., <xref ref-type="bibr" rid="B21">2019</xref>). Motivated by the previous superior success of NAS, we propose to investigate whether an efficient and effective NAS framework could be developed for the network analytics problems.</p>
<p>However, the direct application of existing NAS algorithms to find GNN architectures is non-trivial, due to the two challenges as follows. <italic>First, the traditional search controller of NAS is inefficient to discover a well-performing GNN architecture</italic>. GNNs are specified by a sequence of modules, including aggregation, combination and activation. Considering node classification task, the classification performances of GNNs vary significantly with the slight modification of a module. For example, over graph convolutional networks (GCN) (Kipf and Welling, <xref ref-type="bibr" rid="B18">2017</xref>), the test accuracy drops even if we slightly change the aggregation function from sum to mean pooling. The traditional controller samples the whole module sequence to formulate a new architecture at each search step. After validating the new architecture, the controller gets updates as the result of the mixed module modifications. It would be hard for the traditional controller to learn the following relationship: which part of the architecture modifications improves or degrades the model performance. <italic>Second, the widely adopted technique in NAS such as parameter sharing (Pham et al.</italic>, <xref ref-type="bibr" rid="B23"><italic>2018</italic></xref><italic>) is not suitable to GNN architectures</italic>. The parameter sharing trains common weights and transfers them to every newly sampled architecture, aiming to avoid training from scratch and measure the new architecture quickly. But it fails to share weights between any two heterogeneous GNN architectures, which have the distinct output statistics. The output statistics of a model is defined by the mean, variance, or value interval of its neuron activation values. Suppose that we have weights deeply trained in a GNN architecture with Sigmoid activation function, bounding the neural outputs within interval [0, 1]. If we transfer the weights to another architecture possessing Linear function with loose activation interval [&#x02212;&#x0221E;, &#x0002B;&#x0221E;], the neural output values may be too large to be back propagated steadily by the gradient decent optimizer.</p>
<p>We propose the automated graph neural networks (AGNN) to tackle the aforementioned challenges. Specifically, it could be separated as answering two research questions. (i) How do we design the search controller tailored to explore the well-performing GNN architectures efficiently? (ii) Given the emerging heterogeneous GNN architectures during the search progress, how do we make the parameter sharing feasible? In summary, our contributions are described as follows:</p>
<list list-type="bullet">
<list-item><p>We build up the most comprehensive search space to cover the <italic>elementary, deep</italic> and <italic>scalable</italic> GNN architectures. The search space incorporates the recent techniques, such as skip connections and batch training, to explore the promising models on large-scale graphs.</p></list-item>
<list-item><p>We design an <italic>efficient</italic> controller by considering the key property of GNN architectures into search progress&#x02014;the variation of node distinguishing power with slight architecture modifications.</p></list-item>
<list-item><p>We define the heterogeneous GNN architectures in the context of parameter sharing. A constrained parameter sharing strategy is proposed to enhance the functional <italic>effectiveness</italic> of transferred weights in the new architecture.</p></list-item>
<list-item><p>We conduct the extensive experiments to search the elementary, deep, and scalable GNNs, which delivers the most superior results on both small and large-scale graph datasets. Comparing with existing NAS, AGNN achieves the double wins in the search efficiency and effectiveness.</p></list-item>
</list></sec>
<sec id="s2">
<title>2. Related work</title>
<sec>
<title>2.1. Graph neural networks</title>
<p>The core idea of GNNs is to learn the node embedding representations recursively from the representations at the previous layer. The graph convolutions at each layer is realized by a series of manipulations, including the message passing and self updating. A variety of GNNs based on spatial graph convolutions has been developed, including GNN models with the different aggregation mechanisms (Hamilton et al., <xref ref-type="bibr" rid="B16">2017</xref>; Corso et al., <xref ref-type="bibr" rid="B7">2020</xref>), and the different attentions (Vaswani et al., <xref ref-type="bibr" rid="B27">2017</xref>; Velickovic et al., <xref ref-type="bibr" rid="B28">2017</xref>). Recently, the deep GNNs have been widely studied to learn the high-order neighborhood structures of nodes (Chen et al., <xref ref-type="bibr" rid="B4">2020</xref>; Zhou et al., <xref ref-type="bibr" rid="B37">2020</xref>). Given the large-scale graphs in real-world application, several scalable GNNs are proposed by applying the batch training (Chang and Lin, <xref ref-type="bibr" rid="B3">2010</xref>; Zeng et al., <xref ref-type="bibr" rid="B33">2019</xref>).</p></sec>
<sec>
<title>2.2. Neural architecture search</title>
<p>NAS has been widely explored to facilitate the automation of designing and selecting good neural architectures. Most of NAS frameworks are built up based on reinforcement learning (RL) (Baker et al., <xref ref-type="bibr" rid="B1">2016</xref>; Zoph and Le, <xref ref-type="bibr" rid="B39">2016</xref>). RL-based approaches adopt a recurrent controller to generate the variable-length strings of neural architectures. The controller is updated with policy gradient after evaluating the sampled architecture on the validation set. To tackle the time cost bottleneck of NAS, parameter sharing (Pham et al., <xref ref-type="bibr" rid="B23">2018</xref>) is proposed to transfer the weights well trained before to a new sampled architecture, and avoids training from scratch.</p></sec>
<sec>
<title>2.3. Graph NAS</title>
<p>As far as we know, the only prior work on conjoining the researches of GNNs and NAS is GraphNAS (Gao et al., <xref ref-type="bibr" rid="B13">2019</xref>). To be specific, GraphNAS directly applies the reinforcement learning search method and the traditional parameter sharing. Following this pioneer work, the recent efforts of graph NAS either modify the search space for their specific downstream tasks (Ding et al., <xref ref-type="bibr" rid="B9">2020</xref>; You et al., <xref ref-type="bibr" rid="B32">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B34">2020a</xref>; Cai et al., <xref ref-type="bibr" rid="B2">2021</xref>; Wei et al., <xref ref-type="bibr" rid="B29">2021</xref>), or apply the different search methods (Li and King, <xref ref-type="bibr" rid="B20">2020</xref>; Shi et al., <xref ref-type="bibr" rid="B25">2020</xref>; Zhao et al., <xref ref-type="bibr" rid="B36">2020b</xref>). For example, targeting at the graph classification problem, previous work (Cai et al., <xref ref-type="bibr" rid="B2">2021</xref>; Wei et al., <xref ref-type="bibr" rid="B29">2021</xref>) incorporates the operation of feature filtration or graph pooling into the search space. Besides the reinforcement learning searched algorithm, several differentiable search frameworks have been developed to improve search efficiency. For example, Zhao et al. (<xref ref-type="bibr" rid="B36">2020b</xref>) and Ding et al. (<xref ref-type="bibr" rid="B10">2021</xref>) relax the discrete search space to be continuous, where each graph module is represented by the probabilistic combination of various candidate functions. The evolutionary algorithm is used to generate architecture with generic operations of crossover and mutation (Shi et al., <xref ref-type="bibr" rid="B25">2020</xref>).</p></sec></sec>
<sec id="s3">
<title>3. Search space</title>
<p>Before going to the technique details, we first unify the terminologies used under the graph NAS framework. We use the term &#x0201C;architecture&#x0201D; to refer to an available graph neural networks that could be applied for the downstream application. Specifically, GNN architecture is characterized by multiple independent dimensions, such as aggregation function and hidden units. Along each architecture dimension, there are a series of candidate modules provided to support the automated architecture engineering. For example, we have candidates {SUM, MEAN, MAX} at the dimension of aggregation function. The search space is then constructed by Cartesian product of all the dimensions, which contains a large amount of available architectures. NAS is to iteratively sample the next architecture, and moves toward the optimal architecture in the search space as close as possible (Chen et al., <xref ref-type="bibr" rid="B5">2021</xref>).</p>
<p>Following the popular message passing strategy (Gilmer et al., <xref ref-type="bibr" rid="B14">2017</xref>), GNNs are stacked by a series of graph convoutional layers. Formally, the graph convolutions at the <italic>k</italic>-th layer are:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable columnalign='left'><mml:mtr><mml:mtd><mml:msubsup><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mtext>AGGRE</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x0007B;</mml:mo><mml:msubsup><mml:mi>a</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:msubsup><mml:mi>x</mml:mi><mml:mi>j</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>:</mml:mo><mml:mi>j</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi mathvariant='script'>N</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>i</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x0007D;</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>;</mml:mo></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mtext>COM</mml:mtext><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>W</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msup><mml:msubsup><mml:mi>x</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mn>1</mml:mn><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mi>h</mml:mi><mml:mi>i</mml:mi><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:mi>k</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msubsup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>)</mml:mo><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><inline-formula><mml:math id="M3"><mml:msubsup><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>d</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:msup></mml:math></inline-formula> denotes the representation embedding of node <italic>i</italic> learned at the <italic>k</italic>-th layer. <inline-formula><mml:math id="M4"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">N</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes the set of neighbors adjacent to node <italic>i</italic>. <italic>W</italic><sup>(<italic>k</italic>)</sup>&#x02208;&#x0211D;<sup><italic>d</italic></sup><sup>(<italic>k</italic>)</sup>&#x000D7;<italic>d</italic><sup>(<italic>k</italic>&#x02212;1)</sup> is trainable weight. <inline-formula><mml:math id="M5"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> denotes the edge weight between nodes <italic>i</italic> and <italic>j</italic>. Functions AGGRE and COM are applied to aggregate neighbor embeddings and combine them with the node itself, respectively. &#x003C3; denotes the activation function. Based on above equation, we define the comprehensive search space to support the searches of elementary, deep and scalable GNNs for various applications. We categorize the search dimensions as following, and list their candidate modules in <xref ref-type="supplementary-material" rid="SM1">Supplementary Section S2</xref>.</p>
<list list-type="bullet">
<list-item><p><bold>Elementary dimensions</bold>. We use the term &#x0201C;elementary GNNs&#x0201D; to represent the widely applied models in literature, which often contain less than three layers. The elementary dimensions are: (<bold>I</bold>) hidden units specifying <italic>d</italic><sup>(<italic>k</italic>)</sup>; (<bold>II</bold>) attention function used to compute <inline-formula><mml:math id="M6"><mml:msubsup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula>; (<bold>III</bold>) number of attention heads; (<bold>IV</bold>) aggregation function; (<bold>V</bold>) combination function; and (<bold>VI</bold>) activation function.</p></list-item>
<list-item><p><bold>Deep dimensions</bold>. We include dimension of (<bold>VII</bold>) skip connections to allow the stacking of deep GNNs. To be specific, at layer <italic>k</italic>, the embeddings of up to <italic>k</italic> &#x02212; 1 previous layers could be sampled and combined to the current layer&#x00027;s output.</p></list-item>
<list-item><p><bold>Scalable dimensions</bold>. The dimension of (<bold>VIII</bold>) batch size is included to facilitate the computation on large-scale graphs.</p></list-item>
</list>
<p>We highlight that most of existing search space only cover the elementary dimensions (Gao et al., <xref ref-type="bibr" rid="B13">2019</xref>). In particular, although the batch size is contained in the search space of You et al. (<xref ref-type="bibr" rid="B32">2020</xref>), it is used for the graph classification instead of the node classification problem concerned in this work. The technical implementation of batch sampling for these two problems are significantly different: While the graph classification samples independent graphs as a batch similar to tradition machine learning tasks, the node classification samples dependent nodes to formulate a subgraph. We are aware that the skip connections are searched in Zhao et al. (<xref ref-type="bibr" rid="B34">2020a</xref>). But it only optimize the connection choice at the last layer of a three-layer GNN. They fail to explore the deep and scalable models (Chiang et al., <xref ref-type="bibr" rid="B6">2019</xref>; Chen et al., <xref ref-type="bibr" rid="B4">2020</xref>), which have been recently widely explored to boost GNNs&#x00027; performances.</p></sec>
<sec id="s4">
<title>4. Reinforced conservative controller</title>
<p>In the traditional search controller of reinforcement learning (RL)-based NAS, a recurrent neural networks (RNN) encoder is applied to specify the neural architecture strings (Zoph and Le, <xref ref-type="bibr" rid="B39">2016</xref>; Pham et al., <xref ref-type="bibr" rid="B23">2018</xref>). At each search step, the RNN encoder will sample the string elements one by one, and use them to formulate a new architecture. After validating the new architecture, a scalar reward is used to update the RNN encoder. However, it is problematic to directly apply this traditional controller to find the well-performing GNN architectures. The main reason is that GNNs&#x00027; performances may vary significantly with the slight modifications along a single dimension (e.g., aggregation function). The traditional controller would be hard to learn about which part of architecture modifications contributing more or less to the performance improvement, thus failing to identify the powerful modules of a certain dimension in the future search process.</p>
<p>In order to search GNN architectures efficiently, we propose a new controller named reinforced conservative neural architecture search (RCNAS), as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. It is built up upon RL-based exploration boosted with conservative exploitation. To be specific, there are three key components: (1) a conservative exploiter, which screens out the best architecture found so far; (2) a guided architecture explorer, which slightly modifies the modules of certain dimensions in the preserved best architecture; and (3) a reinforcement learning trainer that learns the relationship between the slight architecture modifications and model performance change.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Illustration of AGNN with a three-layer GNN search along elementary dimensions. Controller takes the best architecture as input, and applies RNN encoder to sample the alternative modules for each dimension. We select the dimension (e.g., activation function) deserved to be explored, and modify the preserved best architecture with the alternative modules.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1029307-g0001.tif"/>
</fig>
<sec>
<title>4.1. Conservative exploiter</title>
<p>The conservative exploiter is applied to keep the best architecture found so far. In this way, the following architecture modifications are performed upon a reliable parent architecture, which ensures fast exploitation toward the better offspring architectures in the huge search space. If the offspring GNN outperforms its parent, the best neural architecture is updated; otherwise, it will be kept and reused to generate the next offspring GNN.</p></sec>
<sec>
<title>4.2. Guided architecture explorer</title>
<p>The guided architecture explorer is proposed to modify the best architecture, <italic>via</italic> choosing the dimensions deserved for exploration. As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, we use the example of the activation function being selected. Correspondingly, the modules of activation function in a three-layer GNN architecture are changed to ELU, ReLU, and Tanh, respectively. The details are introduced as follows.</p>
<sec>
<title>4.2.1. RNN encoders</title>
<p>As shown in the middle part of <xref ref-type="fig" rid="F1">Figure 1</xref>, for each dimension <italic>c</italic>, an RNN encoder is implemented to sample a series of new modules. These modules are potential to be used to update the <italic>n</italic> layers in the preserved GNN correspondingly. First, a subarchitecture string is generated by removing the original modules of the concerned dimension. This subarchitecture represents the input status that asks for module padding. Second, following an embedding layer, the <italic>n</italic> new modules are sampled layer by layer.</p>
<p>Specifically, RNN encoder of dimension <italic>c</italic> decides the sampling probability distribution at layer <italic>k</italic> as: <inline-formula><mml:math id="M7"><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext mathvariant="bold">P</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">P</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>|</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>&#x02264;</mml:mo><mml:mi>k</mml:mi><mml:mo>&#x02264;</mml:mo><mml:mi>n</mml:mi></mml:math></inline-formula>. &#x003B8;<sup><italic>c</italic></sup> denotes the trainable parameters, and <italic>m</italic> denotes the module cardinality. The module at layer <italic>k</italic> is randomly sampled based on distribution <inline-formula><mml:math id="M8"><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext mathvariant="bold">P</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. Reusing the example of activation function dimension in <xref ref-type="fig" rid="F1">Figure 1</xref>, {ELU, ReLU, Tanh} are generated and prepared to modify the preserved architecture.</p></sec>
<sec>
<title>4.2.2. Modification guider</title>
<p>It is responsible to choose the architecture dimensions to modify the preserved GNNs. We note that NAS is encouraged to explore the search space along the direction with a great amount of uncertainty. The uncertainty of a dimension could be defined by the entropy of sampling probability. Formally, the decision entropy of dimension <italic>c</italic> is: <inline-formula><mml:math id="M9"><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext mathvariant="bold">P</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup><mml:mo class="qopname">log</mml:mo><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext mathvariant="bold">P</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. The larger the <italic>E</italic><sub><italic>c</italic></sub> is, the higher is the probability of exploring uncertain dimension <italic>c</italic>.</p>
<p>Given the decision entropies {&#x022EF;&#x02009;, <italic>E</italic><sub><italic>c</italic></sub>, &#x022EF;&#x02009;} of all the dimensions, the modification guider randomly chooses dimensions <inline-formula><mml:math id="M10"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x022EF;</mml:mo><mml:mspace width="0.3em" class="thinspace"/><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> with size <inline-formula><mml:math id="M11"><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow><mml:mo>|</mml:mo><mml:mo>=</mml:mo><mml:mi>s</mml:mi></mml:math></inline-formula>. We use the default value of <italic>s</italic> &#x0003D; 1 to achieve the goal of minimum architecture modification. We provide the hyperparameter study of <italic>s</italic> in Appendix, which shows the model performance generally decreases with the increasing of size <italic>s</italic>. This validate our motivation to explore the search space of GNN architectures with small steps.</p></sec>
<sec>
<title>4.2.3. Architecture modifier</title>
<p>We modify the modules of the preserved best architecture according to list <inline-formula><mml:math id="M12"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:math></inline-formula>. For each dimension within list <inline-formula><mml:math id="M13"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:math></inline-formula>, the corresponding original modules are replaced with the newly sampled ones. Considering the case of <inline-formula><mml:math id="M14"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mtext>activation&#x000A0;function</mml:mtext></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula> in <xref ref-type="fig" rid="F1">Figure 1</xref>, the sampled modules {ELU, ReLU, Tanh} are applied for the activation functions in the three-layer GNN, while keeping the other modules in the preserved architecture unchanged. After the architecture modifications, the offspring GNN is evaluated to estimate its model performance.</p></sec></sec>
<sec>
<title>4.3. Reinforcement learning trainer</title>
<p>We use REINFORCE rule (Sutton et al., <xref ref-type="bibr" rid="B26">2000</xref>) to update RNN encoder. For each modified dimension <inline-formula><mml:math id="M15"><mml:mi>c</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:math></inline-formula>, we compute the gradients of parameters &#x003B8;<sub><italic>c</italic></sub> by the following rule (Zoph and Le, <xref ref-type="bibr" rid="B39">2016</xref>):</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M16"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mi>J</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mi>&#x1D53C;</mml:mi><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>R</mml:mi><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>B</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo class="qopname">log</mml:mo><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><inline-formula><mml:math id="M17"><mml:msubsup><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>P</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> represents the probability of sampled module at layer <italic>k</italic>, given by the corresponding element from vector <inline-formula><mml:math id="M18"><mml:msubsup><mml:mrow><mml:mstyle class="text"><mml:mtext mathvariant="bold">P</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula>. <italic>R</italic> denotes the reward (i.e., validation performance) by evaluating the new offspring architecture. <italic>B</italic><sub><italic>c</italic></sub> denotes the reward baseline of dimension <italic>c</italic> for variance reduction in reinforcement learning. Let <italic>M</italic><sub><italic>b</italic></sub> and <italic>M</italic><sub><italic>o</italic></sub> denote the model performances of the preserved best architecture and the new offspring, respectively. We propose the following reward shaping: <italic>R</italic> &#x0003D; <italic>M</italic><sub><italic>o</italic></sub> &#x02212; <italic>M</italic><sub><italic>b</italic></sub>, which represents the model performance variation due to the architecture modification. Using the same reward, RNN encoders of all the <italic>s</italic> dimensions within list <inline-formula><mml:math id="M19"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">C</mml:mi></mml:mrow></mml:math></inline-formula> are updated based on Eq. (2).</p>
<p>The proposed RCNAS solves the inefficiency problem in the conventional controller by utilizing a small value of <italic>s</italic>. The conventional controller used in GraphNAS (Gao et al., <xref ref-type="bibr" rid="B13">2019</xref>) generates modules of all the dimensions to formulate a new architecture each time, which is mathematically equivalent to <italic>s</italic>&#x0226B;1 in RCNAS. Reward <italic>R</italic> is obtained as the result of mixed architecture modifications on all the dimensions. When updating a specific RNN encoder, REINFORCE rule will introduce noise derived from the other dimensions. It is hard to distinguish the contribution of module samples of each dimension to the final model performance. RNN encoder will fail to learn the following relationship accurately: the model performance and the module selections of certain dimension. In contrast, by applying the extreme case of <italic>s</italic> &#x0003D; 1 in RCNAS, reward <italic>R</italic> is estimated by slightly modifying one architecture dimension. REINFORCE rule only updates the corresponding RNN encoder to learn the above relationship exclusively. This would facilitate the controller to identify the powerful modules of each dimension, and explore the well-performing offspring architectures.</p></sec></sec>
<sec id="s5">
<title>5. Constrained parameter sharing</title>
<p>Compared with training from scratch, the parameter sharing reduces the computation cost by forcing the explored neural architecture to share the common weights (Pham et al., <xref ref-type="bibr" rid="B23">2018</xref>). The transferred weights should work effectively in the new architecture, and estimate its performance as accurately as training from scratch. However, the traditional strategy cannot share weights among the heterogeneous GNN architectures stably for a few reasons. We say that two neural architectures are heterogeneous if they have the significantly distinct output statistics. For example, the output intervals of activation functions Sigmoid and Linear are [0, 1] and [&#x02212;&#x0221E;, &#x0002B;&#x0221E;], respectively. The activation values of Linear may be overly large in the offspring architecture, if its weights are transferred from ancestor equipped with activation function of Sigmoid. The output explosion would lead to unstable training of the offspring architecture. Furthermore, the trainable weights in connection layers, like batch normalization and skip connections, are deeply coupled in the ancestor architecture to connect the specific successive layers. These weights are hard to be transferred to the offspring to bridge another successive layers well.</p>
<p>To tackle the above challenges, we propose constrained parameter sharing strategy as illustrated in <xref ref-type="fig" rid="F2">Figure 2</xref>. The trainable weights are transferred in a layer-wise fashion. For each layer in the new offspring architecture, it share weights from suitable ancestor by satisfying three constraints:</p>
<list list-type="bullet">
<list-item><p>The ancestor and offspring architectures have the same shapes of trainable weights, in order to enable the transferred weights being used directly. The weight shapes are specified by both the hidden units and attention heads.</p></list-item>
<list-item><p>The ancestor and offspring architectures have the same attention and activation functions. The attention function collects the relevant neighbors, and the activation function squashes the output to a specific interval. Both of them largely determine the output statistics of a layer.</p></list-item>
<list-item><p>The weights of the connection layers are not shared. The connection layers contain the batch normalization and skip connections. We train each offspring architecture with a few epochs (e.g., 5 or 20 epochs in our experiment) to adapt these connection weights to the new successive layers.</p></list-item>
</list>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Illustration of the constrained parameter sharing strategy between the ancestor and offspring architectures in layer 2. The trainable weights of a layer are shared when they have the same shapes (constraint 1), attention and activation functions (constraint 2). Constraint 3 avoids sharing in the batch normalization (BN) and skip connection (SC).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1029307-g0002.tif"/>
</fig>
</sec>
<sec id="s6">
<title>6. Experiments</title>
<p>We experiment on the node classification task with the goal of answering the five research questions. <bold>Q1:</bold> How does the elementary GNN architecture discovered by AGNN compare with the human-invented models and the ones searched by other methods? <bold>Q2:</bold> How effective is AGNN to build up the deep architecture? <bold>Q3:</bold> How scalable is AGNN to explore superior model on large-scale graphs? <bold>Q4:</bold> How efficient is the proposed RCNAS controller compared with the ones in other search methods? <bold>Q5:</bold> Whether the constrained parameter sharing transfers weights effectively to the offspring architectures? We provide the explanation of discovered architectures in <xref ref-type="supplementary-material" rid="SM1">Supplementary Sections S4&#x02013;S6</xref>.</p>
<sec>
<title>6.1. Datasets</title>
<p>To study the neural architecture search of elementary and deep GNNs, we use the benchmark node classification datasets of Cora, Citeseer, and Pubmed (Sen et al., <xref ref-type="bibr" rid="B24">2008</xref>) under the transductive setting, and apply PPI under the inductive setting (Zitnik and Leskovec, <xref ref-type="bibr" rid="B38">2017</xref>). To search the scalable GNNs, we use large-scale graphs of Reddit (Hamilton et al., <xref ref-type="bibr" rid="B16">2017</xref>) and ogbn-products (Hu et al., <xref ref-type="bibr" rid="B17">2020</xref>). Their dataset statistics are in <xref ref-type="supplementary-material" rid="SM1">Supplementary Section S1</xref>.</p></sec>
<sec>
<title>6.2. Baseline methods</title>
<list list-type="bullet">
<list-item><p><bold>Human-invented GNNs</bold> : The message-passing based GNNs as shown in Eq. (1) are considered for fair comparison, except the one combined with the pooling layer or other advanced techniques. Considering the elementary GNNs, we apply baseline models of Chebyshev (Defferrard et al., <xref ref-type="bibr" rid="B8">2016</xref>), GCN (Kipf and Welling, <xref ref-type="bibr" rid="B18">2017</xref>), GraphSAGE (Hamilton et al., <xref ref-type="bibr" rid="B16">2017</xref>), GAT (Velickovic et al., <xref ref-type="bibr" rid="B28">2017</xref>), LGCN (Gao et al., <xref ref-type="bibr" rid="B12">2018</xref>). For deep GNNs, we consider state-of-the-art (SOTA) models of PairNorm (Zhao and Akoglu, <xref ref-type="bibr" rid="B35">2019</xref>), SGC (Wu et al., <xref ref-type="bibr" rid="B30">2019</xref>), JKNet (Xu et al., <xref ref-type="bibr" rid="B31">2018</xref>), and APPNP (Klicpera et al., <xref ref-type="bibr" rid="B19">2018</xref>). For scalable GNNs, we use baseline models of GraphSAGE (Hamilton et al., <xref ref-type="bibr" rid="B16">2017</xref>), Cluster-GCN (Chiang et al., <xref ref-type="bibr" rid="B6">2019</xref>), and GraphSAINT (Zeng et al., <xref ref-type="bibr" rid="B33">2019</xref>).</p></list-item>
<list-item><p><bold>NAS approaches:</bold> We note that most of existing NAS methods cannot be applied directly to search deep and scalable GNNs. We use GraphNAS (Gao et al., <xref ref-type="bibr" rid="B13">2019</xref>), the most popular NAS model based on reinforcement learning, as baseline to search elementary GNNs. The random search is implemented to sample architectures randomly, serving as a strong baseline to evaluate the efficiency and effectiveness of the sophisticated NAS.</p></list-item>
</list></sec>
<sec>
<title>6.3. Training details</title>
<p>Following the previous configurations (Velickovic et al., <xref ref-type="bibr" rid="B28">2017</xref>; Gao et al., <xref ref-type="bibr" rid="B12">2018</xref>), we search the two-layer and three-layer elementary GNNs for the transductive and inductive learning, respectively. The layer numbers for deep model search and scalable model search are 16 and 3, respectively. A total of 1,000 architectures are explored iteratively during the search progress. The classification accuracies are averaged <italic>via</italic> randomly initializing the optimal architecture 10 times. The details of training hyperparameter setting are listed in <xref ref-type="supplementary-material" rid="SM1">Supplementary Section S3</xref>.</p></sec>
<sec>
<title>6.4. Results</title>
<sec>
<title>6.4.1. Search of elementary GNNs</title>
<p>We search the elementary GNNs with 2&#x02013;3 layers, and compare with the human-invented GNNs and NAS methods to answer question <bold>Q1</bold>. The test performances of human-invented GNNs are reported directly from their papers. <xref ref-type="table" rid="T1">Tables 1</xref>, <xref ref-type="table" rid="T2">2</xref> summarize the classification results and parameter sizes for the transductive and inductive learning, respectively. We make the following observations.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Classification accuracy (in percent) under transductive learning.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Framework</bold></th>
<th valign="top" align="left"><bold>Model</bold></th>
<th/>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Cora</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Citeseer</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Pubmed</bold></th>
</tr>
<tr>
<th/>
<th/>
<th valign="top" align="center"><bold>&#x00023;Layers</bold></th>
<th valign="top" align="center"><bold>&#x00023;Params</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
<th valign="top" align="center"><bold>&#x00023;Params</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
<th valign="top" align="center"><bold>&#x00023;Params</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">GNNs</td>
<td valign="top" align="left">Chebyshev</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.09M</td>
<td valign="top" align="center">81.2</td>
<td valign="top" align="center">0.09M</td>
<td valign="top" align="center">69.8</td>
<td valign="top" align="center">0.09M</td>
<td valign="top" align="center">74.4</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">GCN</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.02M</td>
<td valign="top" align="center">81.5</td>
<td valign="top" align="center">0.05M</td>
<td valign="top" align="center">70.3</td>
<td valign="top" align="center">0.02M</td>
<td valign="top" align="center">79.0</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">GAT</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.09M</td>
<td valign="top" align="center">83.0 &#x000B1; 0.7</td>
<td valign="top" align="center">0.23M</td>
<td valign="top" align="center">72.5 &#x000B1; 0.7</td>
<td valign="top" align="center">0.03M</td>
<td valign="top" align="center">79.0 &#x000B1; 0.3</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">LGCN</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.05M</td>
<td valign="top" align="center">81.6 &#x000B1; 0.4</td>
<td valign="top" align="center">0.12M</td>
<td valign="top" align="center">70.4 &#x000B1; 1.1</td>
<td valign="top" align="center">0.02M</td>
<td valign="top" align="center">77.3 &#x000B1; 1.2</td>
</tr>
<tr>
<td valign="top" align="left">NAS</td>
<td valign="top" align="left">GraphNAS-w/o share</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.09M</td>
<td valign="top" align="center">82.7 &#x000B1; 0.4</td>
<td valign="top" align="center">0.23M</td>
<td valign="top" align="center">73.5 &#x000B1; 1.0</td>
<td valign="top" align="center">0.03M</td>
<td valign="top" align="center">78.8 &#x000B1; 0.5</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">GraphNAS-with share</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.07M</td>
<td valign="top" align="center">83.3 &#x000B1; 0.6</td>
<td valign="top" align="center">1.91M</td>
<td valign="top" align="center">72.4 &#x000B1; 1.3</td>
<td valign="top" align="center">0.07M</td>
<td valign="top" align="center">78.1 &#x000B1; 0.8</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Random-w/o share</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.37M</td>
<td valign="top" align="center">81.4 &#x000B1; 1.1</td>
<td valign="top" align="center">0.95M</td>
<td valign="top" align="center">72.9 &#x000B1; 0.2</td>
<td valign="top" align="center">0.13M</td>
<td valign="top" align="center">77.9 &#x000B1; 0.5</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Random-with share</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">2.95M</td>
<td valign="top" align="center">82.3 &#x000B1; 0.5</td>
<td valign="top" align="center">0.95M</td>
<td valign="top" align="center">69.9 &#x000B1; 1.7</td>
<td valign="top" align="center">0.13M</td>
<td valign="top" align="center">77.9 &#x000B1; 0.4</td>
</tr>
<tr>
<td valign="top" align="left">AGNN</td>
<td valign="top" align="left">AGNN-w/o share</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.05M</td>
<td valign="top" align="center"><bold>83.6</bold> <bold>&#x000B1;0.3</bold></td>
<td valign="top" align="center">0.71M</td>
<td valign="top" align="center"><bold>73.8</bold> <bold>&#x000B1;0.7</bold></td>
<td valign="top" align="center">0.07M</td>
<td valign="top" align="center"><bold>79.7</bold> <bold>&#x000B1;0.4</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">AGNN-with share</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.37M</td>
<td valign="top" align="center">82.7 &#x000B1; 0.6</td>
<td valign="top" align="center">1.90M</td>
<td valign="top" align="center">72.7 &#x000B1; 0.4</td>
<td valign="top" align="center">0.03M</td>
<td valign="top" align="center">79.0 &#x000B1; 0.5</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Datasets are based on full splitting: all nodes except those in the validation and test sets will be used for training. The bold values indicate the highest accuracy at each column.</p>
<p>&#x00023; Number of parameters.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Test accuracy of the human-invented and searched architectures under the inductive learning.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Framework</bold></th>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Layers</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>PPI</bold></th>
</tr>
<tr>
<th/>
<th/>
<th/>
<th valign="top" align="center"><bold>Params</bold></th>
<th valign="top" align="center"><bold>F1 score</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">GNNs</td>
<td valign="top" align="left">GraphSAGE</td>
<td valign="top" align="center">2</td>
<td valign="top" align="center">0.39M</td>
<td valign="top" align="center">0.612</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">GAT</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">0.89M</td>
<td valign="top" align="center">0.973 &#x000B1; 0.002</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">LGCN</td>
<td valign="top" align="center">4</td>
<td valign="top" align="center">0.85M</td>
<td valign="top" align="center">0.772 &#x000B1; 0.002</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">GraphNAS-w/o share</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">4.1M</td>
<td valign="top" align="center">0.985 &#x000B1; 0.004</td>
</tr>
<tr>
<td valign="top" align="left">NAS</td>
<td valign="top" align="left">GraphNAS-with share</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">1.4M</td>
<td valign="top" align="center">0.960 &#x000B1; 0.036</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Random-w/o share</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">1.4M</td>
<td valign="top" align="center">0.984 &#x000B1; 0.004</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Random-with share</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">1.4M</td>
<td valign="top" align="center">0.977 &#x000B1; 0.011</td>
</tr>
<tr>
<td valign="top" align="left">AGNN</td>
<td valign="top" align="left">AGNN-w/o share</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">4.6M</td>
<td valign="top" align="center"><bold>0.992</bold> <bold>&#x000B1;0.001</bold></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">AGNN-with share</td>
<td valign="top" align="center">3</td>
<td valign="top" align="center">1.6M</td>
<td valign="top" align="center">0.991 &#x000B1; 0.001</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The bold values indicate the highest accuracy at each column.</p>
</table-wrap-foot>
</table-wrap>
<p>&#x02776; <italic>The neural architectures discovered by AGNN without parameter sharing achieve the most superior accuracies on all the benchmarks</italic>. Comparing with the human-invented GNNs, AGNN without parameter sharing delivers the average improvement of 2.8%, owing to the careful selection of each architecture axis. Comparing with GraphNAS and random search, our AGNN is more effective to explore the outperforming models. At each search step, the whole neural architecture is sampled and reconstructed in GraphNAS and random search. In contrast, our AGNN exploits the best architecture to provide a reliable start, and explores the search space by only modifying a specific module class. Therefore, it provides a good trade-off between the exploitation and exploration to pursue the outperforming models.</p>
<p>&#x02777; <italic>AGNN with parameter sharing generally outperforms the human-invented GNNs</italic>. Although the parameter sharing brings performance deterioration, they could accelerate the search progress by avoiding training from scratch for each searched models. We provide trade-off study between the model performance and computation time in the following.</p></sec>
<sec>
<title>6.4.2. Search of deep GNNs</title>
<p>To answer question <bold>Q2</bold>, we search the deep GNNs with 16 layers, and compare with other search algorithms as well as SOTA models. We also include the evolutionary algorithm as an another strong baseline, where the best architecture found so is reserved for the following random mutation.</p>
<p>We note that GraphNAS cannot be directly applied to search deep model due to its simplified search space. The classification accuracies are listed (<xref ref-type="table" rid="T3">Table 3</xref>). &#x02778; <italic>The results show that the novel deep architectures identified by AGNN consistently deliver the outperforming accuracies</italic>. Comparing with the human-invented models, AGNN optimizes the skip connections at each layer to tackle the over-smoothing issue, which is the key bottleneck in developing deep GNNs (Zhou et al., <xref ref-type="bibr" rid="B37">2020</xref>). Due to the large search space of skip connections, the random search may be inefficient to explore the well-performing architectures given the certain search steps.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Classification accuracy (in percent) of 16-layer GNNs.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Cora</bold></th>
<th valign="top" align="center"><bold>Citeseer</bold></th>
<th valign="top" align="center"><bold>Pubmed</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">GCN</td>
<td valign="top" align="center">22.02 &#x000B1; 6.24</td>
<td valign="top" align="center">19.78 &#x000B1; 1.95</td>
<td valign="top" align="center">37.94 &#x000B1; 0.53</td>
</tr>
<tr>
<td valign="top" align="left">PairNorm</td>
<td valign="top" align="center">44.23 &#x000B1; 7.26</td>
<td valign="top" align="center">27.45 &#x000B1; 7.22</td>
<td valign="top" align="center">68.59 &#x000B1; 7.30</td>
</tr>
<tr>
<td valign="top" align="left">SGC</td>
<td valign="top" align="center">72.10 &#x000B1; 0.00</td>
<td valign="top" align="center">71.03 &#x000B1; 1.18</td>
<td valign="top" align="center">70.20 &#x000B1; 0.00</td>
</tr>
<tr>
<td valign="top" align="left">JKNet</td>
<td valign="top" align="center">74.54 &#x000B1; 3.72</td>
<td valign="top" align="center">54.33 &#x000B1; 7.74</td>
<td valign="top" align="center">69.98 &#x000B1; 6.26</td>
</tr>
<tr>
<td valign="top" align="left">APPNP</td>
<td valign="top" align="center">79.38 &#x000B1; 0.62</td>
<td valign="top" align="center">72.13 &#x000B1; 0.53</td>
<td valign="top" align="center">77.07 &#x000B1; 0.66</td>
</tr>
<tr>
<td valign="top" align="left">Random</td>
<td valign="top" align="center">83.76 &#x000B1; 0.42</td>
<td valign="top" align="center">71.55 &#x000B1; 0.94</td>
<td valign="top" align="center">79.01 &#x000B1; 0.47</td>
</tr>
<tr>
<td valign="top" align="left">AGNN</td>
<td valign="top" align="center"><bold>84.06</bold> <bold>&#x000B1;0.29</bold></td>
<td valign="top" align="center"><bold>72.04</bold> <bold>&#x000B1;0.89</bold></td>
<td valign="top" align="center"><bold>79.51</bold> <bold>&#x000B1;0.32</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Datasets are based on public splitting. The bold values indicate the highest accuracy at each column.</p>
</table-wrap-foot>
</table-wrap></sec>
<sec>
<title>6.4.3. Search of scalable GNNs</title>
<p>To answer question <bold>Q3</bold>, we further search the training batch size of GNNs on large-scale graphs: Reddit and ogbn-products. We note that all of other NAS frameworks cannot support the scalable optimization on graphs with more than 200K nodes. Comparing with scalable GNNs and random search, we list the classification accuracy in <xref ref-type="table" rid="T4">Table 4</xref>. &#x02779; <italic>By searching the appropriate batch size, skip connections, etc., AGNN could explore the outperforming scalable GNNs for each large-scale dataset</italic>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Node classification accuracies (in percent) on large-scale graphs.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Reddit</bold></th>
<th valign="top" align="center"><bold>ogbn-products</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">GraphSASE</td>
<td valign="top" align="center">95.96 &#x000B1; 0.03</td>
<td valign="top" align="center">78.70 &#x000B1; 0.36</td>
</tr>
<tr>
<td valign="top" align="left">ClusterGCN</td>
<td valign="top" align="center">95.94 &#x000B1; 0.05</td>
<td valign="top" align="center">78.97 &#x000B1; 0.33</td>
</tr>
<tr>
<td valign="top" align="left">GrarphSAINT</td>
<td valign="top" align="center">95.46 &#x000B1; 0.08</td>
<td valign="top" align="center">79.08 &#x000B1; 0.24</td>
</tr>
<tr>
<td valign="top" align="left">Random</td>
<td valign="top" align="center">95.90 &#x000B1; 0.04</td>
<td valign="top" align="center">79.13 &#x000B1; 0.58</td>
</tr>
<tr>
<td valign="top" align="left">AGNN</td>
<td valign="top" align="center"><bold>96.47</bold> <bold>&#x000B1;0.04</bold></td>
<td valign="top" align="center"><bold>79.37</bold> <bold>&#x000B1;0.69</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>The bold values indicate the highest accuracy at each column.</p>
</table-wrap-foot>
</table-wrap></sec>
<sec>
<title>6.4.4. Search efficiency comparison</title>
<p>To answer question <bold>Q4</bold>, we compare the search efficiencies of AGNN, GraphNAS and random search. Given the total of 1,000 search steps, the search efficiency is represented by the progression of the average performance in the top-10 architectures found so far. From <xref ref-type="fig" rid="F3">Figure 3</xref>, we make the following observation.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>The progression of top-10 averaged performance of AGNN, GraphNAS, and random search. <bold>(A)</bold> PPI and <bold>(B)</bold> Citeseer.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1029307-g0003.tif"/>
</fig>
<p>&#x0277A; <italic>AGNN is much faster to identify the well-performing architectures during the search progress</italic>. At each step, the top-10 architectures discovered by AGNN have the better average performance comparing with GraphNAS and random search. The remarkable efficiency of AGNN is contributed by the exploitation and conservative exploration abilities within RCNAS controller: the best architecture is used preserved for modification; and the smallest amendment on one module class is applied. As explained before, RCNAS controller optimizes the module samples accurately to push the search direction toward the good architectures.</p></sec>
<sec>
<title>6.4.5. Parameter sharing effectiveness</title>
<p>We study the effectiveness of the proposed constrained parameter sharing to answer question <bold>Q5</bold>. The effective transferred weights should couple into the new architectures to estimate them as accurate as training from scratch. Over AGNN framework, we compare the constrained sharing with the relaxed one in GraphNAS as well as training from scratch. The cumulative distribution of classification performances are shown in <xref ref-type="fig" rid="F4">Figure 4</xref> for the 1,000 sampled architectures, where we make the observation.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>The cumulative distribution of validation accuracies of the 1,000 sampled architectures for AGNN under the constrained/relaxed/without parameter sharing. <bold>(A)</bold> PPI and <bold>(B)</bold> Citeseer.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1029307-g0004.tif"/>
</fig>
<p>For the cumulative distribution, <italic>X</italic>-axis denotes the cumulative performance and the <italic>Y</italic>-axis is the corresponding probability, where the lower curve has the better estimation of the sampled architectures. &#x0277B; <italic>The constrained parameter sharing strategy estimates the new architectures similar to the ground truth of training from scratch</italic>. Compared with the relaxed sharing, the cumulative distribution curves of the constrained sharing are much closer to those of training from scratch. Given a certain cumulative probability in <italic>Y</italic>-axis, the constrained sharing reaches those neural architectures with better model performances than the relaxed one. That is because the parameter sharing is allowed only among the homogeneous architectures with similar output statistics. Combined with a few epochs to warm up weights in the connection layers, these constraints ensure the shared weights to be effective in the newly sampled architectures.</p></sec>
<sec>
<title>6.4.6. Influence of architecture modification</title>
<p>We study how the different scales of architecture modifications affect the search efficiency of AGNN. While the preserved architecture is modified at the minimum level when modification size <italic>s</italic> &#x0003D; 1, the architecture string will be resampled completely similar to the traditional controller if <italic>s</italic> &#x0003D; 6. Specially, considering <italic>s</italic> &#x0003D; 1, 3, and 6, we show the top-10 architecture progressions on PPI and Citeseer in <xref ref-type="fig" rid="F5">Figure 5</xref>. Since the modification scales affect the coupling of shared weights to the child architectures, AGNN is evaluated under the parameter sharing.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>The progression of top-10 averaged performance for AGNN under modification size <italic>s</italic> &#x0003D; 1, 3, and 6 on PPI <bold>(left)</bold> and Citeseer <bold>(right)</bold>.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fdata-05-1029307-g0005.tif"/>
</fig>
<p>&#x0277C; <italic>It is observed that the progression efficiency decreases with the increment of</italic> <italic>s</italic>. Specifically, we re-evaluate the best architectures identified with different modification sizes on PPI. The F1 scores of <italic>s</italic> &#x0003D; 1, 3 and 6 are 0.991, 0.987, and 0.982, respectively, which also decrease with <italic>s</italic>. The benefits of smaller <italic>s</italic> in searching neural architectures are due to the following two facts. First, the offspring tends to have the similar structure and output statistics as the preserved input architecture, since it is generated with a slight modifications. The shared weights would be more effective in the offspring to estimate the model performance accurately. Second, compared with <italic>s</italic> &#x0003E; 1, the controller of <italic>s</italic> &#x0003D; 1 changes the specific module class to obtain the corresponding performance variation. By removing the interference from other module classes, the specific RNN encoder exactly learns the relationship between the module samples of certain class and model performance. Through training the RNN encoder iteratively, it tends to sample good modules to formulate the well-performing architectures in the future search.</p></sec>
<sec>
<title>6.4.7. Trade-off between performance and time cost</title>
<p>Training from scratch learns the weights of the sampled architectures accurately to improve search reliance, which help approximate to the optimal architecture. However, it incurs an enormous computational cost. Our constrained sharing provides a trade-off between the search performance and computation time. Through transferring weights to avoid complete training, the constrained strategy uses another few epochs to warm up the weights of sampled architectures to promise an accurate model estimation to some degree. The more epochs it takes, the better search performance it may achieve. We study this trade-off by considering the series of warm-up epochs {5, 20, 50} / {1, 5, 10} for the transductive/inductive settings. The discovered model performance and time cost are shown in <xref ref-type="table" rid="T5">Table 5</xref>. Note that time cost is estimated under the following environments: PyTorch implementation, test machine of GeForce GTX-1080 Ti with 12GB GPU.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>The trade-off between node classification performance and computation time cost for AGNN.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="left"><bold>&#x00023;Epochs</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Cora</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Citeseer</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>Pubmed</bold></th>
<th valign="top" align="center" colspan="2" style="border-bottom: thin solid #000000;"><bold>PPI</bold></th>
</tr>
<tr>
<th/>
<th valign="top" align="left"><bold>T/I</bold></th>
<th valign="top" align="center"><bold>Time</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
<th valign="top" align="center"><bold>Time</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
<th valign="top" align="center"><bold>Time</bold></th>
<th valign="top" align="center"><bold>Accuracy</bold></th>
<th valign="top" align="center"><bold>Time</bold></th>
<th valign="top" align="center"><bold>F1 score</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">AGNN-w/o share</td>
<td valign="top" align="left">200 / 20</td>
<td valign="top" align="center">27.76</td>
<td valign="top" align="center">83.6 &#x000B1; 0.3%</td>
<td valign="top" align="center">23.42</td>
<td valign="top" align="center">73.8 &#x000B1; 0.7%</td>
<td valign="top" align="center">28.85</td>
<td valign="top" align="center">79.7 &#x000B1; 0.4%</td>
<td valign="top" align="center">34.36</td>
<td valign="top" align="center">0.992 &#x000B1; 0.001</td>
</tr>
<tr>
<td valign="top" align="left">AGNN-with share</td>
<td valign="top" align="left">5 / 1</td>
<td valign="top" align="center">1.57</td>
<td valign="top" align="center">41.9 &#x000B1; 2.5%</td>
<td valign="top" align="center">4.60</td>
<td valign="top" align="center">67.7 &#x000B1; 1.3%</td>
<td valign="top" align="center">2.51</td>
<td valign="top" align="center">77.4 &#x000B1; 0.7%</td>
<td valign="top" align="center">3.48</td>
<td valign="top" align="center">0.953 &#x000B1; 0.055</td>
</tr>
<tr>
<td valign="top" align="left">AGNN-with share</td>
<td valign="top" align="left">20 / 5</td>
<td valign="top" align="center">4.34</td>
<td valign="top" align="center">82.7 &#x000B1; 0.6%</td>
<td valign="top" align="center">6.61</td>
<td valign="top" align="center">72.7 &#x000B1; 0.7%</td>
<td valign="top" align="center">7.42</td>
<td valign="top" align="center">79.0 &#x000B1; 0.5%</td>
<td valign="top" align="center">12.57</td>
<td valign="top" align="center">0.991 &#x000B1; 0.001</td>
</tr>
<tr>
<td valign="top" align="left">AGNN-with share</td>
<td valign="top" align="left">50 / 10</td>
<td valign="top" align="center">11.70</td>
<td valign="top" align="center">80.2 &#x000B1; 0.6%</td>
<td valign="top" align="center">11.09</td>
<td valign="top" align="center">72.5 &#x000B1; 0.2%</td>
<td valign="top" align="center">10.42</td>
<td valign="top" align="center">79.1 &#x000B1; 0.2%</td>
<td valign="top" align="center">22.62</td>
<td valign="top" align="center">0.991 &#x000B1; 0.001</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p>Time in second is evaluated to measure the time cost taken to search and train a new architecture. Symbols T and I denote the transductive and inductive learning, respectively. &#x00023;Epochs <italic>a</italic>/<italic>b</italic> denotes the number of training epochs <italic>a</italic> and <italic>b</italic> for these two learning.</p>
</table-wrap-foot>
</table-wrap>
<p>Some notable features about the trade-off could be summarized from <xref ref-type="table" rid="T5">Table 5</xref>. First, the warm-up epoch of 5/1 is insufficient to train the new architecture well to estimate its classification ability accurately. A worse neural architecture may be identified and tested finally in the given task. Especially in dataset Cora, the classification accuracy is even much worse than the human-invented neural networks. Second, the warm-up epoch of 20/5 achieves a good balance between the search performance and the time cost. It obtains the accuracy or F1 score comparable to training from scratch with 200 epochs, but only consumes a small amount of time. Third, the model performances are similar for the warm-up epochs of 50/10 and 20/5, although the former one takes much more time to train the new architectures. That is because the parameter sharing helps transfer the well-trained weights from the ancestor architecture. A few epochs would be enough to adapt these weights to the new architecture to estimate its classification performance.</p></sec>
<sec>
<title>6.4.8. Model transfer</title>
<p>Following the model evaluation of NAS in other domains (Pham et al., <xref ref-type="bibr" rid="B23">2018</xref>; Zoph et al., <xref ref-type="bibr" rid="B40">2018</xref>), we delve into the research problem of understanding whether the discovered GNNs could be generalized to the different tasks. To be specific, we transfer the architecture identified on Cora to the node classification tasks on Citeseer and Pubmed. We apply the neural architecture found without parameter sharing by AGNN. <xref ref-type="table" rid="T6">Table 6</xref> shows the classification accuracies of the transferred architecture, compared with the original optimal ones in <xref ref-type="table" rid="T1">Table 1</xref>.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Test performance comparison of the transferred architecture to the optimal ones on Citeseer and Pubmed.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="center"><bold>Citeseer</bold></th>
<th valign="top" align="center"><bold>Pubmed</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">AGNN-w/o share</td>
<td valign="top" align="center">73.8 &#x000B1; 0.7%</td>
<td valign="top" align="center">79.7 &#x000B1; 0.4%</td>
</tr>
<tr>
<td valign="top" align="left">Transferred model</td>
<td valign="top" align="center">71.8 &#x000B1; 0.7%</td>
<td valign="top" align="center">78.5 &#x000B1; 0.4%</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>&#x0277D; <italic>It is observed that the test accuracies of the transferred architecture are a little bit worse than those of the discovered optimal ones</italic>. The experimental results validate our previous research motivations&#x02014;there is none a graph neural networks could generalize to a series of different graph-structured data. Given a specific graph analysis task, it is crucial to carefully identify a well-performing architecture to optimize the desired performance.</p></sec></sec></sec>
<sec sec-type="conclusions" id="s7">
<title>7. Conclusion</title>
<p>In this paper, we present AGNN to find the optimal graph neural architecture given a node classification task. The comprehensive search space, RCNAS controller and constrained parameter sharing strategy together are designed specifically for the optimization of elementary, deep, and scalable GNNs. The experimental results show that the discovered neural architectures achieve quite the competitive performances on popular benchmark datasets and large-scale graphs. The proposed RCNAS controller searches the well-performing architectures more efficiently, and the shared weights could be more effective in the offspring network under constraints.</p></sec>
<sec sec-type="data-availability" id="s8">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/<xref ref-type="supplementary-material" rid="SM1">Supplementary material</xref>, further inquiries can be directed to the corresponding author.</p></sec>
<sec id="s9">
<title>Author contributions</title>
<p>KZ and XHu contributed to the whole framework. XHua, QS, and RC contributed to the paper revising. All authors contributed to the article and approved the submitted version.</p></sec>
<sec sec-type="funding-information" id="s10">
<title>Funding</title>
<p>This work partially supported by NSF (&#x00023;IIS-1750074, &#x00023;IIS-1900990, and &#x00023;IIS-18490850).</p></sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>Author QS was employed by LinkedIn. Author RC was employed by Samsung Research America. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p></sec>
<sec id="s12">
<title>Author disclaimer</title>
<p>The views, opinions, and/or findings contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.</p></sec>
</body>
<back>
<sec sec-type="supplementary-material" id="s13">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fdata.2022.1029307/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fdata.2022.1029307/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.PDF" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/></sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baker</surname> <given-names>B.</given-names></name> <name><surname>Gupta</surname> <given-names>O.</given-names></name> <name><surname>Naik</surname> <given-names>N.</given-names></name> <name><surname>Raskar</surname> <given-names>R.</given-names></name></person-group> (<year>2016</year>). <article-title>Designing neural network architectures using reinforcement learning</article-title>. <source>arXiv</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1611.02167</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cai</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>Zha</surname> <given-names>Z.-J.</given-names></name> <name><surname>Su</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2021</year>). <article-title>Rethinking graph neural architecture search from message-passing,</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Nashville, TN</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6657</fpage>&#x02013;<lpage>6666</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>W.-L.</given-names></name> <name><surname>Lin</surname> <given-names>T.-H.</given-names></name></person-group> (<year>2010</year>). <article-title>A cluster-based approach for automatic social network construction,</article-title> in <source>2010 IEEE Second International Conference on Social Computing</source> (<publisher-loc>Minneapolis, MN</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>601</fpage>&#x02013;<lpage>606</lpage>.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>M.</given-names></name> <name><surname>Wei</surname> <given-names>Z.</given-names></name> <name><surname>Huang</surname> <given-names>Z.</given-names></name> <name><surname>Ding</surname> <given-names>B.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name></person-group> (<year>2020</year>). <article-title>Simple and deep graph convolutional networks</article-title>. <source>arXiv preprint arXiv:2007.02133</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2007.02133</pub-id></citation>
</ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>Y.-W.</given-names></name> <name><surname>Song</surname> <given-names>Q.</given-names></name> <name><surname>Hu</surname> <given-names>X.</given-names></name></person-group> (<year>2021</year>). <article-title>Techniques for automated machine learning</article-title>. <source>ACM SIGKDD Explorat. Newslett</source>. <volume>22</volume>, <fpage>35</fpage>&#x02013;<lpage>50</lpage>. <pub-id pub-id-type="doi">10.1145/3447556.3447567</pub-id></citation>
</ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chiang</surname> <given-names>W.-L.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Si</surname> <given-names>S.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Bengio</surname> <given-names>S.</given-names></name> <name><surname>Hsieh</surname> <given-names>C.-J.</given-names></name></person-group> (<year>2019</year>). <article-title>Cluster-gcn: an efficient algorithm for training deep and large graph convolutional networks,</article-title> in <source>Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>Anchorage, AL</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>257</fpage>&#x02013;<lpage>266</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Corso</surname> <given-names>G.</given-names></name> <name><surname>Cavalleri</surname> <given-names>L.</given-names></name> <name><surname>Beaini</surname> <given-names>D.</given-names></name> <name><surname>Li&#x000F2;</surname> <given-names>P.</given-names></name> <name><surname>Veli&#x0010D;kovi&#x00107;</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>Principal neighbourhood aggregation for graph nets</article-title>. <source>arXiv preprint arXiv:2004.05718</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2004.05718</pub-id></citation>
</ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Defferrard</surname> <given-names>M.</given-names></name> <name><surname>Bresson</surname> <given-names>X.</given-names></name> <name><surname>Vandergheynst</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <article-title>Convolutional neural networks on graphs with fast localized spectral filtering,</article-title> in <source>NeuIPS</source> (<publisher-loc>Barcelona</publisher-loc>: <publisher-name>Conference on Neural Information Processing Systems</publisher-name>).</citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>Y.</given-names></name> <name><surname>Yao</surname> <given-names>Q.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name></person-group> (<year>2020</year>). <article-title>Propagation model search for graph neural networks</article-title>. <source>arXiv preprint arXiv:2010.03250</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2010.03250</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>Y.</given-names></name> <name><surname>Yao</surname> <given-names>Q.</given-names></name> <name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name></person-group> (<year>2021</year>). <article-title>Diffmg: differentiable meta graph search for heterogeneous graph neural networks,</article-title> in <source>Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>279</fpage>&#x02013;<lpage>288</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Elsken</surname> <given-names>T.</given-names></name> <name><surname>Metzen</surname> <given-names>J. H.</given-names></name> <name><surname>Hutter</surname> <given-names>F.</given-names></name></person-group> (<year>2018</year>). <article-title>Neural architecture search: a survey</article-title>. <source>arXiv</source>. <pub-id pub-id-type="doi">10.1007/978-3-030-05318-5_3</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Ji</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Large-scale learnable graph convolutional networks,</article-title> in <source>Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &#x00026; Data Mining</source> (<publisher-loc>London</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1416</fpage>&#x02013;<lpage>1424</lpage>.<pub-id pub-id-type="pmid">33006927</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gao</surname> <given-names>Y.</given-names></name> <name><surname>Yang</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>P.</given-names></name> <name><surname>Zhou</surname> <given-names>C.</given-names></name> <name><surname>Hu</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>Graphnas: graph neural architecture search with reinforcement learning</article-title>. <source>arXiv</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1904.09981</pub-id></citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gilmer</surname> <given-names>J.</given-names></name> <name><surname>Schoenholz</surname> <given-names>S. S.</given-names></name> <name><surname>Riley</surname> <given-names>P. F.</given-names></name> <name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Dahl</surname> <given-names>G. E.</given-names></name></person-group> (<year>2017</year>). <article-title>Neural message passing for quantum chemistry</article-title>. <source>arXiv</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1704.01212</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Grover</surname> <given-names>A.</given-names></name> <name><surname>Leskovec</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>node2vec: scalable feature learning for networks,</article-title> in <source>Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>San Fransisco, CA</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>855</fpage>&#x02013;<lpage>864</lpage>.<pub-id pub-id-type="pmid">27853626</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hamilton</surname> <given-names>W.</given-names></name> <name><surname>Ying</surname> <given-names>Z.</given-names></name> <name><surname>Leskovec</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>Inductive representation learning on large graphs,</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Long Beach, CA</publisher-loc>), <fpage>1024</fpage>&#x02013;<lpage>1034</lpage>.<pub-id pub-id-type="pmid">34111002</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>W.</given-names></name> <name><surname>Fey</surname> <given-names>M.</given-names></name> <name><surname>Zitnik</surname> <given-names>M.</given-names></name> <name><surname>Dong</surname> <given-names>Y.</given-names></name> <name><surname>Ren</surname> <given-names>H.</given-names></name> <name><surname>Liu</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Open graph benchmark: datasets for machine learning on graphs</article-title>. <source>arXiv preprint arXiv:2005.00687</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2005.00687</pub-id><pub-id pub-id-type="pmid">22854035</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kipf</surname> <given-names>T. N.</given-names></name> <name><surname>Welling</surname> <given-names>M.</given-names></name></person-group> (<year>2017</year>). <article-title>Semi-supervised classification with graph convolutional networks,</article-title> in <source>ICLR</source> (<publisher-loc>Toulon</publisher-loc>: <publisher-name>International Conference on Learning Representations</publisher-name>).</citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Klicpera</surname> <given-names>J.</given-names></name> <name><surname>Bojchevski</surname> <given-names>A.</given-names></name> <name><surname>G&#x000FC;nnemann</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Predict then propagate: graph neural networks meet personalized pagerank</article-title>. <source>arXiv preprint arXiv:1810.05997</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1810.05997</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>King</surname> <given-names>I.</given-names></name></person-group> (<year>2020</year>). <article-title>Autograph: automated graph neural network,</article-title> in <source>International Conference on Neural Information Processing</source> (<publisher-loc>Bangkok</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>189</fpage>&#x02013;<lpage>201</lpage>.<pub-id pub-id-type="pmid">35783353</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>C.</given-names></name> <name><surname>Chen</surname> <given-names>L.-C.</given-names></name> <name><surname>Schroff</surname> <given-names>F.</given-names></name> <name><surname>Adam</surname> <given-names>H.</given-names></name> <name><surname>Hua</surname> <given-names>W.</given-names></name> <name><surname>Yuille</surname> <given-names>A. L.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Auto-deeplab: hierarchical neural architecture search for semantic image segmentation,</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>82</fpage>&#x02013;<lpage>92</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Micheli</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <article-title>Neural network for graphs: a contextual constructive approach</article-title>. <source>IEEE Trans. Neural Netw</source>. <volume>20</volume>, <fpage>498</fpage>&#x02013;<lpage>511</lpage>. <pub-id pub-id-type="doi">10.1109/TNN.2008.2010350</pub-id><pub-id pub-id-type="pmid">19193509</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pham</surname> <given-names>H.</given-names></name> <name><surname>Guan</surname> <given-names>M. Y.</given-names></name> <name><surname>Zoph</surname> <given-names>B.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name> <name><surname>Dean</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Efficient neural architecture search via parameter sharing</article-title>. <source>arXiv</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1802.03268</pub-id><pub-id pub-id-type="pmid">33788694</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sen</surname> <given-names>P.</given-names></name> <name><surname>Namata</surname> <given-names>G.</given-names></name> <name><surname>Bilgic</surname> <given-names>M.</given-names></name> <name><surname>Getoor</surname> <given-names>L.</given-names></name> <name><surname>Galligher</surname> <given-names>B.</given-names></name> <name><surname>Eliassi-Rad</surname> <given-names>T.</given-names></name></person-group> (<year>2008</year>). <article-title>Collective classification in network data</article-title>. <source>AI Mag</source>. <volume>29</volume>, <fpage>93</fpage>&#x02013;<lpage>106</lpage>. <pub-id pub-id-type="doi">10.1609/aimag.v29i3.2157</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shi</surname> <given-names>M.</given-names></name> <name><surname>Wilson</surname> <given-names>D. A.</given-names></name> <name><surname>Zhu</surname> <given-names>X.</given-names></name> <name><surname>Huang</surname> <given-names>Y.</given-names></name> <name><surname>Zhuang</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Evolutionary architecture search for graph neural networks</article-title>. <source>Knowl. Based Syst</source>. <volume>247</volume>, <fpage>108752</fpage>. <pub-id pub-id-type="doi">10.1016/j.knosys.2022.108752</pub-id><pub-id pub-id-type="pmid">34304718</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>McAllester</surname> <given-names>D. A.</given-names></name> <name><surname>Singh</surname> <given-names>S. P.</given-names></name> <name><surname>Mansour</surname> <given-names>Y.</given-names></name></person-group> (<year>2000</year>). <article-title>Policy gradient methods for reinforcement learning with function approximation,</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Denver, CO</publisher-loc>), <fpage>1057</fpage>&#x02013;<lpage>1063</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vaswani</surname> <given-names>A.</given-names></name> <name><surname>Shazeer</surname> <given-names>N.</given-names></name> <name><surname>Parmar</surname> <given-names>N.</given-names></name> <name><surname>Uszkoreit</surname> <given-names>J.</given-names></name> <name><surname>Jones</surname> <given-names>L.</given-names></name> <name><surname>Gomez</surname> <given-names>A. N.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Attention is all you need,</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Long Beach, CA</publisher-loc>) <fpage>5998</fpage>&#x02013;<lpage>6008</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Velickovic</surname> <given-names>P.</given-names></name> <name><surname>Cucurull</surname> <given-names>G.</given-names></name> <name><surname>Casanova</surname> <given-names>A.</given-names></name> <name><surname>Romero</surname> <given-names>A.</given-names></name> <name><surname>Lio</surname> <given-names>P.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2017</year>). <article-title>Graph attention networks</article-title>. <source>arXiv</source> 1. <pub-id pub-id-type="doi">10.48550/arXiv.1710.10903</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wei</surname> <given-names>L.</given-names></name> <name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Yao</surname> <given-names>Q.</given-names></name> <name><surname>He</surname> <given-names>Z.</given-names></name></person-group> (<year>2021</year>). <article-title>Pooling architecture search for graph classification,</article-title> in <source>Proceedings of the 30th ACM International Conference on Information and Knowledge Management</source> (<publisher-loc>Gold Coast, QLD</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>2091</fpage>&#x02013;<lpage>2100</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>F.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name> <name><surname>Souza Jr</surname> <given-names>A. H. d.</given-names></name> <name><surname>Fifty</surname> <given-names>C.</given-names></name> <name><surname>Yu</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Simplifying graph convolutional networks</article-title>. <source>arXiv preprint arXiv:1902.07153</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1902.07153</pub-id><pub-id pub-id-type="pmid">34899230</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>K.</given-names></name> <name><surname>Li</surname> <given-names>C.</given-names></name> <name><surname>Tian</surname> <given-names>Y.</given-names></name> <name><surname>Sonobe</surname> <given-names>T.</given-names></name> <name><surname>Kawarabayashi</surname> <given-names>K.-I.</given-names></name> <name><surname>Jegelka</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Representation learning on graphs with jumping knowledge networks</article-title>. <source>arXiv preprint arXiv:1806.03536</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1806.03536</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>You</surname> <given-names>J.</given-names></name> <name><surname>Ying</surname> <given-names>Z.</given-names></name> <name><surname>Leskovec</surname> <given-names>J.</given-names></name></person-group> (<year>2020</year>). <article-title>Design space for graph neural networks</article-title>, in <source>Advances in Neural Information Processing Systems, Vol. 33</source> (<publisher-loc>Vancouver, BC</publisher-loc>).</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zeng</surname> <given-names>H.</given-names></name> <name><surname>Zhou</surname> <given-names>H.</given-names></name> <name><surname>Srivastava</surname> <given-names>A.</given-names></name> <name><surname>Kannan</surname> <given-names>R.</given-names></name> <name><surname>Prasanna</surname> <given-names>V.</given-names></name></person-group> (<year>2019</year>). <article-title>Graphsaint: graph sampling based inductive learning method</article-title>. <source>arXiv preprint arXiv:1907.04931</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1907.04931</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Wei</surname> <given-names>L.</given-names></name> <name><surname>Yao</surname> <given-names>Q.</given-names></name></person-group> (<year>2020a</year>). <article-title>Simplifying architecture search for graph neural network,</article-title> in <source>International Conference on Information and Knowledge Management</source> (<publisher-loc>Galway</publisher-loc>).</citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>L.</given-names></name> <name><surname>Akoglu</surname> <given-names>L.</given-names></name></person-group> (<year>2019</year>). <article-title>Pairnorm: tackling oversmoothing in gnns</article-title>. <source>arXiv preprint arXiv:1909.12223</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1909.12223</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Gao</surname> <given-names>X.</given-names></name> <name><surname>Mullins</surname> <given-names>R.</given-names></name> <name><surname>Lio</surname> <given-names>P.</given-names></name> <name><surname>Jamnik</surname> <given-names>M.</given-names></name></person-group> (<year>2020b</year>). <article-title>Probabilistic dual network architecture search on graphs</article-title>. <source>arXiv preprint arXiv:2003.09676</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2003.09676</pub-id></citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>K.</given-names></name> <name><surname>Huang</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Zha</surname> <given-names>D.</given-names></name> <name><surname>Chen</surname> <given-names>R.</given-names></name> <name><surname>Hu</surname> <given-names>X.</given-names></name></person-group> (<year>2020</year>). <article-title>Towards deeper graph neural networks with differentiable group normalization</article-title>. <source>arXiv preprint arXiv:2006.06972</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2006.06972</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zitnik</surname> <given-names>M.</given-names></name> <name><surname>Leskovec</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>Predicting multicellular function through multi-layer tissue networks</article-title>. <source>Bioinformatics</source> <volume>33</volume>, <fpage>i190</fpage>-i198. <pub-id pub-id-type="doi">10.1093/bioinformatics/btx252</pub-id><pub-id pub-id-type="pmid">28881986</pub-id></citation></ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zoph</surname> <given-names>B.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2016</year>). <article-title>Neural architecture search with reinforcement learning</article-title>. <source>arXiv</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1611.01578</pub-id><pub-id pub-id-type="pmid">34460412</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zoph</surname> <given-names>B.</given-names></name> <name><surname>Vasudevan</surname> <given-names>V.</given-names></name> <name><surname>Shlens</surname> <given-names>J.</given-names></name> <name><surname>Le</surname> <given-names>Q. V.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning transferable architectures for scalable image recognition,</article-title> in <source>CVPR</source>, <fpage>8697</fpage>&#x02013;<lpage>8710</lpage>.</citation>
</ref>
</ref-list> 
</back>
</article> 