<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2019.00022</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Link Definition Ameliorating Community Detection in Collaboration Networks</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Dilmaghani</surname> <given-names>Saharnaz</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/713554/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Brust</surname> <given-names>Matthias R.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/617270/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Piyatumrong</surname> <given-names>Apivadee</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/738071/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Danoy</surname> <given-names>Gr&#x000E9;goire</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/756675/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Bouvry</surname> <given-names>Pascal</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Interdisciplinary Centre for Security, Reliability, and Trust (SnT), University of Luxembourg</institution>, <addr-line>Esch-sur-Alzette</addr-line>, <country>Luxembourg</country></aff>
<aff id="aff2"><sup>2</sup><institution>National Electronics and Computer Technology Center, A Member of NSTDA</institution>, <addr-line>Bangkok</addr-line>, <country>Thailand</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Andrea Tagarelli, University of Calabria, Italy</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Domenico Mandaglio, University of Calabria, Italy; Pasquale De Meo, University of Messina, Italy</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Saharnaz Dilmaghani <email>saharnaz.dilmaghani&#x00040;uni.lu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>26</day>
<month>06</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>2</volume>
<elocation-id>22</elocation-id>
<history>
<date date-type="received">
<day>02</day>
<month>04</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>04</day>
<month>06</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2019 Dilmaghani, Brust, Piyatumrong, Danoy and Bouvry.</copyright-statement>
<copyright-year>2019</copyright-year>
<copyright-holder>Dilmaghani, Brust, Piyatumrong, Danoy and Bouvry</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Collaboration networks are defined as a set of individuals who come together and collaborate on particular tasks such as publishing a paper. The analysis of such networks permits to extract knowledge on the structure and patterns of communities. The link definition and network extraction have a high impact on the analysis of collaboration networks. Previous studies model the connectivity in a network considering it as a binomial problem with respect to the existence of a collaboration between individuals. However, such a data consists of a high diversity of features that describe the quality of the interaction such as the contribution amount of each individual. In this paper, we have determined a solution to extract collaboration networks using corresponding features in a dataset. We define <italic>collaboration score</italic> to quantify the collaboration between collaborators. In order to validate our proposed method, we benefit from a scientific research institute dataset in which researchers are co&#x02013;authors who are involved in the production of papers, prototypes, and intellectual properties (IP). We evaluated the generated networks, produced through different thresholds of <italic>collaboration score</italic>, by employing a set of network analysis metrics such as clustering coefficient, network density, and centrality measures. We investigated more the obtained networks using a community detection algorithm to further discuss the impact of our model on community detection. The outcome shows that the quality of resulted communities on the extracted collaboration networks can differ significantly based on the choice of the linkage threshold.</p></abstract> <kwd-group>
<kwd>network interactions</kwd>
<kwd>data-to-network</kwd>
<kwd>collaboration network</kwd>
<kwd>data analysis</kwd>
<kwd>community detection analysis</kwd>
</kwd-group>
<counts>
<fig-count count="3"/>
<table-count count="1"/>
<equation-count count="1"/>
<ref-count count="22"/>
<page-count count="6"/>
<word-count count="3799"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Collaboration networks are social structures which indicate the relationship between collaborators who perform on the same tasks. Collaboration is an essential component to define the success of today&#x00027;s knowledge sharing ecosystem (Huang et al., <xref ref-type="bibr" rid="B9">2008</xref>) and establishment of innovation. In collaboration networks, nodes represent individuals (aka collaborators) and links between them imply a collaboration. The analysis of collaboration networks can reveal information about the most likely behavior of individuals and groups in the network (Jamali and Abolhassani, <xref ref-type="bibr" rid="B10">2006</xref>) such as discovering the interaction patterns (Akbas et al., <xref ref-type="bibr" rid="B1">2013</xref>; Long et al., <xref ref-type="bibr" rid="B12">2014</xref>; Dilmaghani et al., <xref ref-type="bibr" rid="B7">2019</xref>), the evolution of collaboration communities (Kibanov et al., <xref ref-type="bibr" rid="B11">2013</xref>) and predictive models on the productivity and longevity of collaborations (Chakraborty et al., <xref ref-type="bibr" rid="B6">2015</xref>).</p>
<p>One prominent property studied in the context of collaboration networks is the community structure of nodes (Pan et al., <xref ref-type="bibr" rid="B17">2014</xref>). The discovery of communities, with dense intra-connections and comparatively sparse inter-cluster, can be beneficial for various applications such as discovering common research area of potential collaborators (Bedi and Sharma, <xref ref-type="bibr" rid="B2">2016</xref>). Various network-based community detection algorithms are used for this purpose, e.g., <italic>Louvain</italic>&#x00027;s algorithm (Blondel et al., <xref ref-type="bibr" rid="B3">2008</xref>), Label Propagation Algorithm (LPA) (Zhu and Ghahramani, <xref ref-type="bibr" rid="B22">2002</xref>).</p>
<p>Most collaboration data are stored in relational databases which are used to extract the collaboration networks to perform network analysis. The context of scientific collaboration networks has been initiated with the studies of Newman (<xref ref-type="bibr" rid="B14">2001a</xref>) and Newman (<xref ref-type="bibr" rid="B15">2001b</xref>). The network is defined such that the researchers are represented as nodes and the links constructed if at least one paper happened to be published by them. Other studies such as Chakraborty et al. (<xref ref-type="bibr" rid="B6">2015</xref>) have followed a similar generative approach to construct the collaboration network from the dataset. In a recent study (Sharma and Bhavani, <xref ref-type="bibr" rid="B20">2019</xref>), a weighted scientific collaboration network has been proposed such that links are weighted by the number of papers. One drawback of previous studies is the elimination of other potential features that represent the collaborations (e.g., date, number of citations). The information which is attached to the data can substantially impact the underlying network representation and, therefore, the outcomes of network analysis (e.g., community detection). Thus the appropriate use of network analysis, substantially depends on choosing the right network representation (Scholtes, <xref ref-type="bibr" rid="B19">2017</xref>), i.e., the definition of nodes and links (Butts, <xref ref-type="bibr" rid="B5">2009</xref>). Besides, in some cases, the definition of the link also requires determining a <italic>threshold</italic> which can significantly alter the outcomes of network properties, e.g., network density (Faust, <xref ref-type="bibr" rid="B8">2007</xref>).</p>
<p>In this paper, we investigated the definition of the fundamental research question of how and which network representation to choose for a given set of data. The drawback of previous studies is that they only consider the existence of a collaboration between individuals to connect them in the network. However, our work proposes a standardized method to produce networks from large and complex datasets. We define a method to construct scientific collaboration networks from the data considering different features describing the collaboration. Furthermore, we benefit from the scientific collaboration dataset of <italic>National Electronics and Computer Technology Center</italic> (NECTEC) to examine our method. Interestingly, our results indicate that identifying a network construction model leads to a less noisy yet well&#x02013;shaped community structure network with high modularity score.</p></sec>
<sec id="s2">
<title>2. Dataset</title>
<p>We benefit from a particular collaboration database provided by the <italic>National Electronics and Computer Technology Center</italic> (NECTEC) that presents different projects and collaborations in the area of R&#x00026;D<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref>. The whole database is the knowledge management about projects within distinct deliverables where the key information is to know project contributors and contributions. The database consists of three datasets, each indicates a particular deliverable: <italic>PAPER, PROTOTYPE</italic>, and <italic>IP</italic> (intellectual property) conducted between July 2013 and July 2018.</p>
<p>The datasets of combined research teams information consist of approximately 8,000 records which correspond to the information of more than 2,300 projects. Detailed statistical information regarding each dataset is provided in <xref ref-type="table" rid="T1">Table 1</xref>. Overall, NECTEC has more than 1,000 members who are contributing to different deliverables with certain features that have been evaluated by the organization. For each researcher who collaborated on a contribution, a contribution percentage has been recorded. Another feature named IC&#x02013;score which is designed by NECTEC, evaluates the scientific value and the outcome of contributions. For instance, producing a prototype in an industrial stage has a higher impact than one in the laboratory stage. For each project, the IC&#x02013;score is divided between each contributor considering their individual participation in the project. Overall, each dataset of the deliverables contains (a) project ID, (b) collaborator&#x00027;s ID, (c) contribution percentage of a collaborator for each project, (d) IC&#x02013;score of a collaborator for each project.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>General overview of the datasets from NECTEC.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Deliverable type</bold></th>
<th valign="top" align="center"><bold>&#x00023; Researchers</bold></th>
<th valign="top" align="center"><bold>&#x00023; Projects</bold></th>
<th valign="top" align="center"><bold>Cont. percentage</bold></th>
<th valign="top" align="center"><bold>IC&#x02013;score</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><italic>PAPER</italic></td>
<td valign="top" align="center">576</td>
<td valign="top" align="center">1717</td>
<td valign="top" align="center"><italic>&#x003BC;</italic> &#x0003D; 22.22, &#x003C3; &#x0003D; 19.73</td>
<td valign="top" align="center"><italic>&#x003BC;</italic> &#x0003D; 3.89, &#x003C3; &#x0003D; 4.61</td>
</tr>
<tr>
<td valign="top" align="left"><italic>PROTOTYPE</italic></td>
<td valign="top" align="center">524</td>
<td valign="top" align="center">539</td>
<td valign="top" align="center"><italic>&#x003BC;</italic> &#x0003D; 15.54, &#x003C3; &#x0003D; 13.73</td>
<td valign="top" align="center"><italic>&#x003BC;</italic> &#x0003D; 9.41, &#x003C3; &#x0003D; 10.75</td>
</tr>
<tr>
<td valign="top" align="left"><italic>IP</italic></td>
<td valign="top" align="center">489</td>
<td valign="top" align="center">630</td>
<td valign="top" align="center"><italic>&#x003BC;</italic> &#x0003D; 25.15, &#x003C3; &#x0003D; 24.42</td>
<td valign="top" align="center"><italic>&#x003BC;</italic> &#x0003D; 4.08, &#x003C3; &#x0003D; 4.63</td>
</tr> <tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Total</td>
<td valign="top" align="center">1, 056</td>
<td valign="top" align="center">2, 347</td>
<td valign="top" align="center"><italic>&#x003BC;</italic> &#x0003D; 20.78, &#x003C3; &#x0003D; 19.82</td>
<td valign="top" align="center"><italic>&#x003BC;</italic> &#x0003D; 5.81, &#x003C3; &#x0003D; 7.73</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Contribution percentage (Cont. percentage) and IC&#x02013;score are features extracted from the dataset and describe the collaboration</italic>.</p>
</table-wrap-foot>
</table-wrap></sec>
<sec id="s3">
<title>3. Methodology for Link Construction</title>
<p>We propose a <italic>collaboration score</italic> function that takes into account the combination of features extracted from the dataset. The purpose is to quantify the contribution of researchers considering features describing the collaborations. The collaboration score is the key element to define the link in the network while nodes are co&#x02013;authors. We introduce a <italic>linkage threshold</italic> (<italic>LT</italic>) on obtained collaboration scores. Thus, multiple networks are produced using various <italic>LT</italic> values.</p>
<p>We define the <italic>collaboration score</italic> function based on the features extracted from the NECTEC datasets which includes (a) the number of projects, (b) the contribution percentage of researchers, and (c) the IC&#x02013;score of researchers. Given two researchers <italic>i</italic> and <italic>j</italic> worked on a mutual project <italic>p</italic>, i.e., (<italic>i, j</italic>), let <italic>n</italic> be the number of projects that <italic>i</italic> and <italic>j</italic> have collaborated, and <italic>p</italic><sub><italic>k, i</italic></sub> and <italic>p</italic><sub><italic>k, j</italic></sub> represent the contribution percentage of researcher <italic>i</italic> and <italic>j</italic>, respectively, for the <italic>k</italic>th project. Likewise, <italic>s</italic><sub><italic>k, i</italic></sub> and <italic>s</italic><sub><italic>k, j</italic></sub> indicate the IC&#x02013;score of each researcher on the <italic>kth</italic> project. Hence, we determine the <italic>collaboration score</italic> function as follows.</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mrow><mml:msub><mml:mi>f</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mi>n</mml:mi></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>n</mml:mi></mml:munderover><mml:mrow><mml:mo stretchy='false'>(</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>,</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:mstyle></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The function takes into account the average of IC&#x02013;score and contribution percentage between any tuple of collaborators. The <italic>LT</italic>, then, is defined such that it determines different levels of collaboration score in the network. The range of <italic>LT</italic> varies from 0 to 1, which is the normalized range of collaboration score. In a nutshell, increasing <italic>LT</italic> enlarges the number of collaborations.</p>
<p>The threshold values indicate links in the network between the nodes. We produce a set of networks considering various <italic>LT</italic>s. Algorithm 1 shows the pseudocode of the data transformation to networks. A relational dataset of collaborations is the input of the algorithm. The researchers are determined as nodes of the network. For each tuple of researchers, the collaboration score is measured (see line 4). In order to generate a network, links are produced considering a particular <italic>LT</italic> value. All collaborations that are less or equal than the level of the chosen threshold are determined as links in the network (see line 7). Considering various levels of <italic>LT</italic>, a set of networks is generated by the algorithm which is examined in section 4.</p>
<table-wrap position="float">
<label>Algorithm 1</label>
<caption><p>Network Extraction from Data</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td align="left" valign="top">&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>Input:</bold> <italic>D</italic>, scientific collaboration dataset<break/> &#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>Output:</bold> <inline-formula><mml:math id="M2"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula>, a vector of generated networks</td>
</tr>
<tr><td align="left" valign="top"><break/>1: &#x000A0;&#x000A0;<bold>procedure</bold> <sc>TRANSFORM-TO-NETWORK</sc>(<italic>D</italic>)</td></tr>
<tr><td align="left" valign="top">2:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<italic>colList</italic> &#x02190; researchers from <italic>D</italic></td></tr>
<tr><td align="left" valign="top">3:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<bold>for</bold> <italic>tuple</italic>(<italic>i, j</italic>) in <italic>colList</italic> <bold>do</bold></td></tr>
<tr><td align="left" valign="top">4:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<italic>f</italic>.append&#x02190;<italic>collaborationScore</italic>(<italic>tuple</italic>(<italic>i, j</italic>))</td></tr>
<tr><td align="left" valign="top">5:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<italic>collaboration</italic>.append&#x02190; Concatenate <italic>tuple</italic>(<italic>i, j</italic>) and <italic>normalize</italic>(<italic>f</italic>)</td></tr>
<tr><td align="left" valign="top">6:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>for</bold> <italic>LT</italic> in <bold>range</bold>(<italic>normalize</italic>(<italic>f</italic>)) <bold>do</bold></td></tr>
<tr><td align="left" valign="top">7:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; <bold>if</bold> <italic>collaboration</italic>.<italic>normalize</italic>(<italic>f</italic>) &#x02264; <italic>LT</italic> <bold>then</bold></td></tr>
<tr><td align="left" valign="top">8:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<italic>nodes</italic>.append([<italic>i, j</italic>])</td></tr>
<tr><td align="left" valign="top">9:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<italic>links</italic>.append([<italic>tuple</italic>(<italic>i, j</italic>)])</td></tr>
<tr><td align="left" valign="top">10:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0; G &#x02190; <italic>Network</italic>(<italic>nodes, links</italic>)</td></tr>
<tr><td align="left" valign="top">11:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;<inline-formula><mml:math id="M3"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula>.append <italic>G</italic></td></tr>
<tr><td align="left" valign="top">12:&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;&#x000A0;return <inline-formula><mml:math id="M4"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">G</mml:mi></mml:mrow></mml:math></inline-formula></td></tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec sec-type="results" id="s4">
<title>4. Results</title>
<p>Our proposed method has been employed on different deliverable types of the previously described NECTEC collaboration data. As a result of the extraction process, our method returns a set of corresponding collaboration networks. In the first stage, we exploit the distribution of the collaboration score (<italic>f</italic>) within each dataset. Next, we analyze the topology of the extracted networks given the different values of <italic>LT</italic> by measuring a set of network metrics. Furthermore, for each generated network, we identify the communities using the <italic>Louvain</italic> algorithm and evaluate their quality.</p>
<sec>
<title>4.1. Data Processing</title>
<p>We exploit the histogram and cumulative distribution function (CDF) of <italic>f</italic> for each dataset of deliverables from NECTEC. <xref ref-type="fig" rid="F1">Figure 1</xref> describes the frequency and distribution of the obtained <italic>f</italic> after normalization. The average (&#x003BC;) of <italic>f</italic> for <italic>PAPER, PROTOTYPE</italic>, and <italic>IP</italic> are 0.24 [standard deviation (&#x003C3; = 0.16)], 0.18 (&#x003C3; &#x0003D; 0.12), and 0.3 (&#x003C3; &#x0003D; 0.21), respectively. Furthermore, the figure also shows that the majority of collaborators have relatively low number of contribution. Nevertheless a small number of collaborators are strongly collaborating in various projects.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The histogram and cumulative distribution function (CDF) of generated collaboration score (<italic>f</italic>).</p></caption>
<graphic xlink:href="fdata-02-00022-g0001.tif"/>
</fig></sec>
<sec>
<title>4.2. Topological Analysis</title>
<p>We analyze the topology and structure of extracted networks from each dataset by calculating a set of network metrics: degree, network density, transitivity, clustering coefficient, betweenness centrality, and closeness centrality. <xref ref-type="fig" rid="F2">Figure 2</xref> describes the evolution of these metrics on a set of 41 networks while increasing <italic>LT</italic> from 0 to 1 with the step of 0.025.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Topological analysis of a set of 41 produced networks from each dataset while increasing <italic>LT</italic> from 0 to 1 by 0.025.</p></caption>
<graphic xlink:href="fdata-02-00022-g0002.tif"/>
</fig>
<p>The degree of a node in collaboration networks represents the number of direct collaborations for each individual. The average node degree of networks obtained from <italic>PAPER</italic> is 6.59, <italic>PROTOTYPE</italic> is 11.46, and <italic>IP</italic> is 5.71 which indicates that on average, teams in <italic>PROTOTYPE</italic> had significantly higher collaborations compared to others. As illustrated in <xref ref-type="fig" rid="F2">Figure 2</xref>, the degree of extracted networks does not change significantly. The reason is after a certain threshold of <italic>LT</italic>, the number of new links which have been added to the network does not grow significantly while the number of nodes stays constant. A similar scenario occurs when measuring network density. The network density calculates the ratio of existing links to the number of all possible links in a network such that a density close to 0 identifies a sparse network while a density equal to 1 is a complete network. With <italic>LT</italic> close to zero, the network mostly consists of isolated nodes which explains why in all three datasets the network density is close to zero. Eventually, the density of the network increases slowly and remains steady. The reason is due to the high number of nodes compared to the number of collaborations between the nodes. This indicates the fact that in real-world collaboration networks each collaborator may only collaborate with a small number of collaborators, hence, the networks are considered as rather sparse.</p>
<p>In order to get knowledge on the complexity of collaborations of each dataset, we calculate the transitivity and clustering coefficient of networks. Transitivity refers to the extent to which the relation that relates two nodes in a network that are connected by a link is transitive. Thus, it represents the symmetry of collaborations in our networks and forms triangles of collaborations. <xref ref-type="fig" rid="F2">Figure 2</xref> illustrates fluctuations for networks constructed with lower <italic>LT</italic>, however, quickly it approaches a consistent value.</p>
<p>On the other hand, the clustering coefficient describes the likelihood of nodes in a network that tend to cluster together (Watts and Strogatz, <xref ref-type="bibr" rid="B21">1998</xref>). The average clustering coefficient of produced networks is 0.44 for <italic>PAPER</italic>, 0.61 for <italic>PROTOTYPE</italic>, and 0.45 for <italic>IP</italic>. For a relatively high <italic>LT</italic> the clustering coefficient approaches approximately to 0.7. A possible explanation can be that contribution of at least three people happens often in scientific collaboration teams (Newman et al., <xref ref-type="bibr" rid="B16">2001</xref>). Therefore, every collaboration that has three or more co&#x02013;authors increases the clustering coefficient significantly.</p>
<p>Centrality measures indicate the importance of nodes in the network. We measure betweenness centrality and closeness centrality to analyze datasets. For a node, the betweenness is defined as the total number of shortest paths between every pair of individuals in the network which pass through the node (Brandes, <xref ref-type="bibr" rid="B4">2001</xref>). In other terms, it highlights collaborators who act as a bridge between different groups in a network.</p>
<p>Moreover, closeness centrality defines the closeness of a node to other nodes by measuring the average shortest path from that node to all other nodes within the network. Hence, the more central a node is, the closer it is to all other nodes (Sabidussi, <xref ref-type="bibr" rid="B18">1966</xref>). All three datasets reach the highest closeness centrality after a certain threshold. However, each dataset reflects a considerably different growth function, such that <italic>IP</italic> follows a linear function after each evolution, <italic>PROTOTYPE</italic>, and <italic>PAPER</italic> are growing exponentially.</p></sec>
<sec>
<title>4.3. Community Detection Analysis</title>
<p>We imply <italic>Louvain</italic> community detection algorithm to evaluate <italic>LT</italic> on <italic>collaboration score</italic>. We extract communities of each network and measure the modularity and number of clusters. The modularity of communities illustrates the strength of connected nodes inside the same community compare to the community of a random graph (with the same size and average degree). The higher the modularity, the more the network is closer to a well-shaped community structure.</p>
<p><xref ref-type="fig" rid="F3">Figure 3</xref> shows the average results of 200 experiments on each dataset including error bars. The figure shows that the modularity of all three datasets converges to relatively a high score of approximately 0.7 after a certain <italic>LT</italic>. It indicates that the produced collaboration networks have well&#x02013;defined community structure compare to the random network of the same size. As illustrated in this figure, increasing <italic>LT</italic> does not affect the modularity after a particular point. For the lower <italic>LT</italic> (&#x0003C;0.4), as also shown in <xref ref-type="fig" rid="F2">Figure 2</xref> networks have a considerably lower density, thus, they are sparse. However, the score increases exponentially and becomes steady for all three datasets for <italic>LT</italic>&#x0003E;0.4. On the other hand, increasing <italic>LT</italic> decreases the number of communities considerably. When networks are sparse (i.e., <italic>LT</italic> &#x02264; 0.2) the number of communities is almost equal to the number of nodes.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Community detection analysis after implying <italic>Louvain</italic> algorithm on networks produced with different <italic>LT</italic> values. The community modularity score, and the number of clusters are the average of 200 experiments for 41 data points. The error bars are not visible because the standard error is very small.</p></caption>
<graphic xlink:href="fdata-02-00022-g0003.tif"/>
</fig>
<p>Moreover, as illustrated in <xref ref-type="fig" rid="F3">Figure 3</xref>, the modularity score increases significantly even for the low values of <italic>LT</italic> and reaches to its highest value before it decreases and becomes steady. On the other hand, the number of communities exponentially decreases. Therefore, the network obtained from <italic>LT</italic> &#x0003C;0.2 has an extremely high number of communities. In a particular case for <italic>PROTOTYPE</italic>, the modularity increases and becomes steady with <italic>LT</italic>&#x0003E;0.4, and similarly the number of communities become constant (&#x0003D; 22) with <italic>LT</italic>&#x0003E;0.5. Furthermore, considering the growth of metrics for <italic>PROTOTYPE</italic> from <xref ref-type="fig" rid="F2">Figure 2</xref>, all metrics are constant with <italic>LT</italic>&#x0003E;0.4.</p></sec></sec>
<sec id="s5">
<title>5. Discussion and Conclusion</title>
<p>The approach outlined in this paper infers collaboration networks of researchers within projects of an organization. Our method uses the features describing the collaborations of a research institute and quantifies them by applying a proposed <italic>collaboration score</italic> function.</p>
<p>Our results show that the quality of the detection of communities from the extracted collaboration networks can differ significantly by the choice of the linkage threshold. It turns out that a greedy increase of links and connections can lead to a noisy network structure where the <italic>identity</italic> of nodes could be affected by a large amount of superfluous connections. Consequently, our future work has to focus on the understanding of a networks preference toward a rich network while avoiding a noisy structure (Newman, <xref ref-type="bibr" rid="B13">2018</xref>). Moreover, our experiments on the execution time of community detection indicate that increasing <italic>LT</italic> impacts the execution time of the algorithm. Hence, one option is to generate the network choosing a considerably low threshold while the modularity of communities is still at the highest possible value.</p>
<p>In this study we use a set of network metrics and the modularity score to evaluate communities of obtained networks. However, as future work we are looking at advancing our collaboration score model for network construction from relational data. Moreover, we consider identifying the optimum <italic>LT</italic> in order to recognize high quality communities within the obtained networks.</p></sec>
<sec id="s6">
<title>Data Availability</title>
<p>The datasets generated for this study are available on request to the corresponding author.</p></sec>
<sec id="s7">
<title>Author Contributions</title>
<p>SD developed the method and performed the computations and measurements. MB and PB were involved in planning and supervised the work. AP provided the datasets. MB, GD, and AP provided critical feedback.</p>
<sec>
<title>Conflict of Interest Statement</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p></sec></sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Akbas</surname> <given-names>M. I.</given-names></name> <name><surname>Brust</surname> <given-names>M. R.</given-names></name> <name><surname>Turgut</surname> <given-names>D.</given-names></name></person-group> (<year>2013</year>). <article-title>Social network generation and role determination based on smartphone data</article-title>. <source>abs/</source>1305.4133.</citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bedi</surname> <given-names>P.</given-names></name> <name><surname>Sharma</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <article-title>Community detection in social networks</article-title>. <source>Wiley Interdiscip. Rev.</source> <volume>6</volume>, <fpage>115</fpage>&#x02013;<lpage>135</lpage>. <pub-id pub-id-type="doi">10.1002/widm.1178</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blondel</surname> <given-names>V. D.</given-names></name> <name><surname>Guillaume</surname> <given-names>J.-L.</given-names></name> <name><surname>Lambiotte</surname> <given-names>R.</given-names></name> <name><surname>Lefebvre</surname> <given-names>E.</given-names></name></person-group> (<year>2008</year>). <article-title>Fast unfolding of communities in large networks</article-title>. <source>J. Stat. Mech.</source> <volume>2008</volume>:<fpage>P10008</fpage>. <pub-id pub-id-type="doi">10.1088/1742-5468/2008/10/P10008</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brandes</surname> <given-names>U.</given-names></name></person-group> (<year>2001</year>). <article-title>A faster algorithm for betweenness centrality</article-title>. <source>J. Math. Sociol.</source> <volume>25</volume>, <fpage>163</fpage>&#x02013;<lpage>177</lpage>. <pub-id pub-id-type="doi">10.1080/0022250X.2001.9990249</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Butts</surname> <given-names>C. T.</given-names></name></person-group> (<year>2009</year>). <article-title>Revisiting the foundations of network analysis</article-title>. <source>Science</source> <volume>325</volume>, <fpage>414</fpage>&#x02013;<lpage>416</lpage>. <pub-id pub-id-type="doi">10.1126/science.1171022</pub-id><pub-id pub-id-type="pmid">19628855</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chakraborty</surname> <given-names>T.</given-names></name> <name><surname>Ganguly</surname> <given-names>N.</given-names></name> <name><surname>Mukherjee</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>An author is known by the context she keeps: significance of network motifs in scientific collaborations</article-title>. <source>Soc. Netw. Anal. Min.</source> <volume>5</volume>:<fpage>16</fpage>. <pub-id pub-id-type="doi">10.1007/s13278-015-0255-3</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dilmaghani</surname> <given-names>S. E.</given-names></name> <name><surname>Piyatumrong</surname> <given-names>A.</given-names></name> <name><surname>Bouvry</surname> <given-names>P.</given-names></name> <name><surname>Brust</surname> <given-names>M. R.</given-names></name></person-group> (<year>2019</year>). <article-title>Transforming collaboration data into network layers for enhanced analytics</article-title>. <source>arXiv preprint</source> arXiv:1902.09364.</citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Faust</surname> <given-names>K.</given-names></name></person-group> (<year>2007</year>). <article-title>7. very local structure in social networks</article-title>. <source>Sociol. Methodol.</source> <volume>37</volume>, <fpage>209</fpage>&#x02013;<lpage>256</lpage>. <pub-id pub-id-type="doi">10.1111/j.1467-9531.2007.00179.x</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>J.</given-names></name> <name><surname>Zhuang</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Giles</surname> <given-names>C. L.</given-names></name></person-group> (<year>2008</year>). <article-title>Collaboration over time: characterizing and modeling network evolution</article-title>, in <source>Proceedings of the International Conference on Web Search and Data Mining</source> (<publisher-loc>Palo Alto, CA</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>107</fpage>&#x02013;<lpage>116</lpage>.</citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jamali</surname> <given-names>M.</given-names></name> <name><surname>Abolhassani</surname> <given-names>H.</given-names></name></person-group> (<year>2006</year>). <article-title>Different aspects of social network analysis</article-title>, in <source>2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI&#x00027;06)</source> (<publisher-loc>Washington, DC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>66</fpage>&#x02013;<lpage>72</lpage>.</citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kibanov</surname> <given-names>M.</given-names></name> <name><surname>Atzmueller</surname> <given-names>M.</given-names></name> <name><surname>Scholz</surname> <given-names>C.</given-names></name> <name><surname>Stumme</surname> <given-names>G.</given-names></name></person-group> (<year>2013</year>). <article-title>On the evolution of contacts and communities in networks of face-to-face proximity</article-title>, in <source>2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>993</fpage>&#x02013;<lpage>1000</lpage>.</citation></ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Long</surname> <given-names>J. C.</given-names></name> <name><surname>Cunningham</surname> <given-names>F. C.</given-names></name> <name><surname>Carswell</surname> <given-names>P.</given-names></name> <name><surname>Braithwaite</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>Patterns of collaboration in complex networks</article-title>. <source>BMC Health Services Res.</source> <volume>14</volume>:<fpage>225</fpage>. <pub-id pub-id-type="doi">10.1186/1472-6963-14-225</pub-id><pub-id pub-id-type="pmid">24885971</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Newman</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>Network structure from rich but noisy data</article-title>. <source>Nat. Phys.</source> <volume>14</volume>:<fpage>542</fpage>. <pub-id pub-id-type="doi">10.1038/s41567-018-0076-1</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Newman</surname> <given-names>M. E.</given-names></name></person-group> (<year>2001a</year>). <article-title>Scientific collaboration networks. I. Network construction and fundamental results</article-title>. <source>Phys. Rev. E</source> <volume>64</volume>:<fpage>016131</fpage>. <pub-id pub-id-type="doi">10.1103/PhysRevE.64.016131</pub-id><pub-id pub-id-type="pmid">11461355</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Newman</surname> <given-names>M. E.</given-names></name></person-group> (<year>2001b</year>). <article-title>Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality</article-title>. <source>Phys. Rev. E</source> <volume>64</volume>:<fpage>016132</fpage>. <pub-id pub-id-type="doi">10.1103/PhysRevE.64.016132</pub-id><pub-id pub-id-type="pmid">11461356</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Newman</surname> <given-names>M. E.</given-names></name> <name><surname>Strogatz</surname> <given-names>S. H.</given-names></name> <name><surname>Watts</surname> <given-names>D. J.</given-names></name></person-group> (<year>2001</year>). <article-title>Random graphs with arbitrary degree distributions and their applications</article-title>. <source>Phys. Rev. E</source> <volume>64</volume>:<fpage>026118</fpage>. <pub-id pub-id-type="doi">10.1103/PhysRevE.64.026118</pub-id><pub-id pub-id-type="pmid">11497662</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>G.</given-names></name> <name><surname>Zhang</surname> <given-names>W.</given-names></name> <name><surname>Wu</surname> <given-names>Z.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <article-title>Online community detection for large complex networks</article-title>. <source>PLoS ONE</source> <volume>9</volume>:<fpage>e102799</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0102799</pub-id><pub-id pub-id-type="pmid">25061683</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sabidussi</surname> <given-names>G.</given-names></name></person-group> (<year>1966</year>). <article-title>The centrality index of a graph</article-title>. <source>Psychometrika</source> <volume>31</volume>, <fpage>581</fpage>&#x02013;<lpage>603</lpage>. <pub-id pub-id-type="doi">10.1007/BF02289527</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Scholtes</surname> <given-names>I.</given-names></name></person-group> (<year>2017</year>). <article-title>When is a network a network? multi-order graphical model selection in pathways and temporal networks</article-title>, in <source>Proceedings of the ACM SIGKDD</source> (<publisher-loc>ACM</publisher-loc>), <fpage>1037</fpage>&#x02013;<lpage>1046</lpage>. <pub-id pub-id-type="doi">10.1145/3097983.3098145</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sharma</surname> <given-names>A.</given-names></name> <name><surname>Bhavani</surname> <given-names>S. D.</given-names></name></person-group> (<year>2019</year>). <article-title>A network formation model for collaboration networks</article-title>, in <source>International Conference on Distributed Computing and Internet Technology</source> (<publisher-loc>Bhubaneswar</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>279</fpage>&#x02013;<lpage>294</lpage>.</citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Watts</surname> <given-names>D. J.</given-names></name> <name><surname>Strogatz</surname> <given-names>S. H.</given-names></name></person-group> (<year>1998</year>). <article-title>Collective dynamics of &#x02018;small-world&#x00027; networks</article-title>. <source>Nature</source> <volume>393</volume>:<fpage>440</fpage>. <pub-id pub-id-type="doi">10.1038/30918</pub-id><pub-id pub-id-type="pmid">9623998</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>X.</given-names></name> <name><surname>Ghahramani</surname> <given-names>Z.</given-names></name></person-group> (<year>2002</year>). <source>Learning From Labeled and Unlabeled Data With Label Propagation</source>. <publisher-name>Technical report</publisher-name>, <publisher-loc>Citeseer</publisher-loc>.</citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>National Electronics and Computer Technology Center (NECTEC) (<ext-link ext-link-type="uri" xlink:href="https://www.nectec.or.th/en/">https://www.nectec.or.th/en/</ext-link>).</p></fn>
</fn-group>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> This work is partially funded by the research programme UL/SnT-ILNAS on Digital Trust for Smart-ICT.</p>
</fn>
</fn-group>
</back>
</article>