<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Big Data</journal-id>
<journal-title>Frontiers in Big Data</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Big Data</abbrev-journal-title>
<issn pub-type="epub">2624-909X</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fdata.2019.00027</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Big Data</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Simultaneous Parameter Learning and Bi-clustering for Multi-Response Models</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Yu</surname> <given-names>Ming</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/686742/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Natesan Ramamurthy</surname> <given-names>Karthikeyan</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/565845/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Thompson</surname> <given-names>Addie</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/740396/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Lozano</surname> <given-names>Aur&#x000E9;lie C.</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/566287/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Booth School of Business, The University of Chicago</institution>, <addr-line>Chicago, IL</addr-line>, <country>United States</country></aff>
<aff id="aff2"><sup>2</sup><institution>IBM Research</institution>, <addr-line>Yorktown Heights, NY</addr-line>, <country>United States</country></aff>
<aff id="aff3"><sup>3</sup><institution>Department of Plant, Soil and Microbial Sciences, Michigan State University</institution>, <addr-line>East Lansing, MI</addr-line>, <country>United States</country></aff>
<aff id="aff4"><sup>4</sup><institution>Department of Agronomy, Purdue University</institution>, <addr-line>West Lafayette, IN</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Ranga Raju Vatsavai, North Carolina State University, United States</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Lidan Shou, Zhejiang University, China; Chao Lan, University of Wyoming, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Aur&#x000E9;lie C. Lozano <email>aclozano&#x00040;us.ibm.com</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Data Mining and Management, a section of the journal Frontiers in Big Data</p></fn></author-notes>
<pub-date pub-type="epub">
<day>14</day>
<month>08</month>
<year>2019</year>
</pub-date>
<pub-date pub-type="collection">
<year>2019</year>
</pub-date>
<volume>2</volume>
<elocation-id>27</elocation-id>
<history>
<date date-type="received">
<day>05</day>
<month>04</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>07</month>
<year>2019</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2019 Yu, Natesan Ramamurthy, Thompson and Lozano.</copyright-statement>
<copyright-year>2019</copyright-year>
<copyright-holder>Yu, Natesan Ramamurthy, Thompson and Lozano</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>We consider multi-response and multi-task regression models, where the parameter matrix to be estimated is expected to have an unknown grouping structure. The groupings can be along tasks, or features, or both, the last one indicating a bi-cluster or &#x0201C;checkerboard&#x0201D; structure. Discovering this grouping structure along with parameter inference makes sense in several applications, such as multi-response Genome-Wide Association Studies (GWAS). By inferring this additional structure we can obtain valuable information on the underlying data mechanisms (e.g., relationships among genotypes and phenotypes in GWAS). In this paper, we propose two formulations to simultaneously learn the parameter matrix and its group structures, based on convex regularization penalties. We present optimization approaches to solve the resulting problems and provide numerical convergence guarantees. Extensive experiments demonstrate much better clustering quality compared to other methods, and our approaches are also validated on real datasets concerning phenotypes and genotypes of plant varieties.</p></abstract>
<kwd-group>
<kwd>high-throughput phenotyping</kwd>
<kwd>multitask learning</kwd>
<kwd>convex clustering</kwd>
<kwd>bi-clustering</kwd>
<kwd>sparse linear regression</kwd>
<kwd>genome-wide association studies</kwd>
</kwd-group>
<contract-num rid="cn001">DE-AR0000593</contract-num>
<contract-sponsor id="cn001">Advanced Research Projects Agency - Energy<named-content content-type="fundref-id">10.13039/100006133</named-content></contract-sponsor>
<counts>
<fig-count count="9"/>
<table-count count="4"/>
<equation-count count="11"/>
<ref-count count="22"/>
<page-count count="11"/>
<word-count count="7621"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>We consider multi-response and multi-task regression models, which generalize single-response regression to learn predictive relationships between multiple input and multiple output variables, also referred to as tasks (Borchani et al., <xref ref-type="bibr" rid="B1">2015</xref>). The parameters to be estimated form a matrix instead of a vector. In several applications, there exist joint group relationships between inputs and outputs. A motivating example is that of multi-response Genome-Wide Association Studies (GWAS) (Schifano et al., <xref ref-type="bibr" rid="B20">2013</xref>), where for instance a group of Single Nucleotide Polymorphisms or SNPs (input variables or features) might influence a group of phenotypes (output variables or tasks) in a similar way, while having little or no effect on another group of phenotypes. Similarly, as another example, stocks values of related companies can affect the future value of a group of stocks similarly. In such cases, the model parameters belonging to the same input-output group tend to be close to each other, and it is desirable to <italic>uncover</italic> and <italic>exploit</italic> these structures in estimating the parameter matrix. See <xref ref-type="fig" rid="F1">Figure 1</xref> for an example.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Multi-response GWAS: the simultaneous grouping relationship between phenotypic traits and SNPs manifest as a block structure (row &#x0002B; column groups) in the parameter matrix. The row and column groups are special cases of the more general block structure. Our proposed approach infers the parameter matrix as well as the group structures.</p></caption>
<graphic xlink:href="fdata-02-00027-g0001.tif"/>
</fig>
<sec>
<title>1.1. Contributions</title>
<p>In this work, we develop formulations that <italic>simultaneously</italic> learn: (a) the parameters of multi-response/task regression models, and (b) the grouping structure in the parameter matrix (row or column or both) that reflects the group relationship between inputs and outputs. We present optimization approaches to efficiently solve the resulting convex problems, and show their numerical convergence. We describe and justify several hyperparameter choices we make during this optimization. Our methods are validated empirically on synthetic data and on real-world datasets concerning phenotypes and genotypes of plant varieties. From the synthetic data experiments, we find that our methods provide a much better and more stable (i.e., lesser standard error) recovery of the underlying group structure. In real-world data experiments, our approaches reveal natural groupings of phenotypes and <italic>checkerboard</italic> patterns of phenotype-SNP groups that inform us of the joint relationship between them.</p>
<p>We emphasize that the parameters as well as the grouping structures are <italic>fully unknown a-priori</italic>, and inferring them simultaneously is our major contribution. This is in contrast to the naive way of estimating the parameters first and then clustering. This naive approach has the danger of propagating the estimation error into clustering, particularly in high dimensions, where the estimator is usually inaccurate due to lack of sufficient samples. Moreover, the clustering step of the naive approach does not use the full information of the data. The joint estimation-clustering procedure we propose naturally promotes sharing of information within groups. Our formulations adopt the convex bi-clustering cost function (Chi et al., <xref ref-type="bibr" rid="B3">2014</xref>) as the regularizer to encourage groupings between columns (tasks) and rows (features) in the parameter matrix. Note that, Chi et al. (<xref ref-type="bibr" rid="B3">2014</xref>) assume that the data matrix to be used for bi-clustering is known a-priori, which is obviously not the case for our setting. As a result, Chi et al. (<xref ref-type="bibr" rid="B3">2014</xref>) can deal only with fixed data and cannot estimate unknown model parameters, while our approaches can simultaneously estimate parameters and discover the clustering structure in them.</p>
<p>To the best of our knowledge, this is the first method that can simultaneously cluster and estimate the parameters efficiently in a unified optimization. <italic>We emphasize that our main goal is to discover the underlying parameter bi-cluster structure without compromising estimation accuracy</italic>. Experiments show that our clusterings are better than other methods while the estimation accuracy is no worse or sometimes even better.</p>
</sec>
<sec>
<title>1.2. Related Work</title>
<p>The premise in multi-task learning is that appropriate sharing of information can benefit all the tasks (Caruana, <xref ref-type="bibr" rid="B2">1998</xref>; Obozinski et al., <xref ref-type="bibr" rid="B18">2006</xref>; Yu et al., <xref ref-type="bibr" rid="B22">2018</xref>). Assuming all tasks to be closely related can be excessive as it ignores the underlying specificity of the mappings. There have been several extensions to multi-task learning that address this problem. The authors in Jalali et al. (<xref ref-type="bibr" rid="B10">2010</xref>) propose a <italic>dirty</italic> model for feature sharing among tasks, wherein a linear superposition of two sets of parameters&#x02014;one that is common to all tasks, and one that is task-specific&#x02014;are used. Kim and Xing (<xref ref-type="bibr" rid="B12">2010</xref>) leverages a <italic>predefined</italic> tree structure among the output tasks (e.g., using hierarchical agglomerative clustering) and imposes group regularizations on the task parameters based on this tree. The approach proposed in Kumar and Daume (<xref ref-type="bibr" rid="B13">2012</xref>) learns to share by defining a set of <italic>basis task parameters</italic> and posing the task-specific parameters as a sparse linear combination of these. Jacob et al. (<xref ref-type="bibr" rid="B9">2009</xref>) and Kang et al. (<xref ref-type="bibr" rid="B11">2011</xref>) assume that the tasks are clustered into groups and proceed to learn the group structure along with the task parameters using a convex and an integer quadratic program, respectively. However, these approaches do not consider joint clustering of the features. In addition, the mixed integer program of Kang et al. (<xref ref-type="bibr" rid="B11">2011</xref>) is computationally intensive and greatly limits the maximum number of tasks that can be considered. Another pertinent approach is the Network Lasso formulation presented in Hallac et al. (<xref ref-type="bibr" rid="B5">2015</xref>). This formulation, however, is limited to settings where only clustering among the tasks is needed.</p>
<p>As mentioned before, convex bi-clustering method (Chi et al., <xref ref-type="bibr" rid="B3">2014</xref>) aims at grouping observations and features in a data matrix; while our approaches aim at discovering groupings in the parameter matrix of multi-response regression models while jointly estimating such a matrix, and the discovered groupings reflect groupings in features and responses.</p>
</sec>
<sec>
<title>1.3. Roadmap</title>
<p>In section 2, we will discuss the proposed joint estimation-clustering formulations; in section 3, we will present the optimization approaches. The choice of hyperparameters used and their significance is discussed in section 4. We illustrate the solution path for one of the formulations in section 4.3. We will provide results for estimation with synthetic data, and two case studies using multi-response GWAS with real data in sections 5, 6, respectively. We conclude in section 7. Additional details and convergence proofs for the optimization approaches are provided in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
</sec>
</sec>
<sec id="s2">
<title>2. Problem Statement and Proposed Methods</title>
<p>We will motivate and propose two distinct formulations for simultaneous parameter learning and clustering with general supervised models involving matrix valued parameters. Our formulations will be developed around multi-task regression in this paper. We are interested in accurate parameter estimation as well as understanding the <italic>bi-cluster</italic> or <italic>checkerboard</italic> structure of the parameter matrix. More formally, denote by <italic>Z</italic> the observed data, &#x00398; the model parameters to be estimated, and <italic>L</italic>(<italic>Z</italic>; &#x00398;) a general loss function, and <italic>R</italic>(&#x00398;) to be the regularization.</p>
<p>In multi-task regression, <italic>Z</italic> &#x0003D; {<italic>X, Y</italic>} where <inline-formula><mml:math id="M1"><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi><mml:mo>&#x000D7;</mml:mo><mml:mi>p</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are the design matrices and <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula> are the response vectors for each task <italic>s</italic> &#x0003D; 1, &#x02026;, <italic>k</italic>. &#x00398; is a matrix in &#x0211D;<sup><italic>p</italic> &#x000D7; <italic>k</italic></sup> containing the regression coefficients for each task. A popular choice for <italic>L</italic> is the squared loss: <inline-formula><mml:math id="M3"><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Z</mml:mi><mml:mo>;</mml:mo><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula>. For regularization <italic>R</italic>(&#x00398;), the &#x02113;<sub>1</sub> norm, denoted as ||&#x00398;||<sub>1</sub>, is commonly used. Here we wish to discover the bi-cluster structure among features and responses, respectively the rows and columns of &#x00398;.</p>
<sec>
<title>2.1. Formulation 1: &#x0201C;Hard Fusion&#x0201D;</title>
<p>We begin with the simplest formulation, which, as we shall see, is a special case of the latter one.</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M4"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow></mml:munder></mml:mstyle><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>Z</mml:mi><mml:mo>;</mml:mo><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mi>R</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mstyle mathsize="1.61em"><mml:mo stretchy="true">[</mml:mo></mml:mstyle><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo class="qopname">&#x0007E;</mml:mo></mml:mover></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mstyle mathsize="1.61em"><mml:mo stretchy="true">]</mml:mo></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here <italic>L</italic>(<italic>Z</italic>; &#x00398;) is the loss function, <italic>R</italic>(&#x00398;) is a regularizer, and <inline-formula><mml:math id="M5"><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> and &#x00398;<sub>&#x000B7;<italic>i</italic></sub> is the <italic>i</italic>th column of &#x00398;. &#x003A9;<sub><italic>W</italic></sub>(&#x00398;) is inspired by the convex bi-clustering objective (Chi et al., <xref ref-type="bibr" rid="B3">2014</xref>, Equation 1) and it encourages sparsity in differences between columns of &#x00398;. Similarly, <inline-formula><mml:math id="M6"><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> encourages sparsity in the differences between the rows of &#x00398;. When the overall objective is optimized, we can expect to see a checkerboard pattern in the model parameter matrix. Note that <italic>W</italic> and <inline-formula><mml:math id="M7"><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:math></inline-formula> are non-negative weights that reflect our prior belief on the closeness of the rows and columns of &#x00398;.</p>
<p>The degree of <italic>sharing</italic> of parameters and hence that of bi-clustering, is controlled using the tuning parameter &#x003BB;<sub>2</sub>. When &#x003BB;<sub>2</sub> is small, each element of &#x00398; will be its own bi-cluster. As &#x003BB;<sub>2</sub> increases, more elements of &#x00398; <italic>fuse</italic> together, the number of rectangles in the checkerboard pattern will reduce. See <xref ref-type="fig" rid="F2">Figure 2</xref> for the change of the checkerboard structure as &#x003BB;<sub>2</sub> increases. Further, by varying &#x003BB;<sub>2</sub> we get a solution path instead of just a point estimate of &#x00398; (see section 4.3). In the rest of the paper, we will use the same design matrix <italic>X</italic> across all tasks for simplicity, without loss of generality.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Evolution of the bi-clustering structure of model coefficient matrix &#x00398; as regularization parameter &#x003BB;<sub>2</sub> increases.</p></caption>
<graphic xlink:href="fdata-02-00027-g0002.tif"/>
</fig>
<p>For sparse multi-task linear regression, we have <italic>L</italic>(<italic>Z</italic>; &#x00398;) &#x0003D; <italic>L</italic>(<italic>X, Y</italic>; &#x00398;) and formulation 1 can be instantiated as,</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow></mml:munder></mml:mstyle><mml:mo>&#x02016;</mml:mo><mml:mi>Y</mml:mi><mml:mo>-</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x00398;</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mstyle mathsize="1.61em"><mml:mo stretchy="true">[</mml:mo></mml:mstyle><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo class="qopname">&#x0007E;</mml:mo></mml:mover></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mstyle mathsize="1.61em"><mml:mo stretchy="true">]</mml:mo></mml:mstyle><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here the rows of &#x00398; correspond to the features, i.e., the columns of <italic>X</italic>, and the columns of &#x00398; correspond to the tasks, i.e., the columns of <italic>Y</italic>. Therefore, the checkerboard pattern in &#x00398; provides us insights on the groups of features that go together with the groups of tasks.</p>
</sec>
<sec>
<title>2.2. Formulation 2: &#x0201C;Soft Fusion&#x0201D;</title>
<p>Formulation 1 is natural and simple, but it forces the parameters belonging to the same row or column cluster to be equal, and this may be limiting. To relax this requirement, we introduce a surrogate parameter matrix &#x00393; that will be used for bi-clustering. This will be mandated to be close to &#x00398;. For multitask regression this yields the objective</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x00398;</mml:mo><mml:mo>,</mml:mo><mml:mo>&#x00393;</mml:mo></mml:mrow></mml:munder></mml:mstyle><mml:mo>&#x02016;</mml:mo><mml:mi>Y</mml:mi><mml:mo>-</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x00398;</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p><bold>Remark 1</bold>. To interpret this more carefully, let us assume that <inline-formula><mml:math id="M11"><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> in Equation (3). In other words, &#x00398;<sub>&#x000B7;<italic>i</italic></sub> has a global component <inline-formula><mml:math id="M12"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula>, and the component &#x00393;<sub>&#x000B7;<italic>i</italic></sub> that participates in the clustering. As &#x003BB;<sub>2</sub> &#x02192; &#x0221E;, <inline-formula><mml:math id="M13"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>&#x02192;</mml:mo><mml:mn>0</mml:mn></mml:math></inline-formula>, and hence &#x00398;<sub>&#x000B7;<italic>i</italic></sub> &#x02192; &#x00393;<sub>&#x000B7;<italic>i</italic></sub>. Now, formulation 2 reduces to formulation 1. Further, if &#x003BB;<sub>1</sub> and &#x003BB;<sub>2</sub> are held constant while only &#x003BB;<sub>3</sub> increases, <inline-formula><mml:math id="M14"><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02192;</mml:mo><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>&#x0002B;</mml:mo><mml:mo>&#x00393;</mml:mo></mml:math></inline-formula>, since &#x00393;<sub>&#x000B7;<italic>i</italic></sub> &#x02192; &#x00393; for all <italic>i</italic>. The key difference between formulation 2 and 1 is the presence of a task-specific global component <inline-formula><mml:math id="M15"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula>, which lends additional flexibility in modeling the individual tasks even when &#x003BB;<sub>3</sub> &#x02192; 0. Whereas, in Equation (2), when &#x003BB;<sub>2</sub> &#x02192; &#x0221E;, all the components of &#x00398;<sub>&#x000B7;<italic>i</italic></sub> take the same value for all <italic>i</italic>, and the tasks are forced to share the same coefficients without any flexibility.</p>
<p><bold>Remark 2</bold>. In certain applications, it might make sense to cluster together features/tasks whose effects have the same amplitude but different signs. This can be accommodated by considering <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x0003C;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:munder><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>c</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula> where <italic>c</italic><sub><italic>i,j</italic></sub> &#x02208; {&#x02212;1, 1} are predefined constants reflecting whether the features or tasks are expected to be negatively or positively correlated.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Optimization Algorithms for the Proposed Formulations</title>
<p>We describe the optimization procedures to solve the two proposed formulations. Note that as long as the loss function <italic>L</italic>(<italic>X, Y</italic>; &#x00398;) and the regularization <italic>R</italic>(&#x00398;) are convex, our formulations are also convex in &#x00398; and &#x00393;, and hence can be solved using modern convex optimization approaches. Here we adopt two computationally efficient approaches.</p>
<sec>
<title>3.1. Optimization for Formulation 1</title>
<p>For our formulation 1 we use the proximal decomposition method introduced in Combettes and Pesquet (<xref ref-type="bibr" rid="B4">2008</xref>). This is an efficient algorithm for minimizing the sum of several convex functions. Our general objective function (1) involves three such functions: <italic>f</italic><sub>1</sub> being <italic>L</italic>(<italic>X, Y</italic>; &#x00398;), <italic>f</italic><sub>2</sub> being <italic>R</italic>(&#x00398;), and <italic>f</italic><sub>3</sub> being the term that multiplies &#x003BB;<sub>2</sub>. At a high level, the algorithm iteratively applies proximal updates with respect to these functions until convergence.</p>
<p>We stack the regression matrix &#x00398; into a column vector <inline-formula><mml:math id="M17"><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>;</mml:mo><mml:mo>&#x02026;</mml:mo><mml:mo>;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02208;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x0211D;</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msup></mml:math></inline-formula>. The proximal operator is given by: <inline-formula><mml:math id="M18"><mml:msub><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">prox</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>f</mml:mi></mml:mrow></mml:msub><mml:mi>b</mml:mi><mml:mo>=</mml:mo><mml:munder><mml:mrow><mml:mstyle class="text"><mml:mtext class="textrm" mathvariant="normal">argmin</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:munder><mml:mo>(</mml:mo><mml:mi>f</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mo>&#x02016;</mml:mo><mml:mi>b</mml:mi><mml:mo>-</mml:mo><mml:mi>a</mml:mi><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mstyle mathsize="1.19em"><mml:mo>)</mml:mo></mml:mstyle></mml:math></inline-formula>, where <italic>a</italic> and <italic>b</italic> are <italic>pk</italic>-dimensional vectors. The proximal operator of the regularized loss can be computed according to the specific <italic>L</italic> and <italic>R</italic> functions. The overall optimization procedure is given in <bold>Algorithm 1</bold> and with the following update rules.</p>
<list list-type="bullet">
<list-item><p>Update for <inline-formula><mml:math id="M19"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x02016;</mml:mo><mml:mi>Y</mml:mi><mml:mo>-</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x00398;</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:math></inline-formula>: Let (<italic>a</italic><sub>1</sub>; &#x02026;; <italic>a</italic><sub><italic>k</italic></sub>) &#x0003D; prox<sub>&#x003C3;<italic>f</italic><sub>1</sub></sub>(<italic>b</italic><sub>1</sub>; &#x02026;; <italic>b</italic><sub><italic>k</italic></sub>), For each <italic>s</italic> &#x02208; {1, &#x02026;, <italic>k</italic>}, we have
<disp-formula id="E4"><mml:math id="M20"><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C3;</mml:mi><mml:msubsup><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:mo>&#x000B7;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003C3;</mml:mi><mml:msubsup><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:msub><mml:mrow><mml:mi>b</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:math></disp-formula>
This step corresponds to the closed-form formula of a ridge regression problem. For very large <italic>p</italic> we can employ efficient approaches, such as Lu and Foster (<xref ref-type="bibr" rid="B14">2014</xref>) and McWilliams et al. (<xref ref-type="bibr" rid="B15">2014</xref>).</p></list-item>
<list-item><p>Update for <inline-formula><mml:math id="M21"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:math></inline-formula>: Let (<italic>a</italic><sub>1</sub>; &#x02026;; <italic>a</italic><sub><italic>k</italic></sub>) &#x0003D; prox<sub>&#x003C3;<italic>f</italic><sub>2</sub></sub>(<italic>b</italic><sub>1</sub>; &#x02026;; <italic>b</italic><sub><italic>k</italic></sub>), For each <italic>s</italic> &#x02208; {1, &#x02026;, <italic>k</italic>}, <italic>j</italic> &#x02208; {1, &#x02026;, <italic>p</italic>},
<disp-formula id="E6"><mml:math id="M22"><mml:mrow><mml:msub><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x02212;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>&#x003BB;</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>&#x0007C;</mml:mo></mml:mrow></mml:mfrac></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow><mml:mo>+</mml:mo></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:msub><mml:mi>b</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mi>j</mml:mi></mml:msub><mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p></list-item>
<list-item><p>Updates for <inline-formula><mml:math id="M23"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mstyle mathsize="1.61em"><mml:mo stretchy="false">[</mml:mo></mml:mstyle><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mstyle mathsize="1.61em"><mml:mo stretchy="false">]</mml:mo></mml:mstyle></mml:math></inline-formula>: This is the standard bi-clustering problem on &#x00398; and can be solved efficiently using the COnvex BiclusteRing Algorithm (COBRA) introduced in Chi et al. (<xref ref-type="bibr" rid="B3">2014</xref>), and described in Algorithm 3 (<xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>) for completeness.</p></list-item>
</list>
<table-wrap position="float">
<label>Algorithm 1:</label>
<caption><p>Proximal decomposition for formulation 1.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td><inline-graphic xlink:href="fdata-02-00027-i0001.tif"/></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>3.2. Optimization for Formulation 2</title>
<p>For our formulation 2 we use an alternating minimization method on &#x00398; and &#x00393;; i.e., we alternatively minimize over &#x00398; and &#x00393; with the other fixed. The first alternating step is to estimate &#x00398; while fixing &#x00393;. This minimization problem is separable for each column and each sub-problem can be easily written as a standard Lasso problem:</p>
<disp-formula id="E7"><label>(4)</label><mml:math id="M24"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo class="qopname">&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mover accent="false"><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo class="qopname">&#x0007E;</mml:mo></mml:mover><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>by defining <inline-formula><mml:math id="M25"><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo>]</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msqrt><mml:msub><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M26"><mml:mover accent="false"><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mi>X</mml:mi><mml:mo>,</mml:mo><mml:msqrt><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msqrt><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula> and hence can be solved efficiently and in parallel for each column. In the second step, we fix &#x00398; and optimize for &#x00393;. The optimization is</p>
<disp-formula id="E8"><label>(5)</label><mml:math id="M27"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mtext class="textrm" mathvariant="normal">minimize</mml:mtext></mml:mrow><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow></mml:msub><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="true">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mi>W</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>which is a standard bi-clustering problem on &#x00398; and can be solved efficiently using the COnvex BiclusteRing Algorithm (COBRA) introduced in Chi et al. (<xref ref-type="bibr" rid="B3">2014</xref>), and described in Algorithm 3 (<xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>) for completeness. The overall procedure is given in <bold>Algorithm 2</bold>.</p>
<table-wrap position="float">
<label>Algorithm 2:</label>
<caption><p>Alternating minimization for formulation 2.</p></caption>
<table frame="hsides" rules="groups">
<tbody>
<tr>
<td><inline-graphic xlink:href="fdata-02-00027-i0002.tif"/></td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>3.3. Numerical Convergence</title>
<p>We establish the following convergence result for our algorithms, when the loss function <italic>L</italic>(<italic>X, Y</italic>; &#x00398;) is convex in &#x00398;. The proofs are given in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
<p>Proposition 1. <italic>The algorithm described in section 3.1 converges to the global minimizer.</italic></p>
<p>Proposition 2. <italic>The algorithm described in section 3.2 converges to the global minimizer.</italic></p>
<p>For both formulations 1 and 2, computational complexity is dominated by the use of the COBRA algorithm. COBRA solves a sequence of convex clustering problems. The subroutine used to solve each convex clustering subproblem scales in storage and computational operations as O(<italic>kpq</italic>), where <italic>k</italic> is the number of tasks, <italic>p</italic> is the number of features and <italic>q</italic> is the number of non-zero weights. In our case <italic>q</italic> is much smaller than <italic>p</italic><sup>2</sup> and <italic>k</italic><sup>2</sup>. Indeed as we shall see in section 4.1, our weights are based on &#x003BA; &#x0003D; 5 nearest neighbors.</p>
</sec>
</sec>
<sec id="s4">
<title>4. Hyperparameter Choices, Solution Path, and Variations</title>
<p>We describe and justify the various hyperparameters choices for formulations 1 and 2.</p>
<sec>
<title>4.1. Weights and Sparsity Regularization</title>
<p>The choice of the column and row similarity weights <italic>W</italic> and <inline-formula><mml:math id="M28"><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:math></inline-formula> can affect the quality of the clustering results and we follow the suggestion in Chi et al. (<xref ref-type="bibr" rid="B3">2014</xref>) to set these. However, in their case the data matrix to be clustered is fixed and known, but in our case, the coefficient matrix &#x00398; we want to cluster is not known. We will get a rough estimate <inline-formula><mml:math id="M29"><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula> by just minimizing the regularized loss function <italic>L</italic>(<italic>Z</italic>; &#x00398;) &#x0002B; &#x003BB;<sub>1</sub><italic>R</italic>(&#x00398;). For example, with multi-task regression in Equations (2) and (3), we can solve</p>
<disp-formula id="E9"><label>(6)</label><mml:math id="M30"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">min</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow></mml:munder></mml:mstyle><mml:mo>&#x02016;</mml:mo><mml:mi>Y</mml:mi><mml:mo>-</mml:mo><mml:mi>X</mml:mi><mml:mo>&#x00398;</mml:mo><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mi>F</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003BB;<sub>1</sub> is tuned by cross-validation (CV) and re-used in the rest of the algorithm. From our multi-task regression experiment, we find that the clustering results are robust to the choice of &#x003BB;<sub>1</sub>.</p>
<p>With the estimated <inline-formula><mml:math id="M31"><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula> we are ready to compute <italic>W</italic> and <inline-formula><mml:math id="M32"><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:math></inline-formula>. The weights for the columns <italic>i</italic> and <italic>j</italic> are computed as <inline-formula><mml:math id="M33"><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>&#x000B7;</mml:mo><mml:mo class="qopname">exp</mml:mo><mml:mstyle mathsize="1.19em"><mml:mo stretchy="true">(</mml:mo></mml:mstyle><mml:mo>-</mml:mo><mml:mi>&#x003D5;</mml:mi><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mstyle mathsize="1.19em"><mml:mo stretchy="true">)</mml:mo></mml:mstyle></mml:math></inline-formula> where <inline-formula><mml:math id="M34"><mml:msubsup><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> is 1 if <italic>j</italic> is among <italic>i</italic>&#x00027;s &#x003BA;-nearest-neighbors or vice versa and 0 otherwise. Here &#x003D5; is non-negative and &#x003D5; &#x0003D; 0 corresponds to uniform weights. In our synthetic and real data experiments we fix &#x003D5; &#x0003D; 20. <inline-formula><mml:math id="M35"><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:math></inline-formula> is computed analogously. It is important to keep the two penalty terms &#x003A9;<sub><italic>W</italic></sub>(&#x00398;) and <inline-formula><mml:math id="M36"><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> on the same scale, else the row or column objective will dominate the solution. We normalize so that the column weights sum to <inline-formula><mml:math id="M37"><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:msqrt><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula> and the row weights sum to <inline-formula><mml:math id="M38"><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:msqrt><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula>.</p>
</sec>
<sec>
<title>4.2. Penalty Multiplier Tuning</title>
<p>We set the penalty multipliers (&#x003BB;<sub>1</sub>, &#x003BB;<sub>2</sub>, and &#x003BB;<sub>3</sub>) for both the formulations using a CV approach. We randomly split our samples into a training set and a hold-out validation set, fitting the models on the training set and then evaluating the root-mean-squared error (RMSE) on the validation set to choose the best values. In order to reduce the computational complexity, we estimate the multipliers greedily, one or two at a time. From our simulations, we determined that this is a reasonable choice. We recognize that these can be tuned further on a case-by-case basis.</p>
<p>&#x003BB;<sub>1</sub> is set to the reasonable value as determined in section 4.1 for both formulations, since the clustering results are robust to this choice. For formulation 1, we estimate the best &#x003BB;<sub>2</sub> by CV using Equation (1). For formulation 2, the tuning process is similar, but we pick a sequence of &#x003BB;<sub>2</sub> and &#x003BB;<sub>3</sub>. We estimate both <inline-formula><mml:math id="M46"><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula> and <inline-formula><mml:math id="M47"><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula>, but calculate RMSE with <inline-formula><mml:math id="M48"><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:math></inline-formula>, since it is used in the clustering objective. When the path of bi-clusterings is computed, we fix &#x003BB;<sub>2</sub> to the CV estimate and vary only &#x003BB;<sub>3</sub>.</p>
</sec>
<sec>
<title>4.3. Solution Paths</title>
<p>One can obtain the entire solution paths for the estimated coefficients &#x00398; by varying the penalty multipliers. Here we provide an example using a synthetic dataset generated as follows. We consider the multi-task regression model: <italic>Y</italic> &#x0003D; <italic>X</italic>&#x00398;&#x0002A; &#x0002B; <italic>E</italic> with <inline-formula><mml:math id="M49"><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0007E;</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. All the entries of design matrix <italic>X</italic> are generated as iid from <italic>N</italic>(0, 1). The true regression parameter &#x00398;<sup>&#x0002A;</sup> has a bi-cluster (checkerboard) structure. To simulate sparsity, we set the coefficients within many of the blocks in the checkerboard to 0. For the non-zero blocks, we follow the generative model recommended in Chi et al. (<xref ref-type="bibr" rid="B3">2014</xref>): the coefficients within each cluster are generated as &#x003B8;<sub><italic>ij</italic></sub> &#x0003D; &#x003BC;<sub><italic>rc</italic></sub> &#x0002B; &#x003F5;<sub><italic>ij</italic></sub> with <inline-formula><mml:math id="M50"><mml:msub><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0007E;</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> to make them close but not identical, where &#x003BC;<sub><italic>rc</italic></sub> is the mean of the cluster defined by the <italic>rth</italic> row partition and <italic>cth</italic> column partition. We set <italic>n</italic> &#x0003D; 50, <italic>p</italic> &#x0003D; 20, and <italic>k</italic> &#x0003D; 15. For the non-zero blocks, we set &#x003BC;<sub><italic>rc</italic></sub> &#x0007E; Uniform{&#x02212;2, &#x02212;1, 1, 2} and set &#x003C3;<sub>&#x003F5;</sub> &#x0003D; 0.25. We set &#x003C3; &#x0003D; 1.5. We use relatively small values for <italic>p</italic> and <italic>k</italic> since there will be a total of <italic>pk</italic> solution paths to visualize.</p>
<p>We begin with the solution paths for formulation 1. We first fix a reasonable &#x003BB;<sub>1</sub> and vary &#x003BB;<sub>2</sub> to get solution paths for all the coefficients. In our experiment, we chose &#x003BB;<sub>1</sub> based on cross-validation as described in section 4.1. These paths are shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. We can see that as &#x003BB;<sub>2</sub> increases, the coefficients begin to merge and eventually for large enough &#x003BB;<sub>2</sub> they are all equal. The solution paths are smooth in &#x003BB;<sub>2</sub>. Similarly, we fix &#x003BB;<sub>2</sub> based on the cross-validation scheme described in section 4.2 and vary &#x003BB;<sub>1</sub> to get solution paths for all the coefficients. This is shown in <xref ref-type="fig" rid="F4">Figure 4</xref>. It is well-known that the solution paths for LASSO are piecewise linear (Rosset and Zhu, <xref ref-type="bibr" rid="B19">2007</xref>), when <italic>L</italic> is least squares loss. Here, we see that the solution paths are not piecewise linear, but rather a smoothed version of it. This smoothness is imparted by the convex clustering regularization, the third term in Equation (2).</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Solution paths for formulation 1, fixing &#x003BB;<sub>1</sub> and varying &#x003BB;<sub>2</sub>. Each line indicates a distinct coefficient.</p></caption>
<graphic xlink:href="fdata-02-00027-g0003.tif"/>
</fig>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Solution paths for formulation 1, fixing &#x003BB;<sub>2</sub> and varying &#x003BB;<sub>1</sub>. Each line indicates a distinct coefficient.</p></caption>
<graphic xlink:href="fdata-02-00027-g0004.tif"/>
</fig>
<p>We can obtain the solution path for formulation 2 as functions of two variables. We first fix a reasonable &#x003BB;<sub>1</sub> and vary &#x003BB;<sub>2</sub>, &#x003BB;<sub>3</sub> to get solution paths for all the coefficients. These paths are shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. The solution paths are smooth in &#x003BB;<sub>2</sub> and &#x003BB;<sub>3</sub>. Similarly, we fix a reasonable &#x003BB;<sub>2</sub> and vary &#x003BB;<sub>1</sub>, &#x003BB;<sub>3</sub> to get solution paths for all the coefficients. These paths are shown in <xref ref-type="fig" rid="F6">Figure 6</xref>. The solution paths are smooth in &#x003BB;<sub>1</sub> and &#x003BB;<sub>3</sub>. The reasonable values are obtained using cross-validation.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Solution paths for formulation 2, fixing &#x003BB;<sub>1</sub> and varying &#x003BB;<sub>2</sub>, &#x003BB;<sub>3</sub>.</p></caption>
<graphic xlink:href="fdata-02-00027-g0005.tif"/>
</fig>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Solution paths for formulation 2, fixing &#x003BB;<sub>2</sub> and varying &#x003BB;<sub>1</sub>, &#x003BB;<sub>3</sub>.</p></caption>
<graphic xlink:href="fdata-02-00027-g0006.tif"/>
</fig>
</sec>
<sec>
<title>4.4. Bi-clustering Thresholds</title>
<p>It is well-known that LASSO tends to select too many variables (Meinshausen and Yu, <xref ref-type="bibr" rid="B16">2009</xref>). Hence ||&#x00398;<sub>&#x000B7;<italic>i</italic></sub> &#x02212; &#x00398;<sub>&#x000B7;<italic>j</italic></sub>||<sub>2</sub> may not be exactly zero in most cases, and we may end up identifying too many clusters as well. In Chi et al. (<xref ref-type="bibr" rid="B3">2014</xref>) the authors defined the measure <italic>v</italic><sub><italic>ij</italic></sub> &#x0003D; ||&#x00398;<sub>&#x000B7;<italic>i</italic></sub> &#x02212; &#x00398;<sub>&#x000B7;<italic>j</italic></sub>||<sub>2</sub> and placed the <italic>ith</italic> and <italic>jth</italic> columns in the same group if <italic>v</italic><sub><italic>ij</italic></sub> &#x02264; &#x003C4; for some threshold &#x003C4;, inspired by Meinshausen and Yu (<xref ref-type="bibr" rid="B16">2009</xref>). In our formulation 1, after selecting the best tuning parameters and estimating &#x00398;, we place the <italic>ith</italic> and <italic>jth</italic> rows in the same group if ||&#x00398;<sub><italic>i</italic>&#x000B7;</sub> &#x02212; &#x00398;<sub><italic>j</italic>&#x000B7;</sub>||<sub>2</sub> &#x02264; &#x003C4;<sub><italic>r</italic></sub>. Similarly, if ||&#x00398;<sub>&#x000B7;<italic>i</italic></sub> &#x02212; &#x00398;<sub>&#x000B7;<italic>j</italic></sub>||<sub>2</sub> &#x02264; &#x003C4;<sub><italic>c</italic></sub> we place the <italic>ith</italic> and <italic>jth</italic> columns in the same group. For formulation 2, we repeat the same approach using &#x00393; instead of &#x00398;.</p>
<p>To compute the thresholds &#x003C4;<sub><italic>r</italic></sub> and &#x003C4;<sub><italic>c</italic></sub>, we first calculate [<italic>v</italic><sub><italic>col</italic></sub>]<sub><italic>ij</italic></sub> &#x0003D; ||&#x00398;<sub>&#x000B7;<italic>i</italic></sub> &#x02212; &#x00398;<sub>&#x000B7;<italic>j</italic></sub>||<sub>2</sub> and stack this matrix to vector <italic>v</italic><sub><italic>col</italic></sub>; similarly we calculate [<italic>v</italic><sub><italic>row</italic></sub>]<sub><italic>ij</italic></sub> &#x0003D; ||&#x00398;<sub><italic>i</italic>&#x000B7;</sub> &#x02212; &#x00398;<sub><italic>j</italic>&#x000B7;</sub>||<sub>2</sub> and stack to vector <italic>v</italic><sub><italic>row</italic></sub>. In the case of sparse linear regression, &#x003C4; should be on the order of the noise (Meinshausen and Yu, <xref ref-type="bibr" rid="B16">2009</xref>): <inline-formula><mml:math id="M51"><mml:mi>&#x003C4;</mml:mi><mml:mo>&#x0221D;</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:msqrt><mml:mrow><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msqrt></mml:math></inline-formula>, where &#x003C3; is typically estimated using the standard deviation of residuals. In general we could set &#x003C4; proportional to the standard deviation of <italic>v</italic><sub><italic>row</italic></sub> or <italic>v</italic><sub><italic>col</italic></sub>.</p>
<p>However in our case, we have an additional regression loss term for estimating the parameters and hence there are two sources of randomness, the regression residual and the error in <italic>v</italic>. Taking these into account, we set <inline-formula><mml:math id="M52"><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mstyle mathsize="1.61em"><mml:mo stretchy="true">[</mml:mo></mml:mstyle><mml:mi>&#x003C3;</mml:mi><mml:msqrt><mml:mrow><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msqrt><mml:mo>&#x0002B;</mml:mo><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>c</mml:mi><mml:mi>o</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mstyle mathsize="1.61em"><mml:mo stretchy="true">]</mml:mo></mml:mstyle></mml:math></inline-formula> and <inline-formula><mml:math id="M53"><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mstyle mathsize="1.61em"><mml:mo stretchy="true">[</mml:mo></mml:mstyle><mml:mi>&#x003C3;</mml:mi><mml:msqrt><mml:mrow><mml:mo class="qopname">log</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>n</mml:mi></mml:mrow></mml:msqrt><mml:mo>&#x0002B;</mml:mo><mml:mi>s</mml:mi><mml:mi>t</mml:mi><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>r</mml:mi><mml:mi>o</mml:mi><mml:mi>w</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mstyle mathsize="1.61em"><mml:mo stretchy="true">]</mml:mo></mml:mstyle></mml:math></inline-formula>. We set the multiplier to <inline-formula><mml:math id="M54"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:math></inline-formula>, following the usual conservative heuristics.</p>
</sec>
<sec>
<title>4.5. Specializing to Column- or Row-Only Clustering (a.k.a. <italic>Uni-clustering</italic>)</title>
<p>Although formulations 1 and 2 have been developed for row-column bi-clustering, they can be easily specialized to clustering columns or rows alone, by respectively using only &#x003A9;<sub><italic>W</italic></sub>(&#x00398;) or <inline-formula><mml:math id="M55"><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> in Equation (2), or using only &#x003A9;<sub><italic>W</italic></sub>(&#x00393;) or <inline-formula><mml:math id="M56"><mml:msub><mml:mrow><mml:mo>&#x003A9;</mml:mo></mml:mrow><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mi>W</mml:mi></mml:mrow><mml:mo>&#x0007E;</mml:mo></mml:mover></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mo>&#x00393;</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> in Equation (3).</p>
</sec>
</sec>
<sec id="s5">
<title>5. Synthetic Data Experiments</title>
<p>We demonstrate our approach using experiments with synthetic data on the problem of multi-task learning. <italic>As emphasized before, our main focus is on bi-clustering result instead of parameter estimation</italic>. We begin by describing the performance measures used to evaluate the clustering and estimation performance.</p>
<sec>
<title>5.1. Performance Measures</title>
<p>Assessing the clustering quality can be hard. In this paper, we use the following three measures to evaluate the quality of clustering: the adjusted Rand index (Hubert and Arabie, <xref ref-type="bibr" rid="B8">1985</xref>) (ARI), the F-1 score (F-1), and the Jaccard index (JI).</p>
<p>Assume <inline-formula><mml:math id="M57"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">B</mml:mi></mml:mrow></mml:math></inline-formula> is the true clustering, define <italic>TP</italic> to be the number of pairs of elements in <italic>S</italic> that are in the same subset in <inline-formula><mml:math id="M58"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:math></inline-formula> and in the same subset in <inline-formula><mml:math id="M59"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">B</mml:mi></mml:mrow></mml:math></inline-formula>. This is the true positive and similarly we can define <italic>TN</italic>, <italic>FN</italic>, <italic>FP</italic> as true negative, false negative, and false positive, respectively. Define <inline-formula><mml:math id="M60"><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula> and <inline-formula><mml:math id="M61"><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>, the F-1 score is defined as:</p>
<disp-formula id="E10"><label>(7)</label><mml:math id="M62"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">F-1</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:mo>&#x000B7;</mml:mo><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>i</mml:mi><mml:mi>s</mml:mi><mml:mi>i</mml:mi><mml:mi>o</mml:mi><mml:mi>n</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>r</mml:mi><mml:mi>e</mml:mi><mml:mi>c</mml:mi><mml:mi>a</mml:mi><mml:mi>l</mml:mi><mml:mi>l</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Using the same notation as F-1 score, the Jaccard Index is defined as:</p>
<disp-formula id="E11"><label>(8)</label><mml:math id="M63"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:mtext class="textrm" mathvariant="normal">JI</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>P</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>F</mml:mi><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>For all these three measures, a value of 1 implies the best possible performance, and a value of 0 means that we are doing poorly. In order to compute ARI, F-1, and JI, we choose the value of the multiplier &#x003BB;<sub>2</sub> in formulation 1, and {&#x003BB;<sub>2</sub>, &#x003BB;<sub>3</sub>} in formulation 2 using the approach described in section 4.2, and obtain the estimated clusterings.</p>
<p>The estimation accuracy is measured by calculating the RMSE on an independent test set, and also the parameter recovery accuracy, <inline-formula><mml:math id="M64"><mml:mo>&#x02016;</mml:mo><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>&#x02016;</mml:mo><mml:mo>/</mml:mo><mml:mo>&#x02016;</mml:mo><mml:msup><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>&#x02016;</mml:mo></mml:math></inline-formula> where <inline-formula><mml:math id="M65"><mml:msub><mml:mrow><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>e</mml:mi><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> and &#x00398;<sup>&#x0002A;</sup> are the estimated and true coefficient matrices.</p>
</sec>
<sec>
<title>5.2. Simulation Setup and Results</title>
<p>We focus on multi-task regression: <italic>Y</italic> &#x0003D; <italic>X</italic>&#x00398;&#x0002A; &#x0002B; <italic>E</italic> with <inline-formula><mml:math id="M66"><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0007E;</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. All the entries of design matrix <italic>X</italic> are generated as iid from <italic>N</italic>(0, 1). The true regression parameter &#x00398;<sup>&#x0002A;</sup> has a bi-cluster (checkerboard) structure. To simulate sparsity, we set the coefficients within many of the blocks in the checkerboard to 0. For the non-zero blocks, we follow the generative model recommended in Chi et al. (<xref ref-type="bibr" rid="B3">2014</xref>): the coefficients within each cluster are generated as &#x003B8;<sub><italic>ij</italic></sub> &#x0003D; &#x003BC;<sub><italic>rc</italic></sub> &#x0002B; &#x003F5;<sub><italic>ij</italic></sub> with <inline-formula><mml:math id="M67"><mml:msub><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0007E;</mml:mo><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003F5;</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> to make them close but not identical, where &#x003BC;<sub><italic>rc</italic></sub> is the mean of the cluster defined by the <italic>r</italic><sup><italic>th</italic></sup> row partition and <italic>c</italic><sup><italic>th</italic></sup> column partition. We set <italic>n</italic> &#x0003D; 200, <italic>p</italic> &#x0003D; 500, and <italic>k</italic> &#x0003D; 250 in our experiment. For the non-zero blocks, we set &#x003BC;<sub><italic>rc</italic></sub> &#x0007E; Uniform{&#x02212;2, &#x02212;1, 1, 2} and set &#x003C3;<sub>&#x003F5;</sub> &#x0003D; 0.25. We try the low-noise setting (&#x003C3; &#x0003D; 1.5), where it is relatively easy to estimate the clusters, and the high-noise setting (&#x003C3; &#x0003D; 3), where it is harder to obtain them.</p>
<p>We compare our formulations 1 and 2 with a 2-step <italic>estimate-then-cluster</italic> approach: (a) Estimate <inline-formula><mml:math id="M68"><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula> first using LASSO, and (b) perform convex bi-clustering on <inline-formula><mml:math id="M69"><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula>. <inline-formula><mml:math id="M70"><mml:mover accent="false"><mml:mrow><mml:mo>&#x00398;</mml:mo></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula> is estimated by solving (6) while selecting the best &#x003BB;<sub>1</sub> as discussed in section 4.1, and the convex bi-clustering step is implemented using COBRA algorithm in Chi et al. (<xref ref-type="bibr" rid="B3">2014</xref>). Our <italic>baseline</italic> clustering performance is the best of: (a) letting each coefficient be its own group, and (b) imposing a single group for all coefficients.</p>
<p>The average clustering quality results on 50 replicates are shown in <xref ref-type="table" rid="T1">Table 1</xref> for low and high noise settings. Most performance measures are reported in the format <italic>mean</italic>&#x000B1;<italic>std</italic>.<italic>dev</italic>. In both tables, the first, second, and third blocks correspond to performances of row, column and row-column bi-clusterings, respectively. We optimize only for bi-clusterings, but the row and the column clusterings are obtained as by-products.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Performance of low and high noise settings.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Noise</bold></th>
<th valign="top" align="center"><bold>Metric</bold></th>
<th valign="top" align="center"><bold>Baseline</bold></th>
<th valign="top" align="center"><bold>2-step</bold></th>
<th valign="top" align="center"><bold>F1</bold></th>
<th valign="top" align="center"><bold>F2</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Low</td>
<td valign="top" align="center">ARI</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.679 &#x000B1; 0.157</td>
<td valign="top" align="center">0.869 &#x000B1; 0.069</td>
<td valign="top" align="center">0.900 &#x000B1; 0.046</td>
</tr>
<tr>
<td valign="top" align="left">Low</td>
<td valign="top" align="center">F-1</td>
<td valign="top" align="center">0.446</td>
<td valign="top" align="center">0.757 &#x000B1; 0.128</td>
<td valign="top" align="center">0.907 &#x000B1; 0.052</td>
<td valign="top" align="center">0.931 &#x000B1; 0.022</td>
</tr>
<tr>
<td valign="top" align="left">Low</td>
<td valign="top" align="center">JI</td>
<td valign="top" align="center">0.287</td>
<td valign="top" align="center">0.625 &#x000B1; 0.161</td>
<td valign="top" align="center">0.834 &#x000B1; 0.081</td>
<td valign="top" align="center">0.871 &#x000B1; 0.042</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Low</td>
<td valign="top" align="center">ARI</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.877 &#x000B1; 0.043</td>
<td valign="top" align="center">0.914 &#x000B1; 0.020</td>
<td valign="top" align="center">0.915 &#x000B1; 0.013</td>
</tr>
<tr>
<td valign="top" align="left">Low</td>
<td valign="top" align="center">F-1</td>
<td valign="top" align="center">0.446</td>
<td valign="top" align="center">0.908 &#x000B1; 0.037</td>
<td valign="top" align="center">0.933 &#x000B1; 0.023</td>
<td valign="top" align="center">0.934 &#x000B1; 0.012</td>
</tr>
<tr>
<td valign="top" align="left">Low</td>
<td valign="top" align="center">JI</td>
<td valign="top" align="center">0.287</td>
<td valign="top" align="center">0.847 &#x000B1; 0.048</td>
<td valign="top" align="center">0.876 &#x000B1; 0.031</td>
<td valign="top" align="center">0.887 &#x000B1; 0.025</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">Low</td>
<td valign="top" align="center">ARI</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.708 &#x000B1; 0.118</td>
<td valign="top" align="center">0.841 &#x000B1; 0.059</td>
<td valign="top" align="center">0.863 &#x000B1; 0.035</td>
</tr>
<tr>
<td valign="top" align="left">Low</td>
<td valign="top" align="center">F-1</td>
<td valign="top" align="center">0.172</td>
<td valign="top" align="center">0.734 &#x000B1; 0.110</td>
<td valign="top" align="center">0.857 &#x000B1; 0.052</td>
<td valign="top" align="center">0.877 &#x000B1; 0.026</td>
</tr>
<tr>
<td valign="top" align="left">Low</td>
<td valign="top" align="center">JI</td>
<td valign="top" align="center">0.094</td>
<td valign="top" align="center">0.591 &#x000B1; 0.134</td>
<td valign="top" align="center">0.753 &#x000B1; 0.077</td>
<td valign="top" align="center">0.781 &#x000B1; 0.035</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">High</td>
<td valign="top" align="center">ARI</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.577 &#x000B1; 0.163</td>
<td valign="top" align="center">0.803 &#x000B1; 0.104</td>
<td valign="top" align="center">0.804 &#x000B1; 0.096</td>
</tr>
<tr>
<td valign="top" align="left">High</td>
<td valign="top" align="center">F-1</td>
<td valign="top" align="center">0.446</td>
<td valign="top" align="center">0.674 &#x000B1; 0.138</td>
<td valign="top" align="center">0.874 &#x000B1; 0.093</td>
<td valign="top" align="center">0.874 &#x000B1; 0.075</td>
</tr>
<tr>
<td valign="top" align="left">High</td>
<td valign="top" align="center">JI</td>
<td valign="top" align="center">0.287</td>
<td valign="top" align="center">0.525 &#x000B1; 0.159</td>
<td valign="top" align="center">0.793 &#x000B1; 0.097</td>
<td valign="top" align="center">0.792 &#x000B1; 0.098</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">High</td>
<td valign="top" align="center">ARI</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.734 &#x000B1; 0.132</td>
<td valign="top" align="center">0.905 &#x000B1; 0.077</td>
<td valign="top" align="center">0.905 &#x000B1; 0.046</td>
</tr>
<tr>
<td valign="top" align="left">High</td>
<td valign="top" align="center">F-1</td>
<td valign="top" align="center">0.446</td>
<td valign="top" align="center">0.799 &#x000B1; 0.107</td>
<td valign="top" align="center">0.924 &#x000B1; 0.054</td>
<td valign="top" align="center">0.933 &#x000B1; 0.039</td>
</tr>
<tr>
<td valign="top" align="left">High</td>
<td valign="top" align="center">JI</td>
<td valign="top" align="center">0.287</td>
<td valign="top" align="center">0.689 &#x000B1; 0.120</td>
<td valign="top" align="center">0.872 &#x000B1; 0.078</td>
<td valign="top" align="center">0.867 &#x000B1; 0.065</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">High</td>
<td valign="top" align="center">ARI</td>
<td valign="top" align="center">0</td>
<td valign="top" align="center">0.555 &#x000B1; 0.187</td>
<td valign="top" align="center">0.801 &#x000B1; 0.125</td>
<td valign="top" align="center">0.812 &#x000B1; 0.105</td>
</tr>
<tr>
<td valign="top" align="left">High</td>
<td valign="top" align="center">F-1</td>
<td valign="top" align="center">0.172</td>
<td valign="top" align="center">0.586 &#x000B1; 0.152</td>
<td valign="top" align="center">0.824 &#x000B1; 0.104</td>
<td valign="top" align="center">0.821 &#x000B1; 0.086</td>
</tr>
<tr>
<td valign="top" align="left">High</td>
<td valign="top" align="center">JI</td>
<td valign="top" align="center">0.094</td>
<td valign="top" align="center">0.437 &#x000B1; 0.179</td>
<td valign="top" align="center">0.714 &#x000B1; 0.118</td>
<td valign="top" align="center">0.713 &#x000B1; 0.104</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>For each noise setting, the first block is for row clustering; the second block is for column clustering; and the third block is for row-column biclustering</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>From <xref ref-type="table" rid="T1">Table 1</xref> we see that both our formulation 1 and 2 give better results on row clustering, column clusterings, and row-column bi-clustering compared to the 2-step procedure. Moreover, the clustering results given by our formulations are more stable, with lesser spread in performance.</p>
<p>It is also instructive to note that the performance obtained for columns is substantially higher compared to those obtained with rows. This could be because of two reasons: (a) the columns of &#x00398; have a one-to-one correspondence to the columns of the task responses <italic>Y</italic>, and hence any relationship between the tasks is easily inherited, (b) the rows of &#x00398; can be noisier than the columns, since each row contributes to all the tasks.</p>
<p>The performance boost obtained with high noise is much higher compared to that with low noise. This makes sense because when noise level is low, the estimation step in the 2-step approach is more accurate and the error propagated into the clustering step is relatively small. However, at high noise levels, the estimation can be inaccurate. This estimation error propagates into the clustering step and makes the clustering result of 2-step approach unreliable. Since our formulations jointly perform estimation and clustering, they obtain more reliable and stable results.</p>
<p>The RMSEs evaluated on the test set and the parameter recovery accuracy are provided in <xref ref-type="table" rid="T2">Table 2</xref>. The oracle RMSE (with &#x00398; known) is 1.5 for the low noise setting and 3.0 for the high noise setting in <xref ref-type="table" rid="T2">Table 2</xref>, and we can see that the proposed methods provide improvements over the others. We also observe improvements in the parameter recovery accuracy. Although the improvement is marginal, it demonstrates that we are not losing estimation accuracy because of the biclustering structure we considered.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>RMSE and parameter recovery accuracy of the estimation schemes for low noise (&#x003C3; &#x0003D; 1.5) and high noise (&#x003C3; &#x0003D; 3) settings.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Noise</bold></th>
<th valign="top" align="left"><bold>Accuracy metric</bold></th>
<th valign="top" align="center"><bold>Lasso</bold></th>
<th valign="top" align="center"><bold>2-step</bold></th>
<th valign="top" align="center"><bold>Form1</bold></th>
<th valign="top" align="center"><bold>Form2</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Low</td>
<td valign="top" align="left">RMSE</td>
<td valign="top" align="center">1.627 &#x000B1; 0.02</td>
<td valign="top" align="center">1.622 &#x000B1; 0.02</td>
<td valign="top" align="center">1.613 &#x000B1; 0.02</td>
<td valign="top" align="center">1.612 &#x000B1; 0.02</td>
</tr>
<tr>
<td valign="top" align="left">Low</td>
<td valign="top" align="left">Rec. acc.</td>
<td valign="top" align="center">0.234 &#x000B1; 0.03</td>
<td valign="top" align="center">0.231 &#x000B1; 0.03</td>
<td valign="top" align="center">0.223 &#x000B1; 0.03</td>
<td valign="top" align="center">0.222 &#x000B1; 0.03</td>
</tr>
<tr style="border-top: thin solid #000000;">
<td valign="top" align="left">High</td>
<td valign="top" align="left">RMSE</td>
<td valign="top" align="center">3.34 &#x000B1; 0.02</td>
<td valign="top" align="center">3.30 &#x000B1; 0.02</td>
<td valign="top" align="center">3.23 &#x000B1; 0.02</td>
<td valign="top" align="center">3.16 &#x000B1; 0.02</td>
</tr>
<tr>
<td valign="top" align="left">High</td>
<td valign="top" align="left">Rec. acc.</td>
<td valign="top" align="center">0.364 &#x000B1; 0.06</td>
<td valign="top" align="center">0.362 &#x000B1; 0.06</td>
<td valign="top" align="center">0.327 &#x000B1; 0.05</td>
<td valign="top" align="center">0.325 &#x000B1; 0.06</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s6">
<title>6. Real Data Experiments</title>
<p>We demonstrate the proposed approaches using real datasets obtained from experiments with Sorghum crops (Tuinstra, <xref ref-type="bibr" rid="B21">2016</xref>). We consider two specific problems from this pipeline: (a) predictive modeling of plant traits using features from remote sensed data (section 6.1), (b) GWAS using the reference traits (section 6.2).</p>
<sec>
<title>6.1. Phenotypic Trait Prediction From Remote Sensed Data</title>
<p>The experimental data was obtained from 18 Sorghum varieties planted in 6 replicate plot locations, and we considered the trait of plant height. The 18 variety names are given in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
<p>From the RGB and hyperspectral images of each plot, we extract features of length 206. Hence <italic>n</italic> &#x0003D; 6, <italic>p</italic> &#x0003D; 206, and the number of tasks <italic>k</italic> &#x0003D; 18, for each trait considered. The presence of multiple varieties with replicates much smaller in number than predictors poses a major challenge: building separate models for each variety is unrealistic, while a single model does not fit all. This is where our proposed simultaneous estimation and clustering approach provides the flexibility to share information among tasks that leads to learning at the requisite level of robustness. Note that here we use the column-only clustering variant of formulation 1.</p>
<p>The dendrogram for task clusters obtained by sweeping the penalty multiplier &#x003BB;<sub>2</sub> is given in <xref ref-type="fig" rid="F7">Figure 7</xref>. This provides some interesting insights from a plant science perspective. As highlighted in <xref ref-type="fig" rid="F7">Figure 7</xref>, the predictive models (columns of &#x00398;) for thicker medium dark plants are grouped together. Similar grouping is seen for thinner tall dark plants, and thick tall plants with many light leaves.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Tree structure of tasks (varieties) inferred using our approach for plant height.</p></caption>
<graphic xlink:href="fdata-02-00027-g0007.tif"/>
</fig>
<p>To compute RMSE, we perform 6-folds CV where each fold consists of at least one example from each variety. As we only have <italic>n</italic> &#x0003D; 6 samples per variety (i.e., per task), it is unrealistic to learn separate models for each variety. For each CV split, we first learn a grouping using one of the compared methods, treat all the samples within a group as i.i.d, and estimate their regression coefficients using Lasso. The methods compared with our approach include: (a) <italic>single model</italic>, which learns a single predictive model using Lasso, treating all the varieties as i.i.d., (b) <italic>No group multitask learning</italic>, which learns a traditional multitask model using Group Lasso where each variety forms a separate group, and (c) Kang et al. (<xref ref-type="bibr" rid="B11">2011</xref>), which uses a mixed integer program to learn shared feature representations among tasks, while simultaneously determining &#x0201C;with whom&#x0201D; each task should share. Results reported in <xref ref-type="table" rid="T3">Table 3</xref>, indicate the superior quality of our groupings in terms of improved predictive accuracy.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>RMSE for plant height prediction.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="center"><bold>RMSE</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Single model</td>
<td valign="top" align="center">44.39 &#x000B1; 6.55</td>
</tr>
<tr>
<td valign="top" align="left">No group multitask learning</td>
<td valign="top" align="center">36.94 &#x000B1; 6.10</td>
</tr>
<tr>
<td valign="top" align="left">Kang et al.</td>
<td valign="top" align="center">37.55 &#x000B1; 7.60</td>
</tr>
<tr>
<td valign="top" align="left">Proposed</td>
<td valign="top" align="center"><bold>33.31</bold> &#x000B1; <bold>5.10</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The bold value indicates root mean square error</italic>.</p>
</table-wrap-foot>
</table-wrap>
</sec>
<sec>
<title>6.2. Multi-Response GWAS</title>
<p>We apply our approach in a multi-response Genome-Wide Association Study (GWAS). While traditional GWAS focuses on associations to single phenotypes, we would like to automatically learn the grouping structure between the phenotypes as well as the features (columns and rows of &#x00398;) using our proposed method. We use the proposed formulations 1 and 2 (bi-clustering variant) in this experiment.</p>
<p>The design matrix <italic>X</italic> consisted of SNPs of Sorghum varieties. We consider <italic>n</italic> &#x0003D; 911 varieties and over 80,000 SNPs. We remove duplicate SNPs and also SNPs that do not have significantly high correlation to at least one response variable. Finally, we end up considering <italic>p</italic> &#x0003D; 2,937 SNPs. The output data <italic>Y</italic> contains the following 6 response variables (columns) for all the <italic>n</italic> varieties collected by hand measurements:</p>
<list list-type="order">
<list-item><p><italic>Height to panicle</italic> (h1): The height of the plant up to the panicle of the Sorghum plant.</p></list-item>
<list-item><p><italic>Height to top collar</italic> (h2): The height of the plant up to the top most leaf collar.</p></list-item>
<list-item><p><italic>Diameter top collar</italic> (d1): The diameter of the stem at the top most leaf collar.</p></list-item>
<list-item><p><italic>Diameter at 5 cm from base</italic> (d2): The diameter of the stem at 5 cm from the base of the plant.</p></list-item>
<list-item><p><italic>Leaf collar count</italic> (l1): The number of leaf collars in the plant.</p></list-item>
<list-item><p><italic>Green leaf count</italic> (l2): The total number of green leaves. This will be &#x0003C;l1 since some leaves may have senesced and will not be green anymore.</p></list-item>
</list>
<p>For each variety, each trait can be an average of measurements from up to four plants.</p>
<p>The coefficient matrix given by our formulations are visualized in <xref ref-type="fig" rid="F8">Figure 8</xref>. To make the figure easier to interpret, we exclude the rows with all zero coefficients and take the average over the coefficients within each bi-cluster. The light yellow regions are coefficients close to zero; red and blue areas are positive and negative coefficients, respectively. The rows and columns are reordered to best show the checkerboard patterns. We wish to emphasize again that these checkerboard patterns in the coefficient matrices are automatically discovered using our proposed procedures, and are not readily evident, or trivially discoverable from the data.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Smoothed coefficient matrix obtained from formulations 1 (left) and 2 (right), revealing the bi-clustering structure.</p></caption>
<graphic xlink:href="fdata-02-00027-g0008.tif"/>
</fig>
<p>The two formulations reveal similar bi-clustering patterns up to reordering. For column clusters, the plant height tasks (h1 and h2), the stem diameter tasks (d1 and d2), and the leaf tasks (l1 and l2) group together. Also, the stem diameter and leaf tasks are more related to each other compared to the height tasks. The bi-clustering patterns reveal the groups of SNPs that influence similar phenotypic traits. Coefficients for height features in the GWAS (<xref ref-type="fig" rid="F9">Figure 9</xref>) study show SNPs with strong effects coinciding with locations of Dwarf 3 (Multani et al., <xref ref-type="bibr" rid="B17">2003</xref>) and especially Dwarf 1 (Hilley et al., <xref ref-type="bibr" rid="B6">2016</xref>) genes known to control plant height that are segregating and significant in the population. The lack of any effect at the Dwarf 2 (Hilley et al., <xref ref-type="bibr" rid="B7">2017</xref>) locus supports previous work indicating that this gene is not a strong contributing factor in this population. This demonstrates that we are able to discover existing factors. We also identify potentially new SNPs for further investigation and biological validation, since many coefficients align with loci outside of the previously identified known height genes.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Distribution of coefficients for height traits for all SNPs. The x-axis shows the positions of genetic variants on the chromosomes. The y-axis are the values of the coefficients for the discovered associations with height trait.The red lines are loci of known height genes, namely genes that are known to be associated to height, and the black and gray dots correspond to coefficients of formulations 1 and 2, respectively. Some correspond to known locations, some correspond to new locations of associated SNPS.</p></caption>
<graphic xlink:href="fdata-02-00027-g0009.tif"/>
</fig>
<p>To evaluate predictive accuracy, we split our data set into three parts: 70% training, 15% validation, and 15% test. We estimate the coefficient matrices by optimizing our formulations on the training set, select the tuning parameters based on the validation set (sections 4.2, 4.4), and then calculate the RMSE on the test set. <xref ref-type="table" rid="T4">Table 4</xref> shows the RMSE on test set.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Comparison of test RMSE on the multi-response GWAS dataset.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center"><bold>Lasso</bold></th>
<th valign="top" align="center"><bold>2-step</bold></th>
<th valign="top" align="center"><bold>Form1</bold></th>
<th valign="top" align="center"><bold>Form2</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">RMSE</td>
<td valign="top" align="center">2.181</td>
<td valign="top" align="center">2.206</td>
<td valign="top" align="center">2.105</td>
<td valign="top" align="center">2.119</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We also estimate the RMSE of the proposed formulations and compare it with the RMSE provided by a simple Lasso model and 2-step procedure. This is shown in <xref ref-type="table" rid="T4">Table 4</xref>. We see that the RMSE of our formulations are slightly less than that of the Lasso and 2-step procedure. Hence, for similar estimation performance, we are able to discover additional interesting structure in the input-output relationship using our proposed methods.</p>
</sec>
</sec>
<sec id="s7">
<title>7. Concluding Remarks</title>
<p>In this paper we introduced and studied formulations for joint estimation and clustering (row or column or both) of the parameter matrix in multi-response models. By design, our formulations imply that coefficients belonging to the same (bi-)cluster are close to one another. By incorporating different notions of closeness between the coefficients, we can tremendously increase the scope of applications in which similar formulations can be used. Some future applications could include sparse subspace clustering and community detection.</p>
<p>Recently there has been a lot of research on non-convex optimization formulations, both from theoretical and empirical perspectives. It would be of interest to see the performance of our formulations on non-convex loss functions. Another extension would be to construct confidence intervals and perform hypothesis testing for the coefficients in each cluster.</p>
</sec>
<sec id="s8">
<title>Data Availability</title>
<p>The raw data supporting the conclusions of this manuscript will be made available by the authors, without undue reservation, to any qualified researcher.</p>
</sec>
<sec id="s9">
<title>Author Contributions</title>
<p>MY discussed the idea, wrote the codes, ran the experiments, and wrote the second part of the paper. KN discussed the idea and wrote the first part of the paper. AT provided the real data and interpreted the real data experiments. AL proposed the original research idea, and wrote the first part of the paper.</p>
<sec>
<title>Conflict of Interest Statement</title>
<p>KN and AL were employed by IBM. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ack><p>The authors acknowledge the contributions of the Purdue and IBM project teams for field work, data collection, processing, and discussions. They thank Dr. Mitchell Tuinstra and Dr. Clifford Weil for leading and coordinating the planning, experimental design, planting, management, and data collection portions of the project. They thank Dr. Naoki Abe for scientific discussions.</p>
</ack>
<sec sec-type="supplementary-material" id="s10">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fdata.2019.00027/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fdata.2019.00027/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Borchani</surname> <given-names>H.</given-names></name> <name><surname>Varando</surname> <given-names>G.</given-names></name> <name><surname>Bielza</surname> <given-names>C.</given-names></name> <name><surname>Larra&#x000F1;aga</surname> <given-names>P.</given-names></name></person-group> (<year>2015</year>). <article-title>A survey on multi-output regression</article-title>. <source>Wiley Interdiscip. Rev. Data Mining Knowl. Discov.</source> <volume>5</volume>, <fpage>216</fpage>&#x02013;<lpage>233</lpage>. <pub-id pub-id-type="doi">10.1002/widm.1157</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Caruana</surname> <given-names>R.</given-names></name></person-group> (<year>1998</year>). <article-title>Multitask learning</article-title>, in <source>Learning to Learn</source> (<publisher-loc>Boston, MA</publisher-loc>: <publisher-name>Kluwer Academic Publishers</publisher-name>), <fpage>95</fpage>&#x02013;<lpage>133</lpage>.</citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chi</surname> <given-names>E. C.</given-names></name> <name><surname>Allen</surname> <given-names>G. I.</given-names></name> <name><surname>Baraniuk</surname> <given-names>R. G.</given-names></name></person-group> (<year>2014</year>). <article-title>Convex biclustering</article-title>. <source>Biometrics</source> <volume>73</volume>, <fpage>10</fpage>&#x02013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1111/biom.12540</pub-id><pub-id pub-id-type="pmid">27163413</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Combettes</surname> <given-names>P. L.</given-names></name> <name><surname>Pesquet</surname> <given-names>J.-C.</given-names></name></person-group> (<year>2008</year>). <article-title>A proximal decomposition method for solving convex variational inverse problems</article-title>. <source>Inverse Probl.</source> <volume>24</volume>:<fpage>065014</fpage>. <pub-id pub-id-type="doi">10.1088/0266-5611/24/6/065014</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hallac</surname> <given-names>D.</given-names></name> <name><surname>Leskovec</surname> <given-names>J.</given-names></name> <name><surname>Boyd</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Network lasso: clustering and optimization in large graphs</article-title>, in <source>Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>Sydney, NSW</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>387</fpage>&#x02013;<lpage>396</lpage>.</citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hilley</surname> <given-names>J.</given-names></name> <name><surname>Truong</surname> <given-names>S.</given-names></name> <name><surname>Olson</surname> <given-names>S.</given-names></name> <name><surname>Morishige</surname> <given-names>D.</given-names></name> <name><surname>Mullet</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Identification of dw1, a regulator of sorghum stem internode length</article-title>. <source>PLoS ONE</source> <volume>11</volume>:<fpage>e0151271</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0151271</pub-id><pub-id pub-id-type="pmid">26963094</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hilley</surname> <given-names>J. L.</given-names></name> <name><surname>Weers</surname> <given-names>B. D.</given-names></name> <name><surname>Truong</surname> <given-names>S. K.</given-names></name> <name><surname>McCormick</surname> <given-names>R. F.</given-names></name> <name><surname>Mattison</surname> <given-names>A. J.</given-names></name> <name><surname>McKinley</surname> <given-names>B. A.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Sorghum dw2 encodes a protein kinase regulator of stem internode length</article-title>. <source>Sci. Rep.</source> <volume>7</volume>:<fpage>4616</fpage>. <pub-id pub-id-type="doi">10.1038/s41598-017-04609-5</pub-id><pub-id pub-id-type="pmid">28676627</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hubert</surname> <given-names>L.</given-names></name> <name><surname>Arabie</surname> <given-names>P.</given-names></name></person-group> (<year>1985</year>). <article-title>Comparing partitions</article-title>. <source>J. Classif.</source> <volume>2</volume>, <fpage>193</fpage>&#x02013;<lpage>218</lpage>. <pub-id pub-id-type="doi">10.1007/BF01908075</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jacob</surname> <given-names>L.</given-names></name> <name><surname>Vert</surname> <given-names>J.-P.</given-names></name> <name><surname>Bach</surname> <given-names>F. R.</given-names></name></person-group> (<year>2009</year>). <article-title>Clustered multi-task learning: a convex formulation</article-title>, in <source>Advances in Neural Information Processing Systems</source>, eds <person-group person-group-type="editor"><name><surname>Koller</surname> <given-names>D.</given-names></name> <name><surname>Schuurmans</surname> <given-names>D.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name></person-group> (<publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>745</fpage>&#x02013;<lpage>752</lpage>.</citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jalali</surname> <given-names>A.</given-names></name> <name><surname>Sanghavi</surname> <given-names>S.</given-names></name> <name><surname>Ruan</surname> <given-names>C.</given-names></name> <name><surname>Ravikumar</surname> <given-names>P. K.</given-names></name></person-group> (<year>2010</year>). <article-title>A dirty model for multi-task learning</article-title>, in <source>Advances in Neural Information Processing Systems</source>, eds <person-group person-group-type="editor"><name><surname>Lafferty</surname> <given-names>J. D.</given-names></name> <name><surname>Williams</surname> <given-names>C. K. I.</given-names></name> <name><surname>Shawe-Taylor</surname> <given-names>J.</given-names></name> <name><surname>Zemel</surname> <given-names>R. S.</given-names></name> <name><surname>Culotta</surname> <given-names>A.</given-names></name></person-group> (<publisher-loc>Vancouver, BC</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>964</fpage>&#x02013;<lpage>972</lpage>.</citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kang</surname> <given-names>Z.</given-names></name> <name><surname>Grauman</surname> <given-names>K.</given-names></name> <name><surname>Sha</surname> <given-names>F.</given-names></name></person-group> (<year>2011</year>). <article-title>Learning with whom to share in multi-task feature learning</article-title>, in <source>Proceedings of the 28th International Conference on Machine Learning (ICML-11)</source> (<publisher-loc>Bellevue, WA</publisher-loc>), <fpage>521</fpage>&#x02013;<lpage>528</lpage>.</citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>S.</given-names></name> <name><surname>Xing</surname> <given-names>E. P.</given-names></name></person-group> (<year>2010</year>). <article-title>Tree-guided group lasso for multi-task regression with structured sparsity</article-title>, in <source>Proceedings of the 27th International Conference on Machine Learning (ICML-10)</source> (<publisher-loc>Haifa</publisher-loc>), <fpage>543</fpage>&#x02013;<lpage>550</lpage>.</citation></ref>
<ref id="B13">
<citation citation-type="web"><person-group person-group-type="author"><name><surname>Kumar</surname> <given-names>A.</given-names></name> <name><surname>Daume III</surname> <given-names>H.</given-names></name></person-group> (<year>2012</year>). <article-title>Learning task grouping and overlap in multi-task learning</article-title>, in <source>Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML&#x00027;12</source> (Edinburgh: Omnipress), <fpage>1723</fpage>&#x02013;<lpage>1730</lpage>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://dl.acm.org/citation.cfm?id=3042573.3042793">http://dl.acm.org/citation.cfm?id=3042573.3042793</ext-link></citation></ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lu</surname> <given-names>Y.</given-names></name> <name><surname>Foster</surname> <given-names>D. P.</given-names></name></person-group> (<year>2014</year>). <article-title>Fast ridge regression with randomized principal component analysis and gradient descent</article-title>. <source>arXiv [Preprint]. arXiv:1405.3952</source>.</citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McWilliams</surname> <given-names>B.</given-names></name> <name><surname>Heinze</surname> <given-names>C.</given-names></name> <name><surname>Meinshausen</surname> <given-names>N.</given-names></name> <name><surname>Krummenacher</surname> <given-names>G.</given-names></name> <name><surname>Vanchinathan</surname> <given-names>H. P.</given-names></name></person-group> (<year>2014</year>). <article-title>Loco: distributing ridge regression with random projections</article-title>. <source>Stat</source> <volume>1050</volume>:<fpage>26</fpage>. <italic>arXiv [Preprint]. arXiv:1406.3469</italic></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Meinshausen</surname> <given-names>N.</given-names></name> <name><surname>Yu</surname> <given-names>B.</given-names></name></person-group> (<year>2009</year>). <article-title>Lasso-type recovery of sparse representations for high-dimensional data</article-title>. <source>Ann. Stat.</source> <volume>37</volume>, <fpage>246</fpage>&#x02013;<lpage>270</lpage>. <pub-id pub-id-type="doi">10.1214/07-AOS582</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Multani</surname> <given-names>D. S.</given-names></name> <name><surname>Briggs</surname> <given-names>S. P.</given-names></name> <name><surname>Chamberlin</surname> <given-names>M. A.</given-names></name> <name><surname>Blakeslee</surname> <given-names>J. J.</given-names></name> <name><surname>Murphy</surname> <given-names>A. S.</given-names></name> <name><surname>Johal</surname> <given-names>G. S.</given-names></name></person-group> (<year>2003</year>). <article-title>Loss of an MDR transporter in compact stalks of maize br2 and sorghum dw3 mutants</article-title>. <source>Science</source> <volume>302</volume>, <fpage>81</fpage>&#x02013;<lpage>84</lpage>. <pub-id pub-id-type="doi">10.1126/science.1086072</pub-id><pub-id pub-id-type="pmid">14526073</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Obozinski</surname> <given-names>G.</given-names></name> <name><surname>Taskar</surname> <given-names>B.</given-names></name> <name><surname>Jordan</surname> <given-names>M.</given-names></name></person-group> (<year>2006</year>). <source>Multi-Task Feature Selection</source>. <publisher-loc>Berkeley, CA</publisher-loc>: <publisher-name>Statistics Department, UC Berkeley, Tech</publisher-name>. Rep, 2.</citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rosset</surname> <given-names>S.</given-names></name> <name><surname>Zhu</surname> <given-names>J.</given-names></name></person-group> (<year>2007</year>). <article-title>Piecewise linear regularized solution paths</article-title>. <source>Ann. Stat.</source> <volume>35</volume>, <fpage>1012</fpage>&#x02013;<lpage>1030</lpage>. <pub-id pub-id-type="doi">10.1214/009053606000001370</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schifano</surname> <given-names>E. D.</given-names></name> <name><surname>Li</surname> <given-names>L.</given-names></name> <name><surname>Christiani</surname> <given-names>D. C.</given-names></name> <name><surname>Lin</surname> <given-names>X.</given-names></name></person-group> (<year>2013</year>). <article-title>Genome-wide association analysis for multiple continuous secondary phenotypes</article-title>. <source>Am. J. Hum. Genet.</source> <volume>92</volume>, <fpage>744</fpage>&#x02013;<lpage>759</lpage>. <pub-id pub-id-type="doi">10.1016/j.ajhg.2013.04.004</pub-id><pub-id pub-id-type="pmid">23643383</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tuinstra</surname> <given-names>M. R.</given-names></name></person-group> (<year>2016</year>). <article-title>Automated sorghum phenotyping and trait development platform</article-title>, in <source>Proceedings of KDD Workshop on Data Science for Food, Energy, and Water</source> (<publisher-loc>San Francisco, CA</publisher-loc>).</citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Gupta</surname> <given-names>V.</given-names></name> <name><surname>Kolar</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>Recovery of simultaneous low rank and two-way sparse coefficient matrices, a nonconvex approach</article-title>. <source>arXiv [Preprint]. arXiv:1802.06967</source></citation>
</ref>
</ref-list>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under the Award Number DE-AR0000593. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency there of.</p>
</fn>
</fn-group>
</back>
</article>