Simultaneous Parameter Learning and Bi-clustering for Multi-Response Models

We consider multi-response and multi-task regression models, where the parameter matrix to be estimated is expected to have an unknown grouping structure. The groupings can be along tasks, or features, or both, the last one indicating a bi-cluster or “checkerboard” structure. Discovering this grouping structure along with parameter inference makes sense in several applications, such as multi-response Genome-Wide Association Studies (GWAS). By inferring this additional structure we can obtain valuable information on the underlying data mechanisms (e.g., relationships among genotypes and phenotypes in GWAS). In this paper, we propose two formulations to simultaneously learn the parameter matrix and its group structures, based on convex regularization penalties. We present optimization approaches to solve the resulting problems and provide numerical convergence guarantees. Extensive experiments demonstrate much better clustering quality compared to other methods, and our approaches are also validated on real datasets concerning phenotypes and genotypes of plant varieties.


SUPPLEMENTARY MATERIAL
This material provides additional details to support the main paper.

COnvex BiclusteRing Algorithm (COBRA) Chi et al. (2014)
We provide the details of the COnvex BiclusteRing Algorithm (COBRA) introduced in Chi et al. (2014) for completeness.

Proof of Proposition 1
PROOF. We need to check that the conditions in Theorem 3.4 in Combettes and Pesquet (2008) are satisfied in our case: Let H be the domain of ⇥ which can be set as R pk . Let C be a nonempty convex subset of H, the strong relative interior of C is where cone(C) = S >0 { ⇥|⇥ 2 C}, and span(C) is the closure of span C. Now we check the conditions. For (i), k⇥k goes to infinity means some k⇥ s k goes to infinity, and then we know f 2 goes to infinity. Therefore (i) holds.
For (ii), we do not have any restriction on ⇥, so the right hand side is just sri(R pk ), hence (ii) holds. Therefore, the proposition follows according to Theorem 3.4 of Combettes and Pesquet (2008).

Proof of Proposition 2
PROOF. The optimization step for is solved by COBRA and it converges to global minimizer according to Proposition 4.1 in Chi et al. (2014) 2 2 (⇥ ) and rf (⇥, ) = 2 2 ( ⇥), it is clear that rf ⇥ (⇥, ) and rf (⇥, ) are both Lipschitzcontinuous in ⇥ and , respectively. Since the optimization step for ⇥ is also assumed to find global minimizer, Theorem 3.9 in Beck (2015) guarantees that our algorithm in Section 3.2 converges to the global minimizer.

Simulation experiments on uni-clustering
We evaluate the performance of our proposed method on synthetic datasets and compare with the following existing methods: • Single task: a baseline approach, where the tasks are learned separately via Lasso. • Tree-guided group Lasso Kim and Xing (2010): Employs a structured penalty function induced from a predefined tree structure among responses, that encourages multiple correlated responses to share a similar set of covariates. We used the code provided by the authors of Kim and Xing (2010) where the tree structure is obtained by running a hierarchical agglomerative clustering on the responses.
Since all of these methods focus on estimation accuracy, we compare the root mean square error (RMSE) only. The parameter settings are the same as before. Columns 2-4 in Table 5 shows the comparison results on RMSE, standard deviation of RMSE, and running time. From Table 5 we see that we obtain slightly improved RMSE (recall that the grouping quality is significantly improved) and our algorithm remains efficient in high dimensions.
To show further the gain of our proposed formulation, we consider a slightly different setting where we set ✏ = 0.05 such that the parameters within groups are more close to each other. We also increase the number of nonzero components in the true coefficient matrix so that the estimation problem is much more difficult (i.e. a larger RMSE). The last column of Table 5 shows the RMSE under this setting, where we see that our proposed formulations improve the estimation accuracy.

Real data setting
The varieties and their names used in the trait prediction experiments (Section 6.1) are given in Table 6