METHODS article

Front. Syst. Biol.

Sec. Data and Model Integration

Volume 5 - 2025 | doi: 10.3389/fsysb.2025.1589079

This article is part of the Research TopicBig Data in Systems BiologyView all 4 articles

Learning Gaussian Graphical Models from Correlated Data

Provisionally accepted
  • 1School of Medicine, Tufts University, Boston, Massachusetts, United States
  • 2Institute for Clinical Research and Health Policy Studies, Tufts Medical Center, Boston, United States
  • 3New York Genome Center, New York, New York, United States
  • 4Section of Computational Biomedicine, School of Medicine, Boston University, Boston, Massachusetts, United States
  • 5Department of Biostatistics, School of Public Health, Boston University, Boston, Massachusetts, United States
  • 6Data Intensive Study Center, Tufts University, Medford, Massachusetts, United States

The final, formatted version of the article will be published soon.

Gaussian Graphical Models (GGMs) are a type of network modeling that uses partial correlation rather than correlation for representing complex relationships among multiple variables. The advantage of using partial correlation is to show the relation between two variables after “adjusting” for the effects of other variables and leads to more parsimonious and interpretable models. There are well established procedures to build GGMs from a sample of independent and identical distributed observations. However, many studies include clustered and longitudinal data that result in correlated observations and ignoring this correlation among observations can lead to inflated Type I error. In this paper, we propose a cluster-based bootstrap algorithm to infer GGMs from correlated data. We use extensive simulations of correlated data from family-based studies to show that the proposed bootstrap method does not inflate the Type I error while retaining statistical power compared to alternative solutions when there are sufficient number of clusters. We apply our method to learn the GGM that represents complex relations between 47 Polygenic Risk Scores generated using genome-wide genotype data from the Long Life Family Study. By comparing it to the conventional methods that ignore within-cluster correlation, we show that our method controls the Type I error well without power loss.

Keywords: Gaussian graphical models, Corelated Data, Bootstrap, Polygenic risk score, Partial Correlation

Received: 06 Mar 2025; Accepted: 09 Jun 2025.

Copyright: © 2025 Song, Gunn, Monti, Peloso, Liu, Lunetta and Sebastiani. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Zeyuan Song, School of Medicine, Tufts University, Boston, 02111, Massachusetts, United States

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.