Technology and Code ARTICLE
sumrep: a summary statistic framework for immune receptor repertoire comparison and model validation
- 1University of Washington, United States
- 2Fred Hutchinson Cancer Research Center, United States
- 3Department of Biological Sciences, Institute of Structural and Molecular Biology, University of London, United Kingdom
- 4Vaccine Research Center (NIAID), United States
- 5Skolkovo Institute of Science and Technology, Russia
- 6Institute of Bioorganic Chemistry (RAS), Russia
- 7Genentech, Inc., United States
- 8Pirogov Russian National Research Medical University, Russia
The adaptive immune system generates an incredible diversity of antigen receptors for B and T cells to keep dangerous pathogens at bay. The DNA sequences coding for these receptors arise by a complex recombination process followed by a series of productivity-based filters, as well as affinity maturation for B cells, giving considerable diversity to the circulating pool of receptor sequences. Although these datasets hold considerable promise for medical and public health applications, the complex structure of the resulting adaptive immune receptor repertoire sequencing (AIRR-seq) datasets makes analysis difficult. In this paper we introduce sumrep, an R package that efficiently performs a wide variety of repertoire summaries and comparisons, and show how sumrep can be used to perform model validation. We find that summaries vary in their ability to differentiate between datasets, although many are able to distinguish between covariates such as donor, timepoint, and cell type for BCR and TCR repertoires. We show that deletion and insertion lengths resulting from V(D)J recombination tend to be more discriminative characterizations of a repertoire than summaries that describe the amino acid composition of the CDR3 region. We also find that state-of-the-art generative models excel at recapitulating gene usage and recombination statistics in a given experimental repertoire, but struggle to capture many physiochemical properties of real repertoires.
Keywords: Repertoire comparison, model validation, Rep-Seq, B cell receptor, T cell receptor, summary statistics
Received: 19 Jul 2019;
Accepted: 11 Oct 2019.
Copyright: © 2019 Olson, Moghimi, Schramm, Obraztsova, Ralph, Vander Heiden, Shugay, Shepherd, Lees and Matsen IV. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Mx. Branden J. Olson, University of Washington, Seattle, United States, email@example.com
Mx. Frederick A. Matsen IV, Fred Hutchinson Cancer Research Center, Seattle, 19024, Washington, United States, firstname.lastname@example.org