AUTHOR=Loher Phillipe , Karathanasis Nestoras 

TITLE=Machine Learning Approaches Identify Genes Containing Spatial Information From Single-Cell Transcriptomics Data

JOURNAL=Frontiers in Genetics

VOLUME=Volume 11 - 2020

YEAR=2021

URL=https://www.frontiersin.org/journals/genetics/articles/10.3389/fgene.2020.612840

DOI=10.3389/fgene.2020.612840

ISSN=1664-8021

ABSTRACT=The development of Single Cell sequencing technologies has allowed researchers to gain important new knowledge about the expression profile of genes in thousands of individual cells of a model organism or tissue. A common disadvantage of this technology is the loss of the 3d structure of the cells. Consequently, the Dialogue for Reverse Engineering Assessments and Methods (DREAM) organized the Single Cell Transcriptomics Challenge, in which we participated, with the aim to address the following two problems; a) to identify the top 60, 40 and 20 genes of the D. melanogaster embryo that contain the most spatial information, and b) to reconstruct the 3-D arrangement of the embryo using information from those genes.
We developed two independent techniques, leveraging machine learning models from Lasso and Deep Neural Networks, that are applied to high-dimensional single-cell sequencing data in order to accurately identify genes that contain spatial information. Our first technique, Lasso.TopX, utilizes the Lasso and ranking statistics and allows a user to define a specific number of features they are interested in. The Neural Network approach utilizes weak supervision for linear regression to accommodate for uncertain or probabilistic training labels.  We show, individually for both techniques, that we are able to identify important, stable, and a user-defined number of genes containing the most spatial information.
The results from both techniques achieve high performance when reconstructing spatial information in D. melanogaster and also generalize to Zebrafish (Danio rerio).  Furthermore, we identified novel D. melanogaster genes that carry important positional information and were not previously suspected. We also show how the indirect use of the full datasets’ information can lead to data leakage and generate bias in overestimating the model’s performance. Lastly, we discuss the applicability of our approaches to other feature selection problems outside the realm of Single Cell Sequencing and the importance of being able to handle probabilistic training labels. Our source code and detailed documentation are available at https://github.com/TJU-CMC-Org/SingleCell-DREAM/.