Factorized discriminant analysis for genetic signatures of neuronal phenotypes

Navigating the complex landscape of single-cell transcriptomic data presents significant challenges. Central to this challenge is the identification of a meaningful representation of high-dimensional gene expression patterns that sheds light on the structural and functional properties of cell types. Pursuing model interpretability and computational simplicity, we often look for a linear transformation of the original data that aligns with key phenotypic features of cells. In response to this need, we introduce factorized linear discriminant analysis (FLDA), a novel method for linear dimensionality reduction. The crux of FLDA lies in identifying a linear function of gene expression levels that is highly correlated with one phenotypic feature while minimizing the influence of others. To augment this method, we integrate it with a sparsity-based regularization algorithm. This integration is crucial as it selects a subset of genes pivotal to a specific phenotypic feature or a combination thereof. To illustrate the effectiveness of FLDA, we apply it to transcriptomic datasets from neurons in the Drosophila optic lobe. We demonstrate that FLDA not only captures the inherent structural patterns aligned with phenotypic features but also uncovers key genes associated with each phenotype.


Introduction
The analysis of gene expression data in single cells presents an intriguing and complex problem.Each cell's gene expression data can be viewed as a high-dimensional vector, allowing each cell to be represented as a single point in the vast space of gene expression.Clusters form within this space, each identifiable and associated with a particular cell type, thanks to the verification from the molecular markers of cell types [1][2][3][4][5].
When the phenotypic traits of each cell type are known, either from past studies or direct measurement [6][7][8][9], we can label each cell type according to its unique characteristics.For example, differentiation of neuronal cell types could be achieved through analyzing a variety of features, such as dendritic and axonal laminations, electrophysiological properties, and connectivity [6,7,10].These features are often categorical in nature.
A critical challenge arises when we attempt to factorize the high-dimensional gene expression data into modules that align with these phenotypes.In simple terms, we aim to find a low-dimensional embedding of gene expression where each axis signifies a single factor.This factor might correspond to a specific phenotypic feature or potentially, the combination of several.
Ideally, variation along one axis in the embedding space would exclusively affect one phenotypic feature.However, due to inevitable noise in the data, this is challenging to achieve.As a workaround, we allow for data projected along one axis to vary primarily with one phenotypic feature and minimally Phenotypic feature 2 Phenotypic feature 1

B
Partial table  A,B) Here, cell types are represented by two phenotypic features, labeled with i and j respectively.If only some combinations of the two features are observed, we have a partial contingency table (B) rather than a complete one (A).(C) We aim to find linear projections of the data that separate the cell types in a manner factorized according to the two features.In this diagram, u, v, and w are aligned with Feature 1, Feature 2, and their combination respectively, with the projected coordinates y, z, and s.
with others.Simultaneously, we want to preserve cell type identities in the low-dimensional space.This means that cells of the same type should remain in close proximity within the embedding space, while cells of different types remain distinct.
In order to address this issue, we propose the method of factorized linear discriminant analysis (FLDA).This is a supervised dimensionality reduction technique, rooted in the concepts of multi-way analysis of variance (ANOVA) [11].FLDA enables the factorization of data into components that correspond to phenotypic features and their combinations.It then seeks a linear transformation that is highly variable with one component, yet stable with others.The power of this approach lies in its simplicity and interpretability.To further leverage our analysis, we introduce a sparse variant of this method.This variant restricts the number of non-zero elements contributing to each linear projection, thereby identifying a subset of genes crucial to each phenotype.The efficacy of FLDA is demonstrated through its application to a single-cell RNA-Seq dataset of T4/T5 neurons in Drosophila [12], focusing particularly on two phenotypes: dendritic location and axonal lamination.

Factorized Linear Discriminant Analysis (FLDA)
Let's consider a situation where each cell type can be characterized by two phenotypic features, both of which are categorical.This essentially means that the sample space for cell types is a Cartesian product of the sample spaces of the two phenotypic features I and J: In this equation, i, j represent different categories of the two phenotypic features.Suppose we have observed n ij cells for each cell type (i, j).This information can be visualized with a contingency table, as shown in Figures 1A and 1B.Note here we account for the scenario where the table might be only partially filled.
We denote the expression values of g genes measured in the kth cell of the cell type (i, j) as Our task is to find linear projections that align with features i and j respectively (see Figure 1C).
To address this, we explored whether we could factorize, for example, y ijk , into components dependent on features i and j.By employing the principles of linear factor models from multiway ANOVA and the concept of variance partitioning, we formulated an objective function to find u that maximizes this objective (for a detailed analysis, refer to Appendix A).
With a complete table, where a and b are the number of categories for feature i and j, we have: Here, M A , M B , and M AB denote the covariance matrices explained by feature i, feature j, and their combination respectively.The hyper-parameters λ 1 and λ 2 determine the relative weights of M B and M AB in comparison to M A .The residual covariance matrix, M e , represents variance within cell type clusters and signifies noise in gene expressions.The formal definitions of these terms are as follows: and with x ijk (11) Analogously, the linear projections v for feature j and w for the combination of both features i and j can be determined by similar formulas.By applying the same rationale to a partial table, we can derive u or v as the linear projection for feature i or j (see Appendix B for a detailed mathematical discussion).
Note that N A is symmetric and M e is positive definite, transforming the optimization problem into a generalized eigenvalue problem [13].When M e is invertible, u * is the eigenvector associated with the highest eigenvalue of M −1 e N A .Generally, if we aim to embed x ijk into a d-dimensional subspace aligned with feature i (d < a), we take the eigenvectors corresponding to the d largest eigenvalues of M −1 e N A , which we term as the top d factorized linear discriminant components (FLDs).
In situations where the number of genes greatly exceeds the number of cells, M e becomes singular and non-invertible.In such cases, we resort to solutions suggested in [14][15][16] that uses a diagonal estimate of M e : diag(σ 2  1 , σ2 2 , ..., σ2 p ), where σ2 i is the ith diagonal element of M e .This solution has been employed in multiple computational biology studies [17][18][19].
As multi-way ANOVA can handle contingency tables with more than two dimensions, our analysis can be easily extended to handle more than two phenotypic features [20].In summary, FLDA is well-suitable for data whose labels form a Cartesian product of multiple features.

Sparse Regularization of FLDA
In computational biology applications, we are often interested in identifying a small subset of genes that effectively characterizes a specific phenotypic feature.This leads to the identification of axes with a few non-zero elements.To find such a sparse solution, we address the following optimization problem: where the number of non-zero elements of u * is constrained to be less or equal to l.
This problem, also known as a sparse generalized eigenvalue problem, presents three challenges [21]: Handling extremely high-dimensional data, M e can be singular and non-invertible; Working with the normalization term u T M e u, which restricts the application of many sparse eigenvalue solutions; Maximizing a convex objective over a nonconvex set, a problem known to be NP-hard.
To overcome these challenges, we employ the truncated Rayleigh flow (Rifle) method, which was designed specifically for solving sparse generalized eigenvalue problems.The Rifle algorithm is a two-step process [21]: First, it acquires an initial vector u 0 that is close to u * .For this, we use the non-sparse FLDA solution as an initial estimate for u 0 ; Second, it iteratively performs a gradient ascent on the objective function.This is followed by a truncation step that retains the l entries of u with the highest values and sets the remaining entries to zero.The step-by-step process of applying the Rifle method to solve our problem is detailed in the following pseudo-code: procedure RIFLE(N A , M e , u 0 , l, η) ▷ η is the step size t = 1 ▷ t indicates the iteration number while not converge do ▷ Converge when u t ≃ u t−1 As previously demonstrated in [21], the Rifle method can effectively converge to the unique sparse leading generalized eigenvector, assuming it exists, at the optimal statistical rate of convergence.The computational complexity of the second step in each iteration is O(lg + g), indicating that the Rifle algorithm scales linearly with g, the number of genes in the input data.
In terms of hyperparameter selection, the step size η should be small enough to ensure convergence, specifically ηλ max (M e ) < 1, where λ max (M e ) is the largest eigenvalue of M e .This is akin to taking small steps to ensure that we don't overshoot the optimal solution.The other hyperparameter, l, which determines the number of genes to be preserved, is chosen empirically based on the design of the subsequent experiment.This parameter can be adjusted depending on the specific requirement of a biological study.

Related Work: Dimensionality Reduction
FLDA is one method for linear dimensionality reduction [22].In formal terms, linear dimensionality reduction can be defined as follows: Given n data points, each of g dimensions, X = [x 1 , x 2 , ..., x n ] ∈ R g×n , and a chosen reduced dimensionality r < g, an objective function f (.) is optimized to produce a linear projection U ∈ R r×g .The result is a low-dimensional transformed dataset Y = U X ∈ R r×n .

Unsupervised Methods for Linear Dimensionality Reduction
Unsupervised linear dimensionality reduction methods, including PCA [25], ICA [26], and FA [27], project data into a low-dimensional space without the use of supervision labels.These methods are crucial in the initial stages of single-cell data analysis to reduce dimensionality and noise, and have been used in numerous studies to identify subpopulations of cells and understand the variance structure of the data [23,19].The shortcoming of these unsupervised methods is that the axes of the low-dimensional space often fail to represent the underlying structure of the data, rendering them uninterpretable.This issue is particularly pronounced with gene expression data due to its high dimensionality (usually encompassing tens of thousands of genes) and the noisy expressions of many genes.These noisy expressions result in significant variance among individual cells, albeit without a structured pattern.In the absence of supervisory signals from phenotypic features, unsupervised methods tend to select these genes to construct the low-dimensional space, which does not provide the desired alignment or effective separation of cell type clusters.To illustrate this, we compared the performance of PCA on the gene expression data with that of FLDA.In brief, we solved the following objective to find the linear projection: The results of this comparison are detailed in the Results section.

Supervised Methods for Linear Dimensionality Reduction
Supervised linear dimensionality reduction techniques, such as LDA [28,29] and CCA [30,31], can overcome the aforementioned issues.By incorporating supervised signals of phenotypic features, genes whose expressions do not inform on the phenotypes can be de-emphasized.

Linear Discriminant Analysis (LDA)
LDA models the differences among data organized in pre-determined classes.Formally, the optimization problem solved by LDA is as follows: where Σ b and Σ e are estimates of the between-class and within-class covariance matrices, respectively.
Unlike FLDA, LDA doesn't explicitly formulate the representation of these classes as a contingency table composed of multiple features.As a result, when applied to an example problem where cell types are organized into a two-dimensional contingency table with phenotypic features i and j, the axes from LDA are generally not aligned with these two phenotypic features.
However, it is possible to perform two separate LDAs for the two features.This modification allows the axes from each LDA to align with its specific feature.We refer to this approach as "2LDAs".There are two main limitations of this approach: first, it discards information about the component depending on the combination of the two features; second, it explicitly maximizes the segregation of cells with different feature levels, which sometimes is not consistent with a good separation of cell type clusters.Detailed comparisons between LDA, "2LDAs", and FLDA are provided in the Results section.

Canonical Correlation Analysis (CCA)
CCA projects two datasets X a ∈ R g×n and X b ∈ R d×n to Y a ∈ R r×n and Y b ∈ R r×n , such that the correlation between Y a and Y b is maximized.Formally, it tries to maximize this objective: To apply CCA to our problem, we designate X a as the gene expression matrix, and X b as the matrix of d phenotypic features (d = 2 for two features as demonstrated later).Unlike FLDA, CCA identifies a transformation of gene expressions that is aligned with a linear combination of phenotypic features, instead of a factorization of gene expressions corresponding to individual phenotypic features.The differences in these approaches are quantified and discussed in the Results section.

Non-linear Dimensionality Reduction Methods
Apart from linear dimensionality reduction, non-linear methods have emerged as popular choices for analyzing single-cell transcriptomic datasets due to their ability to capture complex, non-linear relationships inherent in the data [23].Notable among these methods are t-Distributed Stochastic Neighbor Embedding (t-SNE) [32] and Uniform Manifold Approximation and Projection (UMAP) [33].Unlike linear methods, these algorithms can unravel intricate structures in the data by modeling non-linear manifold structures.
t-SNE minimizes the divergence between two distributions over pairs of data points, one in the high-dimensional space and one in the low-dimensional space, to create a map that reflects the structure of the data.UMAP assumes that the data is uniformly distributed on a locally-connected Riemannian manifold and seeks to find a similar uniform distribution in lower dimensions.
The comparison of FLDA with t-SNE and UMAP hinges on the trade-off between linear and nonlinear dimensionality reductions.While non-linear methods excel in capturing complex data structures and modeling dropout effects [34], and often produce visually appealing embeddings, they exhibit certain limitations compared to linear methods, such as: • Interpretability: Linear methods offer a clear and direct relationship between the original features and the reduced dimensions, which facilitates interpretability.In contrast, the embeddings produced by non-linear methods are often challenging to interpret due to the complex and non-linear transformation functions involved.
• Computational Efficiency: Linear methods are generally more computationally efficient compared to non-linear methods, which can become computationally intensive, especially as the size of the dataset increases.
In single-cell transcriptomics applications, the choice between linear and non-linear dimensionality reduction hinges on balancing the capture of complex data structures with the maintenance of interpretability and computational efficiency.In the context of this paper, our proposed FLDA method is designed to address the challenges associated with single-cell data by offering a structured and interpretable low-dimensional space aligned with neuronal phenotypes.Therefore, we constrained our comparisons of FLDA with other linear dimensionality reduction methods that share the objective of interpretability.5 Experimental Design

Datasets
To quantitatively evaluate FLDA against other linear dimensionality reduction methods such as PCA, CCA, LDA, and the "2LDAs" approach, we initially opted for synthetic datasets.The primary rationale behind this choice lay in the controlled environment synthetic data afford, enabling a precise and standardized comparison of the methods under varying conditions.These datasets consisted of four types of cells, each containing 250 examples, generated from a 2x2 Cartesian product of two features i and j (Figure 2A).We generated expressions for 1000 genes of each cell, with gene levels being either purely noise-driven or correlated with feature i, feature j, or the combination of both.Detailed information about the data generation can be found in Appendix C.
to bridge the gap between the controlled synthetic environment and real-world biological scenarios, we employed a dataset of Drosophila T4/T5 neurons [12] to demonstrate the applicability and advantages of FLDA in analyzing single-cell transcriptome datasets.T4 and T5 neurons, while similar in general morphology and physiological properties, differ in the location of their dendrites in the medulla and lobula, which are two separate brain regions.Both T4 and T5 neurons comprise four subtypes, each pair demonstrating axonal lamination in a specific layer within the lobula plate (Figure 3A).Thus, we identified these neurons using two phenotypic features: feature i indicating the dendritic location in either the medulla or lobula, and feature j signifying axonal lamination at one of the four layers (a/b/c/d) (Figure 3B).In this study, we concentrated on a dataset containing expression data for 17492 genes from 3833 cells, all collected at a predefined time during brain development.

Data Preprocessing
The preprocessing of the T4/T5 neuron dataset adhered to previously documented procedures [3,5,35,12].Briefly, the transcript counts within each column of the count matrix (genes×cells) were normalized to equate to the median number of transcripts per cell, leading to normalized counts, or Transcripts-per-million (T P M gc ), for Gene g in Cell c.We used the log-transformed expression data, denoted by E gc = ln (T P M gc + 1), for subsequent analysis.We selected highly variable genes for further FLDA application based on a common approach in single-cell RNA-Seq studies.This approach is based on establishing a relationship between mean and coefficient of variation [36,37,12].For this particular experiment, we set the hyper-parameters λs in Equation (3) to 1.

Evaluation Metrics
The dimensionality reduction process should satisfy two primary goals: (1) to identify axes that efficiently segregate distinct cell types, and (2) to discover axes that are well-aligned with the respective labels.Consequently, to evaluate the effectiveness of FLDA and various alternative methodologies, we implemented the following metrics (Detailed information of implementing these metrics can be found in Appendix D): • Signal-to-Noise Ratio (SNR): This metric measures the efficacy of each discriminant axis in distinguishing distinct cell types.Higher SNR suggests better separation of different cell types.This metric is relevant to the first goal.
• Explained Variance (EV): This metric gauges the proportion of variance of the feature i or j that a discriminant axis explains.Higher EV indicates that the dimensionality reduction method effectively encapsulates the feature information.This metric is relevant to the second goal.
• Mutual Information (MI): This metric calculates the association between each discriminant axis and each feature, providing insights into how much information an axis provides about a specific feature.A higher MI score suggests better ability of the dimensionality reduction method to capture essential characteristics.This metric is relevant to the second goal.
• Modularity Score: This metric assesses whether each axis is predominantly dependent on a single feature [38].A higher modularity score indicates successful disentanglement of features, which is crucial for interpreting biological data.This metric is relevant to the second goal.
• Silhouette Score: This metric computes the average Silhouette value of all samples, which is a measure of how similar a cell is to its own cluster compared to other clusters.A higher Silhouette score indicates better cluster separation and tighter clustering, This metric is relevant to the first goal.
In addition, we evaluated the execution times of FLDA and alternative methodologies.

Comparative Analysis of FLDA with Other Linear Dimensionality Reduction Methods
To provide a quantitative comparison between FLDA and other dimensionality reduction methods such as PCA, CCA, LDA, and "2LDAs", we measured the proposed metrics on the synthesized datasets as shown in Figure 2A.Given that the synthesized data was organized in a 2x2 table, each LDA of the "2LDAs" approach could only identify one dimension for the specific features i or j.Therefore, as a fair comparison, we only included the corresponding dimensions in FLDA (FLD i and FLD j ) and the top two components of PCA, CCA, and LDA.The overall SNR values normalized by that of LDA and the modularity scores across different noise levels are depicted in Figure 2B,C.The performance of PCA is the worst due to its unsupervised approach, which cannot effectively mitigate the impact of noise on the signal.While supervised approaches generally demonstrate superior SNR, LDA and CCA suffer from low modularity scores.This outcome aligns with our expectation, as LDA maximizes cell type cluster separation without necessarily aligning axes to individual features i or j, and CCA maximizes the correlation to a linear combination of phenotypic features rather than individual ones.Conversely, "2LDAs" achieves the highest modularity scores but exhibits the lowest SNR among supervised approaches, as it aims to maximize the separation of cells with different feature levels, which does not necessarily coincide with maximizing cell type segregation.Both the SNR and modularity score of FLDA approach optimal values because it considers both the alignment of axes to different features and the constraint of variance within cell types.Consistent with the SNR metric, the average Silhouette score for FLDA is close to those of LDA and CCA, outperforms "2LDAs", and significantly surpasses PCA, as detailed in Table 1.Consistent with the modularity score, a robust axis alignment to either feature i or j is observed in FLDA and "2LDAs", but not in the other methods, as shown in a representative plot of the EV and MI metrics across these models in Figure 5.
We further analyzed the execution times of FLDA and other models and summarized the findings in Table 2.The execution time of FLDA is on par with that of LDA, albeit longer than PCA's, attributed to the handling of the covariance matrix in the denominator.In contrast, the execution times for "2LDAs" and CCA are considerably extended, nearly doubling those of FLDA and LDA.This increment is due to "2LDAs" requiring two LDA operations, while CCA necessitates the computation of covariance matrices for both input and phenotypic features, thereby doubling the execution time.

Real-world Application in Computational Biology
A significant question in biology is whether diverse cell type phenotypes are generated by modular transcriptional programs, and if so, what the gene signature for each program is.To demonstrate the potential of our approach in addressing this question, we applied FLDA to the Drosophila T4/T5 neuron dataset.
Given that the data is organized in a 2x4 contingency table, we chose to project the expression data into a seven-dimensional subspace.This subspace was structured such that one FLD was aligned with dendritic location i (FLD i ), three FLDs were aligned with axonal termination j (FLD j1−3 ), and the remaining three were tailored to represent the combination between both phenotypes (FLD ij 1−3 ).Ranking these axes based on their SNR metrics revealed that FLD j1 , FLD i , and FLD j2 had considerably higher SNRs than the others (Figure 3C).Indeed, data representations in the subspace comprising these three dimensions clearly separated the eight neuronal cell types (Figure 3D).As expected,  FLD i differentiated T4 from T5 neurons, which have dendrites located in different brain regions (Figure 3E).Interestingly, FLD j1 separated T4/T5 neurons into two groups, a/b vs c/d, according to the upper or lower lobula place, while FLD j2 divided them into another two groups, a/d vs b/c, indicating whether their axons laminated at the middle or lateral part of the lobula plate (Figure 3E,F).Among these three dimensions, FLD j1 has a much higher SNR than FLD i and FLD j2 , suggesting a hierarchical structure in the genetic organization of T4/T5 neurons: they are first separated into either a/b or c/d types, and subsequently divided into each of the eight subtypes.In fact, this matches the sequence of their cell fate determination, as revealed in a previous genetic study [39].Lastly, the final discriminant axis of the axonal feature FLD j3 separates the group a/c from b/d, suggesting its role in fine-tuning the axonal depth within the upper or lower lobula plate (Figure 3G).
To identify gene signatures for the discriminant components in FLDA, we applied sparsity-based regularization to constrain the number of genes with non-zero weight coefficients.We set the number to 20, a reasonable number of candidate genes that could be tested in a follow-up biological study.
We extracted a list of 20 genes each for the axis of FLD i or FLD j1 .The relative importance of these genes to each axis is directly informed by their weight values (Figure 4A,C).Alongside, we plotted expression profiles of these genes in the eight neuronal cell types (Figure 4B,D).For both axes, the genes critical in separating cells with different feature levels are differentially expressed in corresponding cell types.Finally, FLDA allowed us to examine the component that depends on the combination of both features and identify its gene signature, providing insights into transcriptional regulation of gene expressions in the T4/T5 neuronal cell types (Figures 6 and 7).

Perturbation Analysis
As FLDA, like other supervised methods, relies on accurate phenotype labels to extract meaningful information, we sought to investigate how it might behave in real-world scenarios where inaccuracies are bound to occur.If the phenotypes are annotated incorrectly, can we use FLDA to raise a flag?To address this, we propose a perturbation analysis of FLDA, based on the assumption that among possible phenotype annotations, the projection of gene expression data with correct labels leads to better metric measurements than incorrect ones.As detailed in Appendix E, we deliberately generated three kinds of incorrect labels for the T4/T5 neuron dataset, simulating common erros that could occur during labeling: the phenotypes of a cell type were mislabeled with those of another type; a singular phenotypic category was incorrectly split into two; two phenotypic categories were incorrectly merged into one.We applied FLDA to gene expressions of T4/T5 neurons using these perturbed annotations, and found that proposed metrics, such as SNR and the modularity score, were best when the labels were correct (Figure 8), suggesting that this type of perturbation analysis can be used to flag potential errors in labeling.
In summary, our findings demonstrate that FLDA is a powerful tool for identifying and interpreting gene expressions that correspond to particular phenotypic features, even in the face of potential data mislabeling.This makes it a valuable tool for understanding complex biological systems.The perturbation analysis provides a robust method for validating the accuracy of phenotype annotations, thereby increasing the reliability of subsequent analyses and conclusions.

Discussion
We have introduced FLDA, a novel dimensionality reduction method that linearly projects highdimensional data, such as gene expressions, into a low-dimensional space.The axes of this space are aligned with predefined features like phenotypes, making it an intuitive representation.Furthermore, we incorporated sparse regularization into FLDA, allowing us to select a small set of critical genes that are most informative about the phenotypes.Our application of FLDA in a computational biology context, particularly in the analysis of gene expression data from Drosophila T4/T5 neurons with two phenotypic labels, not only illuminated data structures aligned with the phenotypic labels, but also unveiled previously unreported genes associated with each phenotype.A comparison of our gene lists with those from the previous study [12] unveiled consistent genes including indicator genes for dendritic location like T f AP − 2, dpr2, dpr3, twz, CG34155, and CG12065, and those for axonal lamination such as klg, bi, pros, mav, beat − IIIb, and F as2. Remarkably, we identified genes not reported in the previous study.For example, our results suggest that the gene pHCl − 1 is important to the dendritic phenotype, and the gene Lac is critical to axonal lamination.These genes are promising genetic targets for subsequent experimentation.
FLDA's potential extends beyond the dataset explored in this study.In a separate work, we applied FLDA to another real-world single-cell transcriptomic dataset, showcasing its ability to discern a lowdimensional representation of neuronal types aligned with phenotypic and species attributes, thereby revealing evolutionary counterparts of primate retinal ganglion cells [20].This further substantiates FLDA's applicability across diverse datasets and its promise in unveiling biologically meaningful insights.
The method could also play a role in the discovery of cell types.For example, the known phenotypes in a population might only form a partial table with missing entries (Figure 1B).Like the empty cells in Mendeleev's Periodic Table led to the prediction of new elements, these gaps could indicate predictions of new cell types [40].FLDA can help pinpoint the region of the gene expression space that corresponds to the predicted new type, potentially revealing rare cell populations that might otherwise be overlooked due to insignificance.
Beyond computational biology, FLDA's application can extend to any labeled dataset with labels forming a Cartesian product of multiple attributes.This ability to separate attribute-specific factors makes FLDA invaluable in creating disentangled representations [38,41].The potential of FLDA extends to these areas, and its performance can be optimized for diverse applications.
While our work offers significant advancements, it is not without limitations.The inherent linearity of FLDA, though providing an explicit and easily interpretable model, also presupposes a linear relationship between input features, which may not always hold true.Future work could involve a non-linear version of FLDA.For example, the input features can be projected into an embedding space using a neural network, where the axes align with each label attribute.

Appendix 8.1 A. Objective functions
Here we derive the objective functions used in our analysis.Again if x ijk (k ∈ 1, 2, ...n ij ) represents the expression values of g genes in each cell (x ijk ∈ R g )), we seek to find a linear projection y ijk = u T x ijk that is aligned with the feature i.

Inspiration from ANOVA
We asked what is the best way to factorize y ijk .Inspired by multi-way ANOVA [11], we identified three components: one depending on the feature i, another depending on the feature j, and the last one depending on the combination of both features.We therefore followed the procedures of ANOVA to partition sums of squares and factorize y ijk into these three components.
Let us first assume that all cell types defined by i and j contain the same number of cells.With cell types represented by a complete contingency table (Figure 1A), y ijk can be linearly factorized using the model of two crossed factors.Formally, the linear factorization is the following: where y ijk represents the coordinate of the kth cell in the category defined by i and j; µ is the average level of y; α i is the component that depends on the feature i, and β j is the component that depends on the feature j; (αβ) ij describes the component that depends on the combination of both features i and j; ϵ ijk ∼ N (0, σ 2 ) is the residual of this factorization.
Let us say that the features i and j fall into a and b discrete categories respectively.Then without loss of generality, we can require: (αβ Corresponding to these, there are three null hypotheses: H 02 : Here we want to reject H 01 while accepting H 02 and H 03 so that y ijk is aligned with the feature i.
Next, we partition the total sum of squares.If the number of cells within each cell type category is n, and the total number of cells is N , then we have where ȳ is the average of y ijk over the indices indicated by the dots.Equation ( 23) can be written as with each term having degrees of freedom N − 1, a − 1, b − 1, (a − 1)(b − 1), and N − ab respectively.
Here SS A , SS B , SS AB , and SS e are partitioned sum of squares for the factors α i , β j , (αβ) ij , and the residual.
ANOVA rejects or accepts a null hypothesis by comparing its mean square (the partitioned sum of squares normalized by the degree of freedom) to that of the residual.This is done by constructing F-statistics for each factor as shown below: Under the null hypotheses, the F-statistics follow the F-distribution.Therefore, a null hypothesis is rejected when we observe the value of a F-statistic above a certain threshold calculated from the F-distribution.Here we want F A to be large enough so that we can reject H 01 , but F B and F AB to be small enough for us to accept H 02 and H 03 .In other words, we want to maximize F A while minimizing F B and F AB .Therefore, we propose maximizing an objective L: where λ 1 and λ 2 are hyper-parameters determining the relative weights of F B and F AB compared with F A .

Objective functions under a complete contingency table
When the numbers of cells within categories defined by i and j (n ij ) are not all the same, the total sum of squares cannot be partitioned as in Equation (23).However, if we only care about distinctions between cell types instead of individual cells, we can use the mean value of each cell type cluster (ȳ ij. ) to estimate the overall average value (ỹ ... ), and the average value of each category i (ỹ i.. ) or j (ỹ .j. ).Therefore, Equation ( 23) can be modified as the following: If we describe Equation ( 29) as: then following the same arguments, we want to maximize an objective function in the following format:

Objective functions under a partial contingency table
When we have a representation of a partial table, we can no longer separate out the component that depends on the combination of both features.Therefore, we use another model, a linear model of two nested factors, to factorize y ijk , which has the following format: Note that we now have β j(i) instead of β j + (αβ) ij .In this model, we identify a primary factor, for instance, the feature denoted by i which falls into a categories, and the other (indexed by j) becomes a secondary factor, the number of whose levels b i depends on the level of the primary factor.We merge the component depending on the combination of both features into that of the secondary factor as β j(i) .
Similarly, we have which can be written as with degrees of freedom N − 1, a − 1, M − a, and N − M for each of the terms, where M is: Therefore, we want to maximize the following objective:

B. FLDA with a partial contingency table
Here we provide the mathematical details of FLDA under the representation of a partial table.When we have a partial table, if the feature i is the primary feature with a levels, and the feature j is the secondary feature with b i levels, then N A in Equation ( 2) is defined as follows: where and M is defined as in Equation (39).Correspondingly, M e in Equation ( 2) is defined as: and The remaining mathematical arguments are the same as those for the complete table.In this scenario, because we don't observe all possible combinations of features i and j, we cannot find the linear projection for the combination of both features.

C. Implementation details of data synthesis
To quantitatively compare FLDA with alternative approaches, we synthesized data of four cell types, each of which contained 250 cells.The four cell types were generated from a Cartesian product of two features i and j, where i ∈ {0, 1} and j ∈ {0, 1}.Expressions of 1000 genes were generated for each cell.The expression value of the h th gene in the k th cell of the cell type ij, x h ijk was defined as the following: x 101:200 ijk = j + ϵ ijk (48) x 201:300 x 301:400 x 401:500 ijk = 2i + ϵ ijk (51) x 501:600 ijk = 2j + ϵ ijk (52) x 601:700 ijk = 2i ∧ j + ϵ ijk (53) x 701:800 x 801:900 ijk = ϵ ijk (55) x 901:1000 where and were combinations of the two features.Here ϵ ijk represents Gaussian noise, namely We generated synthetic datasets under varying levels of Gaussian noise by adjusting the σ value across a set of 5 distinct levels (σ ∈ 2, 4, 6, 8, 10).For each of these σ values, we created 10 distinct datasets in a repetitive manner, leading to a total of 50 synthetic datasets.The evaluation metrics for each σ value were then computed as the average across the respective 10 repetitions, providing a robust evaluation of the proposed FLDA method across different noise levels.

D. Implementation details of the metrics used in the study
We measured the following metrics in our experiments:

Signal-to-Noise Ratio (SNR)
Because we care about the separation of cell types, we define the SNR metric as the ratio of the variance between cell types over the variance of the noise, which is estimated from within-cluster variance.For the entire embedding space, given q cell types, if the coordinate of each cell is indicated by x, then we define the overall SNR metric as the following: where xp. is the center of each cell type cluster, and x.. is the center of all data points.
Let x denote the embedded coordinate along a specific dimension.The SNR metric for that axis is therefore:

Explained Variance (EV)
We want to know whether the variation of a specific dimension is strongly explained by that of a specific feature.Therefore, we measure, for each axis, how much of the total explained variance is explained by the variance of the feature i or j.Formally, given the embedded coordinate x ijk , we calculate the EV as the following: where x is the average of x ijk over the indices indicated by the dots.

Mutual Information (MI)
The MI between a discriminant axis u and a feature quantifies how much information of the feature is obtained by observing data projected along that axis.It is calculated as the MI between data representations along the axis y = u T X and feature labels of the data f , where X is the original gene expression matrix: Here H indicates entropy.To calculate H(y) and H(y, f ), we discretize y into 10 bins.

Modularity
Ridgeway and Mozer (2018) argued that in a modular representation, each axis should depend on at most a single feature.Following the arguments in their paper, the modularity score is computed as follows: we first calculate the MI between each feature and each axis (m if denotes the MI between one axis i and one feature f ).If an axis is perfectly modular, it will have high mutual information for only one feature and zeros for the others, we therefore compute a template t if as the following: where θ i = max g m ig .We then calculate the deviation from the template as: where N is the number of features.The modularity score for the axis i is 1 − δ i .The mean of 1 − δ i over i is defined as the overall modularity score.

Silhouette Score
The Silhouette score can be used to measure separation between clusters.The formula for the Silhouette score of one sample point is given by: where: a(i) is the average distance from the ith sample to the other samples in the same cluster, and b(i) is the smallest average distance from the ith sample to the samples in the other clusters.
The Silhouette score ranges from -1 to 1.If the score is close to 1, it means that the sample is appropriately clustered.If the score is close to -1, then by the same logic, one could tell that it would be more appropriate if the data was clustered in its neighbouring cluster.The overall Silhouette score is the average silhouette score of all samples, serving as a measure of how appropriately the data have been separated and clusters.

E. Implementation details of annotation perturbation
We conducted an annotation perturbation analysis to assess the impact of incorrect phenotypic labeling on the performance of our FLDA method.For this, we utilized the T4/T5 neurons dataset and introduced three distinct types of perturbations to the original labels: Firstly, we interchanged the phenotype labels of T4a neurons with one of the seven remaining types (T4b, T4c, T4d, T5a, T5b, T5c, T5d).In this case, phenotype labels for two cell types were erroneous, while the total number of cell type clusters remained the same.We maintained two levels for dendritic phenotypes (T4/T5) and four levels for axonal phenotypes (a/b/c/d).As per our methodological setup, we used one dimension to represent the dendritic feature and three dimensions for the axonal feature.
Secondly, we combined the axonal phenotypic level a with another level (b/c/d) to create an inaccurate new level (a+b/a+c/a+d).Under this scenario, there were three axonal phenotypes, we employed two dimensions to represent the axonal feature.
Lastly, we arbitrarily divided each of the four axonal lamination labels (a/b/c/d) into two distinct levels.For example, among neurons with the original axonal level a, we designated some of them with a level a1, while the rest were assigned a level a2.This resulted in eight axonal phenotypes (a1/a2/b1/b2/c1/c2/d1/d2), and following our methodology, we designated seven dimensions to account for the axonal feature.
In each scenario, we applied FLDA on the T4/T5 neurons dataset using the altered annotations and measured the respective metrics.These metrics were subsequently compared with those derived from the original annotation to assess the robustness of our method to mislabeling.

Figure 1 :
Figure1: Illustration of our approach.(A,B) Here, cell types are represented by two phenotypic features, labeled with i and j respectively.If only some combinations of the two features are observed, we have a partial contingency table (B) rather than a complete one (A).(C) We aim to find linear projections of the data that separate the cell types in a manner factorized according to the two features.In this diagram, u, v, and w are aligned with Feature 1, Feature 2, and their combination respectively, with the projected coordinates y, z, and s.

Figure 2 :
Figure 2: Quantitative comparison between FLDA and other models.(A) Illustration of data synthesis.For implementation details, see Appendix C. The color bar represents the expression values of the 1000 generated genes.(B) Normalized overall Signal-to-Noise Ratio (SNR) metric for each analysis, normalized with respect to that of LDA.The normalized SNR metric of PCA is below 0.8.(C) Overall modularity score for each analysis.The error bars in (B,C) denote standard errors calculated from 10 repeated simulations.

Figure 3 :
Figure 3: Application of FLDA to the dataset of T4/T5 neurons.(A) T4/T5 neuronal cell types and their dendritic and axonal location phenotypes.(B) The organization of T4/T5 neurons in a complete contingency table, where i indicates dendritic location and j indicates axonal termination.(C) SNR metric for each discriminant axis.(D) Data projection into the three-dimensional space consisting of the discriminant axis for feature i (FLD i ) and the first and second discriminant axes for feature j (FLD j1 and FLD j2 ).(E-G) Data projection into the two-dimensional space comprised of FLD i and FLD j1 (E), FLD j1 and FLD j2 (F), or FLD j2 and FLD j3 (the third discriminant axis for feature j) (G).Different cell types are represented by different colors as depicted in Panels (A) and (D).
genes for the dendritic phenotype Selected genes for the axonal phenotype

Figure 4 :
Figure 4: Critical genes extracted from the sparse algorithm.(A) Weight vector of the 20 genes selected for the dendritic phenotype (FLD i ).The weight value is indicated in the color bar with color indicating direction (red: positive and green: negative) and saturation indicating magnitude.(B) Expression patterns of the 20 genes from (A) in eight types of T4/T5 neurons.Dot size indicates the percentage of cells in which the gene was expressed, and color represents average scaled expression.(C) Weight vector of the 20 genes selected for the axonal phenotype (FLD j1 ).Legend as in (A).(D) Expression patterns of the 20 genes from (C) in eight types of T4/T5 neurons.Legend as in (B).

Figure 5 :Figure 6 :
Figure 5: Representative plots (at σ = 6) of EV and MI metrics for FLDA and other models.(A,B) EV (A) and MI (B) metrics of FLDA.FLD i and FLD j indicate the factorized linear discriminants for features i and j. (C,D) EV (C) and MI (D) metrics of 2LDAs.LD i and LD j indicate the linear discriminant components for features i and j. (E,F) EV (E) and MI (F) metrics of LDA.LD 1 and LD 2 indicate the first two linear discriminant components.(G,H) EV (G) and MI (H) metrics of CCA.CCA 1 and CCA 2 indicate the first two canonical correlation axes.(I,J) EV (I) and MI (J) metrics of PCA.PC 1 and PC 2 indicate the first two principal components.EV i and EV j are the explained variance of features i and j along an axis, and MI i and MI j indicate the mutual information between an axis and features i and j respectively.Values of EV and MI metrics are also indicated by the color bars on the right side.A B

Figure 7 :Figure 8 :
Figure 7: Additional plots for critical genes extracted from the sparse algorithm.(A) Weight vector of the 20 genes selected for the combination of dendritic and axonal features (FLD ij 1 ).The weight value is indicated in the color bar with color indicating direction (red: positive and green: negative) and saturation indicating magnitude.(B) Expression patterns of the 20 genes from (A) in eight types of T4/T5 neurons.Dot size indicates the percentage of cells in which the gene was expressed, and color represents average scaled expression.

Table 2 :
Average execution time (in seconds) for FLDA and other models.