Patterns, Profiles, and Parsimony: dissecting transcriptional signatures from minimal single-cell RNA-seq output with SALSA

Single-cell RNA sequencing (scRNA-seq) technologies have precipitated the development of bioinformatic tools to reconstruct cell lineage specification and differentiation processes with single-cell precision. However, start-up costs and data volumes currently required for statistically reproducible insight remain prohibitively expensive, preventing scRNA-seq technologies from becoming mainstream. Here, we introduce single-cell amalgamation by latent semantic analysis (SALSA), a versatile workflow to address those issues from a data science perspective. SALSA is an integrative and systematic methodology that introduces matrix focusing, a parametric frequentist approach to identify fractions of statistically significant and robust data within single-cell expression matrices. SALSA then transforms the focused matrix into an imputable mix of data-positive and data-missing information, projects it into a latent variable space using generalized linear modelling, and extracts patterns of enrichment. Last, SALSA leverages multivariate analyses, adjusted for rates of library-wise transcript detection and cluster-wise gene representation across latent patterns, to assign individual cells under distinct transcriptional profiles via unsupervised hierarchical clustering. In SALSA, cell type assignment relies exclusively on genes expressed both robustly, relative to sequencing noise, and differentially, among latent patterns, which represent best-candidates for confirmatory validation assays. To benchmark how SALSA performs in experimental settings, we used the publicly available 10X Genomics PBMC 3K dataset, a pre-curated silver standard comprising 2,700 single-cell barcodes from human frozen peripheral blood with transcripts aligned to 16,634 genes. SALSA identified at least 7 distinct transcriptional profiles in PBMC 3K based on <500 differentially expressed Profiler genes determined agnostically, which matched expected frequencies of dominant cell types in peripheral blood. We confirmed that each transcriptional profile inferred by SALSA matched known expression signatures of blood cell types based on surveys of 15 landmark genes and other supplemental markers. SALSA was able to resolve transcriptional profiles from only ∼9% of the total count data accrued, spread across <0.5% of the PBMC 3K expression matrix real estate (16,634 genes × 2,700 cells). In conclusion, SALSA amalgamates scRNA-seq data in favor of reproducible findings. Furthermore, by extracting statistical insight at lower experimental costs and computational workloads than previously reported, SALSA represents an alternative bioinformatics strategy to make single-cell technologies affordable and widespread.

stable expression of lineage markers, which may be inconsistent with their physiological 23 reality, and may distort the transcriptional dynamics of cell populations in vivo [11]. 24 Single-cell transcriptomics circumvents many of these obstacles, and recent 25 breakthroughs in RNAseq technologies now allow reconstruction of cell lineage 26 specification processes in embryonic tissues at the level of individual cells [12][13][14][15][16][17][18]. A 27 diverse catalogue of single cell RNAseq (scRNA-seq) platforms and workflows are 28 currently available to interrogate transcriptomes from individual cells.
Using 29 bioinformatic tools, individual cells are sorted and classified by their gene expression 30 similarities, and the identity of each cell is deduced based on their transcriptional and 31 functional ontology [19][20][21][22]. Broadening the scope of samples that can undergo scRNA-32 seq workflows, multiple groups have introduced chemical treatments for freshly excised 33 specimens, akin to whole-tissue mounting, that allow long-term storage of specimens 34 whole and conserving their endogenous transcripts. These modifications make a wider 35 range of collected and banked tissues, including clinical samples, compatible with 36 scRNA-seq protocols designed for freshly excised and dissociated specimens. Some of 37 6 UMIs per detected barcode and total UMIs per aligned gene. This is followed by 115 quantile regression of best-fit parametric scale and shape factors, and a partial 116 contribution rate from one of the two constituent P C and P D distributions. Parametric 117 scale factors of the P C -P D mixture model are used to estimate critical inflection points at 118 transition regions in the eCDF between low, transient, or maximal UMI counts per 119 barcode, whereas parametric shape factors gauge the steepness of these transitions. 120 Thus, when used in combination, P C -P D parametric factors help demarcate the range of 121 UMI coverages corresponding to single cells (Fig. S1). Using a similar logic, the same 122 frequentist approach can be used to distinguish facultative genes from rare or 123 constitutively expressed ones ( Fig. 1C and D). From here on in the workflow, only data 124 from "best-guess" single-cell barcodes, trimmed to retain facultative genes exclusively, 125 is needed for downstream differential expression analysis of scRNA-seq datasets (Fig.  126 1E). 127 The SALSA workflow 128 At its core, the SALSA methodology examines total UMI counts to infer transitions in 129 UMI coverage between artefactual (baseline), single-cell (signal) and multi-cell Therefore, we refer to such a minimum required number of pairwise differences 208 suggestive of mutual exclusivity as the cardinal number. The resulting subset of 209 differentially expressed genes (DEGs) is interrogated further to tease out if they have a 210 cardinal number of pairwise-significant differences greater than SNR=1 between cell 211 majors (∆ Log2FC > 95% TI SSR ). Genes passing this criteria are referred to as DEGs with 212 reproducible expectation estimates (DEGREEs) because of their prospective resilience 213 to technical noise, analogous to their bulk RNA-seq counterparts in the LSTNR method 214 [45]. Finally, to extract candidate marker genes that are minimally impacted by 215 representation differences between inferred cell populations, we perform unweighed 216 log-fold expression ANOVA of DEGREEs across cell majors, and only retain those that 217 exhibit cardinal pairwise significant differences greater than noise levels. We refer to 218 this final subset of genes as Profilers, since they are facultative genes that lie within 219 sequencing dynamic range based on their mean UMI coverage across single cells 220 (SGs), statistically distinct and beyond technical noise from the average of all cells 221 combined (LSTNRs), with statistically significant expression differences between cell 222 (~14%) of data-positive fields comprised a much smaller share of 3,929 (or 23.6% of) 275 detected genes. These genes had 4 or more counts, with up to 419 total mappings for 276 the single-most abundant barcode×gene combination that aligned to the FTH1 locus. 277 We also observed that ~0.27M (or 84%) of all multi-count data (4 or more counts) was 278 made of alignments to only 166 (1%) of detected genes. Among the 166 279 "overrepresented" genes in this fraction of the PBMC 3K library we found 8 protein-280 coding mtDNA genes, 75 ribosomal protein subunits, 8 HLA chains, and housekeeping 281 genes like β -actin, GAPDH, and vimentin (Table S1). 282 Due to the sparsity and low-count UMI tallies in gene-cell matrices from scRNA-seq 283 data, when developing SALSA we opted for imputation-leveraging algorithms like SVD 284 and IRLBA that are designed to handle sparsity dynamically [44]. If missing-data fields 285 are explicitly replaced with values of zero instead, file sizes increase dramatically, 286 approximately ~20-fold in the case of the PBMC 3K data set. To analyse such a 287 dataset, large volumes of dynamic memory storage and computational processing 288 power for data post-processing are required to queue and retain all the additional zeros 289 that were artificially imposed on the gene-cell matrix. In practical terms, leveraging 290 sparsity-capable algorithms to analyse scRNA-seq datasets can represent the 291 difference between performing expression analysis for thousands of single cells using a 292 laptop vs. having to rely on high-performance parallel processing networks to complete 293 the same analysis. 294 count data in a stacked format. Although this stack represents 100% of data-positive 297 gene-cell fields, it occupies only ~5.1% of the footprint from a traditional zero-filled 298 gene-cell expression matrix. Next, total-per-barcode UMI counts (i.e. per-cell coverage) 299 were recorded for calculating normalized barcode×gene expression rates in UMI-per-300 Thousand UMIs (UPT) units later in the workflow. Because in the 10X Genomics 301 curated 3K PBMC dataset each barcode is regarded as a single cell, no barcode 302 filtering by coverage was required in this instance. We tallied and recorded aggregate 303 UMI counts per aligned gene using our P C -P D mixture model to interrogate total-per-304 gene coverages ( Fig. 2A). The SALSA P C -P D mixture model was then used to 305 determine the best candidate subset of genes ranking between low overall detection 306 rates (i.e. rare transcripts) and extraordinarily high counts at "outlier levels" across the 307 board (i.e. constitutive genes). 308 To estimate ranges of "inlier" per-gene coverage scores at different minimum UMI 309 count cut-offs, both the P C -P D mixture model and a heavy-tailed range projection model 310 are iterated from the bottom-up (Fig. 2B). Firstly, all detected genes with 1 or more total 311 UMI counts are admitted for quantile regression. Best-fit parameters are then 312 numerically solved and recorded for further analysis. To calculate this the minimum 313 threshold of total UMI counts for gene admission is raised, and fits are repeated until the 314 entire data stack is discarded. In our implementation, the minimum threshold is updated 315 between iterations in squared steps (e.g. a first minimum cut-off at 1 total UMIs, second 316 at 4, third at 9, etc.) to correspond with the 2-dimensionality of gene-cell matrices. The 317 output from this parametric sweep consists of 7-parameter value sets from the P C -P D 318 mixture model (2 scale factors, 2 shape factors, and a non-zero and non-negative partial contribution rate from either the P C or P D function) and the heavy-tailed range 320 projection model (1 scale factor, 1 shape factor) combined. The 5-parameter set from 321 the P C -P D mixture model described a best-fit distribution that contains most of the 322 observed per-gene coverage values; the 2-parameter set from the heavy tailed 323 projection model is used to estimate an upper bound of admissible "inlier" per-gene 324 coverages ( Fig. 2A  Because of the nature of extreme value distributions, each fit is primarily driven by 327 the make-up of scores closest to baseline, i.e. a minimum discriminant threshold of 328 background vs. signal. As the threshold is raised, the estimated average and projected 329 "inlier" maximum scores of per-gene coverages predicted by the best-fit parameters 330 also rise (Fig. 2B). In a case where the admitted data is dominated by rare transcripts, 331 the projected "inlier maximum coverage" is small. On the other hand, if admitted data is 332 dominated by facultative genes, the projected maximum is larger. However, coverage 333 data and the P C -P D mixture model have different support (discrete vs. continuous 334 values). This mismatch in statistical variable support can be exploited. The SALSA 335 smoothly otherwise. For example, "infinity" shape factors or zero-valued (machine-343 precision) partial contribution rates will both lead to "spikes" appearing in the chart of 344 best-fit P C -P D parameters (Fig. 2C). These "spikes" in the plots of P C -P D parameter 345 values vs. coverage cut-offs serve to highlight transitions between rare, facultative, and 346 constitutive genes based on the ranking of their UMI coverages relative to all other 347 detected genes. As a result, the P C -P D parameters estimates are themselves tied to the 348 empirical coverage data through a parametric approach. Therefore, our filtering strategy 349 used in the SALSA workflow provides a systematic route to back-calculate ranges of 350 total UMI coverages that distinguish facultative genes from all others represented in the 351 data. 352 Based on the P C -P D parametric sweep of the PBMC 3K data set, we partitioned the 353 16,634 detected genes into three categories: 13,077 rarely aligned genes (1 -168 total 354 UMIs each); 3,305 facultative genes (169 -2,799 total UMIs each); and 252 constitutive 355 genes (2,814 -161,685 total UMIs each). The structure of the PBMC 3K data set was 356 notable in that the data-positive fields were unevenly apportioned among the three gene 357 coverage regimes (Table 1    3-valued count fields respectively (Fig. 2D). In principle, these data 386 features would suggest many genes in this subset were detected somewhat frequently terms, these genes are facultative, with their transcripts neither present or absent in all 389 cells at once; instead, some cells express them, some do not, and some express the 390 transcripts at rates that wax and wane. 391 We propose that parametric focusing of gene-cell matrices for the PBMC 3K data 392 set, and arguably, for any scRNA-seq data sets, is a systematic curation strategy that 393 favors retention of diverse blocks of single-cell expression data for subsequent analysis. 394 This strategy for data curation strikes a balance between data volume, computational 395 performance, and statistical variation. Of note, we did not perform parametric focusing 396 at the barcode level on the PBMC 3K dataset because the source files we used only 397 reported single-cell barcodes; even then, parametric focusing on genes alone identified 398 gene subsets to withdraw from further analysis and greatly reduced the computational 399 data load. In the PBMC 3K data set, this initially reduced the computational cost by well 400 over an order of magnitude ahead of performing gene expression and single-cell 401 clustering analyses; in effect, SALSA reduced the data to be analysed to a ~1.3M UMI 402 count stack vs. the original ~45M zero-filled count matrix (Table 1). In practical terms, 403 our parametric focusing approach to pre-processing raw scRNA-seq datasets efficiently 404 distils the informative fraction of expression data from the prominently empty-valued 405 matrix for further analyses. 406 Baseline benchmarking on the PBMC 3K data set with SALSA. Having honed 407 the dataset to a facultative block containing ~1.3M data-positive fields of gene×cell UMI 408 counts, we then performed differential expression analysis using the SALSA workflow 409 (Fig. 1E).
Grand-mean gene expression rates were calculated using "bulked" 410 normalized expression from all single cell barcodes combined. Log 2 -fold normalized expression differences (Log 2 FC) relative to grand-mean gene expression rates were 412 then calculated within single-cell barcodes and used for subsequent inferential tests. 413 Even before expression differences between individual barcodes can be inspected, 414 there are two critical properties of expression measurements that must be demarcated 415 first to produce a reliable statistical analysis: the dynamic range of measurement 416 sensitivity, and the precision of expression measurements. To estimate a useful 417 dynamic range of detectable gene expression over library background, we 418 characterized the distribution of log-transformed grand-mean gene expression rates. 419 We found that a thresholded normal distribution was the best-fit probabilistic model From the facultative block (1.3M data-positive fields, 57.2% of stack) we identified 436 ~76% of genes as SGs (2,519 of the 3,305 genes; Table 1 and Fig. 3A). These genes 437 accounted for the subset of facultative genes that, once resolution and within-cluster 438 representation rates were accounted for in combination, showed statistically significant 439 variation among the 7 prospective clusters (Fig. 1E) values. Interestingly, this LSTNR stratum contained each of the single-cells and genes 447 found in the SG stratum before effect size and measurement error thresholding. 448 Therefore, even though SALSA is extracting an increasingly robust data subset when 449 moving from SGs to LSTNRs, a classification which demands more statistical 450 stringency, it did so without barcode or gene dropouts (Table 1). Conversely, since the 451 resulting LSTNR block was of the same size as its SG predecessor (2,519×2,700), our 452 findings also implied that filtering for gene×cell count data "above precision baseline" 453 meant some data-positive fields were cleared out, which lowered effective data respectively (Table 1). This resulted in a net 78.4% SG-to-LSTNR data retention rate 456 for the PBMC 3K data set (Table 1). 457 In our view, the progression of admitted data in our SG-to-LSTNR transition posits 458 an important cautionary tale: without a proactive precision benchmarking strategy, one 459 risks inferring cell clusters based on gene-cell expression matrices unduly burdened 460 with measurement noise. In the PBMC 3K data set this means that with a 78.4% 461 retention rate, the LSTNR block from the PBMC 3K data set contains the exact same 462 single cells and barcodes as the SG block, yet it is equivalent to discarding roughly 1 463 out of 5 data points worth of noise before further analysis (Table 1). Both the SG and 464 LSTNR blocks represent minute fractions of the original expression matrix, with a 2.3% 465 and 1.8% matrix span respectively. Additionally, most of the expression matrix real-466 estate is earmarked for overwhelmingly sparse or 1-valued count data from rare 467 transcripts or artefacts, accounting for ~78.6% of the matrix allowance (Fig. 2B). In 468 conclusion, in analysing this "silver" dataset we can clearly see how inefficient and 469 misleading scRNA-seq data wrangling can become without performing statistical efforts 470 devoted to either matrix focusing or precision baseline benchmarking prior to clustering 471 of expression patterns. 472

Stratification of LSTNRs and single-cell clustering in the PBMC 3K data set. 473
The precision-benchmarked subset of 2,419 LSTNR genes represents a diverse 474 catalogue of candidate discriminants to distinguish cells belonging different clusters by 475 their transcriptional signatures. Still, at this stage, these cell groupings are prospective 476 and relatively unrefined: considering that LSTNRs represents roughly 3 out of every 4 477 facultative genes in the PBMC 3K data, one could counter that the initial single-cell clustering, which included non-LSTNR data, could be disproportionately influenced by 479 data-positive fields below their own a priori SNR=1 estimated threshold. To address 480 these concerns, we gathered LSTNR block data and, using the SALSA workflow ( Fig.  481 1E) inferred 7 "cell major" groupings via IRLBA re-clustering of barcodes based on the 482 GLM-projected expression rates, then scored representation rates per LSTNR gene 483 within cell majors (Fig. 1E). In this context representation rates reflected the proportion 484 of cells with detected UMIs aligning to LSTNRs. 485 At this point, we reached a cross-road. On the one hand, we extracted a precision-486 curated catalogue of LSTNR genes, whose statistical variation could be exploited to sort 487 out underlying cell types from a corpus of single-cell transcriptomes through 488 bioinformatic means. On the other hand, in understanding the biology behind the 489 system of interest, it is of little advantage to have over 2,400 LSTNR genes if there is no 490 insight on which the best candidates are to validate bioinformatically inferred cell types 491 using on-the-bench assays like qPCR. Similar concerns can be raised regarding 492 representation weights in SALSA. Utilising representation weights, we successfully 493 enhanced our ability to discern between cell majors, as they account for "yes/no" 494 expression status of genes depending on the single-cell cluster. In fact, when 495 interrogating expression matrices dominated by 1-valued gene×cell count data, 496 representation weights provide perhaps the only differential metric at hand that truly 497 discriminates between inferred cell types.
Yet, representation rates at the 498 transcriptional level are notoriously difficult to validate in a quantitative fashion on the 499 bench. 500 inspecting additional layers of gene stratification within LSTNRs, based on how 502 prospectively can LSTNR subsets reproduce differential expression patterns in 503 experimental assays. We convey gene stratification results hereafter using a short-504 hand graphical aid, the "frosty" plot, that illustrates the transition in data retention as 505 detected genes are sifted through increasingly stringent filters of statistical significance. 506 The frosty plot for the PBMC 3K data using SALSA-based gene stratification, and 507 details on its assembly, are shown in Fig. 3A. 508 As mentioned earlier, the list of LSTNRs is iteratively focused by progressively 509 stringent stratification of statistical significance and post hoc cardinality of differential 510 Log 2 FC-based expression between cell majors (Fig. 1E). By sequentially "stripping" 511 statistical comparisons of resolution weights, followed by representation weights, while 512 demanding mutually exclusive post hoc expression differences between cell majors with 513 statistical and above-noise significance throughout, we narrowed the pool of genes. 514 From a 2,419-member list of LSTNR genes from the PBMC 3K data set we sifted the 515 pool down to: a) 1,244 DEGs whose Log 2 FC pairwise differences between cell majors 516 are statistically significant and mutually exclusive (DEGs); of which b) 464 showed 517 differences greater than the SNR=1 noise benchmark (DEGREEs); including c) 462 518 exhibiting statistically distinct and mutually exclusive differences that are insensitive to 519 stochastic incidence of transcripts, or representation, among cell types (Profilers). 520 Notably, even though the number of retained gene×cell count data fields dropped as the 521 number of genes decreased between strata (Table 1)  Profiler-aligned coverage per cell ranging from 4 -289 UMIs. Our stratification 524 approach led to a substantial improvement on information maximization rates: the 525 462×2,700 Profiler block represents ~2.8% of the gene-cell matrix allowance, and a 526 minute portion of the accrued data overall (9.1% of the accrued data stack, equal to 527 0.5% matrix span). This portion accounts for the single least sparse stratum within 528 facultative gene data (16.8% block span). Incidentally, this span is over 3× more 529 populated as a subset than the gene-cell expression matrix overall (5.1% matrix span), 530 as outlined in Table 1. 531 One underlying assumption of this approach is that LSTNR-based single-cell 532 clustering is reasonably conserved across the stratification process. Thus, to ascertain 533 whether that implied fidelity held true, data from the Profiler block was used in IRLBA re-534 clustering. Here, cell majors at the LSTNR and Profiler strata were compared to one 535 another in terms of membership by cluster-to-cluster concordance, and hierarchical 536 sorting by single-cell ordering within clustering dendrogram. In all, we found that single 537 cells were largely classified into matching clusters before and after LSTNR stratification 538 (contingency R 2 = 0.91, Pearson χ 2 p<1×10 −4 ). These results suggest that facultative 539 gene stratification retains underlying transcriptional profiles of single cells, thereby 540 pointing to SALSA successfully extracting a parsimonious subset of testable, 541 agnostically defined candidate biomarkers. 542 After performing single-cell clustering at the Profiler-stratum level, we first paid 543 attention to the relative density of data in each of the 7 cell majors, originally labelled A 544 through G. Several intricacies became immediately apparent -for example, not only 545 were the number of cells different among cell majors, but also the distribution of total 546 ( Fig. 3B and C), all 7 majors that we defined were grossly derived from three core 548 stems: A through D represented one extreme, F and G were situated opposite, and E 549 constituted a lone-standing intermediate stem. In the first stem, only majors A through 550 C, which accounted for slightly over half of all inferred cells, showed similar total UMI 551 coverage distributions among them (roughly 1,000 -3,000 total UMI per cell; Fig. 3D). 552 Log 2 FC expression levels for A through C were in the low-to-mid quantiles among 553 accrued profiler-stratum count data overall (about 0.7 -1.5 log-fold relative to the bulk-554 wise average; Fig. 3C). Meanwhile, major D was substantially skewed towards lower 555 total UMI coverages, with about 500 -2,500 total UMI per cell (Fig. 3D), and mid-to-556 high Log 2 FC expression quantiles (up to 2.8 log-fold vs. bulk). Interestingly, based on 557 net Log 2 FC values, normalized expression rates in major D were about twice as high as 558 those from the A-to-C ensemble. This is seemingly consistent with the mismatch 559 between both ensembles in total UMI coverage, and to which count data each are 560 normalized against. This observation implies that traditional normalized expression 561 scores at the gene×cell level, such as UPT rates, track with total count coverage rather 562 than intensity of expression. Accordingly, cell major E in the intermediate branch 563 showed lower expression values relative to the A-C ensemble in the first stem by about 564 a factor of 2. This was consistent with cells in major E carrying roughly twice as many 565 total UMIs the A-C ensemble (about 2,000 -5,000 total UMI per cell; Fig. 3D). Put 566 simply, these data demonstrate that single-cell expression metrics by normalized-per-567 coverage rates alone are misleading. 568 Using IRLBA-based single-cell clustering within the SALSA workflow illustrates how 569 interpretation of expression matrices can go awry if based solely on expression scoring 570 without account for representation or precision benchmarking. For the PBMC 3K 571 dataset, these risks are most evident in the F-G stem: based on hierarchical clustering 572 ( Fig. 3B, C), majors F and G are more closely related to each other than to any other 573 cell majors, yet their total UMI coverages are at opposite extremes (500 -2,000 and 574 2,000 -16,000 total UMI per cell in F and G, respectively; Fig. 3D). In a seeming 575 contradiction, data in major F aligns to the most Profiler genes while, at the same time, 576 exhibits the lowest per-cell coverage among all majors; major G data make-up for 577 Profiler genes is exactly the opposite. One explanation for such extreme behaviours 578 may be that per-cell coverages for majors F and G are disproportionately laden with 579 UMIs from constitutive genes, all of which are omitted from clustering tests -suggesting 580 that either the translational (e.g. transcripts for ribosomal subunits) or metabolic (e.g. 581 mtDNA-derived and respirasome-encoding transcripts) states of cells in majors F and G 582 are at odds relative to each other. Another explanation could come from the effect of 583 total coverage itself on gene stratification: for single cells whose data is dominated by 1-584 valued counts per gene, more total UMIs will result in lower UPT scores and a larger 585 proportion of "muted" gene×cell Log 2 FC values below the SNR=1 threshold, which then 586 would be excluded from the expression matrix at the LSTNR stage. 587 Peripheral blood cell types in the PBMC 3K data set inferred using SALSA. To 588 examine whether Profiler-based unsupervised clustering tracked with transcriptional 589 signatures from peripheral blood subpopulations, we focused on expression data from a 590 reference subset of 15 "landmark" genes encoding 14 widely recognized protein 591 markers (Figs. 3C, E and F; Table 2). To do so, we inspected within-cluster 592 distributions of Log 2 FC expression levels relative to the whole-specimen collective. We 593 accounted for "yes/no" representation rates within each cluster for landmark genes; by 594 coloring the background of violin plots grey we highlighted clusters with the highest 595 overall Log2FC expression levels in any given gene and showed the highest 596 representation rates in combination (Fig. 3E). To visualise the clustering data in 2D we 597 devised "topographs" using 2D clustering maps overlayed with a non-parametric 598 quantile heatmap highlighting "weighed gene expression" scores. These scores 599 represent a composite of single-cell Log 2 FC values and major-wise representation rate, 600 defined as the proportion of positive cells per major, for a gene of interest. We use 601 these topographs to simultaneously highlight differences in the intensity and 602 predominance of expressed genes among clusters of like cells (Fig. 3F). clustered with (Fig. 3C). However, cell major B was otherwise discordant when 619 compared to its companions, majors A, C, D and E, based on landmark genes (Fig. 3E). and granulocytes in major B in the clustering map (Fig. 3B) suggests a significant 658 interaction between those two cell types. Physiologically this relationship is widely recognized; subsets of granulocytes are known to perform antigen-presenting roles that 660 help coordinate CD4+ T cell activation [54,63,64]. 661 The second transcriptome category constituted majors F and G. This was puzzling, 662 not only because majors F and G had strikingly different total UMI coverages, as 663 mentioned earlier, but also because we found that expression of blood cell-type markers 664 in major G cells had the strongest similarities with granulocytes (major B) in the first 665 transcriptome category. Both majors G and B shared similar representation rates for 666 landmark genes CD4, CD14, and FCεRI(FCER1G) (Fig. 3E), carried a CD33+/CD36+ 667 myeloid signature (Fig. S2), and had matching logistic regression trends for Profiler 668 genes responsive to microbial infections, such as AP1S2, CNPY3, FPR1 and RGS2 669 [72][73][74][75] (Fig. S3). However, major G cells clustered apart from granulocytes in major B 670 and were also distinct in critical ways; for example, major G cells showed much higher 671 expression levels of monocytic and macrophage-enriched genes CD16(FCGR3A) and 672 MS4A4A, respectively (Fig. 3B, C and E) [56,76,77]. We concluded that cell major G 673 represents a combined pool of monocyte-derived subtypes including monocytes, 674 macrophages, and mono-derived dendritic cells. This conclusion was based on two key 675 factors: an implied myeloid origin, like that of antigen-presenting granulocytes in major 676 B, and transcriptional enrichment for genes involved in pathogen recognition processes 677 that are characteristic of leukocytes from the innate immune system [54,56,66,73,74,76-678 78]. 679 Given this backdrop, we surmised that if clustering proximity in the second 680 transcriptome category evokes converging physiologies between non-myeloid cells from 681 major F and myeloid-derived cells from major G, irrespective of their strikingly different total UMI counts, then majors F and G should both present key ontologies. To test this, 683 we asked whether majors F and G shared a minimal set of abundantly detected Profiler 684 genes which: a) were absent from all other cell majors; and b) significantly enriched for 685 critical innate immunity processes or molecular functions. After surveying weighed 686 expression rates by multinomial logistic regression, we found that both majors F and G 687 prominently expressed the Profiler genes HIST1H2AC, PF4, PPBP, and 688 SDPR(CAVIN2). These genes all participate in anti-parasitic NETosis processes, 689 mediate leukocyte chemoattraction and degranulation, and are regulated by calcium-690 sensitive protein kinase C activity [79][80][81][82][83][84][85]. Importantly, most other cells in the PBMC 3K 691 dataset did not express any of these Profiler genes or exhibit their associated ontologies 692 (Fig. S3). 693 Still, the transcriptomic profiles of cells in major F were evidently distinct from 694 monocytic cell types, not only by their coverage, but also because of critical landmark 695 gene signatures. The genes FCεRI(FCER1G) and LYZ were both detected in almost 2 696 out of every 3 cells from major F, which expressed them at the richest Log 2 FC levels 697 among all cell majors relative to the bulk mean. Both these landmark genes showed 698 their lowest Log 2 FC levels across the dataset among cells from major G, even though 699 transcripts from both FCεRI(FCER1G) and LYZ were detected in almost every single 700 major G cell. In contrast, CD16 expression was all but missing from major F, and most 701 prevalent in major G cells overall. Therefore, we used four defining features to derive 702 the identity of the cells in major F: 1) their strong ontological relationship with monocytic 703 cell types, 2) their relative frequency in the data set of ~9% of the total (247 single cells), innate immunity within PBMC fractions. From these features we deduced that cells in 706 major F correspond to lymphoid-derived natural killer (NK) cells [62, [86][87][88][89][90][91]. In the most commonly used chemistries, universal handles and barcodes are fused 748 to poly-A + insertion sites of mRNA molecules. In this context, bridged paired-end 749 sequencing allows each read from sequenced fragments to serve different purposes. fragments this approach aims to circumvent mappability issues from oligo dT 752 mispriming. The dT regions can anneal randomly to points inside poly-A + tracks in RNA 753 templates, but also to degraded gDNA debris. When oligo dTs that anneal with mRNA 754 molecules fail to anchor by stop codons, single-end sequencing of their end-tagged 755 cDNA copies yields unmappable A/T run-ons. These poly-A + run-ons cannot be skipped 756 in single-end sequencing when using reversible terminator chemistries because each 757 base is called one at a time. However, with paired-ending, library templates are flipped 758 once "outer" barcode sequences have been collected, then sequencing is resumed from 759 the opposite ends outwards. This strategy is more likely to capture mappable 760 information from the "inner" ends of tagged cDNA fragments. 761 After trimming off the barcode and sequencing adapter from the sequenced reads, 762 paired-end short-read sequencing usually yields deplexed "inner" data aligned within 763 <500-bp from the cistromic 3' end of mRNA molecules. In that sense, paired-end short-764 read sequencing of scRNA-seq libraries does without full exome representation. Based 765 on GENCODE annotations, aligning only 500-bp 3'-end mRNA fragments allows 766 sampling the human exome with ~18.0 Mb total information, or 2.7-fold less nucleotides 767 than full-transcript sequencing (~47.9-Mb length, 1,350-bp full-transcript size average, 768 ~36,000 annotated genes) [93],. Furthermore, based on an average ~95% unique 769 mappability rate with ≤ 2 nt mismatches for the protein-coding exome using 100-kmers 770 [94], the total mappable reads required for unambiguous sampling of 3'-ended human 771 exomes can be reduced an additional 5-fold. Overall, these estimates project a Considering all these factors, it is somewhat unsurprising that genes encoding 799 destabilizing cis-regulatory elements in 3' UTRs would be harder to detect by scRNA-800 seq, as high mRNA turnover is at odds with maintaining high steady-state mRNA levels 801 necessary to capture them for cDNA synthesis [102]. Less appreciated is that since 3' Another example of the mapping challenges in scRNA-seq can be seen in the case 840 of FOXP3. Previously known as scurfin, FOXP3 is a critical transcription factor that similar extent, CD8+ T cells towards immunosuppressive roles (known as Tsups) 843 [56,59,65,67,70]. The 3' UTR of FOXP3 in the human reference genome is ~900 bp 844 long and contains one (CTGGGG) n single-repeat element. The repeat is located 845 roughly 650 bp upstream from the only poly-A + insertion site annotated for FOXP3. It 846 lies within 130 bp from the 3' end of the mRNA which extends the longest into the 3' 847 UTR (390 bp) among the Genbank-annotated FOXP3 transcripts. However, the single-848 repeat element within the 3' UTR of FOXP3, unlike those in the 3' UTR of CD56, can be 849 aligned uniquely from sequenced k-mers as small as 36 bp [94,103]. This feature 850 effectively rules out the possibility that low counts for FOXP3 in the PBMC 3K dataset 851 (11 individual cells, 1 UMI each) are due to high dropouts of reads from its 3' UTR that 852 are not uniquely mappable. 853 Instead, we propose that the low detection rate of FOXP3 transcripts may stem from 854 the ongoing regulation of FOXP3 mRNA stability in leukocytes. The 3' UTR of FOXP3 855 is targeted by SMHD1, a constitutively expressed ribonucleolytic enzyme endogenous 856 to immune cells in vivo, in particular resting T cells [104,105]. Critically, these findings 857 suggest that recognizing subsets of CD4+ Tregs or CD8+ Tsups in peripheral blood 858 would be extremely difficult based only on FOXP3 read counts from scRNA-seq data. 859 Instead, predicting FOXP3-positive cells would be best achieved by surveying scRNA-860 seq data for gene transcripts highly correlated with antibody-based detection of FOXP3 861 protein in circulating cells, such as CD25(IL2R) or CD127(IL7R) [56,59,65,67,68,70]. In 862 the PBMC 3K data, which is derived from peripheral blood from a healthy subject, this 863 alternative signature matches to small pockets within the CD4+ and CD8+ populations In developing the SALSA workflow, we sought to define a systematic approach that 867 enhances replicability and reproducibility in the analysis of scRNA-seq data. Currently, 868 most mainstream approaches resolve cell cluster from single cell data by testing the 869 whole contents of the gene-cell matrix; SALSA demonstrates that it is more 870 advantageous to prioritize facultative genes for inferential testing. Implementation of SALSA repurposes the structure of single-cell transcriptomics 885 data into a latent variable extraction problem. In return, this approach confers two major 886 advantages to scRNA-seq data analysis: it reduces the number of genes needed for 887 inferential testing and increases statistical robustness. By handling the data in this way, we introduce expression transformants that lend single-cell DEG extraction with 889 statistical compliance. In addition, with this approach we can utilise highly efficient and 890 widely available multivariate analyses algorithms that rely on linearity and 891 homoscedasticity assumptions, such as ANOVAs and hierarchical clustering. SALSA 892 bypasses the more common step of gene×cell expression matrix assembly where 893 empty gene×cell blocks are artificially filled in with zeros [34,37,43,52]. Considering that 894 these empty blocks amount to well over 90% of all possible gene×cell crossings in any 895 previously reported scRNA-seq benchmark datasets [15,36], doing without aggressively 896 zero-inflated gene×cell expression matrices generates smaller gene expression files. 897 When combined with highly efficient SVD-driven algorithms for sparse matrix imputation 898 and latent variable extraction, the SALSA workflow cuts down on the computational 899 footprint and memory allocation typically required for scRNA-seq data post-processing. 900 Critically, t [27,44,52,92,106]. hese statistical enhancements provide biomarker outputs 901 that are testable on the bench because SALSA prioritises genes that can be verified 902 using conventional validation approaches. 903 SALSA, like other recently reported tools for unguided bioinformatic inferencing of 904 hierarchical associations in biology [107,108], leverages the concepts behind latent 905 semantic analysis (LSA) methods: the concepts of "genes" and "single cells" found in 906 scRNA-seq data can be thought of as interchangeable with the concepts of "terms" and 907 "documents" in natural language processing algorithms [109 -112]. Both SALSA and 908 LSA perform eigenvalue optimization driven by explicit count data in ultra-sparse 909 matrices using "local" measures of relative frequency for gene/term counts in a 910 cell/document, such as UMI-per-thousand (UPT) or count-per-document total scaling. overall frequency of a gene/term vs. all order gene/terms detected anywhere. SALSA 913 and LSA differ in how they compile the preponderance of detected gene/term counts 914 into a useful statistical kernel. In LSA, "global" weights are used to adjust for the 915 incidence of terms throughout a known corpus of documents ahead of inferential 916 testing, with the inverse document frequency being the most commonly used "global" 917 weighting statistic. In the SALSA workflow, UPT values from a sequencing round are 918 estimated from counts of UMIs which must initially be deduplicated, disambiguated, and 919 ascribed to an unknown number of single cells. These values are inferred from a pool of 920 observed barcodes and then must be empirically transformed into linearized and 921 normally distributed metrics via generalized linear modelling [9,45,[113][114][115][116][117][118][119][120]. Critically, 922 such metrics are uniquely pertinent to each experimental dataset being analysed. It 923 follows that by retaining statistical idiosyncrasy for separate scRNA-seq runs, SALSA 924 imposes a replicability constraint: if a facultative gene is to score as a "true" DEG, it 925 must do so across multiple biological specimens despite batch effects. Additionally, a 926 "true" DEG must re-score as a consensus DEG free of replication bias once data from 927 any within-replicate DEGs are re-tested as an all-at-once ensemble. This concept will 928 be explored in more detail in future publications. 929 We show that using SALSA, we can isolate the portions harbouring the most 930 informative subset of DEGs in an expression matrix. Conversely, we show that non-931 facultative genes are detrimental to DEG extraction from both an experimental and 932 bioinformatics perspective. On one side, retaining low-coverage genes risks admitting 933 UMI data from contaminant or anecdotal templates, such as poly-A + tracks from gDNA 934 debris or ambient RNA in cell suspensions. On the other side, including ubiquitously 935 transcribed RNA reads in differential expression analysis, such as reads aligned to 936 rRNA or housekeeping gene mRNAs, biases statistical significance testing for genes 937 with higher total UMI counts but small log-fold expression differences. By introducing 938 matrix focusing into scRNA-seq data processing, the SALSA workflow prioritizes data 939 within the dynamic range of expression measurements. By doing so based on 940 parametric distribution fitting, statistical inferences on focused expression matrices yield 941 results that can be replicated systematically across groups using equivalent datasets. In of these profiles will exhibit a transcriptional signature driven by a core group of DEGs. 955 Using SALSA, we unveiled distinct cell profiles represented at comparable rates, and 956 driven by expressed gene modules unique to each cell; we termed these Profiler genes.

Parametric focusing of gene-cell expression matrices
A diagram of matrix focusing process is depicted in Fig. 1B

Implementation of Single-cell Amalgamation by Latent Semantic Analysis (SALSA)
A diagram of the SALSA statistical workflow is depicted in Fig. 1E Hochberg method for multiple comparisons [122]. All metrics and statistical analyses were carried out using JMP® 13.0.0 64-bit statistical software (SAS, Cary, NC).

Stratification of significantly expressed genes
Significant genes (SGs) were identified as those with significantly double-weighted ANOVA      Figure S1. Schematic of total UMI coverage bound estimates for single-cell barcode regimes using P C -P D mixture model parameters.  Tables   Table S1. List of 166 genes in the PBMC 3K dataset with multi-count UMI alignments (4 or more counts) among gene×cell data-positive expression matrix fields.