MAPPER: An Open-Source, High-Dimensional Image Analysis Pipeline Unmasks Differential Regulation of Drosophila Wing Features

Phenomics requires quantification of large volumes of image data, necessitating high throughput image processing approaches. Existing image processing pipelines for Drosophila wings, a powerful genetic model for studying the underlying genetics for a broad range of cellular and developmental processes, are limited in speed, precision, and functional versatility. To expand on the utility of the wing as a phenotypic screening system, we developed MAPPER, an automated machine learning-based pipeline that quantifies high-dimensional phenotypic signatures, with each dimension quantifying a unique morphological feature of the Drosophila wing. MAPPER magnifies the power of Drosophila phenomics by rapidly quantifying subtle phenotypic differences in sample populations. We benchmarked MAPPER’s accuracy and precision in replicating manual measurements to demonstrate its widespread utility. The morphological features extracted using MAPPER reveal variable sexual dimorphism across Drosophila species and unique underlying sex-specific differences in morphogen signaling in male and female wings. Moreover, the length of the proximal-distal axis across the species and sexes shows a conserved scaling relationship with respect to the wing size. In sum, MAPPER is an open-source tool for rapid, high-dimensional analysis of large imaging datasets. These high-content phenomic capabilities enable rigorous and systematic identification of genotype-to-phenotype relationships in a broad range of screening and drug testing applications and amplify the potential power of multimodal genomic approaches.

to plant/leaf evolution and shape analysis of organs (4). We used Elliptic Fourier Descriptors 118 (EFD) as an alternative for a robust translational and rotational invariant representation of wing 119 shape. EFDs are found by fitting a Fourier series to the periodic function obtained from the closed 120 Drosophila wing peripheral contour. The coefficients of the Fourier series act as features as each 121 of them carry a local shape property. The original EFD description is a scale, translation, and 122 rotation invariant shape descriptor (4). We modified the original algorithm, so it is sensitive to size 123 changes. Details about this implementation can be found in the design section of the main text. 124 To find the appropriate number of terms for representing the closed contour of the Drosophila 125 wing blade, we varied the number of harmonics and measured the errors between the EFD 126 reconstruction of the wing and the actual boundary points ( Figure S5). Boundary points have been 127 extracted from the U-Net/ILASTIK generated segmentation mask. With an increase in the number 128 of terms, due to overfitting, the error decreases and saturates to 0. A further manual inspection 129 showed that 20 harmonics are sufficient for representing any wing blade. In summary, EFDs allow 130 us to measure specific local changes within the wing corresponding to a genetic or a 131 pharmacological perturbation. 132

133
A full documentation and user manual of MAPPER is located on GitHub here. Within the user 134 manual are guidelines for file formats, folder organization, image pre-processing, input parameter 135 specifications, user operations, troubleshooting inquiries, and understanding MAPPER's output. 136 S5 Statistical Analysis of Samarkand strain Drosophila wings 137 We processed 128 adult wing images of Drosophila melanogaster from the Samarkand strain 138 ( Figure 5, Figure S10) (5). In this validation test, MAPPER was used to highlight the shape and 139 size differences between the male and female populations. Geometric features and EFD features 140 were separately analyzed to highlight the application of EFD in estimating local shape changes in 141 wing. Principal Component Analysis (PCA) (6) carried out on the geometric features revealed that 142 the maximum variance within data was distributed majorly between the first two principal 143 components (89.4%) ( Figure 5D, Figure S10A, S10B). PCA also revealed that the total area of 144 the wing and total trichome density had maximum loading towards Principal Component 1 (PC1) 145 with total wing area having a negative PC1 loading towards the female population, and total 146 trichome density having a positive PC1 loading towards the male population ( Figure S10B, S10C). 147 Interestingly, the distance between longitudinal veins L3 and L4, d(L3-L4), did not correlate with 148 other wing features for either population ( Figure S10B). Further, d(L3-L4) did not have positive or 149 negative loading for PC1, which explained most of the variance in the data. This is contrary to the 150 other biological axes measurements for the AP and PD lengths that had negative PC1 loading 151 towards the female population. The differences between males and females were further 152 characterized by plotting the known result that wings of a Drosophila female are greater in size 153 than the male wing ( Figure 5E, Figure S10E). PCA analysis and comparison of the standardized 154 total wing areas confirmed this (p < 0.001). Conversely, males had a larger total standardized 155 trichome density ( Figure S10F, p < 0.001) that is revealed by direct comparison and PCA analysis.

9
Taken together, this suggests that high-dimensional data provided by MAPPER can reveal unique 157 features that are disparate between populations using PCA analysis. Clustering carried out using 158 the first two principal components revealed the presence of two distinct clusters representing the 159 male and female populations ( Figure 5D). Further, performing t-distributed stochastic neighbor 160 embedding (t-SNE) analysis revealed similar clustering distinctions between male and female 161 populations ( Figure S10D). High-dimensional analyses, such as PCA and tSNE, allow for a 162 systematic screening of phenotypic changes between two different populations, and can even be 163 extended to multiple populations. 164 A similar approach was followed in analysis of the EFD-based features. PCA was applied on the 165 EFD coefficients extracted for each wing. In this case, a total variance of about 97% was captured 166 in PC1 alone ( Figure 5F). This indicates that the variation was mainly distributed in one direction 167 of linear combinations of the EFD coefficients. Thus, the high-dimensional output of MAPPER 168 coupled with PCA analysis can reveal wing shape features that are distinct between two 169 populations. To investigate the importance of PC1 on wing geometry, we implemented reverse 170 PCA on the principal components. Reverse PCA is carried out by adding the mean vector of the 171 features to the matrix product of PCA projections (scores) and the transpose of the eigenvectors 172 (Equation 1). This process enables mapping of the influence of principal components back to the 173 original data. 174 In our analysis, the first eleven principal components were used to reconstruct the EFD 176 coefficients via reverse PCA. Since we are interested in understanding the importance of PC1, 177 which explained most of the variance in the data, the standard deviation along this PC was 178 calculated. Mean scores of the first 11 PCs were then used to reconstruct a representative wing 179 for the male and female populations ( Figure 5G). 1.5 times the calculated standard deviation along PC1 was then added and subtracted to the representative wing to observe the effect of PC1 181 on wing shape. Reverse PCA was then used to approximate the EFD coefficients with varying 182 PC1. Reverse EFD was then carried out to estimate the contour of the wing blade. The 183 reconstructed contours highlight that the major differences between the wings is mainly because 184 of overall change in the wing blade area ( Figure 5H). However, the reconstructed contour can be 185 used as the mean representation of the samples. This approach is useful when analyzing subtle 186 changes in shapes resulting from dysregulations in genetic pathways. These subtle differences 187 are further characterized by clustering using Gaussian Mixture Models (7) (GMM) where the 188 presence of two distinct clusters is revealed ( Figure 5F). Both of these high-dimensional plots 189 reveal that there are unique and identifiable distinctions between male and female populations in 190 regard to wing shape. 191

S6 Tests for measuring statistical significance
192 One way analysis of variance (ANOVA) was first used to test the hypothesis if the means of the 193 groups compared were equal or not (8). However, ANOVA alone cannot be used to comment on 194 statistical significance of comparisons between any two subgroups. Therefore, we further used a 195 multiple group comparison test for the statistics generated by ANOVA. Tukey's honestly 196 significant difference procedure was used for this task to identify differences in means among all 197 subgroups (9). Bonferoni-Holm correction was then applied on the generated p-values from the 198 previous test in order to adjust for Type I error in statistical testing (10). All the steps were carried 199 out using MATLAB. Additionally, a Bartlett test was used to compare the variances between any

304
The slope parameter of the fit was found to be statistically different from a value of 1.00 (p = 6.7 x 10 -3 ).

305
The 95% CI of the slope parameter is