This document is part of the supplementary material of “A practical guide to descriptive and statistical analysis of R. solanacearum infection data using R”. In this document, datasets from Lowe, et al., MPMI, 2015 and Lowe-Power, et al., mBio, 2016 are prepared for analysis. Each dataset is analyzed seperately in part II and III, respectively.
The disease index score is based the severity of symptoms. A symptom read-out (usually “% wilted leaves”) is turned into a numeric score. This score ranges from 0-4, and each number is used to reflect a certain interval of disease severity. The value of the disease index can never be below 0, or above 4. Therefore, the disease index is not a continuous response variable, but is ordinal.
Specify directory (if running from console..)
###Install all required packages:
#install.packages(c("tidyr","dplyr","ggplot2","MESS","lme4","lmerTest","multcomp","survival","psm","coxme","stargazer","survcomp","rms","modelr","party","stringr"))
###Define Working Directory and set it
###Note for the Markdown version: R-Markdown cannot set the working directory
###R markdown will always use the directory the .Rmd file is located in
###In the .Rmd file this code section is not actually evaluated and only serves illustratory purpose.
wd <- c("~/My_Data/DataDirectory/")
setwd(wd)Specify tables.
###Name of the file to be read
mpmi2015 <- c("S3_LOWE_MPMI_2015.csv")
mBio2016 <- c("S3_Lowe-Power_mBio_2016.csv")These datasets from the research group of Caitilyn Allen at University of Wisconsin have previously been published.
The data can be read into R and stored in a data frame. I am calling that data frame disease.index
di_mpmi <- as.data.frame(read.table(mpmi2015, header=T,
sep=";" , ###Sets the seperator of the csv file
dec=","), ###Sets the decimal operator of the csv file
stringsAsFactors=F) ###
di_mbio <- as.data.frame(read.table(mBio2016, header=T,
sep=";" , ###Sets the seperator of the csv file
dec=","), ###Sets the decimal operator of the csv file
stringsAsFactors=F) ###Using str() one can see if the tables were properly imported.
str(di_mpmi)## 'data.frame': 164 obs. of 19 variables:
## $ DPI : int 1 2 3 4 5 6 7 8 9 10 ...
## $ A : int 0 0 0 0 0 0 1 2 3 4 ...
## $ B : int 0 0 0 0 0 2 4 4 4 4 ...
## $ C : int 0 0 0 0 2 4 4 4 4 4 ...
## $ D : int 0 0 0 0 0 0 0 0 0 0 ...
## $ E : int 0 0 0 0 0 0 2 4 3 4 ...
## $ F : int 0 0 0 0 0 1 1 4 4 4 ...
## $ G : int 0 0 0 0 0 4 4 4 4 4 ...
## $ H : int 0 0 0 0 0 0 0 0 1 2 ...
## $ I : int 0 0 0 0 0 3 4 4 4 4 ...
## $ J : int 0 0 0 2 3 4 4 4 4 4 ...
## $ K : int 0 0 0 0 0 0 0 0 2 3 ...
## $ L : int 0 0 0 0 0 0 4 4 4 4 ...
## $ M : int 0 0 0 0 1 3 4 4 4 4 ...
## $ N : int 0 0 0 0 0 0 1 3 3 4 ...
## $ O : int NA NA NA NA NA NA NA NA NA NA ...
## $ P : int NA NA NA NA NA NA NA NA NA NA ...
## $ Strain: Factor w/ 3 levels "","GMI1000","GMI1000_fcsmut": 2 2 2 2 2 2 2 2 2 2 ...
## $ Batch : int 1 1 1 1 1 1 1 1 1 1 ...
str(di_mbio)## 'data.frame': 280 obs. of 13 variables:
## $ DPI : int 1 2 3 4 5 6 7 8 9 10 ...
## $ A : int 0 0 0 0 0 0 0 0 0 0 ...
## $ B : int 0 0 0 0 0 0 0 1 3 4 ...
## $ C : int 0 0 0 0 0 1 2 4 4 4 ...
## $ D : int 0 0 0 0 0 1 1 4 4 4 ...
## $ E : int 0 0 0 0 0 0 0 0 0 0 ...
## $ F : int 0 0 0 0 0 0 0 0 0 0 ...
## $ G : int 0 0 0 0 0 0 0 0 0 0 ...
## $ H : int 0 0 0 0 0 0 0 0 0 0 ...
## $ I : int 0 0 0 3 4 4 4 4 4 4 ...
## $ J : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Strain: Factor w/ 5 levels "comp_nagGH","comp_nagOP",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Batch : int 1 1 1 1 1 1 1 1 1 1 ...
These tables need to be made long.
library("tidyr")mpmi2015_long <- gather(di_mpmi, key=Plant,
value=DI,
c(A, B, C, D, E, F ,G ,H ,I ,J, K, L, M ,N, O, P),
na.rm=F)
mbio2016_long <- gather(di_mbio, key=Plant,
value=DI,
c(A, B, C, D, E, F ,G ,H ,I ,J),
na.rm=F)Now, it is time to generate the “subject” variable, unique for each subject. A unique identifier can be generated by combining information from Strain, Plant and Batch.
mpmi2015_long$subject <- interaction(mpmi2015_long$Strain,
mpmi2015_long$Plant,
mpmi2015_long$Batch)
mbio2016_long$subject <- interaction(mbio2016_long$Strain,
mbio2016_long$Plant,
mbio2016_long$Batch)
###Factorize batch information
mpmi2015_long$Batch <- as.factor(mpmi2015_long$Batch)
mbio2016_long$Batch <- as.factor(mbio2016_long$Batch)A crucial step that will influence all outputs of the statistical analysis is setting the “contrasts”. Contrasts specifies the “baseline” for each of the variables. “Treatment” contrasts specify that the first alphabetical level will be used as a reference for all others (see Strain below), while a “sum” constrast means that the reference value is the mean across all levels of that variable.
####Specify what should be "appropriate" contrasts
contrasts(mpmi2015_long$Strain) <- "contr.Treatment"###First alphabetical strain will be the baseline
contrasts(mbio2016_long$Strain) <- "contr.Treatment"###First alphabetical strain will be the baseline
contrasts(mpmi2015_long$Batch) <- "contr.sum" ###Batches will be averaged to generate the baseline!
contrasts(mbio2016_long$Batch) <- "contr.sum"The below adds the variable that defines observations that are relevant for disease development.
##Censoring for both datasets.
di_long <- na.omit(mpmi2015_long)
###Slightly modified with as.numeric
oldw <- getOption("warn")
options(warn = -1)
###Interesting R-related observation, the below does not work when subsetting is done with filter(),
###because filter does not retain rownames!
for (i in 1:max(as.numeric(di_long$subject))) { ###Go by subject
dummy1 <- di_long[as.numeric(di_long$subject)==i,] ##Create a first dummy object, that is a subset of the full data containing the current subject
if(min(dummy1$DI) == 0){
#remove those observations that are before disease onset, except the one directly before disease onset.
dummy1 <- dummy1[dummy1$DPI %in% (max(dummy1$DPI[dummy1$DI==0]):max(dummy1$DPI)),]
}
if (max(dummy1$DI) == 4) { ###If this subject dies at some point
dummy2 <- dummy1[dummy1$DI==4,] ###Make a new dummy object, that only contains those recordings where DI=4
NEW <- di_long[as.numeric(di_long$subject)==i & (di_long$DPI %in% min(dummy1$DPI):min(dummy2$DPI)),] ###Generate data subset "NEW", which contains those observations for a subject, that are between (including) the last recording where DI=0 and the first recording where disease index is 4.
} else { ###If this dubject does not die
NEW <- dummy1 ###New is the same as dummy1
}
di_long$Useful[rownames(di_long) %in% rownames(NEW)] <- c("Yes") ###All of those row(names) that are part of the "NEW" object are useful. Therefore these receive status "Yes" in column "Useful"
}
di_long$Useful[which(is.na(di_long$Useful))] <- c("No") ###Those that are not yes (and therefore are NA) become No
rm(dummy1,dummy2)
mpmi2015_long <- di_long
##Same for mBio dataset
di_long <- na.omit(mbio2016_long)
###Interesting R-related observation, the below does not work when subsetting is done with filter(),
###because filter does not retain rownames!
for (i in 1:max(as.numeric(di_long$subject))) { ###Go by subject
dummy1 <- di_long[as.numeric(di_long$subject)==i,] ##Create a first dummy object, that is a subset of the full data containing the current subject
if(min(dummy1$DI) == 0){
#remove those observations that are before disease onset, except the one directly before disease onset.
dummy1 <- dummy1[dummy1$DPI %in% (max(dummy1$DPI[dummy1$DI==0]):max(dummy1$DPI)),]
}
if (max(dummy1$DI) == 4) { ###If this subject dies at some point
dummy2 <- dummy1[dummy1$DI==4,] ###Make a new dummy object, that only contains those recordings where DI=4
NEW <- di_long[as.numeric(di_long$subject)==i & (di_long$DPI %in% min(dummy1$DPI):min(dummy2$DPI)),] ###Generate data subset "NEW", which contains those observations for a subject, that are between (including) the last recording where DI=0 and the first recording where disease index is 4.
} else { ###If this dubject does not die
NEW <- dummy1 ###New is the same as dummy1
}
di_long$Useful[rownames(di_long) %in% rownames(NEW)] <- c("Yes") ###All of those row(names) that are part of the "NEW" object are useful. Therefore these receive status "Yes" in column "Useful"
}
di_long$Useful[which(is.na(di_long$Useful))] <- c("No") ###Those that are not yes (and therefore are NA) become No
rm(dummy1,dummy2)
mbio2016_long <- di_long
options(warn = oldw)library("broom")str(mpmi2015_long)## 'data.frame': 2216 obs. of 7 variables:
## $ DPI : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Strain : Factor w/ 3 levels "","GMI1000","GMI1000_fcsmut": 2 2 2 2 2 2 2 2 2 2 ...
## ..- attr(*, "contrasts")= chr "contr.Treatment"
## $ Batch : Factor w/ 6 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "contrasts")= chr "contr.sum"
## $ Plant : chr "A" "A" "A" "A" ...
## $ DI : int 0 0 0 0 0 0 1 2 3 4 ...
## $ subject: Factor w/ 288 levels ".A.1","GMI1000.A.1",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Useful : chr "No" "No" "No" "No" ...
## - attr(*, "na.action")=Class 'omit' Named int [1:408] 161 162 163 164 325 326 327 328 489 490 ...
## .. ..- attr(*, "names")= chr [1:408] "161" "162" "163" "164" ...
tidy(mpmi2015_long) ###This can be used to assess the descriptive statistics of the data.## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## column n mean sd median trimmed mad min max
## 1 DPI 2216 7.283394 3.9354859 7 7.251973 4.4478 1 14
## 2 Strain* 2216 2.490975 0.5000314 2 2.488726 0.0000 2 3
## 3 Batch* 2216 3.469314 1.6069703 4 3.461669 1.4826 1 6
## 4 Plant* 2216 NaN NA NA NaN NA Inf -Inf
## 5 DI 2216 1.965704 1.9246991 2 1.957159 2.9652 0 4
## 6 subject* 2216 141.037906 78.3927509 148 142.549605 100.8168 2 270
## 7 Useful* 2216 NaN NA NA NaN NA Inf -Inf
## range skew kurtosis se
## 1 13 0.03461304 -1.174779 0.08360139
## 2 1 0.03608253 -1.999600 0.01062215
## 3 5 -0.12466164 -1.101610 0.03413681
## 4 -Inf NA NA NA
## 5 4 0.03913449 -1.934082 0.04088631
## 6 268 -0.18866940 -1.149750 1.66529442
## 7 -Inf NA NA NA
str(mbio2016_long)## 'data.frame': 2800 obs. of 7 variables:
## $ DPI : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Strain : Factor w/ 5 levels "comp_nagGH","comp_nagOP",..: 5 5 5 5 5 5 5 5 5 5 ...
## ..- attr(*, "contrasts")= chr "contr.Treatment"
## $ Batch : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "contrasts")= chr "contr.sum"
## $ Plant : chr "A" "A" "A" "A" ...
## $ DI : int 0 0 0 0 0 0 0 0 0 0 ...
## $ subject: Factor w/ 200 levels "comp_nagGH.A.1",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Useful : chr "No" "No" "No" "No" ...
tidy(mbio2016_long) ###This can be used to assess the descriptive statistics of the data. ## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## column n mean sd median trimmed mad min max
## 1 DPI 2800 7.500000 4.031849 7.5 7.500000 5.1891 1 14
## 2 Strain* 2800 3.000000 1.414466 3.0 3.000000 1.4826 1 5
## 3 Batch* 2800 2.500000 1.118234 2.5 2.500000 1.4826 1 4
## 4 Plant* 2800 NaN NA NA NaN NA Inf -Inf
## 5 DI 2800 1.556786 1.877546 0.0 1.445982 0.0000 0 4
## 6 subject* 2800 100.500000 57.744618 100.5 100.500000 74.1300 1 200
## 7 Useful* 2800 NaN NA NA NaN NA Inf -Inf
## range skew kurtosis se
## 1 13 0.0000000 -1.213584 0.07619478
## 2 4 0.0000000 -1.301214 0.02673090
## 3 3 0.0000000 -1.361171 0.02113263
## 4 -Inf NA NA NA
## 5 4 0.4611584 -1.718249 0.03548228
## 6 199 0.0000000 -1.201345 1.09127070
## 7 -Inf NA NA NA
write.csv(mbio2016_long, file = "S3_mBio2016_long.csv")
write.csv(mpmi2015_long, file = "S3_mpmi2015_long.csv")sessionInfo()## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] broom_0.4.2 tidyr_0.6.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.9 knitr_1.15.1 magrittr_1.5 mnormt_1.5-5
## [5] lattice_0.20-34 R6_2.2.0 stringr_1.2.0 plyr_1.8.4
## [9] dplyr_0.5.0 tools_3.3.2 parallel_3.3.2 grid_3.3.2
## [13] nlme_3.1-131 psych_1.6.12 DBI_0.5-1 htmltools_0.3.5
## [17] yaml_2.1.14 lazyeval_0.2.0 assertthat_0.1 rprojroot_1.2
## [21] digest_0.6.12 tibble_1.2 reshape2_1.4.2 evaluate_0.10
## [25] rmarkdown_1.3 stringi_1.1.2 backports_1.0.5 foreign_0.8-67