This document is part of the supplementary material of “A practical guide to descriptive and statistical analysis of R. solanacearum infection data using R”. In this document, the analysis of the Lowe_mBio_2016 dataset is performed, based on the formatted file generated in section I of material S3.
The below is retained so packages can be installed if necessary.
###Install all required packages:
#install.packages(c("MESS","lme4","lmerTest","multcomp","survival","rms","coxme","stargazer","survcomp","tidyverse","rcompanion"))
###Define Working Directory and set it
###Note for the Markdown version: R-Markdown cannot set the working directory
###R markdown will always use the directory the .Rmd file is located in
###In the .Rmd file this code section is not actually evaluated and only serves illustratory purpose.
wd <- c("~/My_Data/DataDirectory/")
setwd(wd)Import the mBio2016 long table. This was generated in part I.
###Name of the file to be read
di_long <- read.table("S3_mBio2016_long.csv", sep=",", header=T)
di_long$DPI <- di_long$DPI
###Redifine the variables
di_long$Batch <- as.factor(di_long$Batch)
contrasts(di_long$Strain) <- "contr.Treatment"###First alphabetical strain will be the baseline!ant) <- "contr.Treatment"###First alphabetical plant will be the baseline!
contrasts(di_long$Batch) <- "contr.sum" ###Batches will be averaged to generate the baseline!Everything from here on out is very similar to the analysis performed on sim_data.csv. Changes are explained. Change in below code: Removed “Plant” from group_by, since Plant here has a different meaning than in sim_data.csv (here it is a batch specific subject, in the other dataset it is a plant genotype)
library("dplyr")di_summary <- di_long %>% group_by(Strain, Batch, DPI) %>%
summarise(mean(DI),sd(DI),sd(DI)/sqrt(length(DI))) ##calculate within Batch mean, sd, and se for each Plant/Strain combination.
##Averages within each replicate. This summary table is mainly helpful for plotting, not used for analysis...
colnames(di_summary) <- c("Strain", "Batch", "DPI", "mean", "sd", "se") ###Assign correct columnnamesUsing these summaries, one can take a look at the averaged disease progression, using
library("ggplot2")ggplot(filter(di_summary, Batch==di_summary$Batch[[1]])) + ###Use Only Batch A (too busy otherwise)
aes(x=DPI,y=mean,color=Strain) + ###Color by Strain, specify x and y.
geom_area(aes(fill=Strain),position="identity",alpha=0.15) + ###Area plot, colored by Strain
##geom_errorbar(aes(x=DPI, ymax=mean+se, ymin=mean-se), size=0.25)+ ###This line adds SE.
##geom_errorbar(aes(x=DPI, ymax=mean+sd, ymin=mean-sd), size=0.25)+ ###This line adds SD. Don't use both.
facet_wrap(~Strain) + ###One plot per strain
labs(x = "Days post infection", y = "Avg. Disease Index") + #Labels
ggtitle("Figure 1\nDisease Areas,\nper strain, for batch A") #TitleHere, from a plot of the Areas, it seems these strains are not obviously different regarding the disease.
Change: Modified subject selection, as here the subjects are based on interaction, hence have character names. Each level of this corresponds to a number, so this can be used for iteration in the loop below.
library("MESS")####Build a table of AUDCPs, per subject
auc_df <- data.frame() ###Make auc_df data frame
for (i in 1:max(as.numeric(di_long$subject))) { ##Go by subject
temp <- di_long[as.numeric(di_long$subject)==i,] ###Subset full table into the subject table
temp <- droplevels(temp) ###Drop levels, so levels works properly below
auc_df[i,1] <- i ###Subject number
auc_df[i,2] <- levels(temp$Strain) ###Strain
auc_df[i,3] <- levels(temp$Plant) ###Plant
auc_df[i,4] <- levels(temp$Batch) ###Batch
auc_df[i,5] <- auc(temp$DPI,temp$DI) ###AUC; i assume that trapezoid rule is fine here.
### Additionally, auc calculation starts with the lowest x (DPI). I think this is sensible
### I assume that if one specifies "from=0", the curve is expanded by a triangle that covers the range from
### 0 to whatever is the value at the first observation. I think the first observation should ideally be 0
### if data was recorded from the beginning..
}
colnames(auc_df) <- c("subject","Strain","Plant","Batch","AUC") ###Name columns in AUC datafarame
auc_df$Strain <- as.factor(auc_df$Strain) #refactor
auc_df$Plant <- as.factor(auc_df$Plant) #refactor
auc_df$Batch <- as.factor(auc_df$Batch) #refactor
str(auc_df)## 'data.frame': 200 obs. of 5 variables:
## $ subject: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Strain : Factor w/ 5 levels "comp_nagGH","comp_nagOP",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Plant : Factor w/ 10 levels "A","B","C","D",..: 1 1 1 1 2 2 2 2 3 3 ...
## $ Batch : Factor w/ 4 levels "1","2","3","4": 1 2 3 4 1 2 3 4 1 2 ...
## $ AUC : num 31 40 39 31 26 2 23 34 0 41 ...
The auc data.frame contains one area under the curve per subject and all other subject specific variables as stored in the original table.
An initial assessment of strain or batch specific differences in AUDPC can be performed visually, for example by generating boxpots.
ggplot(auc_df) + geom_boxplot(
aes(x=Strain, y=AUC, color=Strain), #Plot boxplots of AUCs, by strains
notch=F) +
labs(y="AUDPC") +
facet_wrap(~Batch) + ###Individual plots per batch (and plant if applicable)
ggtitle("Figure 2\nArea Under the Disease Progression Curve per strain across batches")Here, it can be seen that comp_nagOP performed poorly in replicated 3.
Next, one can use the area under the disease progression curve, to build a linear model, or a linear mixed effects model.
library("lme4")
library("lmerTest")summary(lm(AUC~Batch,data=auc_df)) ###Can be used to identify batch effects. If there are none, including batch as a random factor below is not necessary (but also not necessarily wrong).##
## Call:
## lm(formula = AUC ~ Batch, data = auc_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.63 -11.74 -0.41 11.37 25.32
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.410 1.868 10.391 < 2e-16 ***
## Batch2 9.220 2.642 3.490 0.000596 ***
## Batch3 -4.170 2.642 -1.579 0.116058
## Batch4 -1.730 2.642 -0.655 0.513313
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.21 on 196 degrees of freedom
## Multiple R-squared: 0.1305, Adjusted R-squared: 0.1172
## F-statistic: 9.805 on 3 and 196 DF, p-value: 4.682e-06
auc_lmer <- lmer(
AUC ~ Strain + (1|Batch), ### AUC modeled as a function of strain, random effects of batch
data=auc_df ) #Linear mixed effects model.
auc_lm <- lm(AUC~Strain+Batch,data=auc_df) #Linear model.
AIC(auc_lm,auc_lmer) #Lower AIC, better fit, linear model is slightly better.## df AIC
## auc_lm 9 1613.138
## auc_lmer 7 1602.462
ggplot(data=auc_df, aes(y=AUC, x=Strain)) +geom_boxplot(aes(colour=Batch)) + ggtitle("Boxplot of AUDCP per strain by experimental batch")The model can be explored using various functions, such as summary.
library("broom")
summary(auc_lmer) ### A model summary, containing information on the model.## Linear mixed model fit by REML t-tests use Satterthwaite approximations
## to degrees of freedom [lmerMod]
## Formula: AUC ~ Strain + (1 | Batch)
## Data: auc_df
##
## REML criterion at convergence: 1588.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.97773 -0.93604 -0.00379 0.84847 1.87636
##
## Random effects:
## Groups Name Variance Std.Dev.
## Batch (Intercept) 30.66 5.537
## Residual 177.44 13.321
## Number of obs: 200, groups: Batch, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 20.8250 3.4787 5.9900 5.986 0.000982 ***
## Straincomp_nagOP -1.5250 2.9786 192.0000 -0.512 0.609249
## Strainko_nagGH 0.0125 2.9786 192.0000 0.004 0.996656
## Strainko_nagOP -1.5875 2.9786 192.0000 -0.533 0.594671
## StrainWT 0.1750 2.9786 192.0000 0.059 0.953210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Strnc_OP Str_GH Strnk_OP
## Strncmp_nOP -0.428
## Strank_ngGH -0.428 0.500
## Strank_ngOP -0.428 0.500 0.500
## StrainWT -0.428 0.500 0.500 0.500
tidy(auc_lmer) ### A cleaner display using tidy.## term estimate std.error statistic group
## 1 (Intercept) 20.825000 3.478750 5.986346424 fixed
## 2 Straincomp_nagOP -1.525000 2.978597 -0.511986023 fixed
## 3 Strainko_nagGH 0.012500 2.978597 0.004196607 fixed
## 4 Strainko_nagOP -1.587500 2.978597 -0.532969056 fixed
## 5 StrainWT 0.175000 2.978597 0.058752494 fixed
## 6 sd_(Intercept).Batch 5.537392 NA NA Batch
## 7 sd_Observation.Residual 13.320690 NA NA Residual
###The tidy output explained:
#Term: A discription: (Intercept) overall intercept. Intercept depends on the contrasts set initially. Here treatment contrasts are used, so Intercept = First alphabetical strain (Strain1).
#StrainStrain2: Difference in the estimate (slope), between Strain2 and the (Intercept)
##Estimate: The estimated slopeFor example, we see the estimated slopes (Estimate) and standard errors, together with a t- and corresponding p-value in the output of the summary function. Note, that the above only contains information on differences between different levels and the “baseline”“, which is called (Intercept). The baseline is determined by the contrast settings that were specified earlier.
But, it may be quite relevant to know how individual strains compare to each other (and not just how each strain compares to Strain1). This can be analyzed using a generalized linear hypothesis test, while ajusting for multiple comparisons using Tukey’s method.
library("multcomp")
library("rcompanion")
library("stringr")tidy(summary(glht(auc_lmer, linfct=mcp(Strain="Tukey"))))## lhs rhs estimate std.error statistic p.value
## 1 comp_nagOP - comp_nagGH 0 -1.5250 2.978597 -0.511986023 0.9862088
## 2 ko_nagGH - comp_nagGH 0 0.0125 2.978597 0.004196607 1.0000000
## 3 ko_nagOP - comp_nagGH 0 -1.5875 2.978597 -0.532969056 0.9839683
## 4 WT - comp_nagGH 0 0.1750 2.978597 0.058752494 0.9999973
## 5 ko_nagGH - comp_nagOP 0 1.5375 2.978597 0.516182629 0.9857804
## 6 ko_nagOP - comp_nagOP 0 -0.0625 2.978597 -0.020983034 1.0000000
## 7 WT - comp_nagOP 0 1.7000 2.978597 0.570738517 0.9793118
## 8 ko_nagOP - ko_nagGH 0 -1.6000 2.978597 -0.537165663 0.9834899
## 9 WT - ko_nagGH 0 0.1625 2.978597 0.054555888 0.9999980
## 10 WT - ko_nagOP 0 1.7625 2.978597 0.591721551 0.9763634
This information can, for example, be integrated into a boxplot of the individual disease areas. For example, using a compact letter display, in conbination with AUDPC, combines statistical and visual information. First, however, the grouping letters need to be calculated. Then, these letters are added to the boxplot generated earlier.
auc_cld <- cld(glht(auc_lmer, linfct=mcp(Strain="Tukey"))) ###Save letters
auc_cld <- cbind(levels(auc_df$Strain),auc_cld$mcletters$Letters) ###bind letters to columns
colnames(auc_cld) <- c("Strain","Letter") ###Name columns
auc_cld <- as.data.frame(auc_cld) #Coerce to dataframe
###Integrate letters into auc_df##
###Some extra scripting to make the mean and CI plot.
auc_CI <- as.data.frame(tidy(confint(auc_lmer)))## Computing profile confidence intervals ...
auc_CI <- auc_CI[3:nrow(auc_CI),] ##Drop sig01, sigma
auc_CI$Strain <- levels(auc_df$Strain)
###Mean relative to Strain1 (except strain1 that one is absolute)
for (i in 1:nrow(auc_CI)){
if (i==1){
auc_CI$mean[i] <- mean(c(auc_CI$X2.5..[i],auc_CI$X97.5..[i]))
auc_CI$upr[i] <- auc_CI$X97.5..[i]
auc_CI$lwr[i] <- auc_CI$X2.5..[i]
} else {
auc_CI$mean[i] <- c(auc_CI$mean[1]+mean(c(auc_CI$X2.5..[i],auc_CI$X97.5..[i])))
auc_CI$upr[i] <- c(auc_CI$mean[1]+auc_CI$X97.5..[i])
auc_CI$lwr[i] <- c(auc_CI$mean[1]+auc_CI$X2.5..[i])
}
}
####Generate plot of meanCI of AUDCP with significance letters and raw data as jittered points
ggplot(aes(x=Strain, y=AUC, color=Strain),data=auc_df) + ###Plot the auc_df
geom_crossbar(data = auc_CI, aes(x = Strain, y = mean, ymin = lwr, ymax = upr,fill=Strain), alpha=0.3) +
geom_jitter(aes(shape=Batch)) + ###with jitter overplotted, symbol shape defined by batch
geom_text(aes(x=Strain, y=-3, label=Letter),color="black", data=auc_cld) + ###Get the letters from auc_cld
#and write those to position y=-3
labs(y="AUDPC") + #Y-Axis label
ggtitle("AUDPC raw values and mean from the LMM\nwith 95% CI per strain and grouping letters") #TitleOne can plot the pairwise difference in AUDPC means with confidence intervals for the linear model and the linear mixed effects model.
pairwise_confint_AUDPC_lm <- as.data.frame(confint(glht(auc_lm, mcp(Strain = "Tukey")))$confint)
pairwise_confint_AUDPC_lm$Comparison <- rownames(pairwise_confint_AUDPC_lm)
pairwise_confint_AUDPC_lmer <- as.data.frame(confint(glht(auc_lmer, mcp(Strain = "Tukey")))$confint)
pairwise_confint_AUDPC_lmer$Comparison <- rownames(pairwise_confint_AUDPC_lmer)
###Plot the comparisons, below may not be the most straight-foward way to plot this the way I want it, but it works.
ggplot(pairwise_confint_AUDPC_lm, aes(x = Comparison, y = Estimate, ymin = lwr, ymax = upr, color = Comparison)) + ###Plot Comparison on x, estimate on y
scale_x_discrete(limits = rev(levels(as.factor(pairwise_confint_AUDPC_lm$Comparison)))) + ###Rescale x, so the order is inverted
geom_errorbar() + geom_point() + ###Draw data
coord_flip() + theme(legend.position="none") + xlab("") +###Invert X and Y, hide legend
ggtitle("Difference in means of the AUDPC \nin the linear model with 95% confidence interval") ##Add a title####Plot of the comparisons in the LMM. Here the bars are colored by their absolute value, instead of by their name.
ggplot(pairwise_confint_AUDPC_lmer, aes(x = Comparison, y = Estimate, ymin = lwr, ymax = upr, color = abs(Estimate))) + ###Plot Comparison on x, estimate on y
scale_x_discrete(limits = rev(levels(as.factor(pairwise_confint_AUDPC_lmer$Comparison)))) + ###Rescale x, so the order is inverted
geom_errorbar() + geom_point() + ###Draw data
coord_flip() + theme(legend.position="none") + xlab("") +###Invert X and Y, hide legend
ggtitle("Difference in means of the AUDPC \nin the LMM with 95% confidence interval") ##Add a titleEstimates of lm and lmm are very similar. As a rule of thumb, comparisons where “0” is not part of the 95% confidence interval are likely to produce a signficant p-value (assuming significance is denoted by p<0.05).
In this section, the data is analyzed using a linear mixed effect model. While such models have been used already in the previous section, to test for a strain specific influence on the area under the disease progression curve, different data is used to build the model in this section. As mentioned, AUDPC summarizes disease incidence and time into a single variable, the area. However, in certain cases the AUDCP could be very similar, while the actual disease progression is different. Taking a look at Figure 1, we see that Strain2 and Strain3 display different disease progressions. For Strain 2, the Disease Index continually rises over time. For Strain3, the the disease appears to set on very quickly, in fact before the observations started, but from there it does not increase in severity. However, if we compare the impact of those two strains on the AUDCP (Fig2, 3), it becomes evident, that the AUDPCs for these strains are not statistically different, both strains belong to group “a”. Other methods may be more sensitive to such differences. Below a different approach using linear mixed effect models is taken.
This model is the same as the model in the other files.
###Define contrasts for lmer
contrasts(di_long$Strain) <- "contr.treatment"###First alphabetical strain will be the baseline!
#contrasts(di_long$Plant) <- "contr.treatment"###First alphabetical plant will be the baseline!
contrasts(di_long$Batch) <- "contr.poly" ###Batches will be averaged
###Drop things that are not "Useful"
di_long.useful <- filter(di_long, Useful=="Yes")
str(di_long.useful)## 'data.frame': 577 obs. of 8 variables:
## $ X : int 14 28 33 34 35 36 45 46 64 65 ...
## $ DPI : int 14 14 5 6 7 8 3 4 8 9 ...
## $ Strain : Factor w/ 5 levels "comp_nagGH","comp_nagOP",..: 5 3 1 1 1 1 4 4 2 2 ...
## ..- attr(*, "contrasts")= chr "contr.treatment"
## $ Batch : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## ..- attr(*, "contrasts")= chr "contr.poly"
## $ Plant : Factor w/ 10 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ DI : int 0 0 0 2 3 4 0 4 0 1 ...
## $ subject: Factor w/ 200 levels "comp_nagGH.A.1",..: 161 81 1 1 1 1 121 121 41 41 ...
## $ Useful : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
#Fixed effect from strain, and a fixed interaction between Strain and time. Random effects individual and replicate.
disease_lmer <- lmer(DI ~ Strain + Strain:DPI + (1 | subject) + (1 | Batch), di_long.useful)The model can be investigated using summary functions. The pairwise comparisons may be plotted with 95%CI to assess how different two strains are.
###Check model summary
summary(disease_lmer)## Linear mixed model fit by REML t-tests use Satterthwaite approximations
## to degrees of freedom [lmerMod]
## Formula: DI ~ Strain + Strain:DPI + (1 | subject) + (1 | Batch)
## Data: di_long.useful
##
## REML criterion at convergence: 2218.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.3965 -0.9884 -0.3194 1.1454 1.6227
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 0.00000 0.0000
## Batch (Intercept) 0.02989 0.1729
## Residual 2.62645 1.6206
## Number of obs: 577, groups: subject, 200; Batch, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.238657 0.450959 350.400000 2.747 0.00633 **
## Straincomp_nagOP 0.485533 0.613427 566.700000 0.792 0.42898
## Strainko_nagGH 0.080397 0.614028 564.800000 0.131 0.89587
## Strainko_nagOP 0.341338 0.635535 566.000000 0.537 0.59142
## StrainWT -0.127323 0.593857 565.400000 -0.214 0.83031
## Straincomp_nagGH:DPI 0.060127 0.048290 566.500000 1.245 0.21360
## Straincomp_nagOP:DPI -0.004877 0.047431 558.100000 -0.103 0.91813
## Strainko_nagGH:DPI 0.060755 0.048223 566.800000 1.260 0.20823
## Strainko_nagOP:DPI 0.010235 0.052332 536.100000 0.196 0.84502
## StrainWT:DPI 0.079425 0.043684 561.700000 1.818 0.06957 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Strnc_OP Str_GH Strnk_OP StrnWT Strnc_GH:DPI
## Strncmp_nOP -0.701
## Strank_ngGH -0.704 0.517
## Strank_ngOP -0.678 0.503 0.503
## StrainWT -0.727 0.537 0.534 0.514
## Strnc_GH:DPI -0.928 0.674 0.678 0.651 0.700
## Strnc_OP:DPI -0.008 -0.646 0.001 -0.007 -0.003 0.010
## Strnk_GH:DPI -0.004 -0.003 -0.648 -0.007 0.000 0.004
## Strnk_OP:DPI -0.008 -0.008 -0.004 -0.672 0.002 0.010
## StranWT:DPI -0.007 -0.008 0.002 0.001 -0.621 0.007
## Strnc_OP:DPI Strnk_GH:DPI Strnk_OP:DPI
## Strncmp_nOP
## Strank_ngGH
## Strank_ngOP
## StrainWT
## Strnc_GH:DPI
## Strnc_OP:DPI
## Strnk_GH:DPI 0.007
## Strnk_OP:DPI 0.018 0.014
## StranWT:DPI 0.016 0.005 0.005
###E.g. plot confints
###Make pairwise confints and plot them, with a flipped coordinate system
pairwise_confint <- as.data.frame(confint(glht(disease_lmer, mcp(Strain = "Tukey", interaction_average=T)))$confint)
pairwise_confint$Comparison <- rownames(pairwise_confint)
ggplot(pairwise_confint, aes(x = Comparison, y = Estimate, ymin = lwr, ymax = upr, color = abs(Estimate))) + ###Plot Comparison on x, estimate on y
scale_x_discrete(limits = rev(levels(as.factor(pairwise_confint$Comparison)))) + ###Rescale x, so the order is inverted
geom_errorbar() + geom_point() + ###Draw data
coord_flip() + theme(legend.position="none") + xlab("") +###Invert X and Y, hide legend
ggtitle("Difference in means with 95% confidence interval \ncolored by absolute estimated difference") ##Add a titleconfint_model <- as.data.frame(tidy(confint(disease_lmer)))## Computing profile confidence intervals ...
## Warning in nextpar(mat, cc, i, delta, lowcut, upcut): unexpected decrease
## in profile: using minstep
## Warning in FUN(X[[i]], ...): non-monotonic profile for .sig01
## Warning in nextpar(mat, cc, i, delta, lowcut, upcut): unexpected decrease
## in profile: using minstep
## Warning in profile.merMod(object, which = parm, signames = oldNames, ...):
## non-monotonic profile for (Intercept)
## Warning in confint.thpr(pp, level = level, zeta = zeta): bad spline fit
## for .sig01: falling back to linear interpolation
## Warning in confint.thpr(pp, level = level, zeta = zeta): bad spline fit for
## (Intercept): falling back to linear interpolation
confint_slopes <- confint_model[ (1 + 3 + nlevels( di_long$Strain ) ) : ( 3 + 2*nlevels( di_long$Strain ) ) , 2:3 ]
colnames(confint_slopes) <- c("lwr","upr")
confint_slopes$Estimate <- rowMeans(confint_slopes)
confint_slopes$Strain <- levels(di_long$Strain)
###Plot the estimates, below may not be the most straight-foward way to plot this the way I want it, but it works.
ggplot(confint_slopes, aes(x = Strain, y = Estimate, ymin = lwr, ymax = upr, color = abs(Estimate))) + ###Plot Comparison on x, estimate on y
scale_x_discrete(limits = rev(levels(as.factor(confint_slopes$Strain)))) + ###Rescale x, so the order is inverted
geom_errorbar() + geom_point() + ###Draw data
coord_flip() + theme(legend.position="none") + xlab("") +###Invert X and Y, hide legend
ggtitle("Absolute slopes with 95%CI") ##Add a title#Intercepts are treatment contrasted
confint_icep <- confint_model[4:( 3 + nlevels( di_long$Strain )),]
confint_icep$Strain <- levels(di_long$Strain)
for (i in 1:nrow(confint_icep)){
if (i==1){
confint_icep$Estimate[i] <- mean(c(confint_icep$X2.5..[i],confint_icep$X97.5..[i]))
confint_icep$upr[i] <- confint_icep$X97.5..[i]
confint_icep$lwr[i] <- confint_icep$X2.5..[i]
} else {
confint_icep$Estimate[i] <- c(confint_icep$Estimate[1]+mean(c(confint_icep$X2.5..[i],confint_icep$X97.5..[i])))
confint_icep$upr[i] <- c(confint_icep$Estimate[1]+confint_icep$X97.5..[i])
confint_icep$lwr[i] <- c(confint_icep$Estimate[1]+confint_icep$X2.5..[i])
}
}
ggplot(confint_icep, aes(x = Strain, y = Estimate, ymin = lwr, ymax = upr, color = abs(Estimate))) + ###Plot Comparison on x, estimate on y
scale_x_discrete(limits = rev(levels(as.factor(confint_icep$Strain)))) + ###Rescale x, so the order is inverted
geom_errorbar() + geom_point() + ###Draw data
coord_flip() + theme(legend.position="none") + xlab("") +###Invert X and Y, hide legend
ggtitle("Absolute intercept with 95%CI") ##Add a titleUsing the linear model and raw data, different displays can be plotted, for example, a boxplot of the “Useful” data-points combined with the predictions (extrapolations) of the linear model.
##Add predictions to full dataset
library("modelr")##
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
##
## bootstrap
##Add predictions to full dataset
di_long <- add_predictions(di_long,disease_lmer,var="lmer.pred")## Warning: contrasts dropped from factor Strain
###
ggplot(data=di_long, aes(x=DPI,y=DI))+
geom_boxplot(aes(color=Strain,group=DPI),data=filter(di_long, Useful=="Yes"))+
geom_smooth(aes(y=lmer.pred,color=Strain), method="lm", alpha=0.6) +
facet_wrap(~Strain)+
ggtitle("Boxplots and linear fit for individual strains")Finally, of the methods discussed here, those relying on a linear model of raw observations over time, can be used for other kinds of observations, e.g. bacterial titers.
At least briefly, plot residuals of the lmm to assess if this appears normal (random)
di_long$resid[di_long$Useful=="Yes"] <- resid(disease_lmer)
ggplot(di_long[di_long$Useful=="Yes",], aes(x=DPI,y=resid)) + geom_jitter(aes(color=Strain)) Survival analysis is different in several aspects from the earlier approaches. Survival analysis, also known that time-to-event analysis, builds on a different dataset, which can be generated from raw disease indices. Survival analysis provides a way to analyze the time-to-event recordings within one population. In the case of the data used here, what is of interest is an event that can be referred to as death. Correctly defining death in this context is crucial for the outcome of the analysis. Death, here, is defined as a subject reaching a certain disease index, from which it cannot recover. In this script the disease index that defines the threashold to death is called “cutoff”.
###Set the cutoff by assigning a number to the variable cutoff
cutoff <- c(4) All of the observations that are above the dotted line in the plot below are dead, at the day they cross that line. Those that never cross the line are alive until the end of observations, and are “right censored”, meaning that their event was not observed during the time this subject was observed.
ggplot(data=di_long) +
geom_jitter(aes(x=DPI, y=DI,color=Strain,shape=Batch)) +
geom_segment(aes(x=0, xend=max(DPI)+0.5,y=cutoff, yend=cutoff), linetype="dotted") +
labs(x = "Days post infection", y = "Disease indices", title="Scatterplot of disease indices\n cutoff plotted as dotted line") +
coord_cartesian(xlim=c(-0.1,10))A “survival table” can be generated using the following code. This code works on the long table, generated in the beginning, and the cutoff variable defined above.
###Generate survival table
library("stringr")
surv_from_DI <- data.frame(Subject=levels(di_long$subject),
Strain=str_split_fixed(levels(di_long$subject), "\\.",3)[,1],
Plant=str_split_fixed(levels(di_long$subject), "\\.",3)[,2],
Batch=str_split_fixed(levels(di_long$subject), "\\.",3)[,3])
###Fill survival table based on the di_long table. This generates warnings. These can be ignored and come from the min()
for (i in 1:max(as.numeric(di_long$subject))) { #Go by subject
dummy <- di_long[as.numeric(di_long$subject)==i,] #generate dummy for the subject
if (is.infinite(min(dummy$DPI[which(dummy$DI >= cutoff)]))) { #If none of the DI is greater than the cutoff (this is where warnings are generated, min on an empty object returns infinite and a warning!)
surv_from_DI$End[i] <- max(dummy$DPI) #Generate a observation, censoring at the maximum DPI recorded
surv_from_DI$Death[i] <- 0 #Still alive, because it did not pass the cutoff
} else { #If more than zero DI are greater than the cutoff
surv_from_DI$End[i] <- min(dummy$DPI[which(dummy$DI >= cutoff)]) #Use the lowest DPI where condition is met
surv_from_DI$Death[i] <- 1 #record as dead
}
}
rm(dummy)Kaplan-Meier estimates of survival are the basic tool of survival analysis. These can be estimated using the survfit function from the “survival”" package.
library("survival")surv_DI_fit <- survfit(Surv(End, Death) ~Strain +strata(Batch), data=surv_from_DI)The survminer package provides the ggsurvplot() function. This works nicely on datasets with few treatments. However, for the data presented here, I think it is easier to initially generate a data frame that contains the whole fit and plot with “normal” ggplot2
library("stringr")
###Strata dummy generation, modified from kmiddleton / rexamples
strata_dummy <-NULL
for(i in 1:length(surv_DI_fit$strata)){
# add vector for one strata according to number of rows of strata
strata_dummy <- c(strata_dummy, rep(names(surv_DI_fit$strata)[i], surv_DI_fit$strata[i]))
}
###Data frame generation inspired by a post by Hadley Wickham to the ggplot2 Googlegroup
surv_DI_fit.df <- data.frame(
time = surv_DI_fit$time,
n.risk = surv_DI_fit$n.risk,
n.event = surv_DI_fit$n.event,
surv = surv_DI_fit$surv,
strata = strata_dummy,
upper = surv_DI_fit$upper,
lower = surv_DI_fit$lower
)
zeros <- data.frame(time = 0, surv = 1, strata = names((surv_DI_fit$strata)),
upper = 1, lower = 1)
surv_DI_fit.df <- plyr::rbind.fill(zeros, surv_DI_fit.df) ###I dont want to load plyr because i guess it will interfere with dplyr...
rm(strata_dummy)
rm(zeros)
###Some stuff to rename other stuff, this needs to be adapted if other variables are used.
surv_DI_fit.df$Batch <- as.factor( str_split_fixed(
matrix( nrow=length(surv_DI_fit.df$strata),ncol=2, unlist(strsplit(as.character(surv_DI_fit.df$strata),", ")), byrow=T )[,2],"=",2)[,2])
surv_DI_fit.df$Strain <- as.factor( str_split_fixed(
matrix( nrow=length(surv_DI_fit.df$strata),ncol=2, unlist(strsplit(as.character(surv_DI_fit.df$strata),", ")), byrow=T )[,1],"=",2)[,2])
###End of data frame generation
###Start plotting
ggplot(surv_DI_fit.df, aes(time, surv, colour = Strain)) +
geom_step(aes(y = surv*100,linetype=Batch)) +
facet_wrap(~Strain) +
ggtitle("Survival estimates for all Batches")Comparing the Kaplan-Meier survival estimates can be done in different ways. ###Logrank testing The below produces all pairwise comparisons of the Kaplan Meier estimate of survival using a logrank test.
###Make a table of pairwise chisq pvalues, for the logrank test.
#Based on a post to the R Mailing list by T. Therneau
pw_logrank_test_type <- 0 ###0 for logrank, 1 for peto and peto
pw_logrank <- matrix(0., nlevels(surv_from_DI$Strain),nlevels(surv_from_DI$Strain))
for (i in 1:nlevels(surv_from_DI$Strain)) {
for (j in (1:nlevels(surv_from_DI$Strain))[-i]) {
datasubset <- droplevels(subset( surv_from_DI,
surv_from_DI$Strain %in% (unique(surv_from_DI$Strain))[c(i,j)]))
temp <- survdiff(Surv(End, Death)~Strain+strata(Batch), data=datasubset, rho=pw_logrank_test_type)
pw_logrank[i,j] <- pchisq(temp$chisq, df=1, lower=F) ##df will always be 1 because this is pairwise
}
}
colnames(pw_logrank) <- levels(surv_from_DI$Strain)
rownames(pw_logrank) <- levels(surv_from_DI$Strain)
#Make dummy adjustment table
pw_logrank_adjBon <- pw_logrank
#Fill adjusted pvalue table.
for (i in 1:ncol(pw_logrank)) {
pw_logrank_adjBon[,i] <- cbind(p.adjust(pw_logrank[,i], method="bonferroni"))
}stargazer::stargazer(pw_logrank_adjBon,type="html",title="Pairwise Chisq p-values (Bonferroni adjusted)")| comp_nagGH | comp_nagOP | ko_nagGH | ko_nagOP | WT | |
| comp_nagGH | 0 | 1 | 1 | 1 | 1 |
| comp_nagOP | 1 | 0 | 1 | 1 | 1 |
| ko_nagGH | 1 | 1 | 0 | 1 | 1 |
| ko_nagOP | 1 | 1 | 1 | 0 | 1 |
| WT | 1 | 1 | 1 | 1 | 0 |
Generally a survival regression does not assume proportionality of hazards. A survival regression is fit to a distribution, defined by dist=“”.
library("survcomp")
library("rms")
library("modelr")####Survival Regression###
###This is done using functions from rms.
###psm is a survival::survreg wrapper. but the output is more handle-able.
ddist <- datadist(surv_from_DI)
options(datadist="ddist")
psurv_gaus <- psm(Surv(End, Death) ~Strain, data=surv_from_DI, dist="gaussian")
psurv_logistic <- psm(Surv(End, Death) ~Strain, data=surv_from_DI, dist="logistic")
psurv_lnorm <- psm(Surv(End, Death) ~Strain, data=surv_from_DI, dist="lognormal")
psurv_wei <- psm(Surv(End, Death) ~Strain, data=surv_from_DI, dist="weibull")
###Same with survreg()
s_reg_gaus <- survreg(Surv(End, Death) ~Strain, data=surv_from_DI, dist="gaussian")
s_reg_logistic <- survreg(Surv(End, Death) ~Strain, data=surv_from_DI, dist="logistic")
s_reg_lnorm <- survreg(Surv(End, Death) ~Strain, data=surv_from_DI, dist="lognormal")
s_reg_wei <- survreg(Surv(End, Death) ~Strain, data=surv_from_DI, dist="weibull")
aic.scores.psurv <- rbind(
extractAIC(psurv_wei),
extractAIC(psurv_gaus),
extractAIC(psurv_logistic),
extractAIC(psurv_lnorm))
###Make useable AIC table
rownames(aic.scores.psurv) <- c("Weibull", "Gaussian", "Logist", "Lognorm")
colnames(aic.scores.psurv) <- c("df", "AIC")
###Call tablestargazer::stargazer(aic.scores.psurv,type="html",title="AIC Scores")| df | AIC | |
| Weibull | 6 | 959.200 |
| Gaussian | 6 | 974.861 |
| Logist | 6 | 983.390 |
| Lognorm | 6 | 938.701 |
From the table above, the model with the lowest AIC score can be chosen. For this analysis, this is the lognormal model, but this does not have to apply to other experiments. Then, one can inspect that model for significance
summary(glht(psurv_lnorm,linfct=mcp(Strain="Tukey")))##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: psm(formula = Surv(End, Death) ~ Strain, data = surv_from_DI,
## dist = "lognormal")
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## comp_nagOP - comp_nagGH == 0 0.029127 0.111312 0.262 0.999
## ko_nagGH - comp_nagGH == 0 -0.036242 0.110422 -0.328 0.997
## ko_nagOP - comp_nagGH == 0 0.022685 0.111094 0.204 1.000
## WT - comp_nagGH == 0 -0.048696 0.110089 -0.442 0.992
## ko_nagGH - comp_nagOP == 0 -0.065368 0.111246 -0.588 0.977
## ko_nagOP - comp_nagOP == 0 -0.006442 0.111865 -0.058 1.000
## WT - comp_nagOP == 0 -0.077823 0.110944 -0.701 0.956
## ko_nagOP - ko_nagGH == 0 0.058926 0.111027 0.531 0.984
## WT - ko_nagGH == 0 -0.012455 0.110019 -0.113 1.000
## WT - ko_nagOP == 0 -0.071381 0.110717 -0.645 0.968
## (Adjusted p values reported -- single-step method)
Again, one can then inspect the differences in the model, for example using pairwise comparisons of means.
pairwise_confint_sreg <- as.data.frame(confint(glht(psurv_lnorm, mcp(Strain = "Tukey")))$confint)
pairwise_confint_sreg$Comparison <- rownames(pairwise_confint_sreg)
###Plot the comparisons, below may not be the most straight-foward way to plot this the way I want it, but it works.
ggplot(pairwise_confint_sreg, aes(x = Comparison, y = Estimate, ymin = lwr, ymax = upr, color = Comparison)) + ###Plot Comparison on x, estimate on y
scale_x_discrete(limits = rev(levels(as.factor(pairwise_confint$Comparison)))) + ###Rescale x, so the order is inverted
geom_errorbar() + geom_point() + ###Draw data
coord_flip() + theme(legend.position="none") + xlab("") +###Invert X and Y, hide legend
ggtitle("Difference in means with 95% Confidence interval") ##Add a titleIt is possible, but not really easy, to plot the generated curves. These curves are the result of fitting the data to a distribution in the earlier section. Doing this in a manner that is compatible with ggplot2 is not straightforward. Below is code to generate plots of the KM estimates per batch and the generated regression. This is performed for the four distributions above, and can be adapted to different distributions if necessary.
library("tidyr")##
## Attaching package: 'tidyr'
## The following object is masked from 'package:Matrix':
##
## expand
###Step 1, extract the coefficients. These are relative to Strain1 because Strain is treatment contrasted.
for (i in 1:nlevels(surv_DI_fit.df$Strain)) { #For loop through strains
if(i==1) { #Strain1 is relative to itself, so no change
coef_wei <- list()
coef_logistic <- list()
coef_gaus <- list()
coef_lnorm <- list()
coef_wei[i] <- coef(s_reg_wei)[i]
coef_logistic[i] <- coef(s_reg_logistic)[i]
coef_gaus[i] <- coef(s_reg_gaus)[i]
coef_lnorm[i] <- coef(s_reg_lnorm)[i]
} else { ###Other strains are relative to 1
coef_wei[i] <- coef(s_reg_wei)[1] + coef(s_reg_wei)[i]
coef_logistic[i] <- coef(s_reg_logistic)[1] + coef(s_reg_logistic)[i]
coef_gaus[i] <- coef(s_reg_gaus)[1] + coef(s_reg_gaus)[i]
coef_lnorm[i] <- coef(s_reg_lnorm)[1] + coef(s_reg_lnorm)[i]
}
}
##Step 2
####Store the coefficients and the scale in a new data frame, of parameters
### Keep in mind that survreg.distributions$weibull is different from rweibull, hence the difference in names.
sregparams <- data.frame(
Strain = rep(levels(surv_from_DI$Strain),4 ), #Fill with strains
scale.wei = exp(unlist(coef_wei)), #weibull fit scale parameters
scale.logistic = rep(s_reg_logistic$scale, nlevels(surv_from_DI$Strain)), #fill with logis scales
scale.gaus = rep(s_reg_gaus$scale, nlevels(surv_from_DI$Strain)), #fill with gaus scales
scale.lnorm = rep(s_reg_lnorm$scale, nlevels(surv_from_DI$Strain)), #fill with lnorm scale
shape.wei = rep(1/s_reg_wei$scale, nlevels(surv_from_DI$Strain)), #shape for weibull
shape.logistic = unlist(coef_logistic), #shape for logistic
shape.gaus = unlist(coef_gaus), #shape for gaus
shape.lnorm = unlist(coef_lnorm) #shape for lnorm
)
##Step 3
###Calculate the "daily" value of each curve
for (i in 1:nlevels(surv_DI_fit.df$Strain)){
if(i==1) {
wei <- list()
logis <- list()
gaus <- list()
lnorm <- list()
}
x <- levels(surv_DI_fit.df$Strain)[i]
n <- c(1:max(surv_from_DI$End))
data <- filter(sregparams, Strain==x)
time <- n
wei <- cbind(wei, pweibull(
q=n,
scale=data$scale.wei,
shape=data$shape.wei,
lower.tail=FALSE))
logis <- cbind(logis,plogis(
q=n,
scale=data$scale.logistic,
location=data$shape.logistic,
lower.tail=FALSE ))
gaus <- cbind(gaus,pnorm(
q=n,
sd=data$scale.gaus,
mean=data$shape.gaus,
lower.tail = F))
lnorm <- cbind(lnorm, plnorm(
q=n,
sd=data$scale.lnorm,
mean=data$shape.lnorm,
lower.tail=F))
}
##Step 4
###Put all the curves into a data.frame that contains information on "time" and also "Strain", for compatibility with other data.frames
sreg_curves <- data.frame(
wei.sreg = cbind(unlist(wei)),
logis.sreg = cbind(unlist(logis)),
gaus.sreg = cbind(unlist(gaus)),
lnorm.sreg = cbind(unlist(lnorm)),
Strain = rep(unlist(levels(surv_DI_fit.df$Strain)),each=max(surv_from_DI$End)),
time = rep(c(1:max(surv_from_DI$End)), nlevels(surv_DI_fit.df$Strain))
)
##Step 5
###Turn that data.frame into a long data.frame (not used here but for other figures.)
sreg_long <- sreg_curves %>% gather(., key="Distribution",values = c(lnorm.sreg, wei.sreg,gaus.sreg,logis.sreg) )
sreg_long$Distribution <- as.factor(sreg_long$Distribution)
##Levels: gaus.sreg lnorm.sreg logis.sreg wei.sreg
levels(sreg_long$Distribution) <- c("Gaussian","Lognormal","Loglogistic","Weibull")Now, these can be plotted and inspected visually..
###Plot of KM+Weibull
ggplot(surv_DI_fit.df, aes(time, surv, colour = Strain)) +
geom_step(aes(linetype=Batch)) +
geom_line(data=sreg_curves,aes(y=wei.sreg),color="black") +
facet_wrap(~Strain) +
ggtitle("Kaplan-Meier estimates and fit to\nWeibull distribution")###Plot of KM+Logistic
ggplot(surv_DI_fit.df, aes(time, surv, colour = Strain)) +
geom_step(aes(linetype=Batch)) +
geom_line(data=sreg_curves,aes(y=logis.sreg),color="black") +
facet_wrap(~Strain) +
ggtitle("Kaplan-Meier estimates and fit to\n Logistic distribution")###Plot of KM+Gaussian
ggplot(surv_DI_fit.df, aes(time, surv, colour = Strain)) +
geom_step(aes(linetype=Batch)) +
geom_line(data=sreg_curves,aes(y=gaus.sreg),color="black") +
facet_wrap(~Strain) +
ggtitle("Kaplan-Meier estimates and fit to\nGaussian distribution")###Plot of KM+Lognormal
ggplot(surv_DI_fit.df, aes(time, surv, colour = Strain)) +
geom_step(aes(linetype=Batch)) +
geom_line(data=sreg_curves,aes(y=lnorm.sreg),color="black") +
facet_wrap(~Strain) +
ggtitle("Kaplan-Meier estimates and fit to\nLognormal distribution")Different approaches to survival analysis, are based on analysing the hazards. The hazard is the probability of experiencing an event at a given timepoint. Many hazard based analysis assume that hazards are proportional between treatments, meaning that they differ by a fixed factor. Hazards were strongly influenced by Cox and a basic model is the cox proportional hazards model.
###Cox-Proportional hazards####
#Build model
srv_coxph <- coxph(Surv(End, Death) ~Strain + strata(Batch), data=surv_from_DI)
###Check porportionality of hazards
cox.zph(srv_coxph, transform = "log")## rho chisq p
## Straincomp_nagOP -0.0952 1.398 0.237
## Strainko_nagGH -0.0636 0.618 0.432
## Strainko_nagOP -0.0564 0.505 0.477
## StrainWT -0.0281 0.120 0.729
## GLOBAL NA 1.641 0.801
If the proportionality of hazards is violated within the dataset can be seen from the p-value(s) returned by cox.zph. One can then continue with a comparison of hazard ratios, however hazard ratios are not appropriate if the assumption of proportional hazards is violated.
####Hazard ratio
haz_rats <- hazard.ratio(x= surv_from_DI$Strain,
surv.time = surv_from_DI$End,
surv.event = surv_from_DI$Death,
strat = surv_from_DI$Batch,
method.test = "wald" ) ###Overall hazard ratios
###Pairwise hazard ratios / modified from the pairwise chisq calculation
pw_hazrats <- matrix(0., nlevels(surv_from_DI$Strain),nlevels(surv_from_DI$Strain))
for (i in 1:nlevels(surv_from_DI$Strain)) {
for (j in (1:nlevels(surv_from_DI$Strain))[-i]) {
datasubset <- droplevels(subset( surv_from_DI,
surv_from_DI$Strain %in% (unique(surv_from_DI$Strain))[c(i,j)]))
temp <- hazard.ratio(
x= datasubset$Strain,
surv.time = datasubset$End,
surv.event = datasubset$Death,
strat = datasubset$Batch,
method.test = "likelihood.ratio" ###Define test to determine p.
)
pw_hazrats[i,j] <- temp$p.value
}
}
colnames(pw_hazrats) <- levels(surv_from_DI$Strain)
rownames(pw_hazrats) <- levels(surv_from_DI$Strain)stargazer::stargazer(pw_hazrats,type="html",title="Pairwise hazard ratio pvalues")| comp_nagGH | comp_nagOP | ko_nagGH | ko_nagOP | WT | |
| comp_nagGH | 0 | 0.782 | 0.964 | 0.870 | 0.815 |
| comp_nagOP | 0.782 | 0 | 0.750 | 0.752 | 0.431 |
| ko_nagGH | 0.964 | 0.750 | 0 | 0.892 | 0.879 |
| ko_nagOP | 0.870 | 0.752 | 0.892 | 0 | 0.987 |
| WT | 0.815 | 0.431 | 0.879 | 0.987 | 0 |
If the hazards are found to be non-proportional, it might be a good idea to perform survival regression analysis, or pairwise log-rank testing (see earlier) instead of hazard ratio tests. Alternatives may also come from the use of Cox mixed-effects model, similar to linear mixed-effects model, but with a different type of response variable.
library("coxme")## Loading required package: bdsmatrix
##
## Attaching package: 'bdsmatrix'
## The following object is masked from 'package:SparseM':
##
## backsolve
## The following object is masked from 'package:base':
##
## backsolve
cme <- coxme(Surv(End, Death) ~Strain + (1|Batch), data=surv_from_DI)
anova(cme)## Analysis of Deviance Table
## Cox model: response is Surv(End, Death)
## Terms added sequentially (first to last)
##
## loglik Chisq Df Pr(>|Chi|)
## NULL -714.75
## Strain -705.81 17.885 4 0.001299 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(glht(cme,linfct=mcp(Strain="Tukey")))##
## Simultaneous Tests for General Linear Hypotheses
##
## Multiple Comparisons of Means: Tukey Contrasts
##
##
## Fit: coxme(formula = Surv(End, Death) ~ Strain + (1 | Batch), data = surv_from_DI)
##
## Linear Hypotheses:
## Estimate Std. Error z value Pr(>|z|)
## comp_nagOP - comp_nagGH == 0 -0.113278 0.264388 -0.428 0.993
## ko_nagGH - comp_nagGH == 0 0.038949 0.254455 0.153 1.000
## ko_nagOP - comp_nagGH == 0 -0.005224 0.262860 -0.020 1.000
## WT - comp_nagGH == 0 0.096273 0.250872 0.384 0.995
## ko_nagGH - comp_nagOP == 0 0.152227 0.264181 0.576 0.979
## ko_nagOP - comp_nagOP == 0 0.108054 0.271187 0.398 0.995
## WT - comp_nagOP == 0 0.209551 0.260259 0.805 0.929
## ko_nagOP - ko_nagGH == 0 -0.044173 0.261807 -0.169 1.000
## WT - ko_nagGH == 0 0.057324 0.250829 0.229 0.999
## WT - ko_nagOP == 0 0.101497 0.259382 0.391 0.995
## (Adjusted p values reported -- single-step method)
An inherent question when analyzing experimental data, is which analysis produced which result and why. Below, the outputs from the three parametric analysis performed are printed, so they can be compared.
summary(auc_lmer)## Linear mixed model fit by REML t-tests use Satterthwaite approximations
## to degrees of freedom [lmerMod]
## Formula: AUC ~ Strain + (1 | Batch)
## Data: auc_df
##
## REML criterion at convergence: 1588.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.97773 -0.93604 -0.00379 0.84847 1.87636
##
## Random effects:
## Groups Name Variance Std.Dev.
## Batch (Intercept) 30.66 5.537
## Residual 177.44 13.321
## Number of obs: 200, groups: Batch, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 20.8250 3.4787 5.9900 5.986 0.000982 ***
## Straincomp_nagOP -1.5250 2.9786 192.0000 -0.512 0.609249
## Strainko_nagGH 0.0125 2.9786 192.0000 0.004 0.996656
## Strainko_nagOP -1.5875 2.9786 192.0000 -0.533 0.594671
## StrainWT 0.1750 2.9786 192.0000 0.059 0.953210
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Strnc_OP Str_GH Strnk_OP
## Strncmp_nOP -0.428
## Strank_ngGH -0.428 0.500
## Strank_ngOP -0.428 0.500 0.500
## StrainWT -0.428 0.500 0.500 0.500
summary(disease_lmer)## Linear mixed model fit by REML t-tests use Satterthwaite approximations
## to degrees of freedom [lmerMod]
## Formula: DI ~ Strain + Strain:DPI + (1 | subject) + (1 | Batch)
## Data: di_long.useful
##
## REML criterion at convergence: 2218.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.3965 -0.9884 -0.3194 1.1454 1.6227
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 0.00000 0.0000
## Batch (Intercept) 0.02989 0.1729
## Residual 2.62645 1.6206
## Number of obs: 577, groups: subject, 200; Batch, 4
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.238657 0.450959 350.400000 2.747 0.00633 **
## Straincomp_nagOP 0.485533 0.613427 566.700000 0.792 0.42898
## Strainko_nagGH 0.080397 0.614028 564.800000 0.131 0.89587
## Strainko_nagOP 0.341338 0.635535 566.000000 0.537 0.59142
## StrainWT -0.127323 0.593857 565.400000 -0.214 0.83031
## Straincomp_nagGH:DPI 0.060127 0.048290 566.500000 1.245 0.21360
## Straincomp_nagOP:DPI -0.004877 0.047431 558.100000 -0.103 0.91813
## Strainko_nagGH:DPI 0.060755 0.048223 566.800000 1.260 0.20823
## Strainko_nagOP:DPI 0.010235 0.052332 536.100000 0.196 0.84502
## StrainWT:DPI 0.079425 0.043684 561.700000 1.818 0.06957 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) Strnc_OP Str_GH Strnk_OP StrnWT Strnc_GH:DPI
## Strncmp_nOP -0.701
## Strank_ngGH -0.704 0.517
## Strank_ngOP -0.678 0.503 0.503
## StrainWT -0.727 0.537 0.534 0.514
## Strnc_GH:DPI -0.928 0.674 0.678 0.651 0.700
## Strnc_OP:DPI -0.008 -0.646 0.001 -0.007 -0.003 0.010
## Strnk_GH:DPI -0.004 -0.003 -0.648 -0.007 0.000 0.004
## Strnk_OP:DPI -0.008 -0.008 -0.004 -0.672 0.002 0.010
## StranWT:DPI -0.007 -0.008 0.002 0.001 -0.621 0.007
## Strnc_OP:DPI Strnk_GH:DPI Strnk_OP:DPI
## Strncmp_nOP
## Strank_ngGH
## Strank_ngOP
## StrainWT
## Strnc_GH:DPI
## Strnc_OP:DPI
## Strnk_GH:DPI 0.007
## Strnk_OP:DPI 0.018 0.014
## StranWT:DPI 0.016 0.005 0.005
summary(psurv_lnorm)## Effects Response : Surv(End, Death)
##
## Factor Low High Diff. Effect S.E.
## Strain - comp_nagOP:comp_nagGH 1 2 NA 0.029127 0.11131
## Survival Time Ratio 1 2 NA 1.029600 NA
## Strain - ko_nagGH:comp_nagGH 1 3 NA -0.036242 0.11042
## Survival Time Ratio 1 3 NA 0.964410 NA
## Strain - ko_nagOP:comp_nagGH 1 4 NA 0.022685 0.11109
## Survival Time Ratio 1 4 NA 1.022900 NA
## Strain - WT:comp_nagGH 1 5 NA -0.048696 0.11009
## Survival Time Ratio 1 5 NA 0.952470 NA
## Lower 0.95 Upper 0.95
## -0.19041 0.24866
## 0.82662 1.28230
## -0.25402 0.18154
## 0.77567 1.19910
## -0.19642 0.24179
## 0.82167 1.27350
## -0.26582 0.16843
## 0.76658 1.18340
An easier comparisons might be accomplished with the compact letter display.
cld(glht(auc_lmer, linfct=mcp(Strain="Tukey")))## comp_nagGH comp_nagOP ko_nagGH ko_nagOP WT
## "a" "a" "a" "a" "a"
#Compact letters for lmerTest objects are a little tricky. This solution comes from the rcompanion.
### Extract lsmeans table
lmerlsm <- difflsmeans(disease_lmer)$diffs.lsmeans.table
Comparison = str_split_fixed(rownames(lmerlsm),"Strain ",2)[,2]
### Produce compact letter display
library(rcompanion)
# cldList(comparison = Comparison,
# p.value = p.adjust(lmerlsm$'p-value',
# method = "bonferroni") ,
# threshold = 0.05)
cld(glht(srv_coxph,linfct=mcp(Strain="Tukey")))## comp_nagGH comp_nagOP ko_nagGH ko_nagOP WT
## "a" "a" "a" "a" "a"
cld(glht(psurv_lnorm,linfct=mcp(Strain="Tukey")))## comp_nagGH comp_nagOP ko_nagGH ko_nagOP WT
## "a" "a" "a" "a" "a"
cld(glht(cme,linfct=mcp(Strain="Tukey")))## comp_nagGH comp_nagOP ko_nagGH ko_nagOP WT
## "a" "a" "a" "a" "a"
sessionInfo()## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] coxme_2.2-5 bdsmatrix_1.3-2 tidyr_0.6.1 rms_5.1-0
## [5] SparseM_1.74 Hmisc_4.0-2 Formula_1.2-1 lattice_0.20-34
## [9] survcomp_1.24.0 prodlim_1.5.9 modelr_0.1.0 stringr_1.2.0
## [13] rcompanion_1.5.0 multcomp_1.4-6 TH.data_1.0-8 MASS_7.3-45
## [17] survival_2.40-1 mvtnorm_1.0-5 broom_0.4.2 lmerTest_2.0-33
## [21] lme4_1.1-12 Matrix_1.2-8 MESS_0.4-3 geepack_1.2-1
## [25] ggplot2_2.2.1 dplyr_0.5.0
##
## loaded via a namespace (and not attached):
## [1] survivalROC_1.0.3 nlme_3.1-131 pbkrtest_0.4-7
## [4] ordinal_2015.6-28 RColorBrewer_1.1-2 rprojroot_1.2
## [7] tools_3.3.3 backports_1.0.5 R6_2.2.0
## [10] KernSmooth_2.23-15 rpart_4.1-10 rmeta_2.16
## [13] nortest_1.0-4 DBI_0.5-1 lazyeval_0.2.0
## [16] mgcv_1.8-17 colorspace_1.3-2 ade4_1.7-5
## [19] nnet_7.3-12 gridExtra_2.2.1 mnormt_1.5-5
## [22] quantreg_5.29 htmlTable_1.9 hermite_1.1.1
## [25] expm_0.999-1 sandwich_2.3-4 labeling_0.3
## [28] scales_0.4.1 checkmate_1.8.2 polspline_1.1.12
## [31] lmtest_0.9-35 psych_1.6.12 mc2d_0.1-18
## [34] multcompView_0.1-7 digest_0.6.12 foreign_0.8-67
## [37] minqa_1.2.4 rmarkdown_1.3 base64enc_0.1-3
## [40] WRS2_0.9-1 htmltools_0.3.5 manipulate_1.0.1
## [43] htmlwidgets_0.8 SuppDists_1.1-9.4 zoo_1.7-14
## [46] acepack_1.4.1 car_2.1-4 magrittr_1.5
## [49] modeltools_0.2-21 Rcpp_0.12.9 DescTools_0.99.19
## [52] munsell_0.4.3 ucminf_1.1-4 stringi_1.1.2
## [55] yaml_2.1.14 plyr_1.8.4 grid_3.3.3
## [58] parallel_3.3.3 stargazer_5.2 splines_3.3.3
## [61] knitr_1.15.1 EMT_1.1 boot_1.3-18
## [64] reshape2_1.4.2 codetools_0.2-15 stats4_3.3.3
## [67] evaluate_0.10 latticeExtra_0.6-28 data.table_1.10.4
## [70] nloptr_1.0.4 bootstrap_2017.2 miscTools_0.6-22
## [73] MatrixModels_0.4-1 gtable_0.2.0 purrr_0.2.2
## [76] reshape_0.8.6 assertthat_0.1 coin_1.1-3
## [79] BSDA_1.01 tibble_1.2 lava_1.4.7
## [82] cluster_2.0.5 maxLik_1.3-4 RVAideMemoire_0.9-63