Prospective validation

As with the previous vignettes, we begin by loading the relevant libraries and data.

library( tidyverse )
library( ABCmonster )
data( MACCSbinary )

Making predictions on test data

The package provides convenient functions to train a single GBM model on the entire training set, and then apply that model to score new samples. We use this functionality to make predictions on our test data. The training and test sets are part of the same data frame MACCSbinary, where the distinction between the two is given by the Label column: test samples have NA in this column. The helper functions, which are named ABCtrain() and ABCpredict(), recognize this convention and limit the provided data frame to the corresponding rows. Thus, all we have to do is apply both functions to MACCSbinary:

m <- ABCtrain( MACCSbinary )      # Will use rows of MACCSbinary where Label is not-NA
P <- ABCpredict( m, MACCSbinary ) # Will use rows of MACCSbinary where Label is NA

Note that ABCpredict() drops the Label column, because it contains NA values for test samples and, thererfore, carries no information. The predictions are instead stored in a new ABCpred column. We can examine them directly by selecting the appropriate columns from the output.

colnames(P) %>% head
P %>% select( ABCpred, Drug, pubchem_id )

ABCpred	Drug	pubchem_id
0.3336071	Beauvericin	3007984
0.0399048	Benomyl	28780
0.0283885	Bisantrene	5351322
0.0274539	Bromopyruvate	70684
0.0487629	Calyculin A	5311365
0.2469249	Camptothecin	24360
0.0530215	Cisplatin	441203
0.1881744	Colchicine	6167
0.0641300	Cycloheximide	6197
0.2442043	Dyclonine	3180
0.1768879	Fenpropimorph	93365
0.7887600	Fluconazole	3365
0.1473230	Imatinib	5291
0.0212638	Latrunculin B	3892
0.0604480	Methotrexate	4112
0.9578105	Miconazole	4189
0.4685228	Mitoxantrone	4212
0.2158389	Pentamidine	4735
0.0507315	Rapamycin	5284616
0.1595335	Staurosporine	44259
0.0494319	Tacrolimus	445643
0.1771366	Tamoxifen	2733526
0.1322911	Tunicamycin	6433557
0.1129077	Valinomycin	441139

Validation data

Now that we made predictions for test samples, we need ground truth to evaluate those predictions against. To that end, we collected dose response data for ABC16 and Parental yeast strains across four doses for all 24 test compounds. Each experiment was repeated in duplicate.

The raw dose response validation data can be loaded directly from this package. Let’s load and examine the first few entries.

data(ABCvaldata)
head(ABCvaldata)

##          Drug pubchem_id Abbrev Replicate  Dose     ABC16  Parental
## 1 Beauvericin    3007984    BEA         1     0 1.0000000 1.0000000
## 2 Beauvericin    3007984    BEA         1 Top/4 0.3689786 0.9675383
## 3 Beauvericin    3007984    BEA         1 Top/2 0.1479589 0.8854351
## 4 Beauvericin    3007984    BEA         1   Top 0.0999338 0.9135777
## 5 Beauvericin    3007984    BEA         2     0 1.0000000 1.0000000
## 6 Beauvericin    3007984    BEA         2 Top/4 0.6023782 1.0142114

Each row corresponds to a single drug / replicate / dose tuple. The last two columns contain growth values that have been normalized to the DMSO controls. We can plot this data directly for visualization (see Figure 4a in the paper).

## Define order of labels for the "Dose" axis
doseLbl <- c("0","Top/4","Top/2","Top")

## Prepare the data for plotting
V <- ABCvaldata %>% gather( Strain, Growth, ABC16, Parental ) %>%
     mutate( Dose = match(Dose, doseLbl), DrugRep = str_c(Abbrev, Replicate) )

## Plot dose response curves
ggplot( V, aes(x=Dose, y=Growth, fill=Strain) ) +
    geom_area( alpha = 0.3, color="black", position="identity" ) +
    scale_fill_manual( values = c("ABC16"="green", "Parental"="gray50") ) +
    scale_x_continuous( breaks = 1:4, labels = doseLbl ) +
    scale_y_continuous( breaks = c(0,0.5,1) ) +
    facet_wrap( ~DrugRep, nrow=6, ncol=8 ) + theme_bw() +
    theme( axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
          strip.text.x = element_text(margin = margin(0.1,0,0.1,0,"cm")) )

We may also be interested in concordance of the two duplicates. Let overall drug response be the area under the dose response curve. Because the growth values have been normalized to DMSO, we approximate this area by simply summing the growth values across all four doses. We then compute the log-fold change in drug response between ABC16 and Parental strains, placing us in a position to directly compare the two replicates for each drug.

## Define the color palette
pal <- c( "Parental"="tomato", "Equal"="black", "ABC16"="steelblue" )

## Compute the overall response as the sum across dose-specific values
S <- ABCvaldata %>% group_by( Drug, Abbrev, Replicate ) %>%
    summarize_at( vars(ABC16, Parental), sum ) %>%
    mutate( Value = -log2(ABC16 / Parental) ) %>% ungroup %>%
    select( -ABC16, -Parental ) %>% spread( Replicate, Value ) %>%
    mutate( Sensitivity = cut(`1`, c(-Inf, -0.1,0.1, Inf), labels=names(pal)) )

## Create a scatter plot that directly compares the two replicates
ggplot( S, aes(x=`1`,y=`2`, color=Sensitivity) ) +
    geom_point() + theme_bw() + coord_fixed() +
    scale_color_manual( values=pal ) +
    ggrepel::geom_text_repel( aes(label = Abbrev), show.legend=FALSE ) +
    geom_abline( slope = 1, linetype = "dashed", color = "gray40" ) +
    xlab( "Replicate 1" ) + ylab( "Replicate 2" )

The plot shows a strong concordance between the two replicates. We can also quantify this observation by explicitly performing the correlation test.

with( S, cor.test(`1`, `2`) )

## 
##  Pearson's product-moment correlation
## 
## data:  1 and 2
## t = 20.659, df = 22, p-value = 6.759e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9425843 0.9893738
## sample estimates:
##       cor 
## 0.9751822

Evaluation of performance

To evaluate predictions from our model, we first use the drug names to match up the predictions (column ABCpred from P) against the experimental dose response summary S. The overall measured response is then computed by averaging across the two replicates.

## Match up predictions and experimental data
M <- P %>% select( ABCpred, Drug ) %>% inner_join(S, by="Drug") %>%
    mutate( Response = map2_dbl( `1`, `2`, ~mean(c(.x,.y)) ) )

## Plot the correspondence between the two
## Include a loess fit to the data points
ggplot( M, aes(x=ABCpred, y=Response, color=Sensitivity) ) + theme_bw() +
    geom_point() + xlab("Predicted P(ABC-16 is more sensitive)") +
    ylab( "Measured response,\n averaged across replicates" ) +
    ggrepel::geom_text_repel( aes(label = Abbrev) ) +
    scale_x_log10() + scale_color_manual( values=pal, guide=FALSE )

While the fit doesn’t look particularly compelling, the correlation between predicted and measured values is nevertheless significant. We can compute the Spearman’s rho and the associated p value between the two.

## Apply correlation test to columns ABCpred and Response in M
with( M, cor.test(ABCpred, Response, method="spearman") )

## 
##  Spearman's rank correlation rho
## 
## data:  ABCpred and Response
## S = 1066, p-value = 0.007682
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5365217

Making predictions on new drugs

SECTION UNDER DEVELOPMENT. Here we will present the scripts that allow users to convert a SMILES representation of their drug to the binary MACCS features, which can then be directly provided to ABCpredict() to infer sensitivity.

Making predictions on test data

Validation data

Evaluation of performance

Making predictions on new drugs

Contents