As with the previous vignettes, we begin by loading the relevant libraries and data.

library( tidyverse )
library( ABCmonster )
data( MACCSbinary )

Making predictions on test data

The package provides convenient functions to train a single GBM model on the entire training set, and then apply that model to score new samples. We use this functionality to make predictions on our test data. The training and test sets are part of the same data frame MACCSbinary, where the distinction between the two is given by the Label column: test samples have NA in this column. The helper functions, which are named ABCtrain() and ABCpredict(), recognize this convention and limit the provided data frame to the corresponding rows. Thus, all we have to do is apply both functions to MACCSbinary:

m <- ABCtrain( MACCSbinary )      # Will use rows of MACCSbinary where Label is not-NA
P <- ABCpredict( m, MACCSbinary ) # Will use rows of MACCSbinary where Label is NA

Note that ABCpredict() drops the Label column, because it contains NA values for test samples and, thererfore, carries no information. The predictions are instead stored in a new ABCpred column. We can examine them directly by selecting the appropriate columns from the output.

colnames(P) %>% head
P %>% select( ABCpred, Drug, pubchem_id )
ABCpred Drug pubchem_id
0.3336071 Beauvericin 3007984
0.0399048 Benomyl 28780
0.0283885 Bisantrene 5351322
0.0274539 Bromopyruvate 70684
0.0487629 Calyculin A 5311365
0.2469249 Camptothecin 24360
0.0530215 Cisplatin 441203
0.1881744 Colchicine 6167
0.0641300 Cycloheximide 6197
0.2442043 Dyclonine 3180
0.1768879 Fenpropimorph 93365
0.7887600 Fluconazole 3365
0.1473230 Imatinib 5291
0.0212638 Latrunculin B 3892
0.0604480 Methotrexate 4112
0.9578105 Miconazole 4189
0.4685228 Mitoxantrone 4212
0.2158389 Pentamidine 4735
0.0507315 Rapamycin 5284616
0.1595335 Staurosporine 44259
0.0494319 Tacrolimus 445643
0.1771366 Tamoxifen 2733526
0.1322911 Tunicamycin 6433557
0.1129077 Valinomycin 441139

Validation data

Now that we made predictions for test samples, we need ground truth to evaluate those predictions against. To that end, we collected dose response data for ABC16 and Parental yeast strains across four doses for all 24 test compounds. Each experiment was repeated in duplicate.

The raw dose response validation data can be loaded directly from this package. Let’s load and examine the first few entries.

data(ABCvaldata)
head(ABCvaldata)
##          Drug pubchem_id Abbrev Replicate  Dose     ABC16  Parental
## 1 Beauvericin    3007984    BEA         1     0 1.0000000 1.0000000
## 2 Beauvericin    3007984    BEA         1 Top/4 0.3689786 0.9675383
## 3 Beauvericin    3007984    BEA         1 Top/2 0.1479589 0.8854351
## 4 Beauvericin    3007984    BEA         1   Top 0.0999338 0.9135777
## 5 Beauvericin    3007984    BEA         2     0 1.0000000 1.0000000
## 6 Beauvericin    3007984    BEA         2 Top/4 0.6023782 1.0142114

Each row corresponds to a single drug / replicate / dose tuple. The last two columns contain growth values that have been normalized to the DMSO controls. We can plot this data directly for visualization (see Figure 4a in the paper).

## Define order of labels for the "Dose" axis
doseLbl <- c("0","Top/4","Top/2","Top")

## Prepare the data for plotting
V <- ABCvaldata %>% gather( Strain, Growth, ABC16, Parental ) %>%
     mutate( Dose = match(Dose, doseLbl), DrugRep = str_c(Abbrev, Replicate) )

## Plot dose response curves
ggplot( V, aes(x=Dose, y=Growth, fill=Strain) ) +
    geom_area( alpha = 0.3, color="black", position="identity" ) +
    scale_fill_manual( values = c("ABC16"="green", "Parental"="gray50") ) +
    scale_x_continuous( breaks = 1:4, labels = doseLbl ) +
    scale_y_continuous( breaks = c(0,0.5,1) ) +
    facet_wrap( ~DrugRep, nrow=6, ncol=8 ) + theme_bw() +
    theme( axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
          strip.text.x = element_text(margin = margin(0.1,0,0.1,0,"cm")) )

We may also be interested in concordance of the two duplicates. Let overall drug response be the area under the dose response curve. Because the growth values have been normalized to DMSO, we approximate this area by simply summing the growth values across all four doses. We then compute the log-fold change in drug response between ABC16 and Parental strains, placing us in a position to directly compare the two replicates for each drug.

## Define the color palette
pal <- c( "Parental"="tomato", "Equal"="black", "ABC16"="steelblue" )

## Compute the overall response as the sum across dose-specific values
S <- ABCvaldata %>% group_by( Drug, Abbrev, Replicate ) %>%
    summarize_at( vars(ABC16, Parental), sum ) %>%
    mutate( Value = -log2(ABC16 / Parental) ) %>% ungroup %>%
    select( -ABC16, -Parental ) %>% spread( Replicate, Value ) %>%
    mutate( Sensitivity = cut(`1`, c(-Inf, -0.1,0.1, Inf), labels=names(pal)) )

## Create a scatter plot that directly compares the two replicates
ggplot( S, aes(x=`1`,y=`2`, color=Sensitivity) ) +
    geom_point() + theme_bw() + coord_fixed() +
    scale_color_manual( values=pal ) +
    ggrepel::geom_text_repel( aes(label = Abbrev), show.legend=FALSE ) +
    geom_abline( slope = 1, linetype = "dashed", color = "gray40" ) +
    xlab( "Replicate 1" ) + ylab( "Replicate 2" )

The plot shows a strong concordance between the two replicates. We can also quantify this observation by explicitly performing the correlation test.

with( S, cor.test(`1`, `2`) )
## 
##  Pearson's product-moment correlation
## 
## data:  1 and 2
## t = 20.659, df = 22, p-value = 6.759e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9425843 0.9893738
## sample estimates:
##       cor 
## 0.9751822

Evaluation of performance

To evaluate predictions from our model, we first use the drug names to match up the predictions (column ABCpred from P) against the experimental dose response summary S. The overall measured response is then computed by averaging across the two replicates.

## Match up predictions and experimental data
M <- P %>% select( ABCpred, Drug ) %>% inner_join(S, by="Drug") %>%
    mutate( Response = map2_dbl( `1`, `2`, ~mean(c(.x,.y)) ) )

## Plot the correspondence between the two
## Include a loess fit to the data points
ggplot( M, aes(x=ABCpred, y=Response, color=Sensitivity) ) + theme_bw() +
    geom_point() + xlab("Predicted P(ABC-16 is more sensitive)") +
    ylab( "Measured response,\n averaged across replicates" ) +
    ggrepel::geom_text_repel( aes(label = Abbrev) ) +
    scale_x_log10() + scale_color_manual( values=pal, guide=FALSE )

While the fit doesn’t look particularly compelling, the correlation between predicted and measured values is nevertheless significant. We can compute the Spearman’s rho and the associated p value between the two.

## Apply correlation test to columns ABCpred and Response in M
with( M, cor.test(ABCpred, Response, method="spearman") )
## 
##  Spearman's rank correlation rho
## 
## data:  ABCpred and Response
## S = 1066, p-value = 0.007682
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5365217

Making predictions on new drugs

SECTION UNDER DEVELOPMENT. Here we will present the scripts that allow users to convert a SMILES representation of their drug to the binary MACCS features, which can then be directly provided to ABCpredict() to infer sensitivity.