Abstract: The development of new effective, safe drugs to treat cancer remains a challenging and time-consuming task due to limited hit rates, restraining subsequent development eﬀorts. Despite the impressive progress of quantitative structure-activity relationship (QSAR) and machine learning-based models that have been developed to predict molecule pharmacodynamics and bioactivity, they have had mixed success at identifying compounds with anticancer properties against multiple cell lines. Here, we have developed a novel predictive tool pdCSM-cancer that uses a graph-based signature representation of the chemical structure of a small molecule in order to accurately predict molecules likely to be active against one or multiple cancer cell lines.
pdCSM-cancer represents the most comprehensive anticancer bioactivity prediction platform developed till date, comprising trained and validated models on experimental data of the growth inhibition concentration (GI50%) effects, including over 18,000 compounds, on 9 tumour types and 74 distinct cancer cell lines. Across 10-fold cross validation, it achieved Pearson’s correlation coefficients of up to 0.74 and comparable performance of up to 0.67 across independent, non-redundant blind tests. Leveraging the insights from these cell line specific models, we developed a generic predictive model to identify molecules active in at least 60 cell lines. Our final model achieved an Area Under ROC Curve (AUC) of up to 0.94 on 10-fold cross-validation and up to 0.94 on independent non-redundant blind tests, outperforming alternative approaches. We believe our predictive tool will provide a valuable resource to optimizing and enriching screening libraries for the identification of effective and safe anticancer molecules.