AI for drug discovery
(Source: GEN)







Machine Learning and Reasoning for Drug Discovery

A tutorial @ECML-PKDD 2021.

TL;DR: This tutorial reviews recent developments on drug discovery using machine learning methods.


A/Prof Truyen Tran
Truyen Tran

Dr Thin Nguyen
Thin Nguyen

Tri Nguyen
Tri Nguyen

Applied AI Institute, Deakin University

Slides | Videos

Powered by neural networks, modern machine learning has enjoyed great successes in data-intensive domains such as computer vision and languages where human can naturally perform well. Machine learning equipped with reasoning is now accelerating fields that traditionally require deep expertise such as physics, chemistry and biomedicine. This tutorial provides an overview of how machine learning and reasoning are speeding up and lowering the cost of drug discovery. This includes how machine learning can help in wide range of areas such as novel molecule identification, protein representation, drug-target binding, drug re-purposing, generative drug design, chemical reaction, retrosynthesis planning, drug-drug interaction, and safety assessment. We will also discuss relevant machine learning models for graph classification, molecular graph transformation, drug generation using deep generative models and reinforcement learning, and chemical reasoning.

Prerequisite: the tutorial assumes some familarity with deep learning.


The tutorial is broadly organised into three parts. Part A introduces the drug discovery pipeline from virtual in silico screening to wet lab experimentation to clinical trials. We will then explain how machine learning and reasoning play the role in each of the stage in the pipeline.

Part B focuses on representation learning of molecular structures and predicting biochemical properties given the structures. On representing drugs, we will cover traditional fingerprints, string representation and learning, graph representation and learning. On representing proteins, we will discuss recent unsupervised embedding techniques operating on the sequences and 2D structures. Then we talk about how drug and protein interact, and recent deep learning techniques to model and predict their binding. Part B ends with the topic of polypharmacy and predicting drug-drug interaction.

Part C covers the optimisation of molecular structures to meet desirable drug properties, and generative models for goal-directed exploration in the drug space. We also talk about the chain of synthesis of target drugs, including reaction prediction and retrosynthesis planing. Finally, we explain about reasoning in the domain knowledge graphs with applications to recommendation and drug repurposing.

How is this relevant to AI/ML/DM community?

Drug discovery is a scientific area of the most profound impact to humanity. The field is steadily moving from being knowledge-driven towards data-driven, where we now routinely screen hundreds of millions of potential drugs, and explore the astronomically large chemical space. Machine learning is making important contributions to the field, finding new drugs for previously undruggable targets. On the one hand, the current advances in deep learning coupled with big compute have opened up new opportunities to accelerate the drug discovery pipeline. On the other hand, the domain offers new challenges unseen before and this has motivated the development of new kinds of modelling techniques, especially in the area of graphs and geometric machine learning.

Existing related talks

  • Truyen Tran, “AI for drug discovery” (Slides | Video)., A invited talk @VietAI Summit, HCM City, Vietnam, Nov 2019 .


Part A: Introduction (30 mins)

  • Drug discovery pipeline
  • Machine learning tasks in drug discovery

Part B: Molecular representation and property prediction (90 mins)

  •  Molecular representation learning
    • Fingerprints
    • String representation
    • Graph representation
    • Self-supervised learning of molecules
  •  Molecular property prediction
    • Quantum chemistry
    • Graph regression and classification
    • Graph multitask learning
    • Explaining graph prediction
    • Data efficient drug discovery
  •   Protein representation learning
    • Embedding, BERT
    • 2D contact map
    • 3D structure
    • Protein folding
  •   Drug-target binding prediction
    • Multi-target prediction
    • Drug-protein binding as graph-graph interaction
    • Polypharmacy and drug-drug interaction.

Part C: Drug design & synthesis (90 mins)

  • Molecular optimisation
    • Bayesian optimisation in latent space
    • Goal-directed reinforcement learning
  • Generative molecular generation
    • Deep generative models for molecules
    • Recurrent models for molecules
  • Reasoning on biomedical knowledge graphs
    • Recommendation
    • Drug repurposing
  • Retrosynthesis
    • Chemical planning
    • Chemical reaction as graph morphism
  • Wrapping up & future


  1. Adhikari, B. (2019). "DEEPCON: Protein contact prediction using dilatedconvolutional neural networks with  dropout". Bioinformatics,36(2),470–477
  2. Agrawal, A., & Choudhary, A. (2016). "Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science". Apl Materials, 4(5), 053208.
  3. Alley, Ethan C., et al. "Unified rational protein engineering with sequence-only deep representation learning." bioRxiv (2019): 589333.
  4. Altae-Tran, Han, et al. "Low data drug discovery with one-shot learning." ACS central science 3.4 (2017): 283-293.
  5. Aspuru-Guzik, Alán, Roland Lindh, and Markus Reiher. "The matter simulation (r) evolution." ACS central science 4.2 (2018): 144-152.
  6. Bepler, Tristan, and Bonnie Berger. "Learning protein sequence embeddings using information from structure." International Conference on Learning Representations. 2018.
  7. Bonner, S., Barrett, I. P., Ye, C., Swiers, R., Engkvist, O., Bender, A., ... & Hamilton, W. (2021). "A review of biomedical datasets relating to drug discovery: A knowledge graph perspective". arXiv preprint arXiv:2102.10062.
  8. Bottou, Léon. "From machine learning to machine reasoning." Machine learning 94.2 (2014): 133-149.
  9. Bradshaw, J., et al. "A model to search for synthesizable molecules." Advances in Neural Information Processing Systems 32 (2019).
  10. Callahan, Tiffany J., et al. "Knowledge-based biomedical data science." Annual review of biomedical data science 3 (2020): 23-41.
  11. Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
  12. Cheng, Feixiong, et al. "Prediction of drug-target interactions and drug repositioning via network-based inference." PLoS computational biology 8.5 (2012): e1002503.
  13. Chithrananda, S., Grand, G., & Ramsundar, B. (2020). "Chemberta: Large-scale self-supervised pretraining for molecular property prediction". Machine Learning for Molecules Workshop NeurIPS 2020
  14. Devlin, al.(2019). "BERT: Pre-training of Deep BidirectionalTransformers for Language Understanding". InProceedings of the2019  Conference of the North American Chapter of the Association forComputational Linguistics: Human Language Technologies, Volume  1(Long and Short Papers), pages 4171–4186.
  15. Do, Kien, et al. "Attentional Multilabel Learning over Graphs-A message passing approach." Machine Learning, 2019.
  16. Do, Kien, Truyen Tran, and Svetha Venkatesh. "Graph Transformation Policy Network for Chemical Reaction Prediction." KDD’19.
  17. Do, Kien, Truyen Tran, and Svetha Venkatesh. "Knowledge graph embedding with multiple relation projections." 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018.
  18. Do, Kien, Truyen Tran, and Svetha Venkatesh. "Learning deep matrix representations." arXiv preprint arXiv:1703.01454 (2017).
  19. Duvenaud, David K., et al. "Convolutional networks on graphs for learning molecular fingerprints." Advances in neural information processing systems. 2015.
  20. Elnaggar, Ahmed, et al. "ProtTrans: towards cracking the language of Life's code through self-supervised deep learning and high performance computing." arXiv preprint arXiv:2007.06225 (2020).
  21. Gilmer, Justin, et al. "Neural message passing for quantum chemistry." International conference on machine learning. PMLR, 2017..
  22. Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016.
  23. Gómez-Bombarelli, Rafael, et al. "Automatic chemical design using a data-driven continuous representation of molecules." ACS Central Science (2016).
  24. Jin, W., Barzilay, R., & Jaakkola, T. (2018). "Junction Tree Variational Autoencoder for Molecular Graph Generation". ICML’18.
  25. Jin, W., Yang, K., Barzilay, R., & Jaakkola, T. (2019). "Learning multimodal graph-to-graph translation for molecular optimization". ICLR'19.
  26. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., ... & Hassabis, D. (2021). "Highly accurate protein structure prediction with AlphaFold". Nature, 596(7873), 583-589.
  27. Kadurin, Artur, et al. "The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology." Oncotarget 8.7 (2017): 10883.
  28. Kandathil, Shaun M., et al. "Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterised proteins." bioRxiv (2021): 2020-11.
  29. Khardon, Roni, and Dan Roth. "Learning to reason." Journal of the ACM (JACM) 44.5 (1997): 697-725.
  30. Kuhlman, Brian, and Philip Bradley. "Advances in protein structure prediction and design." Nature Reviews Molecular Cell Biology 20.11 (2019): 681-697.
  31. Kusner, Matt J., Brooks Paige, and José Miguel Hernández-Lobato. "Grammar variational autoencoder." International Conference on Machine Learning. PMLR, 2017.
  32. Lee, Wing-Hin, et al. "The potential to treat lung cancer via inhalation of repurposed drugs." Advanced drug delivery reviews 133 (2018): 107-130.
  33. Lim, S., Lu, Y., Cho, C. Y., Sung, I., Kim, J., Kim, Y., ... & Kim, S. (2021). "A review on compound-protein interaction prediction methods: Data, format, representation and model". Computational and Structural Biotechnology Journal, 19, 1541.
  34. Lipinski, Christopher A., et al. "Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings." Advanced drug delivery reviews 23.1-3 (1997): 3-25.
  35. Mahmood, Omar, et al. "Masked graph modeling for molecule generation." Nature communications 12.1 (2021): 1-12.
  36. Mohamed, S. K., Nováček, V., & Nounu, A. (2020). "Discovering protein drug targets using knowledge graph embeddings". Bioinformatics, 36(2), 603-610.
  37. Nguyen, T. M., Nguyen, T., Le, T. M., & Tran, T. (2021). “GEFA: Early Fusion Approach in Drug-Target Affinity Prediction”. IEEE/ACM Transactions on Computational Biology and Bioinformatics
  38. Nguyen, T., Le, H., & Venkatesh, S. (2019). "GraphDTA: prediction of drug–target binding affinity using graph convolutional networks". Bioinformatics, 2021.
  39. Nguyen, Tri Minh, et al. "Counterfactual Explanation with Multi-Agent Reinforcement Learning for Drug Target Prediction." arXiv preprint arXiv:2103.12983 (2021).
  40. Paliwal, S., de Giorgio, A., Neil, D., Michel, J. B., & Lacoste, A. M. (2020). "Preclinical validation of therapeutic targets predicted by tensor factorization on heterogeneous graphs". Scientific reports, 10(1), 1-19.
  41. Penmatsa, Aravind, Kevin H. Wang, and Eric Gouaux. "X-ray structure of dopamine transporter elucidates antidepressant mechanism." Nature 503.7474 (2013): 85-90.
  42. Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. "Deepwalk: Online learning of social representations." Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014.
  43. Pham, T., Tran, T., & Venkatesh, S. (2018). "Relational dynamic memory networks". arXiv preprint arXiv:1808.04247.
  44. Pham, Trang, et al. (2017) "Column Networks for Collective Classification." AAAI.
  45. Pham, Trang, Truyen Tran, and Svetha Venkatesh (2018). "Graph Memory Networks for Molecular Activity Prediction." ICPR’18.
  46. Pushpakom, Sudeep, et al. "Drug repurposing: progress, challenges and recommendations." Nature reviews Drug discovery 18.1 (2019): 41-58.
  47. Qiu, Jiezhong, et al. (2020) "GCC: Graph contrastive coding for graph neural network pre-training." Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
  48. Rao, al.(2019). "Evaluating Protein Transfer Learning with TAPE". In Advances in Neural Information Processing Systems.
  49. Rong, Yu, et al. "Self-supervised graph transformer on large-scale molecular data." arXiv preprint arXiv:2007.02835 (2020).
  50. Réda, Clémence, Emilie Kaufmann, and Andrée Delahaye-Duriez. "Machine learning applications in drug development." Computational and structural biotechnology journal 18 (2020): 241-252.
  51. Senior, A. al.(2020). "Improved protein structure prediction usingpotentials from deep learning". Nature, pages 1–5.
  52. Shi, Chence, et al. "A graph to graphs framework for retrosynthesis prediction." International Conference on Machine Learning. PMLR, 2020.
  53. Shi, Chence, et al. "GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation." International Conference on Learning Representations. 2019.
  54. Simonovsky, Martin, and Nikos Komodakis. "Graphvae: Towards generation of small graphs using variational autoencoders." International conference on artificial neural networks. Springer, Cham, 2018.
  55. Stokes, Jonathan M., et al. "A deep learning approach to antibiotic discovery." Cell 180.4 (2020): 688-702.
  56. Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.
  57. Veličković, Petar, et al. "Graph Attention Networks." International Conference on Learning Representations. 2018.
  58. Yang, K. K., Wu, Z., Bedbrook, C. N., & Arnold, F. H. (2018). "Learned protein embeddings for machine learning". Bioinformatics, 34(15), 2642-2648.
  59. Ying, Rex, et al. "Gnnexplainer: Generating explanations for graph neural networks." Advances in neural information processing systems 32 (2019): 9240.
  60. You, Jiaxuan, et al. "Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation." NeurIPS (2018).
  61. You, Jiaxuan, et al. "GraphRNN: Generating realistic graphs with deep auto-regressive models." ICML (2018).
  62. Yuan, J., Jin, Z., Guo, H., Jin, H., Zhang, X., Smith, T., & Luo, J. (2020). "Constructing biomedical domain-specific knowledge graph with minimum supervision". Knowledge and Information Systems, 62(1), 317-336.
  63. Zhang, Daniel, et al. "The AI index 2021 annual report." arXiv preprint arXiv:2103.06312 (2021).
  64. Zhang, Rui, et al. "Drug repurposing for COVID-19 via knowledge graph completion." Journal of biomedical informatics 115 (2021): 103696.
  65. Zhou, Zhenpeng, et al. "Optimization of molecules via deep reinforcement learning." Scientific reports 9.1 (2019): 1-10.
  66. Zitnik, M., Agrawal, M., & Leskovec, J. (2018). "Modeling polypharmacy side effects with graph convolutional networks". Bioinformatics, 34(13), i457-i466.