AstraZeneca and NVIDIA spur healthech revolution via Cambridge-1 supercomputer

NVIDIA is collaborating with Cambridge UK headquartered Big Biotech business AstraZeneca and the University of Florida’s academic health centre UF Health, on new AI research projects using breakthrough transformer neural networks technology.

Transformer-based neural network architectures – which have become available only in the last several years – allow researchers to leverage massive datasets using self-supervised training methods, avoiding the need for manually labeled examples during pre-training. 

These models, equally adept at learning the syntactic rules to describe chemistry as they are at learning the grammar of languages, are finding applications across research domains and modalities.

NVIDIA is collaborating with AstraZeneca on a transformer-based generative AI model for chemical structures used in drug discovery that will be among the very first projects to run on Cambridge-1, which is soon to go online as the UK’s largest supercomputer based at the KAO Data Centre in Harlow.

The model will be open sourced, available to researchers and developers in the NVIDIA NGC software catalogue and deployable in the NVIDIA Clara Discovery platform for computational drug discovery.

Separately, UF Health is harnessing NVIDIA’s state-of-the-art Megatron framework and BioMegatron pre-trained model – available on NGC – to develop GatorTron, the largest clinical language model to date.

New NGC applications include AtacWorks, a deep learning model that identifies accessible regions of DNA, and MELD, a tool for inferring the structure of biomolecules from sparse, ambiguous or noisy data.

The MegaMolBART drug discovery model being developed by NVIDIA and AstraZeneca is slated for use in reaction prediction, molecular optimisation and de novo molecular generation. 

It’s based on AstraZeneca’s MolBART transformer model and is being trained on the ZINC chemical compound database – using NVIDIA’s Megatron framework to enable massively scaled-out training on supercomputing infrastructure.

The large ZINC database allows researchers to pretrain a model that understands chemical structure, bypassing the need for hand-labeled data. Armed with a statistical understanding of chemistry, the model will be specialised for a number of downstream tasks, including predicting how chemicals will react with each other and generating new molecular structures.

“Just as AI language models can learn the relationships between words in a sentence, our aim is that neural networks trained on molecular structure data will be able to learn the relationships between atoms in real-world molecules,” said Ola Engkvist, head of molecular AI, discovery sciences, and R & D at AstraZeneca. 

“Once developed, this NLP model will be open source, giving the scientific community a powerful tool for faster drug discovery.”

The model, trained using NVIDIA DGX SuperPOD, gives researchers ideas for molecules that don’t exist in databases but could be potential drug candidates. 

Computational methods, known as in-silico techniques, allow drug developers to search through more of the vast chemical space and optimise pharmacological properties before shifting to expensive and time-consuming lab testing.

This collaboration will use the NVIDIA DGX A100-powered Cambridge-1 and Selene supercomputers to run large workloads at scale. Cambridge-1 is the largest supercomputer in the UK, ranking No. 3 on the Green500 and No. 29 on the TOP500 list of the world’s most powerful systems. 

NVIDIA’s Selene supercomputer topped the most recent Green500 and ranks fifth on the TOP500.

Read full original article »