Accelerating Molecular Discovery with Generative Language Models
A journey through the chemical space
dc.contributor.author
Born, Jannis
dc.contributor.supervisor
Borgwardt, Karsten
dc.contributor.supervisor
Manica, Matteo
dc.contributor.supervisor
Aspuru Guzik, Alán
dc.date.accessioned
2022-12-16T08:16:21Z
dc.date.available
2022-12-15T23:10:01Z
dc.date.available
2022-12-16T08:16:21Z
dc.date.issued
2022
dc.identifier.uri
http://hdl.handle.net/20.500.11850/587547
dc.identifier.doi
10.3929/ethz-b-000587547
dc.description.abstract
The discovery of new molecules and materials with desired properties is pivotal to our success in combatting global challenges such as the climate crisis or emerging diseases. However, navigating the discrete and practically infinite chemical search space while having to respect a cascade of multiproperty objectives is extremely challenging. In the past few decades, the chemical industry has faced not only a decline in productivity, but also ever-rising costs for the research and development of novel materials and molecules. Recently, molecular generative models coupled with virtual screening methods have shown promising results in efficient and systematic chemical space exploration. The hopes are high that such methods can accelerate the molecular discovery process, in particular when coupled with chemical synthesis planning tools and robotic hardware in automated laboratories. However, most generative models are optimized toward simplistic, chemo-centric objectives, disregard system-level information about the target environment of the molecule and can thus not be applied to generate molecules conditionally for a wide range of objectives. This thesis is about developing conditional molecular generative models that can be queried with a semantic context and flexibly generate molecules for desired conditions without the need of specific optimization. Moreover, this thesis aims to improve the "entanglement" of de novo design and property prediction by developing molecular generative models that possess inductive biases about continuous properties and also excel at predicting such properties. This is achieved by exploiting analogies between natural language and organic chemistry. Asaprerequisiteforgenerativemodeling, the first part of this thesis is devoted to building predictive models for molecular properties. The first chapter presents a simple, yet robust and interpretable chemical language model that heavily relies on data augmentation and is shown to exhibit strong performance across a wide range of properties such as toxicity. The next chapter develops proteochemometric language models for protein-ligand binding affinity prediction and demonstrates that by discarding more than 95% of the residues from the protein sequence, the performance of binding affinity prediction for human protein kinases significantly improves. The second part of this thesis focuses on the main goal of developing generative language models for conditional molecular design. Leveraging the property predictors in a reinforcement-learning optimization scheme yields a generative model that can be conditioned on a biomolecular context vector (e.g., a gene expression signature of a malignant tumour or a target protein) and generate molecules with high affinity toward this context. The experiments show that this method generalizes well and can propose molecules with high selectivity for unseen protein targets even in the absence of experimental data for such targets. In a case study on accelerated molecular discovery, the proposed generative model is integrated into a completely autonomous workflow that spans retrosynthesis models, synthesis protocol generation and the successful wet-lab synthesis on a robotic hardware. The last chapter then proposes a multitask language model that abstracts regression as a conditional sequence modeling problem and thus unifies the previous work on molecular property prediction and conditional generation within the same model. This model not only excels on regression tasks despite relying on a classification loss, it can also be conditioned simultaneously on arbitrary molecular substructures and continuous target properties. As demonstrated, this model outperforms specialized approaches in conditional molecular design and can decorate seed molecules, proteins or chemical reactions based on a desired property primer without the need of any optimization. This finds particular application in property-driven local exploration of the chemical space and paves the road toward foundation models in material design. Altogether, this thesis may contribute toward accelerated molecular discovery by providing methods to improve the quality of the average hypothesis that is considered for downstream chemical synthesis and wet-lab experimentation.
en_US
dc.format
application/pfd
en_US
dc.language.iso
en
en_US
dc.publisher
ETH Zurich
en_US
dc.rights.uri
http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.subject
deep learning
en_US
dc.subject
Accelerated discovery
en_US
dc.subject
generative modeling
en_US
dc.subject
Computational chemistry
en_US
dc.subject
deep generative models
en_US
dc.subject
Transformer networks
en_US
dc.title
Accelerating Molecular Discovery with Generative Language Models
en_US
dc.type
Doctoral Thesis
dc.rights.license
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
dc.date.published
2022-12-16
ethz.title.subtitle
A journey through the chemical space
en_US
ethz.size
220 p.
en_US
ethz.code.ddc
DDC - DDC::0 - Computer science, information & general works::004 - Data processing, computer science
en_US
ethz.identifier.diss
28807
en_US
ethz.publication.place
Zurich
en_US
ethz.publication.status
published
en_US
ethz.leitzahl
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02060 - Dep. Biosysteme / Dep. of Biosystems Science and Eng.::09486 - Borgwardt, Karsten M. (ehemalig) / Borgwardt, Karsten M. (former)
en_US
ethz.leitzahl.certified
ETH Zürich::00002 - ETH Zürich::00012 - Lehre und Forschung::00007 - Departemente::02060 - Dep. Biosysteme / Dep. of Biosystems Science and Eng.::09486 - Borgwardt, Karsten M. (ehemalig) / Borgwardt, Karsten M. (former)
en_US
ethz.date.deposited
2022-12-15T23:10:01Z
ethz.source
FORM
ethz.eth
yes
en_US
ethz.availability
Open access
en_US
ethz.rosetta.installDate
2022-12-16T08:16:22Z
ethz.rosetta.lastUpdated
2023-02-07T08:53:53Z
ethz.rosetta.exportRequired
true
ethz.rosetta.versionExported
true
ethz.COinS
ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.atitle=Accelerating%20Molecular%20Discovery%20with%20Generative%20Language%20Models&rft.date=2022&rft.au=Born,%20Jannis&rft.genre=unknown&rft.btitle=Accelerating%20Molecular%20Discovery%20with%20Generative%20Language%20Models
Files in this item
Publication type
-
Doctoral Thesis [30307]