Chemical Language Models Improve Similarity Search With SMILES Variations

Authors:

(1) Clayton W. Kosonocky, Department of Molecular Biosciences, The University of Texas at Austin (clayton.kosonocky@utexas.edu);

(2) Aaron L. Feller, Department of Molecular Biosciences, The University of Texas at Austin (aaron.feller@utexas.edu);

(3) Claus O. Wilke, Department of Integrative Biology, The University of Texas at Austin and Corresponding Author (wilke@austin.utexas.edu);

(4) Andrew D. Ellington, Department of Molecular Biosciences, The University of Texas at Austin (ellingtonlab@gmail.com).

Table of Links

3. Results and Discussion

In-silico drug discovery methods have long relied on chemical similarity searches as a computational tool in pharmaceutical pipelines [22]. Recently, language models have been successful in biochemical prediction tasks for chemical properties such as drug-likeness, protein-ligand interactions, and other metrics [11, 13]. Because a multitude of characteristics can be predicted from chemical language models (CLMs), it is plausible that embeddings from these models contain a summary of molecular properties for a given molecule. So-called Simplified Molecular Line Entry System (SMILES) representations can be used to represent chemical structures as strings and can be used as inputs to CLMs, but due to the nature of chemical connectivity, there are often many valid representations for the same molecule [17]. This multiplicity of input formats causes the CLM to generate different embeddings for the same molecule, which can either be mitigated through string standardization (canonicalization) or utilized as a data augmentation technique that allows CLMs to better span molecular space [25, 26]. The latter reduces overfitting to string artifacts and improves structural comprehension; an analogy in natural language would be that training a model on all languages improves the understanding of language as a whole compared to a model trained only on English [25, 26].

However, the combination of alternative canonicalizations and data augmentation may actually prove to be advantageous for a chemical similarity search. Recent reports on emergence from the natural language processing literature demonstrated that language models understand Swahili and other languages incredibly well despite them being <0.01% of training data [27, 28]. SMILES canonicalization formats often share characters and substructures with one another, the primary difference being the specific grammatical rules used to assemble a given molecule. For CLMs trained on one canonicalization, inputs using unseen canonicalizations would be akin to underrepresented languages in natural language training sets, as the underlying string structure of the unseen canonicalization is not entirely unknown to the model, but is novel in the context of the molecule at hand. It seems likely that a vector comparison between embeddings from two different canonicalizations could ignore the differences in string and structural data and instead use whole-molecule properties approximating molecular function, serving as a novel prompt engineering strategy for the discovery of structurally distinct functional analogues. This would be equivalent to querying a predominantly English-trained model with a French phrase to obtain an embedding, and then searching amongst English phrases to find the entry with the closest semantic meaning. To our knowledge this potential behavior is largely unexplored.

To test this hypothesis, we built a CLM-embedding-based chemical similarity search utilizing one canonicalization for the CLM and database, and another canonicalization for the query SMILES strings. A pipeline was developed to perform a chemical similarity search using cosine similarity on CLM embeddings obtained from ChemBERTa, an unsupervised transformer encoder-based model [29]. This pipeline was named the Chemical Semantic Search (CheSS), outlined in Figure 1. ChemBERTa was trained on SMILES strings canonicalized using the default RDKit implementation, herein referred to as “RDKit Atom 0” [29, 31]. We converted ChemBERTa’s training set into a molecular database of ∼10M RDKit Atom 0 SMILES strings, upon which CheSS searches were performed. We then explored how the CheSS search results were impacted by three different SMILES canonicalizations. The first was RDKit’s default canonicalization “RDKit Atom 0”. The second canonicalization was created by maximizing the distance of the created string embedding to RDKit Atom 0 by varying the root atom number, herein referred to as “RDKit Atom n” (Fig. S2). The third used OEChem 2.3.0, a markedly different algorithm, to canonicalize the query and is referred to as “OEChem”. All CheSS searches utilizing these three canonicalizations differed only in the representation of the query molecule, with the database and model remaining constant.

This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.