Hack for Compound Similarity Searching with Neo4J .

3 min readNov 20, 2022

As a cheminformatician we all know how much we care for similarity search because its required by several teams in pharmaceutical to look for similar chemical by fingerprints , features , pharmacophore and lot more in order to find some leads in their projects. Molecular fingerprints have been used for a long time now in drug discovery and virtual screening. Their ease of use and the speed at which they can perform substructure and similarity searches -is the reason for their popularity among informaticians. A number of approaches have been developed based on SQL and as well as nosql methods to do similarity . There is an RDkit cartridge been built to connect Neo4J with RDkit in order to perform similarity searching. However, in this post i am trying to run some compound similarity searches without using rdkit plugin but instead using Neo4J’s Graph Datascience library and the functions they have like cosine, pearson, jaccard.

We use RDkit to generate some fingerprints of compounds as bit vectors and insert them in the database as a node property . The script is available at my Github. The trick i used to insert them is to convert to Float datatype . I am not sure exactly why integer one doesn’t work . Once you have the float type as node property then you can call the gds.similarity function to perform similarity searches . We use rdkit to generate the fingerprints ECFP4 and MACCS as shown below :

import itertools
import pandas as pd
from rdkit import DataStructs
from rdkit import Chem
from rdkit.Chem.rdMolDescriptors import GetMorganFingerprintAsBitVect
from rdkit.Chem import PandasTools
from rdkit.Chem import AllChem
from rdkit.Chem import DataStructs
import numpy as np
from rdkit.Chem import MACCSkeys
## Reading 10000 compounds
dataset = pd.read_csv('chembl_smiles.csv', nrows=10000,low_memory=False)

i = 0
j = 0
dat = pd.DataFrame()
try:
    for ind, row in dataset.iterrows():
        mol = Chem.MolFromSmiles(dataset['canonical_smiles'][ind])
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 2,
                                                   nBits=2048).ToBitString()
        ms = list(MACCSkeys.GenMACCSKeys(mol).ToBitString())
        i = i + 1
        dat = dat.append(
            {
                'chembl_id': dataset['chembl_id'][ind],
                'SMILES': dataset['canonical_smiles'][ind],
                'MACCS': list(ms),
                'ECFP4': list(fp)
            },
            ignore_index=True)
        if i == 5000:
            j = j + 1
            print("Batch : ", j)
            i = 0
except:
    print(dataset['chembl_id'][ind])
    #pass

dat['ECFP4'] = dat.apply(lambda row: (','.join(row['ECFP4'])), axis=1)
dat['MACCS'] = dat.apply(lambda row: (','.join(row['MACCS'])), axis=1)
dat.to_csv('chembl_test.csv', sep=',', index=False)

Now we create a connection to the Neo4J instance and load the data with fingerprints. The trick is to convert them into float datatype in order to load as an array of floats inside a node property.

gph_conn.query("""MATCH (n:Mol)
DETACH DELETE n""")

gph_conn.query("""
               // USING PERIODIC COMMIT 5000
               LOAD CSV WITH HEADERS FROM 'file:///D:/Github/neo4j_book/chembl_test.csv' AS row
               CREATE (m:Mol {CHEMBL_ID: row.chembl_id, SMILES: row.SMILES, ECFP4:[s IN split(row.ECFP4,",") | toFloat(s)],
               MACCS:[m IN split(row.MACCS,",") | toFloat(m)]
               }
               )
               """)

The last part is to run a query with gds library. I used cosine similarity . But there are other similarity functions available. Good thing is you can compute similarity of two feature set ECFP and MACCS in a single query together and compute similarity. However what i see using gds.similarity.jaccard has some issues and giving me incorrect scores , it may be a problem with the neo4j implementation and people have already reported it on the forum. But cosine and pearson works great as i show below:

import pandas as pd
results = gph_conn.query("""MATCH (n1:Mol {CHEMBL_ID: 'CHEMBL6254'})
                            MATCH (n2:Mol)
                            RETURN n2.CHEMBL_ID,gds.similarity.pearson(n1.ECFP4,n2.ECFP4) AS ECFP4,
                            gds.similarity.cosine(n1.MACCS,n2.MACCS) AS MACCS order by ECFP4 desc limit 20 ;
                                           """)
result = pd.DataFrame(results)
result = result.set_axis(['CHEMBL_ID', 'ECFP4', 'MACCS'], axis=1, inplace=False)
result.head(10)

I am sure we can scale the computation in a variety ways and this feature in neo4j can enable informaticians to track a variety of molecular paths and also search for similar molecular paths in the discovery stage.

Its upto the readers to test a large set as chembl and GDB .

Hack for Compound Similarity Searching with Neo4J .

Written by Abhik Seal