SDF Usage

This example is a bit more fit-for-purpose than the last two. The chembl_downloader.supplier() function makes sure that the latest SDF dump is downloaded and loads it from the gzip file into a rdkit.Chem.ForwardSDMolSupplier using a context manager to make sure the file doesn’t get closed until after parsing is done. Like the previous examples, it can also explicitly take a version.

from rdkit import Chem

import chembl_downloader

with chembl_downloader.supplier() as suppl:
    data = []
    for i, mol in enumerate(suppl):
        if mol is None or mol.GetNumAtoms() > 50:
            continue
        fp = Chem.PatternFingerprint(mol, fpSize=1024, tautomerFingerprints=True)
        smi = Chem.MolToSmiles(mol)
        data.append((smi, fp))

This example was adapted from Greg Landrum’s RDKit blog post on generalized substructure search.

Iterate over SMILES

This example uses the chembl_downloader.supplier() method and RDKit to get SMILES strings from molecules in ChEMBL’s SDF file. If you want direct access to the RDKit molecule objects, use chembl_downloader.supplier().

import chembl_downloader

for smiles in chembl_downloader.iterate_smiles():
    print(smiles)

Get an RDKit substructure library

Building on the chembl_downloader.supplier() function, the chembl_downloader.get_substructure_library() makes the preparation of a substructure library automated and reproducible. Additionally, it caches the results of the build, which takes on the order of tens of minutes, only has to be done once and future loading from a pickle object takes on the order of seconds.

The implementation was inspired by Greg Landrum’s RDKit blog post, Some new features in the Substruct Library. The following example shows how it can be used to accomplish some of the first tasks presented in the post:

from rdkit import Chem

import chembl_downloader

library = chembl_downloader.get_substructure_library()
query = Chem.MolFromSmarts("[O,N]=C-c:1:c:c:n:c:c:1")
matches = library.GetMatches(query)