SQL Usage
Download an extract the SQLite dump using the following:
import chembl_downloader
path = chembl_downloader.download_extract_sqlite(version="28")
After it’s been downloaded and extracted once, it’s smart and does not need to download
again. It gets stored using pystow automatically in the ~/.data/chembl
directory.
You can modify the previous code slightly by omitting the version keyword argument
to automatically find the latest version of ChEMBL:
import chembl_downloader
path = chembl_downloader.download_extract_sqlite()
The version keyword argument is available for all functions in this package (e.g.,
including chembl_downloader.connect(), chembl_downloader.cursor(), and
chembl_downloader.query()), but will be omitted below for brevity.
Automatic Connection
Inside the archive is a single SQLite database file. Normally, people manually untar this folder then do something with the resulting file. Don’t do this, it’s not reproducible! Instead, the file can be downloaded and a connection can be opened automatically with:
import chembl_downloader
with chembl_downloader.connect() as conn:
with conn.cursor() as cursor:
cursor.execute(...) # run your query string
rows = cursor.fetchall() # get your results
The chembl_downloader.cursor() function provides a convenient wrapper around this
operation:
import chembl_downloader
with chembl_downloader.cursor() as cursor:
cursor.execute(...) # run your query string
rows = cursor.fetchall() # get your results
Run a Query and Get a Pandas DataFrame
The most powerful function is chembl_downloader.query(), which builds on the
previous chembl_downloader.connect() function in combination with
pandas.read_sql() to make a query and load the results into a pandas DataFrame for
any downstream use.
import chembl_downloader
sql = """
SELECT
MOLECULE_DICTIONARY.chembl_id,
MOLECULE_DICTIONARY.pref_name
FROM MOLECULE_DICTIONARY
JOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregno
WHERE molecule_dictionary.pref_name IS NOT NULL
LIMIT 5
"""
df = chembl_downloader.query(sql)
df.to_csv(..., sep="\t", index=False)
Note
Suggestion 1: use pystow to make a reproducible file path that’s portable to
other people’s machines (e.g., it doesn’t have your username in the path).
Suggestion 2: RDKit is now pip-installable with pip install rdkit-pypi, which means most users don’t have to muck around with complicated conda environments and configurations. One of the powerful but understated tools in RDKit is the [rdkit.Chem.PandasTools](https://rdkit.org/docs/source/rdkit.Chem.PandasTools.html) module.
Querying for a Scalar
For SQL queries that return a scalar result, you can use
chembl_downloader.query_scalar(). In the following example, this gets a summary
count over the number of activities in the database.
import chembl_downloader
sql = "SELECT COUNT(activity_id) FROM activities"
count: int = chembl_downloader.query_scalar(sql)