SQL Usage
=========

Download an extract the SQLite dump using the following:

.. code-block:: python

    import chembl_downloader

    path = chembl_downloader.download_extract_sqlite(version="28")

After it's been downloaded and extracted once, it's smart and does not need to download
again. It gets stored using :mod:`pystow` automatically in the ``~/.data/chembl``
directory.

You can modify the previous code slightly by omitting the ``version`` keyword argument
to automatically find the latest version of ChEMBL:

.. code-block:: python

    import chembl_downloader

    path = chembl_downloader.download_extract_sqlite()

The ``version`` keyword argument is available for all functions in this package (e.g.,
including :func:`chembl_downloader.connect`, :func:`chembl_downloader.cursor`, and
:func:`chembl_downloader.query`), but will be omitted below for brevity.

Automatic Connection
--------------------

Inside the archive is a single SQLite database file. Normally, people manually untar
this folder then do something with the resulting file. Don't do this, it's not
reproducible! Instead, the file can be downloaded and a connection can be opened
automatically with:

.. code-block:: python

    import chembl_downloader

    with chembl_downloader.connect() as conn:
        with conn.cursor() as cursor:
            cursor.execute(...)  # run your query string
            rows = cursor.fetchall()  # get your results

The :func:`chembl_downloader.cursor` function provides a convenient wrapper around this
operation:

.. code-block:: python

    import chembl_downloader

    with chembl_downloader.cursor() as cursor:
        cursor.execute(...)  # run your query string
        rows = cursor.fetchall()  # get your results

Run a Query and Get a Pandas DataFrame
--------------------------------------

The most powerful function is :func:`chembl_downloader.query`, which builds on the
previous :func:`chembl_downloader.connect` function in combination with
:func:`pandas.read_sql` to make a query and load the results into a pandas DataFrame for
any downstream use.

.. code-block:: python

    import chembl_downloader

    sql = """
    SELECT
        MOLECULE_DICTIONARY.chembl_id,
        MOLECULE_DICTIONARY.pref_name
    FROM MOLECULE_DICTIONARY
    JOIN COMPOUND_STRUCTURES ON MOLECULE_DICTIONARY.molregno == COMPOUND_STRUCTURES.molregno
    WHERE molecule_dictionary.pref_name IS NOT NULL
    LIMIT 5
    """

    df = chembl_downloader.query(sql)
    df.to_csv(..., sep="\t", index=False)

.. note::

    Suggestion 1: use :mod:`pystow` to make a reproducible file path that's portable to
    other people's machines (e.g., it doesn't have your username in the path).

    Suggestion 2: RDKit is now pip-installable with `pip install rdkit-pypi`, which
    means most users don't have to muck around with complicated conda environments and
    configurations. One of the powerful but understated tools in RDKit is the
    [rdkit.Chem.PandasTools](https://rdkit.org/docs/source/rdkit.Chem.PandasTools.html)
    module.

Querying for a Scalar
---------------------

For SQL queries that return a scalar result, you can use
:func:`chembl_downloader.query_scalar`. In the following example, this gets a summary
count over the number of activities in the database.

.. code-block:: python

    import chembl_downloader

    sql = "SELECT COUNT(activity_id) FROM activities"
    count: int = chembl_downloader.query_scalar(sql)