Link

The nCov-Group Data Repository

AI- and HPC-enabled Generated Leads for SARS-CoV-2 Drugs

This repository provides access to data, models, and code produced by the nCoV Group in support of research aimed at generating leads for potential SARS-CoV-2 drugs. The data include representations and computed descriptors for around 4.2 billion small molecules: some 60 TB of data in all, although many useful subsets are much smaller.

These data will be updated regularly as the collaboration produces new results. Shared data are located on the nCov-Group Data Repository endpoint at this location, from where they can be accessed via Globus. To access the data, users should: 1) log in with existing credentials (here) and 2) access the Globus endpoint.

Table of contents

Note:

A manuscript describing these data and the associated methodology and processing pipelines is under preparation. A link will be posted here when it is available.

Data Processing Pipeline

The data processing pipeline is used to compute different types of features and representations of billions of small molecules. The pipeline first converts the SMILES representation for each molecule to a canonical SMILES form. (De-duplication is in progress.) It then creates, for each molecule, three different types of features: 1) molecular fingerprints that encode the structure of molecules; 2) molecular descriptors (using Mordred); and 3) 2D images of the molecular structure. These features are being used as input to various machine learning and deep learning models that predict important characteristics including docking scores, toxicity, and more.

Data processing pipeline


Dataset Catalog

We have obtained molecule definitions from the following source datasets. For each, we provide a link to the original source, the number of molecules included from the dataset, and the percentage of those molecules that are not found in any other listed dataset.

KeyDescription and link# Molecules% Uniq
BDBThe Binding Database1,813,53820.4
CASCAS COVID-19 Antiviral Candidate Compounds49,43755.5
DBKDrugbank9,67876.1
DCLDrugCentral Online Drug Compendium3,9812.4
DUDDUDE database of useful decoys99,78299.7
E1515.5M-molecule subset of ENA15,547,09199.7
EDBDrugBank plus Enamine Hit Locator Library 2018310,78261.2
EMOeMolecules25,946,98893.9
ENAEnamine REAL Database1,211,723,72385.9
FFICureFFI FDA-approved drugs and CNS drugs1,49712.2
G13GDB-13 small organic molecules up to 13 atoms977,468,30199.5
G17GDB-17-Set up to 17 atom extension of GDB-1350,000,000100.0
HOPHarvard Organic Photovoltaic Dataset35083.7
L1KL100010,1410.0
MOSMolecular Sets (MOSES)1,936,96281.3
PCHPubChem97,545,26648.5
QM9QM9 subset of GDB-17133,88584.0
REPRepurposing-related drug/tool compounds10,1410.0
SAVSynthetically Accessible Virtual Inventory (SAVI)265,047,09799.8
SURSureChEMBL dataset of molecules from patents17,915,3849.8
ZINZINC151,225,804,82985.1
Total 3,891,378,853 

Notes:

  • The key for each dataset may be used in filenames in place of the full name in downloads elsewhere.
  • The numbers above may be less than what can be found at the source, due to conversion failures and/or version differences.
  • These numbers do not account for de-duplication, within or between datasets.

Dataset Downloads

Follow the links below to access canonical SMILES, molecular fingerprints, descriptors, and images (png format) for each dataset. The links in the final row provide access to all SMILES, fingerprints, descriptors, and images, respectively.

Methodology and Data Formats

Converting to SMILES Canonical Form

We use Open Babel to convert the simplified molecular-input line-entry system (SMILES) specifications of chemical species obtained from various sources into a consistent canonical smiles representation:

obabel {filename} -O can_{filename} -ocan -e

We organize the resulting molecule specifications in one directory per source dataset, each containing one CSV file with format <SOURCE-KEY, IDENTIFIER, SMILES>, where:

  • SOURCE-KEY, as defined above, identifies the source dataset
  • IDENTIFIER is an identifier, either obtained for the source dataset or (if none such is available) defined by us. Identifiers are unique within a dataset, but may not be unique across datasets. Thus it is recommended to use the combination SOURCE-KEY+IDENTIFER to identify molecules.
  • SMILES is a canonical smile as produced by Open Babel

Computing Fingerprints

We use RDKit to compute a 2048-bit fingerprint for each molecule.

We organize these fingerprints in CSV files with format <SOURCE-KEY, IDENTIFIER, SMILES, FINGERPRINT>, where SOURCE-KEY, IDENTIFIER, and SMILES are as above, and DESCRIPTOR is a Base64-encoded representation of the fingerprint.

Calculating Descriptors

We generate molecular descriptors using Mordred. The collected descriptors (~1800 for each molecule) include both 2D and 3D descriptors.

We organize these descriptors in one directory per source dataset, each containing one or more PKL files, each organized internally as a Python dictionary with entries in the form:

{ SMILES:  (
    [IDENTIFIER], 
    NumPy Array[descriptor1, descriptor2, .., descriptorN]
    ),  
..
}

where SMILES and IDENTIFIER are as described above, and the floating point array is the descriptor.

Creating 2D Molecule Images

We use RDKit to create a 128x128 images of each molecule.

We organize these data as png formatted images in pickle files that can be read in Python following:

import pickle
p = pickle.load(open(/data/pubchem/images/pubchem-0-100.pkl, 'rb'))

print(p[:5])

[('PC', '', 'CC(=O)OC(CC(=O)[O-])C[N+](C)(C)C', <PIL.PngImagePlugin.PngImageFile image mode=RGB size=128x128 at 0x7F7E1D5259D0>), 
('PC', '', 'CC(=O)OC(CC(=O)O)C[N+](C)(C)C', <PIL.PngImagePlugin.PngImageFile image mode=RGB size=128x128 at 0x7F7E1D4B8810>), 
('PC', '', 'C1=CC(C(C(=C1)C(=O)O)O)O', <PIL.PngImagePlugin.PngImageFile image mode=RGB size=128x128 at 0x7F7E1D058E90>), 
('PC', '', 'CC(CN)O', <PIL.PngImagePlugin.PngImageFile image mode=RGB size=128x128 at 0x7F7E1D058F90>), 
('PC', '', 'C(C(=O)COP(=O)(O)O)N', <PIL.PngImagePlugin.PngImageFile image mode=RGB size=128x128 at 0x7F7E1D05F090>)]

Code

Code to help users understand the methodology and use the data are included in the Globus Labs Covid Analyses GitHub repository.

Data Extraction from Literature

The data extraction team is working to extract a set of known antiviral molecules that have been previously tested against coronaviruses. This set of molecules will inform future efforts to screen candidates using simulated docking and more. There are two efforts current underway, a manual extraction effort, and an effort to build a named-entity recognition model that aims to automatically from a much larger literature corpus.

Manual Extraction of Antivirals from Literature

Coming Soon

Named-Entity Recognition Models for Identification of Antivirals

Coming Soon

Acknowledgements

Research was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act.

Data storage and computational support for this research project has been generously supported by the following resources. The data generated have been prepared as part of the nCov-Group Collaboration, a group of over 200 researchers working to use computational techniques to address various challenges associated with COVID-19.

Petrel Data Service at the Argonne Leadership Computing Facility (ALCF)

This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Petrel

Theta at the Argonne Leadership Computing Facility (ALCF)

This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

ALCF

Frontera at the Texas Advanced Computing Center (TACC)

TACC

Comet at the San Diego Supercomputing Center (SDSC)

SDSC

Data and Computing Infrastructure

Many aspects of the data and computing infrastructure have been leveraged from other projects including but not limited to:

Data processing and computation:

Data Tools, Services, and Expertise:

Disclaimer

For All Information

Unless otherwise indicated, this information has been authored by an employee or employees of the UChicago Argonne, LLC., operator of the Argonne National laboratory with the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this information. The public may copy and use this information without charge, provided that this Notice and any statement of authorship are reproduced on all copies.

While every effort has been made to produce valid data, by using this data, User acknowledges that neither the Government nor UChicago Argonne LLC. makes any warranty, express or implied, of either the accuracy or completeness of this information or assumes any liability or responsibility for the use of this information. Additionally, this information is provided solely for research purposes and is not provided for purposes of offering medical advice. Accordingly, the U.S. Government and UChicago Argonne LLC. are not to be liable to any user for any loss or damage, whether in contract, tort (including negligence), breach of statutory duty, or otherwise, even if foreseeable, arising under or in connection with use of or reliance on the content displayed on this site.