The nCov-Group Data Repository

Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation

This repository provides access to data, models, and code produced by the nCoV Group in support of research aimed at generating leads for potential SARS-CoV-2 drugs. The data include representations and computed descriptors for around 4.2 billion small molecules: some 60 TB of data in all, although many useful subsets are much smaller.

These data will be updated regularly as the collaboration produces new results. Shared data are located on the nCov-Group Data Repository endpoint at this location, from where they can be accessed via Globus. To access the data, users should: 1) log in with existing credentials (link) and 2) access the Globus endpoint.


Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release

IMPECCABLE: integrated modeling pipelinE for COVID cure by assessing better LEads

High-Throughput Virtual Screening and Validation of a SARS-CoV-2 Main Protease Noncovalent Inhibitor

Protein-Ligand Docking Surrogate Models: A SARS-CoV-2 Benchmark for Deep Learning Accelerated Virtual Screening

Table of contents

Data Processing Pipeline

The data processing pipeline is used to compute different types of features and representations of billions of small molecules. The pipeline is first used to convert the SMILES representation for each molecule to a canonical SMILES to allow for de-duplication and consistency across data sources. Next, for each molecule, three different types of features are computed: 1) molecular fingerprints that encode the structure of molecules; 2) 2D and 3D molecular descriptors; and 3) 2D images of the molecular structure. These features are being used as input to various machine learning and deep learning models that will be used to predict important characteristics of candidate molecules including docking scores, toxicity, and more.

Data processing pipeline Figure 1: The computational pipeline that is used to enrich the data collected from includeddatasets. After collection, each molecule in each dataset has canonical SMILES, 2D and 3D molecular features, fingerprints, and images computed. These enrichments simplifymolecule disambiguation, ML-guided compound screening, similarity searching, and neuralnetwork training respectively.

Dataset Catalog

We have obtained molecule definitions from the following source datasets. For each, we provide a link to the original source, the number of molecules included from the dataset, and the percentage of those molecules that are not found in any other listed dataset.

KeyDescription and link# Molecules% Uniq
BDBThe Binding Database1,813,53820.4
CASCAS COVID-19 Antiviral Candidate Compounds49,43755.5
CHMCheMBL db of bioactive mols with drug-like properties1,940,732 
DCLDrugCentral Online Drug Compendium3,9812.4
DUDDUDE database of useful decoys99,78299.7
E1515.5M-molecule subset of ENA15,547,09199.7
EDBDrugBank plus Enamine Hit Locator Library 2018310,78261.2
ENAEnamine REAL Database1,211,723,72385.9
FFICureFFI FDA-approved drugs and CNS drugs1,49712.2
G13GDB-13 small organic molecules up to 13 atoms977,468,30199.5
G17GDB-17-Set up to 17 atom extension of GDB-1350,000,000100.0
HOPHarvard Organic Photovoltaic Dataset35083.7
LITCOVID-relevant small mols extracted from literature803 
MOSMolecular Sets (MOSES)1,936,96281.3
MCUMCULE compound database45,472,755 
QM9QM9 subset of GDB-17133,88584.0
REPRepurposing-related drug/tool compounds13,5530.0
SAVSynthetically Accessible Virtual Inventory (SAVI)283,194,30999.8
SURSureChEMBL dataset of molecules from patents17,915,3849.8
Total 4,206,934,042 


  • The key for each dataset may be used in filenames in place of the full name in downloads elsewhere.
  • The numbers above may be less than what can be found at the source, due to conversion failures and/or version differences.
  • These numbers do not account for de-duplication, within or between datasets.

Dataset Downloads

Follow the links below to access canonical SMILES, molecular fingerprints, descriptors, and images (png format) for each dataset. The links in the final row provide access to all SMILES, fingerprints, descriptors, and images, respectively.

Methodology and Data Processing Pipeline

Canonical Molecular Structures

We use Open Babel v3.0 to convert the simplified molecular-input line-entry system (SMILES)specifications of chemical species obtained from various sources into a consistent canonical smilesrepresentation. We organize the resulting molecule specifications in one directory per source dataset, each containing one CSV file with columns [SOURCE-KEY, IDENTIFIER, SMILES],where SOURCE-KEY identifies the source dataset; IDENTIFIER is an identifier either obtainedfrom the source dataset or, if none such is available, defined internally; and SMILES is a canon-ical SMILES as produced by Open Babel. Identifiers are unique within a dataset, but may notbe unique across datasets. Thus, the combination of (SOURCE-KEY, IDENTIFIER) is needed to identify molecules uniquely. We obtain the canonical SMILES by using the following Open Babel command

obabel {inputfilename} -O{outputfilename} -ocan -e

Molecular Fingerprints

We use RDKit (version 2019.09.3) to compute a 2048-bit fingerprint for each molecule. Weorganize these fingerprints in CSV files with each row with columns [SOURCE-KEY, IDENTI-FIER, SMILES, FINGERPRINT], where SOURCE-KEY, IDENTIFIER, and SMILES are asdefined in Table 2, and FINGERPRINT is a Base64-encoded representation of the fingerprint. In Figure 2, we show an example of how to load the fingerprint data from a batch file within individual dataset using Python 3. Further examples of how to use fingerprints are available inthe accompanying GitHub repository.

Working with fingerprint files Figure 2: A simple Python code example showing how to load data from a fingerprint file.(This and other examples are accessible in the accompanying GitHub repository.

Molecular Descriptors

We generate molecular descriptors using Mordred(version 1.2.0). The collected descriptors(∼1800 for each molecule) include descriptors for both 2D and 3D molecular features. Weorganize these descriptors in one directory per source dataset, each containing one or moreCSV files. Each row in the CSV file has columns [SOURCE-KEY, IDENTIFIER, SMILES, DESCRIPTOR_1… DESCRIPTOR_N]. In Figure 3, we show how to load the data for anindividual dataset (e.g., FFI) using Python 3 and explore its shape (Figure 3-left), and create a TSNE embedding to explore the molecular descriptor space (Figure 3-right).

Working with descriptors files Figure 3: Molecular descriptor examples: (left) load descriptor data and (right) create asimple TSNE projection of the FFI dataset.

Molecular Images

Images for each molecule were generated using a custom script [44] to read the canonical SMILES structure with RDKit, kekulize the structure, handle conformers, draw the molecule with rd-kit.Chem.Draw, and save the file as a PNG-format image with size 128×128 pixels. For each dataset, individual pickle files are saved containing batches of 10 000 images for ease of use, with entries in the format (SOURCE, IDENTIFIER, SMILES, image in PIL format). In Figure 4, weshow an example of loading and display image data from a batch of files from the FFI dataset.

Working with descriptors files

Figure 4: Molecular image examples. The examples show how to (top) load the data and (bottom) display a subset of the images using matplotlib.

Data Access

Providing access to such a large quantity of heterogeneous data (currently ~60 TB) is challenging. We use Globus to handle authentication and authorization, and to enable high-speed,reliable access to the data. Access to this data is available to anyone following authentication via institutional credentials, an ORCID profile, a Google account, or many other common identities. Users can access the data through a web user interface shown in Fig. 5, facilitating easy browsing, direct download via HTTPS of smaller files, and high-speed, reliable transfer of larger data files to their laptop or a computing cluster via Globus Connect Personal or an instance of Globus Connect Server. There are more than 20 000 active Globus endpoints distributed around the world. Users may also access the data with a full-featured Python SDK. More details on Globus can be found at

Globus data access Figure 5: Data access with Globus. All data are stored on Globus endpoints, allowing usersto access, move, and share the data through a web interface (pictured above), a REST API, or with a Python client. The user here has just transferred the first three files of descriptors associated with the E15 dataset to an endpoint at UChicago.


Code to help users understand the methodology and use the data are included in the Globus Labs Covid Analyses GitHub repository.

Data Extraction from Literature

The data extraction team is working to extract a set of known antiviral molecules that have been previously tested against coronaviruses. This set of molecules will inform future efforts to screen candidates using simulated docking and more. There are two efforts current underway, a manual extraction effort, and an effort to build a named-entity recognition model that aims to automatically from a much larger literature corpus.

Named-Entity Recognition Models for Identification of Antivirals

Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of coronavirus research. We report here on a project that leverages both human and artificial intelligence to detect references to drug-like molecules in free text. We engage non-expert humans to create a corpus of labeled text, use this labeled corpus to train a named entity recognition model, and employ the trained model to extract 10912 drug-like molecules from the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198875 papers. Performance analyses show that our automated extraction model can achieve performance on par with that of non-expert humans.

AI- and HPC-enabled Lead Generation for SARS-CoV-2: Models and Processes to Extract Druglike Molecules Contained in Natural Language Text

Manual Extraction of Antivirals from Literature

Babuji, Y., Blaiszik, B., Chard, K., Chard, R., Foster, I., Gordon, I., Hong, Z., Karbarz, K., Li, Z., Novak, L., Sarvey, S., Schwarting, M., Smagacz, J., Ward, L., & Orozco White, M. (2020). Lit - A Collection of Literature Extracted Small Molecules to Speed Identification of COVID-19 Therapeutics [Dataset]. Materials Data Facility.

“Lit” Dataset


Research was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act.

Data storage and computational support for this research project has been generously supported by the following resources. The data generated have been prepared as part of the nCov-Group Collaboration, a group of over 200 researchers working to use computational techniques to address various challenges associated with COVID-19.

Petrel Data Service at the Argonne Leadership Computing Facility (ALCF)

This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.


Theta at the Argonne Leadership Computing Facility (ALCF)

This research used resources of the Argonne Leadership Computing Facility, a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.


Frontera at the Texas Advanced Computing Center (TACC)


Comet at the San Diego Supercomputing Center (SDSC)


Data and Computing Infrastructure

Many aspects of the data and computing infrastructure have been leveraged from other projects including but not limited to:

Data processing and computation:

Data Tools, Services, and Expertise:


For All Information

Unless otherwise indicated, this information has been authored by an employee or employees of the UChicago Argonne, LLC., operator of the Argonne National laboratory with the U.S. Department of Energy. The U.S. Government has rights to use, reproduce, and distribute this information. The public may copy and use this information without charge, provided that this Notice and any statement of authorship are reproduced on all copies.

While every effort has been made to produce valid data, by using this data, User acknowledges that neither the Government nor UChicago Argonne LLC. makes any warranty, express or implied, of either the accuracy or completeness of this information or assumes any liability or responsibility for the use of this information. Additionally, this information is provided solely for research purposes and is not provided for purposes of offering medical advice. Accordingly, the U.S. Government and UChicago Argonne LLC. are not to be liable to any user for any loss or damage, whether in contract, tort (including negligence), breach of statutory duty, or otherwise, even if foreseeable, arising under or in connection with use of or reliance on the content displayed on this site.