Prop3D Dataset Organization
CATH-based Dataset
The inherently hierarchical structure of CATH (A) is echoed in the design schema underlying the Prop3D dataset (B), as illustrated above. Prop3D can be accessed through a cloud-based HDF file (HSDS) seeded with the CATH hierarchy for all available superfamilies. For clarity, an example of one such superfamily is the individual H-group 2.60.40.10 (Immunoglobulins) shown here as the orange sector (denoted by an asterisk near 4 o’clock). Each such superfamily is further split into:
The
domaingroups, with datasets provided for each domain (atomic features, residue features, and edge features), as delineated in the upper-half of (B)Pre-calculated data splits (
data_splits), shown in the lower-half of (B), which exist as hard-links (denoted as dashed green lines) to domain groups.Pre-calculated
representativesfor each superfamily, provided by the CATH, and are hard-linked to domain groups.
Each domains is further separated into:
atoms, a dataset for atom-based coordinates and featuresresidues, a dataset for residue-based coordinates and features (useful for coarse-gaining or graph based applications)edges, a dataset for residue-residue interaction graphs with edge features
The ‘sunburst’ style CATH diagram, from http://cathdb.info, is under the Creative Commons Attribution 4.0 International License.
Prop3D-20 Dataset
For the initial release of Prop3D, we include Prop3D-20, a condensed dataset of just 20 superfamilies with high-population density or of interest to our DeepUrfold project. The HSDS H5 is organized as follow:
We organize each superfamily by their CATH hierarchy within the HDF file:
/1
/10
/10
/10
/domains
/data_splits
/representatives
/238
/10
/domains
/data_splits
/representatives
/490
/10
/domains
/data_splits
/representatives
/510
/10
/domains
/data_splits
/representatives
/20
/1260
/10
/domains
/data_splits
/representatives
/2
/30
/30
/100
/domains
/data_splits
/representatives
/40
/50
/140
/domains
/data_splits
/representatives
/60
/40
/10
/domains
/data_splits
/representatives
/3
/10
/20
/30
/domains
/data_splits
/representatives
/30
/230
/10
/domains
/data_splits
/representatives
/300
/20
/domains
/data_splits
/representatives
/310
/60
/domains
/data_splits
/representatives
/1360
/40
/domains
/data_splits
/representatives
/1370
/10
/domains
/data_splits
/representatives
/1380
/10
/domains
/data_splits
/representatives
/40
/50
/300
/domains
/data_splits
/representatives
/720
/domains
/data_splits
/representatives
/80
/10
/10
/domains
/data_splits
/representatives
/90
/79
/10
/domains
/data_splits
/representatives
/420
/10
/domains
/data_splits
/representatives
PDB-based Dataset
Still in progress.
The PDB-based datasets is organized like a single superfamily just domains, data_splits and representatives at the root level. Representatives are taken from the first entry of each mmseqs 30% clusters provided by the PDB.