Explore StructomeDB
A guided tour of the database through questions and data. All statistics are drawn from the full 104M-pair dataset computed April 2026.
How large is StructomeDB?
StructomeDB covers 61,631 representative protein chains from the full experimental PDB.
An exhaustive all-vs-all structural comparison returned 104,528,817 structurally detected pairs
— those where at least one directional Foldseek search returned a hit. Each pair carries a full feature vector
of structural and sequence metrics.
104.5M
unique pair records
61,631
representative chains
7,351
species represented
10,359
Pfam families covered
Which domains of life are represented?
The database spans all domains of life including viruses. Eukaryota and Bacteria together account for
87.8% of chains, with Viruses and Archaea contributing the remainder.
All 61,631 chains carry UniProt accessions.
Chain composition by domain of life
Annotation coverage across 61,631 chains
How similar are proteins to each other?
Pairs are classified by structural similarity (Foldseek TM-score) and sequence similarity (BLASTP)
at a threshold of 0.5 on each axis. The vast majority of pairs are unrelated —
the expected result across a structome-scale comparison. Only 5.1% show meaningful
structural or sequence similarity.
Pair distribution across four quadrants (threshold 0.5 / 0.5)
Where do pairs concentrate in similarity space?
The 10×10 grid maps structural similarity (X) against sequence similarity (Y) in 0.1-interval bins.
The densest cell is at low structural and low sequence similarity — pairs with TM-score 0.10–0.15
and no BLASTP hit — containing 38.8M pairs (37.1% of the database).
Pair density across the structural × sequence similarity landscape (log₁₀ scale)
Do structure and sequence alignments cover the same regions?
Among the 1.5M pairs with a BLASTP hit, StructomeDB retains both the Foldseek and BLASTP
alignment windows — enabling direct comparison of which protein regions each tool detected.
At ±10 residue tolerance, 71.6% are discordant — structure and sequence are
highlighting different parts of the protein. This novel signal is highest in divergent candidates (94.4%).
Alignment discordance rate by quadrant (±10 residue tolerance)
Overall discordance at different tolerances
BLASTP similarity distribution (detectable pairs only)
How many pairs cross kingdom boundaries?
38.1% of unrelated pairs and 38.3% of convergent candidates
span different domains of life — potential windows into convergent evolution.
In homologs, cross-kingdom pairs are enriched for Bacteria × Eukaryota comparisons.
Same-domain vs cross-kingdom pairs per quadrant
How large are the representative chains?
Representative chains range from 76 to 4,629 residues,
with a mean of 299.9 and median of 245 residues.
73% are fully gapless structures.
76
min chain length (residues)
245
median chain length
4,629
max chain length (residues)
Gapless vs gapped representative chains
What sets StructomeDB apart?
Every existing tool requires you to submit a query and search — StructomeDB is the
precomputed matrix. All 104.5M structurally detected pairs are already calculated,
annotated, and ready to retrieve. No other resource allows you to simultaneously
interrogate both structural and sequence similarity across the entire experimental
PDB, set your own thresholds, and export reproducible labelled datasets for
downstream analysis or machine learning — without running a single search.
How StructomeDB compares to related resources
Structural search tools
| Tool | What it does | What it doesn't do |
|---|---|---|
| RCSB PDB | Sequence clustering at preset identity thresholds (30–100%) and on-demand pairwise sequence and structure comparison via the Comparison Tool | Clustering is not a pairwise similarity matrix; on-demand comparisons are query-based; no precomputed structome-wide two-axis landscape; not designed for population-level retrieval or ML dataset export |
| DALI | On-demand structural alignment of a query against the PDB | Not precomputed; no sequence axis; no exportable feature vectors |
| PDBeFold | Secondary-structure-based structural search against the PDB | Not precomputed; no sequence axis; no pairwise matrix |
| Foldseek | Ultra-fast structural search using 3Di alphabet encoding | Query-based only; no precomputed all-vs-all matrix; no integrated sequence axis |
| Structome-Q / Structome-TM | Structome-scale structural search and TM-score ranking tools | Search tools, not a precomputed pairwise database; no integrated sequence similarity axis |
Structural organisation resources
| Resource | What it does | What it doesn't do |
|---|---|---|
| SCOP | Manually curated hierarchical classification of protein structural domains | Classification resource, not a pairwise comparison database; no sequence similarity axis |
| CATH | Automated and curated classification of protein domain structures | Classification resource, not a pairwise comparison database; no sequence similarity axis |
| ECOD | Evolutionary classification of protein domains using both structure and sequence | Classification resource, not a pairwise comparison database; no user-defined thresholds |
Prediction resources
| Resource | What it does | What it doesn't do |
|---|---|---|
| AlphaFold DB | Repository of predicted protein structures for UniProt sequences | Structure repository only; no pairwise comparisons; StructomeDB covers experimental structures exclusively |
| ESMAtlas | Large-scale repository of predicted metagenomic protein structures | Structure repository only; predicted structures; no pairwise comparison matrix |
StructomeDB is the only resource providing a precomputed, threshold-adjustable,
two-axis similarity landscape across both structure and sequence for the full
experimental PDB — with exportable feature vectors ready for machine learning,
evolutionary analysis, or structural bioinformatics pipelines.