How large is StructomeDB?
StructomeDB covers 61,631 representative protein chains from the full experimental PDB. An exhaustive all-vs-all structural comparison returned 104,528,817 structurally detected pairs — those where at least one directional Foldseek search returned a hit. Each pair carries a full feature vector of structural and sequence metrics.
104.5M unique pair records
61,631 representative chains
7,351 species represented
10,359 Pfam families covered

Which domains of life are represented?
The database spans all domains of life including viruses. Eukaryota and Bacteria together account for 87.8% of chains, with Viruses and Archaea contributing the remainder. All 61,631 chains carry UniProt accessions.
Chain composition by domain of life
Annotation coverage across 61,631 chains

How similar are proteins to each other?
Pairs are classified by structural similarity (Foldseek TM-score) and sequence similarity (BLASTP) at a threshold of 0.5 on each axis. The vast majority of pairs are unrelated — the expected result across a structome-scale comparison. Only 5.1% show meaningful structural or sequence similarity.
Pair distribution across four quadrants (threshold 0.5 / 0.5)

Where do pairs concentrate in similarity space?
The 10×10 grid maps structural similarity (X) against sequence similarity (Y) in 0.1-interval bins. The densest cell is at low structural and low sequence similarity — pairs with TM-score 0.10–0.15 and no BLASTP hit — containing 38.8M pairs (37.1% of the database).
Pair density across the structural × sequence similarity landscape (log₁₀ scale)

Do structure and sequence alignments cover the same regions?
Among the 1.5M pairs with a BLASTP hit, StructomeDB retains both the Foldseek and BLASTP alignment windows — enabling direct comparison of which protein regions each tool detected. At ±10 residue tolerance, 71.6% are discordant — structure and sequence are highlighting different parts of the protein. This novel signal is highest in divergent candidates (94.4%).
Alignment discordance rate by quadrant (±10 residue tolerance)
Overall discordance at different tolerances
BLASTP similarity distribution (detectable pairs only)

How many pairs cross kingdom boundaries?
38.1% of unrelated pairs and 38.3% of convergent candidates span different domains of life — potential windows into convergent evolution. In homologs, cross-kingdom pairs are enriched for Bacteria × Eukaryota comparisons.
Same-domain vs cross-kingdom pairs per quadrant

How large are the representative chains?
Representative chains range from 76 to 4,629 residues, with a mean of 299.9 and median of 245 residues. 73% are fully gapless structures.
76 min chain length (residues)
245 median chain length
4,629 max chain length (residues)
Gapless vs gapped representative chains

What sets StructomeDB apart?
Every existing tool requires you to submit a query and search — StructomeDB is the precomputed matrix. All 104.5M structurally detected pairs are already calculated, annotated, and ready to retrieve. No other resource allows you to simultaneously interrogate both structural and sequence similarity across the entire experimental PDB, set your own thresholds, and export reproducible labelled datasets for downstream analysis or machine learning — without running a single search.
How StructomeDB compares to related resources
Structural search tools
Tool What it does What it doesn't do
RCSB PDB Sequence clustering at preset identity thresholds (30–100%) and on-demand pairwise sequence and structure comparison via the Comparison Tool Clustering is not a pairwise similarity matrix; on-demand comparisons are query-based; no precomputed structome-wide two-axis landscape; not designed for population-level retrieval or ML dataset export
DALI On-demand structural alignment of a query against the PDB Not precomputed; no sequence axis; no exportable feature vectors
PDBeFold Secondary-structure-based structural search against the PDB Not precomputed; no sequence axis; no pairwise matrix
Foldseek Ultra-fast structural search using 3Di alphabet encoding Query-based only; no precomputed all-vs-all matrix; no integrated sequence axis
Structome-Q / Structome-TM Structome-scale structural search and TM-score ranking tools Search tools, not a precomputed pairwise database; no integrated sequence similarity axis
Structural organisation resources
Resource What it does What it doesn't do
SCOP Manually curated hierarchical classification of protein structural domains Classification resource, not a pairwise comparison database; no sequence similarity axis
CATH Automated and curated classification of protein domain structures Classification resource, not a pairwise comparison database; no sequence similarity axis
ECOD Evolutionary classification of protein domains using both structure and sequence Classification resource, not a pairwise comparison database; no user-defined thresholds
Prediction resources
Resource What it does What it doesn't do
AlphaFold DB Repository of predicted protein structures for UniProt sequences Structure repository only; no pairwise comparisons; StructomeDB covers experimental structures exclusively
ESMAtlas Large-scale repository of predicted metagenomic protein structures Structure repository only; predicted structures; no pairwise comparison matrix
StructomeDB is the only resource providing a precomputed, threshold-adjustable, two-axis similarity landscape across both structure and sequence for the full experimental PDB — with exportable feature vectors ready for machine learning, evolutionary analysis, or structural bioinformatics pipelines.