Explore — StructomeDB

How large is StructomeDB?

StructomeDB covers 61,631 representative protein chains from the full experimental PDB. An exhaustive all-vs-all structural comparison returned 104,528,817 structurally detected pairs — those where at least one directional Foldseek search returned a hit. Each pair carries a full feature vector of structural and sequence metrics.

104.5M unique pair records

61,631 representative chains

7,351 species represented

10,359 Pfam families covered

Which domains of life are represented?

The database spans all domains of life including viruses. Eukaryota and Bacteria together account for 87.8% of chains, with Viruses and Archaea contributing the remainder. All 61,631 chains carry UniProt accessions.

Chain composition by domain of life

Annotation coverage across 61,631 chains

How similar are proteins to each other?

Pairs are classified by structural similarity (Foldseek TM-score) and sequence similarity (BLASTP) at a threshold of 0.5 on each axis. The vast majority of pairs are unrelated — the expected result across a structome-scale comparison. Only 5.1% show meaningful structural or sequence similarity.

Pair distribution across four quadrants (threshold 0.5 / 0.5)

Where do pairs concentrate in similarity space?

The 10×10 grid maps structural similarity (X) against sequence similarity (Y) in 0.1-interval bins. The densest cell is at low structural and low sequence similarity — pairs with TM-score 0.10–0.15 and no BLASTP hit — containing 38.8M pairs (37.1% of the database).

Pair density across the structural × sequence similarity landscape (log₁₀ scale)

Do structure and sequence alignments cover the same regions?

Among the 1.5M pairs with a BLASTP hit, StructomeDB retains both the Foldseek and BLASTP alignment windows — enabling direct comparison of which protein regions each tool detected. At ±10 residue tolerance, 71.6% are discordant — structure and sequence are highlighting different parts of the protein. This novel signal is highest in divergent candidates (94.4%).

Alignment discordance rate by quadrant (±10 residue tolerance)

Overall discordance at different tolerances

BLASTP similarity distribution (detectable pairs only)

How many pairs cross kingdom boundaries?

38.1% of unrelated pairs and 38.3% of convergent candidates span different domains of life — potential windows into convergent evolution. In homologs, cross-kingdom pairs are enriched for Bacteria × Eukaryota comparisons.

Same-domain vs cross-kingdom pairs per quadrant

How large are the representative chains?

Representative chains range from 76 to 4,629 residues, with a mean of 299.9 and median of 245 residues. 73% are fully gapless structures.

76 min chain length (residues)

245 median chain length

4,629 max chain length (residues)

Gapless vs gapped representative chains

What sets StructomeDB apart?

Every existing tool requires you to submit a query and search — StructomeDB is the precomputed matrix. All 104.5M structurally detected pairs are already calculated, annotated, and ready to retrieve. No other resource allows you to simultaneously interrogate both structural and sequence similarity across the entire experimental PDB, set your own thresholds, and export reproducible labelled datasets for downstream analysis or machine learning — without running a single search.

How StructomeDB compares to related resources

Structural search tools

Tool	What it does	What it doesn't do
RCSB PDB	Sequence clustering at preset identity thresholds (30–100%) and on-demand pairwise sequence and structure comparison via the Comparison Tool	Clustering is not a pairwise similarity matrix; on-demand comparisons are query-based; no precomputed structome-wide two-axis landscape; not designed for population-level retrieval or ML dataset export
DALI	On-demand structural alignment of a query against the PDB	Not precomputed; no sequence axis; no exportable feature vectors
PDBeFold	Secondary-structure-based structural search against the PDB	Not precomputed; no sequence axis; no pairwise matrix
Foldseek	Ultra-fast structural search using 3Di alphabet encoding	Query-based only; no precomputed all-vs-all matrix; no integrated sequence axis
Structome-Q / Structome-TM	Structome-scale structural search and TM-score ranking tools	Search tools, not a precomputed pairwise database; no integrated sequence similarity axis

Structural organisation resources

Resource	What it does	What it doesn't do
SCOP	Manually curated hierarchical classification of protein structural domains	Classification resource, not a pairwise comparison database; no sequence similarity axis
CATH	Automated and curated classification of protein domain structures	Classification resource, not a pairwise comparison database; no sequence similarity axis
ECOD	Evolutionary classification of protein domains using both structure and sequence	Classification resource, not a pairwise comparison database; no user-defined thresholds

Prediction resources

Resource	What it does	What it doesn't do
AlphaFold DB	Repository of predicted protein structures for UniProt sequences	Structure repository only; no pairwise comparisons; StructomeDB covers experimental structures exclusively
ESMAtlas	Large-scale repository of predicted metagenomic protein structures	Structure repository only; predicted structures; no pairwise comparison matrix

StructomeDB is the only resource providing a precomputed, threshold-adjustable, two-axis similarity landscape across both structure and sequence for the full experimental PDB — with exportable feature vectors ready for machine learning, evolutionary analysis, or structural bioinformatics pipelines.

Explore StructomeDB