StructomeDB

Structome-wide pairwise protein structural and sequence comparison database

Vanessa Robinson  ·  Haoyue Hu  ·  Ashar J. Malik*  ·  David B. Ascher*
School of Chemistry and Molecular Biosciences & Australian Centre for Ecogenomics, The University of Queensland


What is StructomeDB?

StructomeDB is a pre-computed database of 104,528,817 structurally detected pairs from an exhaustive all-vs-all comparison across 61,631 UniProt-linked representative protein chains drawn from the full experimental Protein Data Bank (PDB), snapshotted April 12, 2026. Pairs where neither directional Foldseek search returned a hit are absent from the database — the vast majority of possible chain combinations show no detectable structural similarity.

Each representative chain was selected by excluding chimeric chains, applying structural quality filters (minimum contiguous resolved segment > 75 residues), and retaining a single best-quality chain per UniProt accession using a hierarchical ranking procedure. The resulting set spans 7,351 species across all domains of life including viruses: Eukaryota (46.7%), Bacteria (41.1%), Viruses (6.9%), and Archaea (4.2%).

Pairwise structural similarity was assessed using Foldseek in exhaustive search mode, recording the alignment-normalised TM-score for every structural hit. Pairwise sequence similarity was independently assessed using BLASTP (E-value ≤ 10), recording percentage identity, percentage similarity, E-value, bit-score, and alignment coordinates. Both directions of each comparison were merged into a single bidirectional record per pair, with rotation matrix U and translation vector T preserved to enable instantaneous structural superimposition.

Two-Axis Classification

Each pair is placed on a two-dimensional similarity landscape:

The landscape is divided into a 10×10 grid with 0.1-interval bins on each axis. A user-defined threshold on each axis divides the space into four biologically meaningful quadrants:

QuadrantSymbolStructuralSequenceInterpretation
Homologs++≥ threshold≥ thresholdHigh structural and sequence similarity — likely evolutionarily related
Convergent candidates+−≥ threshold< thresholdSimilar structure, dissimilar sequence — potential convergent evolution
Divergent candidates−+< threshold≥ thresholdDissimilar structure, similar sequence — conserved motif, diverged fold
Unrelated−−< threshold< thresholdLow structural and sequence similarity

At the default threshold of 0.5 on both axes, StructomeDB contains 611,283 homolog pairs (0.6%), 4,583,756 convergent candidates (4.4%), 126,022 divergent candidates (0.1%), and 99,207,756 unrelated pairs (94.9%).

Feature Vector

Every pair record carries a unified feature vector. Fields prefixed fs_ refer to the first-seen chain (query direction); fields prefixed ss_ refer to the second-seen chain (reverse direction).

FieldDescription
hashDeterministic 16-character MD5-based pair identifier
first_seen_id / second_seen_idPDB chain identifiers (PDBID_CHAIN format)
best_alntmscoreBest-direction Foldseek alignment-normalised TM-score
best_blastp_similarityCorresponding BLASTP sequence similarity
best_directionDirection (fs/ss) from which best scores were taken
fs_u / fs_tRotation matrix U and translation vector T for structural superimposition
fs_struct_q/tstart/endFoldseek alignment coordinates on query and target
fs_blastp_q/hstart/endBLASTP alignment coordinates on query and target
fs_blastp_identityBLASTP percentage identity
fs_blastp_evalueBLASTP E-value
fs_blastp_bitscoreBLASTP bit-score
fs_domain / ss_domainBiological domain (Eukaryota, Bacteria, Archaea, Viruses)
fs_pfam / ss_pfamPfam family annotation
fs_scop_label / ss_scop_labelSCOP 2.08 family label
fs_cath_code / ss_cath_codeCATH 4.4.0 domain code
fs_uniprot_id / ss_uniprot_idUniProt accession
fs_tax_id / ss_tax_idNCBI Taxonomy ID
fs_scientific_name / ss_scientific_nameScientific species name

Using the Web Interface

Setting the threshold

The interactive grid on the homepage displays the pair density landscape across the full 10×10 structural × sequence similarity space. Each cell is coloured by the log₁₀ number of pairs it contains — darker cells are more densely populated.

Click or drag anywhere on the grid to move the threshold crosshair. The crosshair snaps to 0.1 increments on both axes. The quadrant counts on the right update automatically within approximately 400ms of releasing the crosshair.

Exporting data

  1. Set your desired threshold using the interactive grid
  2. Enter a random seed in the Export panel (default: 42). The same seed and threshold combination always produces the same sample, making downloads reproducible
  3. Click Export CSV
  4. Wait while the server samples up to 10,000 pairs from each occupied quadrant (maximum 40,000 rows per request) and prepares the file — this typically takes 10–30 seconds
  5. The CSV file downloads automatically when ready

The exported file includes the quadrant label, threshold values, seed, and the complete feature vector for every pair.

Full database access

The complete dataset of 104,528,817 pairs (~26 GB) is available upon request for research purposes. Please contact ashar.malik@uq.edu.au with a brief description of your intended use.

Guided Examples

The following worked examples illustrate the two primary use cases for StructomeDB: generating labelled datasets for machine learning, and performing on-demand pairwise analysis of any two protein chains.

Use case 1: Generating labelled training data for machine learning

Machine learning models for protein function prediction, evolutionary classification, or structural bioinformatics require labelled training data — pairs with known similarity relationships assigned to meaningful categories. StructomeDB is designed for exactly this use case: set your own structural and sequence thresholds, instantly see how many pairs fall in each quadrant, and export a reproducible labelled dataset without running a single search.

StructomeDB landing page showing the sequence-structure similarity landscape, quadrant counts, and export panel

Reading the similarity landscape

The heatmap shows the full 10×10 grid of structural similarity (X axis, Foldseek TM-score) against sequence similarity (Y axis, BLASTP). Cell colour approximates pair count on a log₁₀ scale — darker cells contain more pairs.

Two features of the grid are worth noting. First, the two empty rows at low sequence similarity (roughly 0.15–0.35) reflect the BLASTP detection floor: sequence alignments below approximately 30% similarity are generally not returned, leaving those cells unpopulated. This is a biological reality, not a data gap. Second, the bottom row (sequence similarity 0.0–0.1) appears populated despite falling below the BLASTP detection floor. This is a binning artefact: pairs with no BLASTP hit are assigned a sequence similarity of 0 and therefore fall into the bottom bin. These pairs have Foldseek structural hits across the full TM-score range (0–1), which is why the bottom row is coloured — they are structurally detectable but sequence-invisible.

Setting the threshold

Click or drag anywhere on the grid to move the red crosshair. The crosshair snaps to 0.1 increments on both axes and divides the space into four labelled quadrants: ++ Homologs, +− Convergent candidates, −+ Divergent candidates, and −− Unrelated. The quadrant counts in the right panel update automatically as you move the threshold, showing exactly how many pairs fall into each category at your chosen cutoff.

Exporting labelled data

Once satisfied with your threshold, click Export CSV. The server randomly samples up to 10,000 pairs per quadrant (up to 40,000 rows total), assigning each row a quadrant label of ++, +-, -+, or -- alongside the full feature vector. If a quadrant contains fewer than 10,000 pairs at your chosen threshold — as can happen when the crosshair is set near an extreme corner — all available pairs in that quadrant are exported. Depending on server load, the download may take up to 30 seconds to prepare.

To obtain a different random sample at the same threshold, change the random seed in the Export panel and export again. The same seed and threshold combination always produces the same sample, making your datasets fully reproducible.

To compare any two specific protein chains outside the pre-computed database, use the Compare two proteins panel on the right — see Use case 2 for a full walkthrough.

Use case 2: On-demand comparison of any two chains

The 61,631 representative chains in StructomeDB do not cover every experimentally determined protein structure. If you have a pair of interest that falls outside this set — or simply want to compare any two PDB chains interactively — the Analyse endpoint runs fresh Foldseek and BLASTP comparisons on demand and returns a full interactive result.

As a worked example, we compare the alpha and beta subunits of human deoxyhaemoglobin: 1HV4_A (alpha chain) and 1HV4_B (beta chain). These are paralogous globin chains — evolutionarily related but sequence-diverged — making them a useful illustration of what StructomeDB reports for pairs with high structural similarity and moderate sequence similarity.

Navigate to the Analyse page, enter 1HV4_A as Chain A and 1HV4_B as Chain B, and click Analyse. The job typically completes in 15–30 seconds.

Reading the result page

The result page displays two panels side by side — one for each direction of comparison. This is deliberate: StructomeDB treats A→B and B→A as distinct comparisons because the TM-score is normalised to the query chain length, and both Foldseek and BLASTP may anchor their alignments to different regions of the protein depending on which chain is the query. For closely related chains of similar length (as here), both directions converge on nearly identical scores. For more divergent pairs, the two panels can differ substantially — revealing, for example, that a structural alignment anchors to the N-terminal domain in one direction but the C-terminal domain in the other.

The summary metrics strip across the top reports:

For 1HV4_A vs 1HV4_B, both TM-scores are 0.925 and both BLASTP similarities are 55.6% — indicating very high structural similarity with moderate sequence similarity, consistent with paralogous globin chains that diverged from a common ancestor.

Interpreting the alignment tracks

Below each 3D viewer, a sequence feature track shows the query chain with two alignment windows overlaid:

Foldseek — the region of the query chain covered by the structural alignment
BLASTP — the region of the query chain covered by the sequence alignment

When the two tracks overlap closely, structure and sequence are detecting similarity in the same region — a concordant signal. When they diverge, the tools are highlighting different parts of the protein: a pattern that can indicate domain insertions, deletions, or structural rearrangements between the two chains. For 1HV4_A vs 1HV4_B the tracks are near-fully concordant, as expected for two closely related globin chains.

The screenshot below shows the full result page for this comparison, including the dual-pane 3D superimposition and alignment tracks:

StructomeDB result page for 1HV4_A vs 1HV4_B showing dual-pane Mol* viewer and alignment tracks
The rotation matrix U and translation vector T reported by Foldseek for each direction are applied directly to superimpose the structures shown in the 3D viewer. These same values are stored in the pre-computed database for every pair, enabling instantaneous structural superimposition without re-running any search. They are available in the exported CSV as fs_u, fs_t, ss_u, and ss_t.

API Reference

All endpoints are available under /structome_db/api/. Responses are JSON unless otherwise noted. Rate limits apply per IP address.

GET /api/density

Returns the precomputed 10×10 grid density table used to render the homepage heatmap. Served once on page load.

unlimited

curl https://biosig.lab.uq.edu.au/structome_db/api/density
[
  {"cell_id": "0.05,0.05", "count": 38762936},
  {"cell_id": "0.05,0.15", "count": 21748389},
  ...
]

GET /api/quadrant_counts

Returns the number of pairs in each quadrant for a given threshold.

60 / min · 500 / hr

ParameterTypeDefaultDescription
tmfloat0.5Structural similarity threshold (0–1)
simfloat0.5Sequence similarity threshold (0–1)
curl "https://biosig.lab.uq.edu.au/structome_db/api/quadrant_counts?tm=0.5&sim=0.5"
{"pp": 611283, "pm": 4583756, "mp": 126022, "mm": 99207756}
# Python
import requests
r = requests.get(
    "https://biosig.lab.uq.edu.au/structome_db/api/quadrant_counts",
    params={"tm": 0.5, "sim": 0.5}
)
print(r.json())

GET /api/export

Samples up to 10,000 pairs from each quadrant and returns a CSV file. The same seed and threshold always produce the same sample.

5 / min · 20 / hr

ParameterTypeDefaultDescription
tmfloat0.5Structural similarity threshold (0–1)
simfloat0.5Sequence similarity threshold (0–1)
seedint42Random seed for reproducible sampling
# curl
curl -o structomedb_export.csv \
  "https://biosig.lab.uq.edu.au/structome_db/api/export?tm=0.5&sim=0.5&seed=42"
# Python
import requests

r = requests.get(
    "https://biosig.lab.uq.edu.au/structome_db/api/export",
    params={"tm": 0.5, "sim": 0.5, "seed": 42},
    stream=True
)
with open("structomedb_export.csv", "wb") as f:
    for chunk in r.iter_content(chunk_size=8192):
        f.write(chunk)
print("Done")
# R
download.file(
  url = paste0("https://biosig.lab.uq.edu.au/structome_db/api/export",
               "?tm=0.5&sim=0.5&seed=42"),
  destfile = "structomedb_export.csv"
)
df <- read.csv("structomedb_export.csv")

GET /api/health

Returns database connectivity status. Useful for monitoring.

curl https://biosig.lab.uq.edu.au/structome_db/api/health
{
  "status": "ok",
  "databases": {
    "db3": {"ok": true, "pairs": 104528817},
    "db5": {"ok": true, "rows": 104528817}
  }
}

Analyse

What it does

The Analyse page performs an on-demand pairwise comparison between any two PDB chains you specify. Unlike the pre-computed database, this runs fresh computations at request time and works for any experimentally determined chain — not just the 61,631 representative set.

Given two chain identifiers in PDBID_CHAIN format (e.g. 1HV4_A and 4HHB_A), the backend:

  1. Downloads the full mmCIF files from RCSB
  2. Extracts the specified chains
  3. Runs Foldseek in both directions (A→B and B→A) using exhaustive search
  4. Runs BLASTP in both directions
  5. Applies the Foldseek rotation matrix U and translation vector T to superimpose each structure onto the other
  6. Returns an interactive dual-pane viewer with alignment tracks

Jobs are asynchronous — after submission the page polls for completion and redirects automatically. Jobs expire after one hour.

If either chain is not in the representative set, the comparison still runs normally — the Analyse endpoint operates independently of the pre-computed database and fetches structures directly from RCSB.

Metrics

Results are displayed across two panels — left pane shows Chain A as reference with Chain B superimposed; right pane shows Chain B as reference with Chain A superimposed.

Summary strip — top-level metrics across both directions:

0.812TM-score A→B
0.798TM-score B→A
83.2%BLASTP sim A→B
83.2%BLASTP sim B→A
141Chain A length
141Chain B length

Per-pane metrics — direction-specific scores shown above each viewer:

0.812TM-score
138Struct aln len
84.1%Struct aln %id
2.3e-67BLASTP e-value
83.2%BLASTP sim
MetricSourceDescription
TM-scoreFoldseekAlignment-normalised TM-score for this direction. Values closer to 1.0 indicate stronger structural similarity. TM-score is length-normalised and therefore comparable across pairs of different sizes.
Struct aln lenFoldseekNumber of residue pairs included in the structural alignment. The TM-score and structure-guided identity are both computed over this window.
Struct aln %idFoldseekSequence identity computed only within the structural alignment window — identical residues divided by structural alignment length. This is a structure-guided metric: residue pairings come from the 3D superimposition, not from a sequence alignment algorithm. It answers the question "of the positions where these structures agree geometrically, how many are also the same amino acid?" Note this differs from global BLASTP identity.
BLASTP e-valueBLASTPExpect value for the best sequence alignment hit. Lower values indicate more statistically significant sequence similarity.
BLASTP simBLASTPPercentage of aligned positions that are either identical or have a positive BLOSUM62 substitution score (ppos). This is the primary sequence similarity axis used throughout StructomeDB.

Alignment tracks — below each viewer, a Saguaro sequence feature viewer shows the query chain sequence with two alignment tracks overlaid:

Foldseek — the region of the query chain covered by the structural alignment
BLASTP — the region of the query chain covered by the sequence alignment

Comparing the two tracks reveals whether the structural and sequence alignments are co-localised on the protein — a divergence between them can indicate domain-level structural rearrangement or the presence of an inserted/deleted region.

POST /api/analyse/submit

Submit an analysis job. Accepts JSON body or query parameters.

ParameterTypeRequiredDescription
chain_astringyesFirst chain in PDBID_CHAIN format (e.g. 1HV4_A)
chain_bstringyesSecond chain in PDBID_CHAIN format (e.g. 4HHB_A)
curl -X POST https://biosig.lab.uq.edu.au/structome_db/api/analyse/submit \
  -H "Content-Type: application/json" \
  -d '{"chain_a": "1HV4_A", "chain_b": "4HHB_A"}'
{"job_id": "747d8bc4-7e82-41e5-996c-e571572e6e6b", "status": "queued"}
# Python
import requests, time

r = requests.post(
    "https://biosig.lab.uq.edu.au/structome_db/api/analyse/submit",
    json={"chain_a": "1HV4_A", "chain_b": "4HHB_A"}
)
job_id = r.json()["job_id"]
print("Job:", job_id)

GET /api/analyse/status/<job_id>

Poll the status of a running job. Typical job completion time is 15–45 seconds depending on chain length.

Status valueMeaning
queuedJob accepted, not yet started
runningJob in progress — msg field gives current step
doneComplete — result is available
errorJob failed — msg field contains the error
curl https://biosig.lab.uq.edu.au/structome_db/api/analyse/status/747d8bc4-7e82-41e5-996c-e571572e6e6b
{"status": "running", "msg": "Running Foldseek A→B...", "chain_a": "1HV4_A", "chain_b": "4HHB_A"}
# Python — poll until done
while True:
    s = requests.get(
        f"https://biosig.lab.uq.edu.au/structome_db/api/analyse/status/{job_id}"
    ).json()
    print(s["status"], s["msg"])
    if s["status"] in ("done", "error"):
        break
    time.sleep(3)

GET /api/analyse/result/<job_id>

Returns the full result JSON once a job has status done. Jobs expire after one hour.

curl https://biosig.lab.uq.edu.au/structome_db/api/analyse/result/747d8bc4-7e82-41e5-996c-e571572e6e6b
{
  "chain_a": "1HV4_A",
  "chain_b": "4HHB_A",
  "seq_a": "VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR",
  "seq_b": "VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR",
  "ab": {
    "alntmscore": 0.812,
    "u": [[...], [...], [...]],
    "t": [...],
    "qstart": 1, "qend": 141,
    "tstart": 1, "tend": 141,
    "pident": 84.1,
    "alnlen": 138,
    "evalue": 2.3e-67,
    "bits": 241.0
  },
  "ba": { "..." },
  "blastp_ab": {
    "pident": 83.2,
    "ppos": 83.2,
    "alnlen": 141,
    "qstart": 1, "qend": 141,
    "tstart": 1, "tend": 141,
    "evalue": 2.3e-67,
    "bits": 241.0
  },
  "blastp_ba": { "..." }
}

Rate Limits

EndpointPer minutePer hour
/api/densityunlimitedunlimited
/api/quadrant_counts60500
/api/export520

Exceeding rate limits returns HTTP 429. For bulk programmatic access or the full database please contact the authors.

Licence

StructomeDB is released under the Creative Commons CC0 1.0 Universal licence. The database is freely available to all users with no restrictions on use, redistribution, or incorporation into downstream resources.

Contact

Ashar Malik
ashar.malik@uq.edu.au

BioSig Lab · The University of Queensland