Readme

1. What are language models?

Language model (LM) at their core is an AI model trained to understand the patterns, grammar, and context of human language. Modern LMs, such as those based on the Transformer architecture, have become advanced enough to generate coherent, human-like text, answer questions, and summarize documents.

2. What are protein Language Models (pLMs)?

A protein language model (pLM) applies the same principles as a traditional LM to the "language" of biology. By training on enormous databases containing millions of known protein sequences (e.g., UniProt), a pLM learns the complex relationships between amino acids.

3. What are embeddings and how are they extracted?

An embedding is a rich, numerical "fingerprint" of a token (like a word or an amino acid). Instead of representing "Alanine" as a simple letter 'A', a pLM represents it as a high-dimensional vector. This vector captures the deep contextual meaning of that amino acid within its specific protein sequence.

These embeddings are extracted from the hidden layers of a pre-trained pLM. The power of this approach is that the embedding for an Alanine at the start of a protein will be different from the embedding for an Alanine in the middle of a functional catalytic site, as the vector captures context.

4. What is SA-Prot?

SA-Prot (Structure-Aware Protein) is a structure aware pLM. While standard pLMs are trained only on 1D amino acid sequences, SA-Prot was trained on both the 1D sequence and the 3D atomic coordinates in the form of Foldseek 3Di tokens. This allows it to learn the intricate relationship between sequence and structure simultaneously.

In SA-Prot, a token is not just an amino acid but a representation of the amino acid within its local 3D structural context. For example, the model learns a different representation for an Alanine found in an alpha-helix versus one found in a beta-sheet. The per-token embedding is a high-dimensional vector (1280 numbers). This high dimensionality allows the vector to encode an immense amount of nuanced information, capturing not just the amino acid's identity but also its precise local geometry and its relationship with the surrounding sequence and structure. This makes SA-Prot's embeddings particularly powerful for tasks in structural phylogenetics as demonstrated by Structome-DeepRoots.

5. Why Structome-DeepScore?

Traditional structural comparison uses 3D metrics like Q-score and TM-score. These methods compare the raw 3D coordinates (x,y,z) of protein atoms, allowing for the recovery of structural similarity signals. Structome-DeepScore is built on a simple but powerful question: can we achieve a more nuanced and discriminative comparison by expanding the dimensionality of the data?

6. The DeepScore Metric

To answer this question, we introduce a new metric called DeepScore. Instead of comparing 3D coordinates, DeepScore compares proteins using the cosine similarity of their high-dimensional structure-aware embeddings. After two protein structures are aligned, the embedding for each aligned residue is retrieved from the SA-Prot model. The DeepScore is then calculated from the similarity of these vector pairs.

This approach is powerful because each embedding is derived from a combined token that represents both sequence and local structural information. Therefore, the DeepScore comparison not only considers spatial equivalence but also the implicit evolutionary, chemical, and structural context captured by the protein language model across thousands of dimensions.

7. The User Interface: Input Guidelines

The web server is designed to be straightforward. The following limits and guidelines apply: * Your dataset can be composed of sequences, structures, or PDB accessions. * Care must be taken with the input files: * Sequences must only contain the standard 20 amino acid one-letter codes. * Submitted structures must be complete and have no missing backbone regions (i.e., no residue gaps). * Each protein, whether provided as a sequence or a structure, cannot exceed 300 residues in length.

8. Job Submission

Upon successful validation and submission of your data, the server will provide you with a unique Job ID. This ID can be used to access your results once the analysis is complete.

9. The Results Page

A successful job submission leads to an interactive results page, which is composed of several linked components designed to provide a comprehensive analysis of the pairwise structural comparison.

Summary Cards

The top row provides a high-level overview of your job, including the Job ID, the names and lengths of your two proteins, and the final DeepScore and TM-score distances. These scores update dynamically as you toggle the alignment direction.

Structure Viewer

An interactive 3D viewer powered by the PDBe Molstar Component displays the superimposed structural alignment.

Alignment Plot

This panel displays an interactive scatter plot of the residue-level structural equivalences for the selected alignment direction.

Alignment Table

This panel shows the raw data used to generate the scatter plot in a scrollable table. It lists every aligned residue pair and includes their amino acid identity, 3Di state, and Cα-Cα distance. A "Download CSV" button is provided to save this data for offline analysis.

Access

Structome-DeepScore is a web app that will be made publicly available. https://biosig.lab.uq.edu.au/structome_deepscore

Citation

Structome-DeepScore --- coming soon

Additional reading:

Contact

For any issues with the server or functionality, please contact: