Readme

1. What are language models?

Language model (LM) at their core is an AI model trained to understand the patterns, grammar, and context of human language. Modern LMs, such as those based on the Transformer architecture, have become advanced enough to generate coherent, human-like text, answer questions, and summarize documents.

2. What are protein Language Models (pLMs)?

A protein language model (pLM) applies the same principles as a traditional LM to the "language" of biology. By training on enormous databases containing millions of known protein sequences (e.g., UniProt), a pLM learns the complex relationships between amino acids.

3. What are embeddings and how are they extracted?

An embedding is a rich, numerical "fingerprint" of a token (like a word or an amino acid). Instead of representing "Alanine" as a simple letter 'A', a pLM represents it as a high-dimensional vector. This vector captures the deep contextual meaning of that amino acid within its specific protein sequence.

These embeddings are extracted from the hidden layers of a pre-trained pLM. The power of this approach is that the embedding for an Alanine at the start of a protein will be different from the embedding for an Alanine in the middle of a functional catalytic site, as the vector captures context.

4. What is SA-Prot?

SA-Prot (Structure-Aware Protein) is a structure aware pLM. While standard pLMs are trained only on 1D amino acid sequences, SA-Prot was trained on both the 1D sequence and the 3D atomic coordinates in the form of Foldseek 3Di tokens. This allows it to learn the intricate relationship between sequence and structure simultaneously.

In SA-Prot, a token is not just an amino acid but a representation of the amino acid within its local 3D structural context. For example, the model learns a different representation for an Alanine found in an alpha-helix versus one found in a beta-sheet. The per-token embedding is a high-dimensional vector (1280 numbers). This high dimensionality allows the vector to encode an immense amount of nuanced information, capturing not just the amino acid's identity but also its precise local geometry and its relationship with the surrounding sequence and structure. This makes SA-Prot's embeddings particularly powerful for tasks in structural phylogenetics as demonstrated by Structome-DeepRoots.

5. Why Structome-DeepRoots?

Traditional distance-based phylogenies use 3D structural metrics like Q-score and TM-score (e.g., in Structome-Q and Structome-TM). These methods compare the raw 3D coordinates ($x, y, z$) of protein atoms, allowing for the recovery of deeper evolutionary signals. Structome-DeepRoots is built on a simple but powerful question: if moving from 1D (sequence) to 3D (coordinates) improved the signal, can we achieve an even greater leap in resolution by expanding the dimensionality of the comparison further?

6. The DeepScore Metric

To answer this question, we introduce a new metric called DeepScore. Instead of comparing 3D coordinates, DeepScore compares proteins using the cosine similarity of their high-dimensional structure-aware embeddings. After two protein structures are aligned, the embedding for each aligned residue is retrieved from the SA-Prot model. The DeepScore is then calculated from the similarity of these vector pairs.

This approach is powerful because each embedding is derived from a combined token that represents both sequence and local structural information. Therefore, the DeepScore comparison not only considers spatial equivalence but also the implicit evolutionary, chemical, and structural context captured by the protein language model across thousands of dimensions.

7. Bootstrap Approximation

To assess the confidence of the branches in the final phylogenetic tree, Structome-DeepRoots employs a bootstrap approximation method. This is achieved by computationally generating pseudo-replicate datasets. Instead of re-sampling columns from an alignment, Structome-DeepRoots perturb the high-dimensional embeddings for each residue by adding a small amount of controlled, random noise. A new distance matrix and tree are calculated for each perturbed set of embeddings (100 in total). The support for each clade in the original tree is then determined by the frequency at which it appears in the population of bootstrapped trees.

8. The User Interface: Input Guidelines

The web server is designed to be straightforward. The following limits and guidelines apply: * A maximum of 50 taxa can be included in a single analysis. * Your dataset can be composed of sequences, structures, or a mix of both. * A Noise Level parameter can be set on the input page, which controls the magnitude of the perturbation applied during the bootstrap analysis. * Care must be taken with the input files: * Sequences must only contain the standard 20 amino acid one-letter codes. * Submitted structures must be complete and have no missing backbone regions (i.e., no residue gaps). * Each protein, whether provided as a sequence or a structure, cannot exceed 300 residues in length.

9. Job Submission

Upon successful validation and submission of your data, the server will provide you with a unique Job ID. This ID can be used to access your results once the analysis is complete.

10. The Results Page

A successful job submission leads to the interactive results page, which is composed of several linked components designed to provide a comprehensive overview of your structural phylogenetic analysis. It cannot be understated, please go through all sections to understand the results and what they mean before using the tree.

Job Information

This top panel provides summary statistics for your job, including the Job ID, the number of structures analyzed, and failed comparisons (if any). It also contains a button to download the final Embeddings Tree with support in Newick (.nwk) format.

Phylogenetic Tree Viewer

This panel gives a quick summary tree (radial layout). * Toggle: You can switch between the tree built from traditional TM-scores which is the base-line for comparison used by Structome-DeepRoots and the tree built from DeepScore - the high-dimensional embeddings. * Interactivity: The tree can be zoomed and panned using your mouse. Clicking anywhere on the viewer will expand it into a large "pop-out" modal for more detailed inspection. Clicking the background returns it to its original size.

Structure Viewer

An interactive 3D viewer powered by the PDBe Molstar Component displays the superimposed structural alignment for the selected pair. * Display: By default, the target structure is colored green and the query structure is colored yellow. * Toggle: You can switch the alignment perspective between "A→B" and "B→A" to see the results as the distance matrix averages the scores of these two comparisons. * Interaction: The view updates automatically whenever a new pair is selected from the heatmap.

Distance Matrix Heatmap

The central component is a clickable heatmap representing the all-against-all distance matrix calculated using the DeepScore metric. * Interaction: Clicking any cell in the heatmap is the primary way to explore your results. It instantly updates the Structure Viewer and the Pairwise Information Panel. * Toggle: The axes of the heatmap can be sorted alphabetically by name or sorted according to the leaf order of the phylogenetic tree, which clusters related structures together.

Pairwise Information Panel

This panel provides detailed diagnostics for the currently selected pair. * It displays the calculated alignment lengths and any warnings generated during the comparison. * It contains two dot plots that visualise the residue-level structural equivalences found by the alignment algorithm in both directions (A→B and B→A).

Bootstrap Noise Visualization

This final panel allows you to assess the effect of the bootstrap perturbation on the embeddings for any given protein. * Use the dropdown menu to select a protein from your set. * The viewer displays a series of images that visualize how the embedding space for that protein's residues was sampled during the bootstrapping process.

Access

Structome-DeepRoots is a web app that will be made publicly available. https://biosig.lab.uq.edu.au/structome_deeproots

Citation

Structome-DeepRoots --- coming soon

Additional reading:

Structural Phylogenetics with Confidence
Structome-Q
Structome-TM
Structome-AlignViewer

Contact

For any issues with the server or functionality, please contact:

Ashar Malik
Email: ashar.malik@uq.edu.au