Module 3 of 4

Building the Tree

From Geometry to Matrix to Tree.

Here we see the full pipeline of distance-based phylogenetics.

  1. Geometry (Top): The raw physical reality of the structures.
  2. Matrix (Bottom Left): The pairwise distances ($d_{ij}$) derived from geometry.
  3. Tree (Bottom Right): The simplified evolutionary graph.
Interactive Lab

Drag the points (A, B, C, D) in the top plot.

The ◆ Black Diamond on the tree marks the Midpoint Root.

Neighbor-Joining (NJ)

Unlike UPGMA (which assumes a constant molecular clock), NJ allows for varying rates of evolution.

Common Sources of Error (and how to create them by dragging points)

Distance-based trees are only as good as the distance matrix. Below are a few classic failure modes you can intentionally create in “Shape Space” to see how and why trees go wrong.

  • Weak signal / no clear clusters
    Setup: Arrange A, B, C, D roughly as the corners of a square (all pairwise distances similar). What you’ll see: the distance matrix loses block structure, and small moves can flip the NJ joins.
  • Long-branch attraction (a fast-evolving outlier)
    Setup: Drag A far away from the other three points (A becomes an extreme outlier). What you’ll see: branch lengths stretch and internal joins become unstable; the midpoint root shifts. In real datasets, a rapidly evolving taxon can sometimes appear incorrectly “pulled” toward the wrong group.
  • Star-like topology (insufficient resolution)
    Setup: Place A, B, C, D all at similar distance from each other (near-equal off-diagonal values). What you’ll see: the tree becomes effectively unresolved; multiple topologies fit nearly equally well.
  • Near-ties in the NJ criterion (multiple “best” joins)
    Setup: Make A close to B and also close to C (a “triangle” where AB ≈ AC), while D is moderately far. What you’ll see: tiny drags cause a different first join (A–B vs A–C), even though the matrix looks similar.
  • Hidden structure vs. projected geometry (model mismatch)
    Setup: Create two tight clusters (A,B) and (C,D), then slowly slide the clusters closer until they overlap. What you’ll see: the matrix still contains subtle differences, but the tree may “snap” between plausible explanations. In real structural phylogenetics, different distance definitions (TM-score vs RMSD vs Q-score) can change what the algorithm considers “close”.
  • Noise sensitivity (small measurement error)
    Setup: Put A and B extremely close, and C and D extremely close, then drag one point by a tiny amount. What you’ll see: the heatmap changes only slightly, but the inferred topology can still flip if there is a near-tie.

1. Shape Space (Drag Points)
2. Distance Matrix
3. NJ Tree (Midpoint Rooted)