Skip to content

Distance Geometry

Distance Geometry (DG) is the mathematical framework for determining point positions from interpoint distances. In molecular conformer generation, we know approximate distance ranges between all atom pairs from the molecular topology, and we need to find 3D coordinates that satisfy these constraints.

The Core Idea

Given N atoms, we want to find coordinates x1,,xNR3 such that:

lijxixjuiji<j

where lij and uij are the lower and upper distance bounds.

Distance Bounds Sources

The bounds matrix is populated from several sources, each corresponding to the topological distance between atoms:

Topological DistanceSourcePrecision
1 (bonded)UFF bond lengths±0.01 Å
2 (1-3 path)Law of cosines from bond angles±0.04 Å
3 (1-4 path)Torsion angle cis/trans extremescomputed
4 (1-5 path)Chained 1-4 distances±0.08 Å
≥5 (non-bonded)Van der Waals radii0.7–1.0×

Details on each: Bounds Matrix

The Triangle Inequality

For any three points in metric space:

dijdik+dkj

Applied to bounds, this means for all triples (i,j,k):

uijmin(uij,uik+ukj)lijmax(lij,likukj,lkjuik)

These updates are applied iteratively via Floyd-Warshall until convergence. This tightens the bounds and ensures feasibility — if any lij>uij after smoothing, the bounds are inconsistent (the atom arrangement is geometrically impossible).

Distance Picking

Given smoothed bounds [lij,uij], we pick a random distance for each pair:

dij=lij+rij(uijlij)

where rijUniform(0,1) from the MinstdRand LCG:

sn+1=48271snmod(2311)

This is the same RNG as RDKit's boost::minstd_rand, ensuring reproducible, bit-identical outputs for the same seed.

From Distances to Coordinates

The Metric Matrix (Cayley-Menger Transform)

Given a distance matrix D where Dij=dij2, we construct the metric matrix T:

Tij=12(D0i+D0jDij)

where:

D0i=1Nk=1NDik1N2k<lDkl

Intuitively, Tij=xixj when coordinates are centered at the centroid. The metric matrix is the Gram matrix of the coordinate vectors.

Eigendecomposition

If the distances correspond to an exact Euclidean embedding in d dimensions, then T has exactly d positive eigenvalues and the rest are zero.

T=VΛVT

The coordinates are recovered as:

xik=λkvik

where λk are the top d eigenvalues and vik are the corresponding eigenvector components.

Why 4D?

When the molecule has chiral centers (@/@@ in SMILES), we embed in 4 dimensions instead of 3. This gives the optimizer additional freedom to satisfy chiral volume constraints. After bounds minimization, the 4th dimension is collapsed:

  1. Phase 1 bounds FF: wchiral=1.0, w4D=0.1 — establish chirality
  2. Phase 2 bounds FF: wchiral=0.2, w4D=1.0 — collapse 4th dim
  3. Take first 3 columns of the coordinate matrix

Power Iteration

Instead of a full eigendecomposition (O(N3)), sci-form uses power iteration to extract only the top d eigenpairs:

  1. Start with a random vector v(0)
  2. Iterate: v(k+1)=Tv(k)Tv(k)
  3. Eigenvalue: λ=vTTv
  4. Deflate: TTλvvT
  5. Repeat for next eigenpair

This is more efficient for large molecules since we only need 3–4 eigenpairs, not all N.

Rejection Criteria

Not every random distance sample yields a valid embedding. Rejections happen when:

ConditionMeaning
D0i<103 for any iDegenerate metric matrix
λk0 for kdDistances not embeddable in dD
Many consecutive failuresSwitch to random-box fallback

After N/4 consecutive embedding failures, the algorithm falls back to randomly placing atoms in a [5,5]3 box and relying entirely on the force field minimization.

Released under the MIT License.