Distance Geometry

Distance Geometry (DG) is the mathematical framework for determining point positions from interpoint distances. In molecular conformer generation, we know approximate distance ranges between all atom pairs from the molecular topology, and we need to find 3D coordinates that satisfy these constraints.

The Core Idea

Given $N$ atoms, we want to find coordinates $x_{1}, \dots, x_{N} \in R^{3}$ such that:

l_{i j} \leq ∥ x_{i} - x_{j} ∥ \leq u_{i j} \forall i < j

where $l_{i j}$ and $u_{i j}$ are the lower and upper distance bounds.

Distance Bounds Sources

The bounds matrix is populated from several sources, each corresponding to the topological distance between atoms:

Topological Distance	Source	Precision
1 (bonded)	UFF bond lengths	±0.01 Å
2 (1-3 path)	Law of cosines from bond angles	±0.04 Å
3 (1-4 path)	Torsion angle cis/trans extremes	computed
4 (1-5 path)	Chained 1-4 distances	±0.08 Å
≥5 (non-bonded)	Van der Waals radii	0.7–1.0×

Details on each: Bounds Matrix

The Triangle Inequality

For any three points in metric space:

d_{i j} \leq d_{i k} + d_{k j}

Applied to bounds, this means for all triples $(i, j, k)$ :

\begin{aligned} u_{i j} & \leftarrow min (u_{i j}, u_{i k} + u_{k j}) \\ l_{i j} & \leftarrow max (l_{i j}, l_{i k} - u_{k j}, l_{k j} - u_{i k}) \end{aligned}

These updates are applied iteratively via Floyd-Warshall until convergence. This tightens the bounds and ensures feasibility — if any $l_{i j} > u_{i j}$ after smoothing, the bounds are inconsistent (the atom arrangement is geometrically impossible).

Distance Picking

Given smoothed bounds $[l_{i j}, u_{i j}]$ , we pick a random distance for each pair:

d_{i j} = l_{i j} + r_{i j} \cdot (u_{i j} - l_{i j})

where $r_{i j} \sim Uniform (0, 1)$ from the MinstdRand LCG:

s_{n + 1} = 48271 \cdot s_{n} \mod (2^{31} - 1)

This is the same RNG as RDKit's boost::minstd_rand, ensuring reproducible, bit-identical outputs for the same seed.

From Distances to Coordinates

The Metric Matrix (Cayley-Menger Transform)

Given a distance matrix $D$ where $D_{i j} = d_{i j}^{2}$ , we construct the metric matrix $T$ :

T_{i j} = \frac{1}{2} (D_{0 i} + D_{0 j} - D_{i j})

where:

D_{0 i} = \frac{1}{N} \sum_{k = 1}^{N} D_{i k} - \frac{1}{N^{2}} \sum_{k < l} D_{k l}

Intuitively, $T_{i j} = x_{i} \cdot x_{j}$ when coordinates are centered at the centroid. The metric matrix is the Gram matrix of the coordinate vectors.

Eigendecomposition

If the distances correspond to an exact Euclidean embedding in $d$ dimensions, then $T$ has exactly $d$ positive eigenvalues and the rest are zero.

T = V Λ V^{T}

The coordinates are recovered as:

x_{i k} = \sqrt{λ_{k}} \cdot v_{i k}

where $λ_{k}$ are the top $d$ eigenvalues and $v_{i k}$ are the corresponding eigenvector components.

Why 4D?

When the molecule has chiral centers (@/@@ in SMILES), we embed in 4 dimensions instead of 3. This gives the optimizer additional freedom to satisfy chiral volume constraints. After bounds minimization, the 4th dimension is collapsed:

Phase 1 bounds FF: $w_{chiral} = 1.0$ , $w_{4 D} = 0.1$ — establish chirality
Phase 2 bounds FF: $w_{chiral} = 0.2$ , $w_{4 D} = 1.0$ — collapse 4th dim
Take first 3 columns of the coordinate matrix

Power Iteration

Instead of a full eigendecomposition ( $O (N^{3})$ ), sci-form uses power iteration to extract only the top $d$ eigenpairs:

Start with a random vector $v^{(0)}$
Iterate: $v^{(k + 1)} = \frac{T v^{(k)}}{∥ T v^{(k)} ∥}$
Eigenvalue: $λ = v^{T} T v$
Deflate: $T \leftarrow T - λ v v^{T}$
Repeat for next eigenpair

This is more efficient for large molecules since we only need 3–4 eigenpairs, not all $N$ .

Rejection Criteria

Not every random distance sample yields a valid embedding. Rejections happen when:

Condition	Meaning
$D_{0 i} < 10^{- 3}$ for any $i$	Degenerate metric matrix
$λ_{k} \leq 0$ for $k \leq d$	Distances not embeddable in $d$ D
Many consecutive failures	Switch to random-box fallback

After $N / 4$ consecutive embedding failures, the algorithm falls back to randomly placing atoms in a $[- 5, 5]^{3}$ box and relying entirely on the force field minimization.

Distance Geometry ​

The Core Idea ​

Distance Bounds Sources ​

The Triangle Inequality ​

Distance Picking ​

From Distances to Coordinates ​

The Metric Matrix (Cayley-Menger Transform) ​

Eigendecomposition ​

Why 4D? ​

Power Iteration ​

Rejection Criteria ​