Embedding
This page covers the embedding step in detail: converting a smoothed bounds matrix into initial 3D (or 4D) atomic coordinates.
Overview
The embedding process has four stages:
- Distance matrix construction — pick random distances from bounds
- Metric matrix computation — Cayley-Menger transform
- Eigendecomposition — power iteration for top eigenpairs
- Coordinate extraction — scale eigenvectors by eigenvalue square roots
Step 1: Distance Picking
Given the smoothed bounds
where
This is the same RNG as RDKit's boost::minstd_rand, ensuring bit-identical outputs for the same seed. The choice of LCG over Mersenne Twister for distance picking is deliberate — it's simple, fast, and matches the RDKit reference implementation.
INFO
The distances are sampled for all
Step 2: Metric Matrix (Cayley-Menger Transform)
The metric matrix
where:
Geometric Interpretation
If we place the centroid of all atoms at the origin, then:
This is the Gram matrix. Its eigenvalues correspond to the variance of coordinates along each principal axis, and its eigenvectors give the principal directions.
Positive Definite Check
For exact Euclidean distances in
- Exactly
positive eigenvalues - All remaining eigenvalues are zero
Since our distances are randomly sampled (not exact),
Rejection criterion: If
Step 3: Power Iteration
Instead of computing the full eigendecomposition, we use power iteration with deflation to extract only the needed eigenpairs:
Algorithm
For each eigenpair
Power Iteration Pseudocode
function power_iteration(T, max_iter=200, tol=1e-6):
v = random_unit_vector(N) // seeded from Mt19937
for iter in 1..max_iter:
w = T × v // matrix-vector product
λ = v · w // Rayleigh quotient
v_new = w / ||w|| // normalize
if ||v_new - v|| < tol:
break
v = v_new
return (λ, v)After extracting eigenpair
This removes the contribution of the found eigenvector, so the next power iteration converges to the next-largest eigenvalue.
RNG for Initial Vectors
The initial random vectors for power iteration use Mt19937 (Mersenne Twister), seeded from the same base seed. This differs from the MinstdRand used for distance picking — Mt19937 provides better uniformity for high-dimensional random vectors.
Step 4: Coordinate Extraction
The coordinates are recovered as:
where:
indexes atoms ( ) indexes spatial dimensions ( ) is the -th eigenvalue is the -th component of the -th eigenvector
Validation
After extraction, we check:
- All eigenvalues positive:
for - No NaN/Inf coordinates
- Reasonable scale: coordinates should not be excessively large
If any check fails, we reject this embedding and retry with the next RNG state.
Complete Embedding Example
For a 4-atom molecule with bounds:
Random Box Fallback
After
This always succeeds but produces a poor initial geometry. The bounds force field minimization then moves atoms to satisfy distance constraints from scratch. While slower, this ensures the pipeline never gets stuck on difficult molecules.