Skip to content

Embedding

This page covers the embedding step in detail: converting a smoothed bounds matrix into initial 3D (or 4D) atomic coordinates.

Overview

The embedding process has four stages:

  1. Distance matrix construction — pick random distances from bounds
  2. Metric matrix computation — Cayley-Menger transform
  3. Eigendecomposition — power iteration for top eigenpairs
  4. Coordinate extraction — scale eigenvectors by eigenvalue square roots

Step 1: Distance Picking

Given the smoothed bounds [lij,uij], we construct a complete distance matrix by sampling:

dij=lij+r(uijlij)

where rUniform(0,1) from the MinstdRand linear congruential generator:

sn+1=48271snmod(2311)r=sn+12311

This is the same RNG as RDKit's boost::minstd_rand, ensuring bit-identical outputs for the same seed. The choice of LCG over Mersenne Twister for distance picking is deliberate — it's simple, fast, and matches the RDKit reference implementation.

INFO

The distances are sampled for all (N2) pairs, in the order d01,d02,,d0,N1,d12,d13,, matching RDKit's iteration order.

Step 2: Metric Matrix (Cayley-Menger Transform)

The metric matrix T converts squared interpoint distances into inner products:

Tij=12(D0i+D0jdij2)

where:

D0i=1Nk=1Ndik2d2¯d2¯=1N2k<ldkl2

Geometric Interpretation

If we place the centroid of all atoms at the origin, then:

Tij=xixj

This is the Gram matrix. Its eigenvalues correspond to the variance of coordinates along each principal axis, and its eigenvectors give the principal directions.

Positive Definite Check

For exact Euclidean distances in d dimensions, the metric matrix has:

  • Exactly d positive eigenvalues
  • All remaining eigenvalues are zero

Since our distances are randomly sampled (not exact), T may have small negative eigenvalues. We take the top 3 (or 4 for chiral molecules) positive eigenvalues and ignore the rest.

Rejection criterion: If D0i<103 for any atom i, the metric matrix is degenerate and we reject this sample.

Step 3: Power Iteration

Instead of computing the full eigendecomposition, we use power iteration with deflation to extract only the needed eigenpairs:

Algorithm

For each eigenpair k=1,,d:

Power Iteration Pseudocode
function power_iteration(T, max_iter=200, tol=1e-6):
    v = random_unit_vector(N)  // seeded from Mt19937
    
    for iter in 1..max_iter:
        w = T × v                    // matrix-vector product
        λ = v · w                     // Rayleigh quotient
        v_new = w / ||w||             // normalize
        if ||v_new - v|| < tol:
            break
        v = v_new
    
    return (λ, v)

After extracting eigenpair (λk,vk), we deflate the matrix:

TTλkvkvkT

This removes the contribution of the found eigenvector, so the next power iteration converges to the next-largest eigenvalue.

RNG for Initial Vectors

The initial random vectors for power iteration use Mt19937 (Mersenne Twister), seeded from the same base seed. This differs from the MinstdRand used for distance picking — Mt19937 provides better uniformity for high-dimensional random vectors.

Step 4: Coordinate Extraction

The coordinates are recovered as:

xik=λkvik

where:

  • i indexes atoms (1iN)
  • k indexes spatial dimensions (1kd)
  • λk is the k-th eigenvalue
  • vik is the i-th component of the k-th eigenvector

Validation

After extraction, we check:

  1. All eigenvalues positive: λk>103 for k=1,,d
  2. No NaN/Inf coordinates
  3. Reasonable scale: coordinates should not be excessively large

If any check fails, we reject this embedding and retry with the next RNG state.

Complete Embedding Example

For a 4-atom molecule with bounds:

Random Box Fallback

After N/4 consecutive embedding failures, the algorithm switches to random box placement:

xidUniform(5,5)for d{1,2,3}

This always succeeds but produces a poor initial geometry. The bounds force field minimization then moves atoms to satisfy distance constraints from scratch. While slower, this ensures the pipeline never gets stuck on difficult molecules.

Released under the MIT License.