Embedding

This page covers the embedding step in detail: converting a smoothed bounds matrix into initial 3D (or 4D) atomic coordinates.

Overview

The embedding process has four stages:

Distance matrix construction — pick random distances from bounds
Metric matrix computation — Cayley-Menger transform
Eigendecomposition — power iteration for top eigenpairs
Coordinate extraction — scale eigenvectors by eigenvalue square roots

Step 1: Distance Picking

Given the smoothed bounds $[l_{i j}, u_{i j}]$ , we construct a complete distance matrix by sampling:

d_{i j} = l_{i j} + r \cdot (u_{i j} - l_{i j})

where $r \sim Uniform (0, 1)$ from the MinstdRand linear congruential generator:

s_{n + 1} = 48271 \cdot s_{n} \mod (2^{31} - 1)

r = \frac{s_{n + 1}}{2^{31} - 1}

This is the same RNG as RDKit's boost::minstd_rand, ensuring bit-identical outputs for the same seed. The choice of LCG over Mersenne Twister for distance picking is deliberate — it's simple, fast, and matches the RDKit reference implementation.

INFO

The distances are sampled for all $(\binom{N}{2})$ pairs, in the order $d_{01}, d_{02}, \dots, d_{0, N - 1}, d_{12}, d_{13}, \dots$ , matching RDKit's iteration order.

Step 2: Metric Matrix (Cayley-Menger Transform)

The metric matrix $T$ converts squared interpoint distances into inner products:

T_{i j} = \frac{1}{2} (D_{0 i} + D_{0 j} - d_{i j}^{2})

where:

D_{0 i} = \frac{1}{N} \sum_{k = 1}^{N} d_{i k}^{2} - \bar{d^{2}}

\bar{d^{2}} = \frac{1}{N^{2}} \sum_{k < l} d_{k l}^{2}

Geometric Interpretation

If we place the centroid of all atoms at the origin, then:

T_{i j} = x_{i} \cdot x_{j}

This is the Gram matrix. Its eigenvalues correspond to the variance of coordinates along each principal axis, and its eigenvectors give the principal directions.

Positive Definite Check

For exact Euclidean distances in $d$ dimensions, the metric matrix has:

Exactly $d$ positive eigenvalues
All remaining eigenvalues are zero

Since our distances are randomly sampled (not exact), $T$ may have small negative eigenvalues. We take the top 3 (or 4 for chiral molecules) positive eigenvalues and ignore the rest.

Rejection criterion: If $D_{0 i} < 10^{- 3}$ for any atom $i$ , the metric matrix is degenerate and we reject this sample.

Step 3: Power Iteration

Instead of computing the full eigendecomposition, we use power iteration with deflation to extract only the needed eigenpairs:

Algorithm

For each eigenpair $k = 1, \dots, d$ :

Power Iteration Pseudocode

function power_iteration(T, max_iter=200, tol=1e-6):
    v = random_unit_vector(N)  // seeded from Mt19937
    
    for iter in 1..max_iter:
        w = T × v                    // matrix-vector product
        λ = v · w                     // Rayleigh quotient
        v_new = w / ||w||             // normalize
        if ||v_new - v|| < tol:
            break
        v = v_new
    
    return (λ, v)

After extracting eigenpair $(λ_{k}, v_{k})$ , we deflate the matrix:

T \leftarrow T - λ_{k} v_{k} v_{k}^{T}

This removes the contribution of the found eigenvector, so the next power iteration converges to the next-largest eigenvalue.

RNG for Initial Vectors

The initial random vectors for power iteration use Mt19937 (Mersenne Twister), seeded from the same base seed. This differs from the MinstdRand used for distance picking — Mt19937 provides better uniformity for high-dimensional random vectors.

Step 4: Coordinate Extraction

The coordinates are recovered as:

x_{i k} = \sqrt{λ_{k}} \cdot v_{i k}

where:

$i$ indexes atoms ( $1 \leq i \leq N$ )
$k$ indexes spatial dimensions ( $1 \leq k \leq d$ )
$λ_{k}$ is the $k$ -th eigenvalue
$v_{i k}$ is the $i$ -th component of the $k$ -th eigenvector

Validation

After extraction, we check:

All eigenvalues positive: $λ_{k} > 10^{- 3}$ for $k = 1, \dots, d$
No NaN/Inf coordinates
Reasonable scale: coordinates should not be excessively large

If any check fails, we reject this embedding and retry with the next RNG state.

Complete Embedding Example

For a 4-atom molecule with bounds:

Random Box Fallback

After $N / 4$ consecutive embedding failures, the algorithm switches to random box placement:

x_{i d} \sim Uniform (- 5, 5) for d \in {1, 2, 3}

This always succeeds but produces a poor initial geometry. The bounds force field minimization then moves atoms to satisfy distance constraints from scratch. While slower, this ensures the pipeline never gets stuck on difficult molecules.

Embedding ​

Overview ​

Step 1: Distance Picking ​

Step 2: Metric Matrix (Cayley-Menger Transform) ​

Geometric Interpretation ​

Positive Definite Check ​

Step 3: Power Iteration ​

Algorithm ​

RNG for Initial Vectors ​

Step 4: Coordinate Extraction ​

Validation ​

Complete Embedding Example ​

Random Box Fallback ​