Skip to content

Benchmarks

Comprehensive comparison of sci-form against RDKit, the gold standard for 3D molecular conformer generation.

Methodology

All comparisons use heavy-atom pairwise-distance RMSD — the root-mean-square deviation of all pairwise distances between non-hydrogen atoms. This metric is alignment-free (no superposition needed) and focuses on the chemically important scaffold geometry.

RMSDpairwise=1|P|(i,j)P(dijsci-formdijRDKit)2

where P is the set of all heavy-atom pairs.

Diverse Molecule Benchmark

131 curated molecules spanning 27 chemical categories, from simple alkanes to macrocycles and metal-containing compounds.

Overall Results

MetricValue
Total molecules131
Parse success100%
Embed success97.7% (128/131)
Geometry quality97.7%
Throughput60 mol/s

Per-Category Results

Embed Failures

Only 3 molecules fail to embed (out of 131):

MoleculeCategoryReason
Pyrenepolycyclic4-ring fused polyaromatic
CubanestrainedExtreme 90° angles in 4-rings
FluoranthenepolycyclicFused 5-6-6-5 ring system

These are well-known hard cases for distance geometry due to their extreme geometric constraints.

RDKit Comparison

Heavy-atom pairwise-distance RMSD between sci-form and RDKit conformers. Multi-seed ensemble comparison (5 seeds per molecule, minimum RMSD reported).

Overall Results

MetricValue
Average RMSD0.064 Å
Median RMSD0.011 Å
< 0.1 Å82.8%
< 0.3 Å94.4%
< 0.5 Å98.4%
< 1.0 Å100%

RMSD Distribution

Hardest Categories

CategoryAvg RMSDDescription
silicon0.543 ÅSi atom typing differences
selenium0.507 ÅSe parameter approximations
strained0.182 ÅCubane, cyclopropane
polycyclic0.112 ÅFused aromatic systems

GDB-20 Ensemble Comparison

Large-scale validation on 500 molecules from the GDB-20 database (molecules with up to 20 heavy atoms), using an ensemble of 5 sci-form seeds compared against 21 RDKit seeds. The minimum RMSD across all seed combinations is reported.

Results

MetricAll-atomHeavy-atom
Embed success100%100%
Average min-RMSD0.063 Å0.024 Å
> 0.1 Å13.00%9.20%
> 0.3 Å6.80%1.00%
> 0.5 Å1.60%0.00%
> 0.7 Å0.00%0.00%

min-RMSD Distribution (all atoms)

RangeCountShare
0.00–0.05 Å41983.80%
0.05–0.10 Å163.20%
0.10–0.20 Å163.20%
0.20–0.30 Å153.00%
0.30–0.50 Å265.20%
0.50–0.70 Å81.60%
> 0.70 Å00.00%

Ensemble Rescue Rate

Of molecules with single-seed RMSD > 0.5 Å, the multi-seed ensemble rescued 88.4% (61/69) to below 0.5 Å. Only 8 molecules remain above the threshold after ensemble selection.

ChemBL 10K Benchmark

Stress test on 10,000 molecules from the ChemBL database with practical pharmaceutical relevance (up to 100 atoms).

MetricValue
Parse success100%
Embed success97.54%
Geometry quality97.18%
Throughput2.1 mol/s

Lower throughput is expected for larger molecules due to O(N3) scaling of Floyd-Warshall and eigendecomposition.

Performance Scaling

The dominant cost is the Floyd-Warshall triangle smoothing (O(N3)) and the BFGS optimization (each iteration is O(N2) for the inverse Hessian update).

Property Calculation Performance

Conformer Generation

DatasetMoleculesSuccessThroughput
Diverse (131 molecules)13197.7%60 mol/s
ChemBL 10K10,00097.5%2.1 mol/s
GDB-20 (500 sample)500100%~50 mol/s

Single-seed mode, no ensemble. Throughput measured on consumer hardware (8-core).

Electronic Structure (EHT)

MoleculeAtomsBasis FunctionsTime
H₂O35< 1 ms
Benzene1218< 2 ms
Naphthalene1828< 5 ms
Drug-like (~30 heavy)~40~60~10 ms

EHT cost scales as O(NAO3) for diagonalization. The STO-3G minimal basis keeps NAO small even for medium molecules.

ESP Grid

Grid ResolutionSpacingTypical SizeTime
Coarse1.0 Å10³ grid< 5 ms
Standard0.5 Å20³ grid~20 ms
Fine0.2 Å50³ grid~300 ms

ESP evaluation is O(Natoms×Ngrid). The parallel evaluator (compute_esp_grid_parallel) gives near-linear speedup on multi-core systems.

Complete Property Pipeline (single molecule)

Full pipeline (embed + charges + EHT + ESP + DOS + SASA + dipole) on a drug-like molecule (~30 heavy atoms):

StepTime
Conformer generation~10 ms
Gasteiger charges< 1 ms
EHT calculation~5 ms
ESP grid (0.5 Å)~20 ms
DOS computation< 1 ms
SASA< 1 ms
Dipole< 1 ms
Total~38 ms

Released under the MIT License.