Skip to content

SMILES Parsing

The first step of the pipeline converts a SMILES (Simplified Molecular Input Line Entry System) string into a molecular graph suitable for 3D coordinate generation.

From String to Graph

The Molecule Structure

The parser produces a Molecule containing:

rust
pub struct Molecule {
    pub graph: Graph<Atom, Bond, Undirected>,
    pub ring_info: Vec<Vec<usize>>,  // SSSR ring membership
}

Atom Properties

Each atom stores the properties needed for 3D generation:

PropertyTypeDescription
elementu8Atomic number (1=H, 6=C, 7=N, 8=O, ...)
hybridizationHybridizationSP, SP2, SP3, or Unspecified
formal_chargei8Formal charge (−2 to +2 typically)
aromaticboolWhether the atom is aromatic
num_implicit_hu8Implicit hydrogens added
chiralOption<Chirality>@ (CCW) or @@ (CW)
in_ringboolWhether the atom is in any ring

Bond Properties

PropertyTypeValues
orderBondOrderSingle, Double, Triple, Aromatic
stereoBondStereoNone, E (trans), Z (cis)

Hybridization Assignment

After building the graph, hybridization is assigned based on local environment:

hybridization={SP3if all bonds are singleSP2if any bond is double or atom is aromaticSPif any bond is triple or two double bonds

Special cases:

  • Aromatic atoms → SP2
  • N with 3 single bonds → SP2 if in an aromatic ring, SP3 otherwise
  • O with 2 single bonds → SP3 (e.g., ether)
  • Terminal atoms (degree 1) → inherit from neighbor

Implicit Hydrogen Addition

SMILES encodes most hydrogens implicitly. The parser adds them explicitly because 3D coordinate generation requires all atoms:

nH=vbondsorder|q|

where v is the normal valence, and q is the formal charge.

Standard valences used:

ElementValences
C4
N3, 5
O2
S2, 4, 6
P3, 5
B3
F, Cl, Br, I1
Si4
Se2

For aromatic atoms, the aromatic bond contributes 1.5 to the bond order sum, but in practice the parser assigns 1 per aromatic bond and adjusts the hydrogen count to maintain proper valence.

Ring Detection (SSSR)

The Smallest Set of Smallest Rings (SSSR) is computed after graph construction. This is essential for:

  • Bounds matrix: ring bonds constrain torsion angles
  • SMARTS matching: ring-membership queries (R, r)
  • ETKDG: ring torsion patterns differ from chain torsions
  • Force field: ring planarity enforcement

The SSSR is found using a modified graph traversal:

  1. Compute the cycle rank: μ=|E||V|+1
  2. Find all shortest-path back edges
  3. Extract μ independent rings

Example: Phenol

For the SMILES c1ccccc1O:

Result: 13 atoms (6C + 1O + 6H), 13 bonds, 1 SSSR ring of size 6, all ring carbons SP2, oxygen SP3.

Released under the MIT License.