Evaluation objects and pipelines#

%load_ext autoreload
%autoreload 2

import open3d # open3d can occasionally cause issues during imports; importing it first can help alleviate that
import numpy as np
from rdkit import Chem
from shepherd_score.conformer_generation import embed_conformer_from_smiles

from shepherd_score.evaluations.evaluate import ConfEval, UnconditionalEvalPipeline
from shepherd_score.evaluations.evaluate import ConsistencyEvalPipeline, ConditionalEvalPipeline
Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.

Conformer evaluation base class#

The base class used to evaluate conformer validity and get 2D graph properties is ConfEval. Other evaluation classes (other than docking) inherit from ConfEval and the related pipelines utilize these objects.

Let’s run a small experiment where the MMFF94-relaxed molecule is the “generated” molecule – represent it as an atomic point cloud.

rdkit_mol = embed_conformer_from_smiles('c1Cc2ccc(Cl)cc2C(=O)c1c3cc(N1nnc2cc(C)c(Cl)cc2c1=O)ccc3', MMFF_optimize=True)

# get the atomic numbers as an array and the positions of the atoms
atoms = np.array([a.GetAtomicNum() for a in rdkit_mol.GetAtoms()])
positions = rdkit_mol.GetConformer().GetPositions()
conf_eval = ConfEval(atoms, positions, solvent='water') # solvent = None if gas phase
conf_eval.to_pandas() # show the attributes as a pandas Series
xyz_block                   46\n\nC -3.532 -1.388 1.036\nC -4.779 -0.713 1...
mol                          <rdkit.Chem.rdchem.Mol object at 0x145c44717290>
smiles                      Cc1cc2nnn(-c3cccc(C4=CCc5ccc(Cl)cc5C4=O)c3)c(=...
molblock                    \n     RDKit          3D\n\n 46 50  0  0  0  0...
energy                                                             -85.047816
partial_charges             [-0.01924502, -0.08763144, 0.02359121, -0.0418...
solvent                                                                 water
charge                                                                      0
xyz_block_post_opt          46\n\nC           -3.75184451948461       -1.5...
mol_post_opt                 <rdkit.Chem.rdchem.Mol object at 0x145c445a2b20>
smiles_post_opt             Cc1cc2nnn(-c3cccc(C4=CCc5ccc(Cl)cc5C4=O)c3)c(=...
molblock_post_opt           \n     RDKit          3D\n\n 46 50  0  0  0  0...
energy_post_opt                                                    -85.063889
partial_charges_post_opt    [-0.01744142, -0.08851187, 0.02560457, -0.0409...
is_valid                                                                 True
is_valid_post_opt                                                        True
is_graph_consistent                                                      True
SA_score                                                             2.746132
QED                                                                  0.422032
logP                                                                  5.21832
fsp3                                                                 0.083333
morgan_fp                   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
SA_score_post_opt                                                    2.746132
QED_post_opt                                                         0.422032
logP_post_opt                                                         5.21832
fsp3_post_opt                                                        0.083333
morgan_fp_post_opt          [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
strain_energy                                                        0.016073
rmsd                                                                 0.435447
dtype: object

Conformer evaluation pipelines#

Since typically multiple molecules are generated and all need to be evaluated, some pipeline classes used.

Unconditional evaluation#

The UnconditionalEvalPipeline simply iterates over all the generated molecules with ConfEval and stores the full evaluation.

Let’s generate a few test molecules and embed them with RDKit ETKDG. We prepare them for the necessary inputs: a list of tuples containing each the molecule’s corresponding atoms’ atomic numbers and positions as numpy arrays.

smiles_ls = ['CC', 'CCC', 'CCCC']*100
test_mols = [embed_conformer_from_smiles(smi, MMFF_optimize=False) for smi in smiles_ls]

generated_mols = []
for m in test_mols:
    generated_mols.append(
        (np.array([a.GetAtomicNum() for a in m.GetAtoms()]), m.GetConformer().GetPositions())
    )

Initialize and run the pipeline.

uncond_pipe = UnconditionalEvalPipeline(generated_mols=generated_mols, solvent='water')
uncond_pipe.evaluate(verbose=True, num_processes=4, num_workers=1)
Unconditional Eval: 100%|█████████████████████| 300/300 [00:16<00:00, 18.62it/s]
uncond_pipe.evaluate(verbose=True, num_processes=1, num_workers=8)
Unconditional Eval: 100%|█████████████████████| 300/300 [00:13<00:00, 22.26it/s]
properties_df, global_attr = uncond_pipe.to_pandas()
global_attr.head()
generated_mols molblocks molblocks_post_opt strain_energies rmsds SA_scores logPs QEDs fsp3s SA_scores_post_opt logPs_post_opt QEDs_post_opt fsp3s_post_opt
0 ([6, 6, 1, 1, 1, 1, 1, 1], [[-0.76367466358819... \n RDKit 3D\n\n 8 7 0 0 0 0... \n RDKit 3D\n\n 8 7 0 0 0 0... 0.006719 0.001154 2.747568 1.0262 0.372786 1.0 2.747568 1.0262 0.372786 1.0
1 ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[-1.11574... \n RDKit 3D\n\n 11 10 0 0 0 0... \n RDKit 3D\n\n 11 10 0 0 0 0... 0.017008 0.014296 1.754957 1.4163 0.385471 1.0 1.754957 1.4163 0.385471 1.0
2 ([6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [... \n RDKit 3D\n\n 14 13 0 0 0 0... \n RDKit 3D\n\n 14 13 0 0 0 0... 0.012470 0.062969 1.605723 1.8064 0.431024 1.0 1.605723 1.8064 0.431024 1.0
3 ([6, 6, 1, 1, 1, 1, 1, 1], [[-0.74841118384649... \n RDKit 3D\n\n 8 7 0 0 0 0... \n RDKit 3D\n\n 8 7 0 0 0 0... 0.007161 0.011072 2.747568 1.0262 0.372786 1.0 2.747568 1.0262 0.372786 1.0
4 ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[-1.24483... \n RDKit 3D\n\n 11 10 0 0 0 0... \n RDKit 3D\n\n 11 10 0 0 0 0... 0.011083 0.021313 1.754957 1.4163 0.385471 1.0 1.754957 1.4163 0.385471 1.0
properties_df
num_generated_mols                                                       300
solvent                                                                water
num_valid                                                                600
num_valid_post_opt                                                       600
num_consistent_graph                                                     600
frac_valid                                                               2.0
frac_valid_post_opt                                                      2.0
frac_consistent                                                          2.0
frac_unique                                                            0.005
frac_unique_post_opt                                                   0.005
avg_graph_diversity                                                 0.118716
graph_similarity_matrix    [[1.0, 0.2, 0.16666666666666666, 1.0, 0.2, 0.1...
dtype: object

Consistency evaluation#

This is used to evaluate if the jointly generated interaction profiles correspond to the true interaction profile of the generated molecule. The ConsistencyEvalPipeline simply iterates over all the generated molecules with the ConsistencyEval class and stores the full evaluation. In addition to the properties calculated by ConfEval it also does score-based alignment so it is a slower operation.

from shepherd_score.container import Molecule

Prepare the inputs. We pretend that the test smiles are “generated” molecules with their corresponding interaction profiles. ConsistencyEvalPipeline expects this format for the inputs.

smiles_ls = ['CC', 'CCC', 'CCCC']*33
test_mols = [embed_conformer_from_smiles(smi, MMFF_optimize=True) for smi in smiles_ls]

generated_mols = []
generated_surf_points = []
generated_surf_esp = []
generated_pharm_feats = []
for m in test_mols:
    generated_mols.append(
        (np.array([a.GetAtomicNum() for a in m.GetAtoms()]), m.GetConformer().GetPositions())
    )
    # Generate and store each interaction profile as if they were generated.
    # Notably, we use MMFF94 partial charges and ConsistencyEvalPipeline
    #  will compare the ESP to xTB generated partial charges
    molec = Molecule(m, num_surf_points=200, probe_radius=1.2, partial_charges=None, pharm_multi_vector=False)
    generated_surf_points.append(molec.surf_pos)
    generated_surf_esp.append(molec.surf_esp)
    generated_pharm_feats.append(
        (molec.pharm_types, molec.pharm_ancs, molec.pharm_vecs)
    )

Initialize and run the pipeline

consis_eval = ConsistencyEvalPipeline(
    generated_mols = generated_mols,
    generated_surf_points = generated_surf_points,
    generated_surf_esp = generated_surf_esp,
    generated_pharm_feats = generated_pharm_feats,
    probe_radius=1.2,
    pharm_multi_vector=False,
    solvent=None
)
consis_eval.evaluate(num_workers=1, verbose=True)
Consistency Eval: 100%|█████████████████████████| 99/99 [02:35<00:00,  1.57s/it]
consis_eval.evaluate(num_workers=10, verbose=True)
Consistency Eval: 100%|█████████████████████████| 99/99 [00:22<00:00,  4.43it/s]

You can view the saved attributes and properties as a pandas Series for the global (whole set) attributes and a DataFrame for per-sample properties.

properties_df_consis, global_attr_consis = consis_eval.to_pandas()
global_attr_consis.head()
generated_mols generated_surf_points generated_surf_esp generated_pharm_feats molblocks molblocks_post_opt strain_energies rmsds SA_scores logPs ... sims_esp_upper_bound sims_surf_lower_bound sims_esp_lower_bound sims_pharm_lower_bound sims_surf_consistent_relax sims_esp_consistent_relax sims_pharm_consistent_relax sims_surf_consistent_relax_optimal sims_esp_consistent_relax_optimal sims_pharm_consistent_relax_optimal
0 ([6, 6, 1, 1, 1, 1, 1, 1], [[-0.75427424835602... [[-0.5813803, -1.4873081, -2.4721417], [-1.619... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... ([3], [[-6.558661541644639e-08, -1.08085862537... \n RDKit 3D\n\n 8 7 0 0 0 0... \n RDKit 3D\n\n 8 7 0 0 0 0... 0.000197 0.005077 2.747568 1.0262 ... 0.977656 NaN NaN NaN 0.976818 0.976816 1.000000 0.977372 0.977369 0.999995
1 ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[-1.23450... [[-1.6486647, -2.8718212, 1.0219024], [-1.1368... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... ([3], [[0.004617025132251855, 0.00548270018345... \n RDKit 3D\n\n 11 10 0 0 0 0... \n RDKit 3D\n\n 11 10 0 0 0 0... 0.000214 0.002903 1.754957 1.4163 ... 0.971617 NaN NaN NaN 0.962465 0.962464 0.999997 0.963093 0.963092 0.999997
2 ([6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [... [[-3.3231347, -1.6956815, 0.9845177], [-3.3837... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... ([3, 3], [[-1.3538755875425377, 0.129942825536... \n RDKit 3D\n\n 14 13 0 0 0 0... \n RDKit 3D\n\n 14 13 0 0 0 0... 0.000287 0.003050 1.605723 1.8064 ... 0.963071 NaN NaN NaN 0.969413 0.969412 0.999997 0.969788 0.969787 0.999994
3 ([6, 6, 1, 1, 1, 1, 1, 1], [[0.754557611047011... [[0.6699976, -3.0041163, 0.48550072], [0.25443... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... ([3], [[-2.2718348313688352e-09, 1.48791622940... \n RDKit 3D\n\n 8 7 0 0 0 0... \n RDKit 3D\n\n 8 7 0 0 0 0... 0.000178 0.004375 2.747568 1.0262 ... 0.975956 NaN NaN NaN 0.976338 0.976336 1.000000 0.977218 0.977217 0.999992
4 ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[1.264255... [[1.5423033, -2.574833, 1.6369876], [1.8832769... [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... ([3], [[-0.012158210270805473, -0.127197368223... \n RDKit 3D\n\n 11 10 0 0 0 0... \n RDKit 3D\n\n 11 10 0 0 0 0... 0.000232 0.003398 1.754957 1.4163 ... 0.971013 NaN NaN NaN 0.966891 0.966890 0.999998 0.967209 0.967207 0.999992

5 rows × 30 columns

properties_df_consis
num_generated_mols                                                                 99
solvent                                                                          None
probe_radius                                                                      1.2
pharm_multi_vector                                                              False
num_valid                                                                         198
num_valid_post_opt                                                                198
num_consistent_graph                                                              198
frac_valid                                                                        2.0
frac_valid_post_opt                                                               2.0
frac_consistent                                                                   2.0
frac_unique                                                                  0.015152
frac_unique_post_opt                                                         0.015152
avg_graph_diversity                                                           0.11912
graph_similarity_matrix             [[1.0, 0.2, 0.16666666666666666, 1.0, 0.2, 0.1...
avg_graph_diversity_post_opt                                                  0.11912
graph_similarity_matrix_post_opt    [[1.0, 0.2, 0.16666666666666666, 1.0, 0.2, 0.1...
dtype: object

Conditional evaluation#

This is used to evaluate if the generated is similar (based on shepherd_score 3D scoring functions) to the target/reference molecule. The ConditionalEvalPipeline simply iterates over all the generated molecules with the ConditionalEval class and stores the full evaluation. In addition to the properties calculated by ConfEval it also does score-based alignment so it is a slower operation.

rdkit_mol = embed_conformer_from_smiles('c1Cc2ccc(Cl)cc2C(=O)c1c3cc(N1nnc2cc(C)c(Cl)cc2c1=O)ccc3', MMFF_optimize=True)

# Again using MMFF94 partial charges
ref_molec = Molecule(rdkit_mol, num_surf_points=200, probe_radius=1.2, pharm_multi_vector=False)
cond_pipe = ConditionalEvalPipeline(ref_molec, generated_mols=generated_mols,
                                    condition='all', num_surf_points=200,
                                    pharm_multi_vector=False, solvent=None)
cond_pipe.evaluate(verbose=True, num_workers=1, mp_context='spawn')
Conditional Eval: 100%|█████████████████████████| 99/99 [02:09<00:00,  1.31s/it]
cond_pipe.evaluate(verbose=True, num_workers=8, mp_context='spawn')
Conditional Eval: 100%|█████████████████████████| 99/99 [00:21<00:00,  4.68it/s]
properties_df_cond, global_attr_cond = cond_pipe.to_pandas()
global_attr_cond.head()
generated_mols molblocks molblocks_post_opt strain_energies rmsds SA_scores logPs QEDs fsp3s SA_scores_post_opt ... sims_surf_target_relax sims_esp_target_relax sims_pharm_target_relax sims_surf_target_relax_optimal sims_esp_target_relax_optimal sims_pharm_target_relax_optimal sims_surf_target_relax_esp_aligned sims_pharm_target_relax_esp_aligned graph_similarities graph_similarities_post_opt
0 ([6, 6, 1, 1, 1, 1, 1, 1], [[-0.75427424835602... \n RDKit 3D\n\n 8 7 0 0 0 0... \n RDKit 3D\n\n 8 7 0 0 0 0... 0.000197 0.005077 2.747568 1.0262 0.372786 1.0 2.747568 ... 0.136365 0.135720 0.010560 0.158949 0.158310 0.075350 0.158950 0.007106 0.011905 0.011905
1 ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[-1.23450... \n RDKit 3D\n\n 11 10 0 0 0 0... \n RDKit 3D\n\n 11 10 0 0 0 0... 0.000214 0.002903 1.754957 1.4163 0.385471 1.0 1.754957 ... 0.166879 0.166159 0.011084 0.207935 0.207234 0.075341 0.207939 0.068363 0.011628 0.011628
2 ([6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [... \n RDKit 3D\n\n 14 13 0 0 0 0... \n RDKit 3D\n\n 14 13 0 0 0 0... 0.000287 0.003050 1.605723 1.8064 0.431024 1.0 1.605723 ... 0.175781 0.175023 0.015852 0.239678 0.238713 0.077309 0.239671 0.069266 0.011494 0.011494
3 ([6, 6, 1, 1, 1, 1, 1, 1], [[0.754557611047011... \n RDKit 3D\n\n 8 7 0 0 0 0... \n RDKit 3D\n\n 8 7 0 0 0 0... 0.000178 0.004375 2.747568 1.0262 0.372786 1.0 2.747568 ... 0.138604 0.137975 0.010560 0.203743 0.203575 0.075350 0.203740 0.073533 0.011905 0.011905
4 ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[1.264255... \n RDKit 3D\n\n 11 10 0 0 0 0... \n RDKit 3D\n\n 11 10 0 0 0 0... 0.000232 0.003398 1.754957 1.4163 0.385471 1.0 1.754957 ... 0.155630 0.154976 0.008322 0.209363 0.208614 0.075337 0.209363 0.005040 0.011628 0.011628

5 rows × 26 columns

properties_df_cond
num_generated_mols                                                               99
solvent                                                                        None
pharm_multi_vector                                                            False
condition                                                                       all
num_surf_points                                                                 200
lam                                                                             0.3
lam_scaled                                                                62.206048
ref_molblock                      \n     RDKit          3D\n\n 46 50  0  0  0  0...
ref_mol_SA_score                                                           2.746132
ref_mol_QED                                                                0.422032
ref_mol_logP                                                                6.78332
ref_mol_fsp3                                                               0.083333
ref_mol_morgan_fp                 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
ref_surf_resampling_scores        [0.8396994276889428, 0.8444193298291229, 0.854...
ref_surf_esp_resampling_scores    [0.8396284268748854, 0.844353854359548, 0.8546...
sims_surf_upper_bound                                                      0.862957
sims_esp_upper_bound                                                       0.862882
num_valid                                                                       198
num_valid_post_opt                                                              198
num_consistent_graph                                                            198
frac_valid                                                                      2.0
frac_valid_post_opt                                                             2.0
frac_consistent                                                                 2.0
frac_unique                                                                0.015152
frac_unique_post_opt                                                       0.015152
avg_graph_diversity                                                        0.988324
dtype: object