Evaluation objects and pipelines#
%load_ext autoreload
%autoreload 2
import open3d # open3d can occasionally cause issues during imports; importing it first can help alleviate that
import numpy as np
from rdkit import Chem
from shepherd_score.conformer_generation import embed_conformer_from_smiles
from shepherd_score.evaluations.evaluate import ConfEval, UnconditionalEvalPipeline
from shepherd_score.evaluations.evaluate import ConsistencyEvalPipeline, ConditionalEvalPipeline
Jupyter environment detected. Enabling Open3D WebVisualizer.
[Open3D INFO] WebRTC GUI backend enabled.
[Open3D INFO] WebRTCWindowSystem: HTTP handshake server disabled.
Conformer evaluation base class#
The base class used to evaluate conformer validity and get 2D graph properties is ConfEval. Other evaluation classes (other than docking) inherit from ConfEval and the related pipelines utilize these objects.
Let’s run a small experiment where the MMFF94-relaxed molecule is the “generated” molecule – represent it as an atomic point cloud.
rdkit_mol = embed_conformer_from_smiles('c1Cc2ccc(Cl)cc2C(=O)c1c3cc(N1nnc2cc(C)c(Cl)cc2c1=O)ccc3', MMFF_optimize=True)
# get the atomic numbers as an array and the positions of the atoms
atoms = np.array([a.GetAtomicNum() for a in rdkit_mol.GetAtoms()])
positions = rdkit_mol.GetConformer().GetPositions()
conf_eval = ConfEval(atoms, positions, solvent='water') # solvent = None if gas phase
conf_eval.to_pandas() # show the attributes as a pandas Series
xyz_block 46\n\nC -3.532 -1.388 1.036\nC -4.779 -0.713 1...
mol <rdkit.Chem.rdchem.Mol object at 0x145c44717290>
smiles Cc1cc2nnn(-c3cccc(C4=CCc5ccc(Cl)cc5C4=O)c3)c(=...
molblock \n RDKit 3D\n\n 46 50 0 0 0 0...
energy -85.047816
partial_charges [-0.01924502, -0.08763144, 0.02359121, -0.0418...
solvent water
charge 0
xyz_block_post_opt 46\n\nC -3.75184451948461 -1.5...
mol_post_opt <rdkit.Chem.rdchem.Mol object at 0x145c445a2b20>
smiles_post_opt Cc1cc2nnn(-c3cccc(C4=CCc5ccc(Cl)cc5C4=O)c3)c(=...
molblock_post_opt \n RDKit 3D\n\n 46 50 0 0 0 0...
energy_post_opt -85.063889
partial_charges_post_opt [-0.01744142, -0.08851187, 0.02560457, -0.0409...
is_valid True
is_valid_post_opt True
is_graph_consistent True
SA_score 2.746132
QED 0.422032
logP 5.21832
fsp3 0.083333
morgan_fp [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
SA_score_post_opt 2.746132
QED_post_opt 0.422032
logP_post_opt 5.21832
fsp3_post_opt 0.083333
morgan_fp_post_opt [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
strain_energy 0.016073
rmsd 0.435447
dtype: object
Conformer evaluation pipelines#
Since typically multiple molecules are generated and all need to be evaluated, some pipeline classes used.
Unconditional evaluation#
The UnconditionalEvalPipeline simply iterates over all the generated molecules with ConfEval and stores the full evaluation.
Let’s generate a few test molecules and embed them with RDKit ETKDG. We prepare them for the necessary inputs: a list of tuples containing each the molecule’s corresponding atoms’ atomic numbers and positions as numpy arrays.
smiles_ls = ['CC', 'CCC', 'CCCC']*100
test_mols = [embed_conformer_from_smiles(smi, MMFF_optimize=False) for smi in smiles_ls]
generated_mols = []
for m in test_mols:
generated_mols.append(
(np.array([a.GetAtomicNum() for a in m.GetAtoms()]), m.GetConformer().GetPositions())
)
Initialize and run the pipeline.
uncond_pipe = UnconditionalEvalPipeline(generated_mols=generated_mols, solvent='water')
uncond_pipe.evaluate(verbose=True, num_processes=4, num_workers=1)
Unconditional Eval: 100%|█████████████████████| 300/300 [00:16<00:00, 18.62it/s]
uncond_pipe.evaluate(verbose=True, num_processes=1, num_workers=8)
Unconditional Eval: 100%|█████████████████████| 300/300 [00:13<00:00, 22.26it/s]
properties_df, global_attr = uncond_pipe.to_pandas()
global_attr.head()
| generated_mols | molblocks | molblocks_post_opt | strain_energies | rmsds | SA_scores | logPs | QEDs | fsp3s | SA_scores_post_opt | logPs_post_opt | QEDs_post_opt | fsp3s_post_opt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ([6, 6, 1, 1, 1, 1, 1, 1], [[-0.76367466358819... | \n RDKit 3D\n\n 8 7 0 0 0 0... | \n RDKit 3D\n\n 8 7 0 0 0 0... | 0.006719 | 0.001154 | 2.747568 | 1.0262 | 0.372786 | 1.0 | 2.747568 | 1.0262 | 0.372786 | 1.0 |
| 1 | ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[-1.11574... | \n RDKit 3D\n\n 11 10 0 0 0 0... | \n RDKit 3D\n\n 11 10 0 0 0 0... | 0.017008 | 0.014296 | 1.754957 | 1.4163 | 0.385471 | 1.0 | 1.754957 | 1.4163 | 0.385471 | 1.0 |
| 2 | ([6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [... | \n RDKit 3D\n\n 14 13 0 0 0 0... | \n RDKit 3D\n\n 14 13 0 0 0 0... | 0.012470 | 0.062969 | 1.605723 | 1.8064 | 0.431024 | 1.0 | 1.605723 | 1.8064 | 0.431024 | 1.0 |
| 3 | ([6, 6, 1, 1, 1, 1, 1, 1], [[-0.74841118384649... | \n RDKit 3D\n\n 8 7 0 0 0 0... | \n RDKit 3D\n\n 8 7 0 0 0 0... | 0.007161 | 0.011072 | 2.747568 | 1.0262 | 0.372786 | 1.0 | 2.747568 | 1.0262 | 0.372786 | 1.0 |
| 4 | ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[-1.24483... | \n RDKit 3D\n\n 11 10 0 0 0 0... | \n RDKit 3D\n\n 11 10 0 0 0 0... | 0.011083 | 0.021313 | 1.754957 | 1.4163 | 0.385471 | 1.0 | 1.754957 | 1.4163 | 0.385471 | 1.0 |
properties_df
num_generated_mols 300
solvent water
num_valid 600
num_valid_post_opt 600
num_consistent_graph 600
frac_valid 2.0
frac_valid_post_opt 2.0
frac_consistent 2.0
frac_unique 0.005
frac_unique_post_opt 0.005
avg_graph_diversity 0.118716
graph_similarity_matrix [[1.0, 0.2, 0.16666666666666666, 1.0, 0.2, 0.1...
dtype: object
Consistency evaluation#
This is used to evaluate if the jointly generated interaction profiles correspond to the true interaction profile of the generated molecule. The ConsistencyEvalPipeline simply iterates over all the generated molecules with the ConsistencyEval class and stores the full evaluation. In addition to the properties calculated by ConfEval it also does score-based alignment so it is a slower operation.
from shepherd_score.container import Molecule
Prepare the inputs. We pretend that the test smiles are “generated” molecules with their corresponding interaction profiles. ConsistencyEvalPipeline expects this format for the inputs.
smiles_ls = ['CC', 'CCC', 'CCCC']*33
test_mols = [embed_conformer_from_smiles(smi, MMFF_optimize=True) for smi in smiles_ls]
generated_mols = []
generated_surf_points = []
generated_surf_esp = []
generated_pharm_feats = []
for m in test_mols:
generated_mols.append(
(np.array([a.GetAtomicNum() for a in m.GetAtoms()]), m.GetConformer().GetPositions())
)
# Generate and store each interaction profile as if they were generated.
# Notably, we use MMFF94 partial charges and ConsistencyEvalPipeline
# will compare the ESP to xTB generated partial charges
molec = Molecule(m, num_surf_points=200, probe_radius=1.2, partial_charges=None, pharm_multi_vector=False)
generated_surf_points.append(molec.surf_pos)
generated_surf_esp.append(molec.surf_esp)
generated_pharm_feats.append(
(molec.pharm_types, molec.pharm_ancs, molec.pharm_vecs)
)
Initialize and run the pipeline
consis_eval = ConsistencyEvalPipeline(
generated_mols = generated_mols,
generated_surf_points = generated_surf_points,
generated_surf_esp = generated_surf_esp,
generated_pharm_feats = generated_pharm_feats,
probe_radius=1.2,
pharm_multi_vector=False,
solvent=None
)
consis_eval.evaluate(num_workers=1, verbose=True)
Consistency Eval: 100%|█████████████████████████| 99/99 [02:35<00:00, 1.57s/it]
consis_eval.evaluate(num_workers=10, verbose=True)
Consistency Eval: 100%|█████████████████████████| 99/99 [00:22<00:00, 4.43it/s]
You can view the saved attributes and properties as a pandas Series for the global (whole set) attributes and a DataFrame for per-sample properties.
properties_df_consis, global_attr_consis = consis_eval.to_pandas()
global_attr_consis.head()
| generated_mols | generated_surf_points | generated_surf_esp | generated_pharm_feats | molblocks | molblocks_post_opt | strain_energies | rmsds | SA_scores | logPs | ... | sims_esp_upper_bound | sims_surf_lower_bound | sims_esp_lower_bound | sims_pharm_lower_bound | sims_surf_consistent_relax | sims_esp_consistent_relax | sims_pharm_consistent_relax | sims_surf_consistent_relax_optimal | sims_esp_consistent_relax_optimal | sims_pharm_consistent_relax_optimal | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ([6, 6, 1, 1, 1, 1, 1, 1], [[-0.75427424835602... | [[-0.5813803, -1.4873081, -2.4721417], [-1.619... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | ([3], [[-6.558661541644639e-08, -1.08085862537... | \n RDKit 3D\n\n 8 7 0 0 0 0... | \n RDKit 3D\n\n 8 7 0 0 0 0... | 0.000197 | 0.005077 | 2.747568 | 1.0262 | ... | 0.977656 | NaN | NaN | NaN | 0.976818 | 0.976816 | 1.000000 | 0.977372 | 0.977369 | 0.999995 |
| 1 | ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[-1.23450... | [[-1.6486647, -2.8718212, 1.0219024], [-1.1368... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | ([3], [[0.004617025132251855, 0.00548270018345... | \n RDKit 3D\n\n 11 10 0 0 0 0... | \n RDKit 3D\n\n 11 10 0 0 0 0... | 0.000214 | 0.002903 | 1.754957 | 1.4163 | ... | 0.971617 | NaN | NaN | NaN | 0.962465 | 0.962464 | 0.999997 | 0.963093 | 0.963092 | 0.999997 |
| 2 | ([6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [... | [[-3.3231347, -1.6956815, 0.9845177], [-3.3837... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | ([3, 3], [[-1.3538755875425377, 0.129942825536... | \n RDKit 3D\n\n 14 13 0 0 0 0... | \n RDKit 3D\n\n 14 13 0 0 0 0... | 0.000287 | 0.003050 | 1.605723 | 1.8064 | ... | 0.963071 | NaN | NaN | NaN | 0.969413 | 0.969412 | 0.999997 | 0.969788 | 0.969787 | 0.999994 |
| 3 | ([6, 6, 1, 1, 1, 1, 1, 1], [[0.754557611047011... | [[0.6699976, -3.0041163, 0.48550072], [0.25443... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | ([3], [[-2.2718348313688352e-09, 1.48791622940... | \n RDKit 3D\n\n 8 7 0 0 0 0... | \n RDKit 3D\n\n 8 7 0 0 0 0... | 0.000178 | 0.004375 | 2.747568 | 1.0262 | ... | 0.975956 | NaN | NaN | NaN | 0.976338 | 0.976336 | 1.000000 | 0.977218 | 0.977217 | 0.999992 |
| 4 | ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[1.264255... | [[1.5423033, -2.574833, 1.6369876], [1.8832769... | [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ... | ([3], [[-0.012158210270805473, -0.127197368223... | \n RDKit 3D\n\n 11 10 0 0 0 0... | \n RDKit 3D\n\n 11 10 0 0 0 0... | 0.000232 | 0.003398 | 1.754957 | 1.4163 | ... | 0.971013 | NaN | NaN | NaN | 0.966891 | 0.966890 | 0.999998 | 0.967209 | 0.967207 | 0.999992 |
5 rows × 30 columns
properties_df_consis
num_generated_mols 99
solvent None
probe_radius 1.2
pharm_multi_vector False
num_valid 198
num_valid_post_opt 198
num_consistent_graph 198
frac_valid 2.0
frac_valid_post_opt 2.0
frac_consistent 2.0
frac_unique 0.015152
frac_unique_post_opt 0.015152
avg_graph_diversity 0.11912
graph_similarity_matrix [[1.0, 0.2, 0.16666666666666666, 1.0, 0.2, 0.1...
avg_graph_diversity_post_opt 0.11912
graph_similarity_matrix_post_opt [[1.0, 0.2, 0.16666666666666666, 1.0, 0.2, 0.1...
dtype: object
Conditional evaluation#
This is used to evaluate if the generated is similar (based on shepherd_score 3D scoring functions) to the target/reference molecule. The ConditionalEvalPipeline simply iterates over all the generated molecules with the ConditionalEval class and stores the full evaluation. In addition to the properties calculated by ConfEval it also does score-based alignment so it is a slower operation.
rdkit_mol = embed_conformer_from_smiles('c1Cc2ccc(Cl)cc2C(=O)c1c3cc(N1nnc2cc(C)c(Cl)cc2c1=O)ccc3', MMFF_optimize=True)
# Again using MMFF94 partial charges
ref_molec = Molecule(rdkit_mol, num_surf_points=200, probe_radius=1.2, pharm_multi_vector=False)
cond_pipe = ConditionalEvalPipeline(ref_molec, generated_mols=generated_mols,
condition='all', num_surf_points=200,
pharm_multi_vector=False, solvent=None)
cond_pipe.evaluate(verbose=True, num_workers=1, mp_context='spawn')
Conditional Eval: 100%|█████████████████████████| 99/99 [02:09<00:00, 1.31s/it]
cond_pipe.evaluate(verbose=True, num_workers=8, mp_context='spawn')
Conditional Eval: 100%|█████████████████████████| 99/99 [00:21<00:00, 4.68it/s]
properties_df_cond, global_attr_cond = cond_pipe.to_pandas()
global_attr_cond.head()
| generated_mols | molblocks | molblocks_post_opt | strain_energies | rmsds | SA_scores | logPs | QEDs | fsp3s | SA_scores_post_opt | ... | sims_surf_target_relax | sims_esp_target_relax | sims_pharm_target_relax | sims_surf_target_relax_optimal | sims_esp_target_relax_optimal | sims_pharm_target_relax_optimal | sims_surf_target_relax_esp_aligned | sims_pharm_target_relax_esp_aligned | graph_similarities | graph_similarities_post_opt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ([6, 6, 1, 1, 1, 1, 1, 1], [[-0.75427424835602... | \n RDKit 3D\n\n 8 7 0 0 0 0... | \n RDKit 3D\n\n 8 7 0 0 0 0... | 0.000197 | 0.005077 | 2.747568 | 1.0262 | 0.372786 | 1.0 | 2.747568 | ... | 0.136365 | 0.135720 | 0.010560 | 0.158949 | 0.158310 | 0.075350 | 0.158950 | 0.007106 | 0.011905 | 0.011905 |
| 1 | ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[-1.23450... | \n RDKit 3D\n\n 11 10 0 0 0 0... | \n RDKit 3D\n\n 11 10 0 0 0 0... | 0.000214 | 0.002903 | 1.754957 | 1.4163 | 0.385471 | 1.0 | 1.754957 | ... | 0.166879 | 0.166159 | 0.011084 | 0.207935 | 0.207234 | 0.075341 | 0.207939 | 0.068363 | 0.011628 | 0.011628 |
| 2 | ([6, 6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [... | \n RDKit 3D\n\n 14 13 0 0 0 0... | \n RDKit 3D\n\n 14 13 0 0 0 0... | 0.000287 | 0.003050 | 1.605723 | 1.8064 | 0.431024 | 1.0 | 1.605723 | ... | 0.175781 | 0.175023 | 0.015852 | 0.239678 | 0.238713 | 0.077309 | 0.239671 | 0.069266 | 0.011494 | 0.011494 |
| 3 | ([6, 6, 1, 1, 1, 1, 1, 1], [[0.754557611047011... | \n RDKit 3D\n\n 8 7 0 0 0 0... | \n RDKit 3D\n\n 8 7 0 0 0 0... | 0.000178 | 0.004375 | 2.747568 | 1.0262 | 0.372786 | 1.0 | 2.747568 | ... | 0.138604 | 0.137975 | 0.010560 | 0.203743 | 0.203575 | 0.075350 | 0.203740 | 0.073533 | 0.011905 | 0.011905 |
| 4 | ([6, 6, 6, 1, 1, 1, 1, 1, 1, 1, 1], [[1.264255... | \n RDKit 3D\n\n 11 10 0 0 0 0... | \n RDKit 3D\n\n 11 10 0 0 0 0... | 0.000232 | 0.003398 | 1.754957 | 1.4163 | 0.385471 | 1.0 | 1.754957 | ... | 0.155630 | 0.154976 | 0.008322 | 0.209363 | 0.208614 | 0.075337 | 0.209363 | 0.005040 | 0.011628 | 0.011628 |
5 rows × 26 columns
properties_df_cond
num_generated_mols 99
solvent None
pharm_multi_vector False
condition all
num_surf_points 200
lam 0.3
lam_scaled 62.206048
ref_molblock \n RDKit 3D\n\n 46 50 0 0 0 0...
ref_mol_SA_score 2.746132
ref_mol_QED 0.422032
ref_mol_logP 6.78332
ref_mol_fsp3 0.083333
ref_mol_morgan_fp [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
ref_surf_resampling_scores [0.8396994276889428, 0.8444193298291229, 0.854...
ref_surf_esp_resampling_scores [0.8396284268748854, 0.844353854359548, 0.8546...
sims_surf_upper_bound 0.862957
sims_esp_upper_bound 0.862882
num_valid 198
num_valid_post_opt 198
num_consistent_graph 198
frac_valid 2.0
frac_valid_post_opt 2.0
frac_consistent 2.0
frac_unique 0.015152
frac_unique_post_opt 0.015152
avg_graph_diversity 0.988324
dtype: object