schemist package

Submodules

schemist.cleaning module

Chemical structure cleaning routines.

schemist.cleaning.clean_selfies(selfies: str, *args, **kwargs) → str[source]: Sanitize a SELFIES string or list of SELFIES strings.

schemist.cleaning.clean_smiles(smiles: str, *args, **kwargs) → str[source]: Sanitize a SMILES string or list of SMILES strings.

schemist.cleaning.sanitize_smiles_to_mol(s: str) → Mol[source]: Apply sanifix5.

schemist.cli module

Command-line interface for schemist.

schemist.cli.main() → None[source]

schemist.collating module

Tools to collate chemical data files.

schemist.collating.collate_inventory(catalog: DataFrame, root_dir: str | None = None, drop_invalid: bool = True, drop_unmapped: bool = False, catalog_smiles_column: str = 'input_smiles', id_column_name: str | None = None, id_n_digits: int = 8, id_prefix: str = '') → DataFrame[source]

schemist.collating.collate_inventory_from_file(catalog_path: str | TextIO, root_dir: str | None = None, format: str | None = None, *args, **kwargs) → DataFrame[source]

schemist.collating.deduplicate(df: DataFrame, column: str = 'smiles', input_representation: str = 'smiles', index_columns: List[str] | None = None, drop_inchikey: bool = False) → DataFrame[source]

schemist.collating.deduplicate_file(filename: str | TextIO, format: str | None = None, *args, **kwargs) → DataFrame[source]

schemist.converting module

Converting between chemical representation formats.

schemist.converting.convert_string_representation(strings: Iterable[str] | str, input_representation: str = 'smiles', output_representation: Iterable[str] | str = 'smiles', **kwargs) → str | None | Iterable[str | None] | Dict[str, str | None | Iterable[str | None]][source]: Convert between string representations of chemical structures.

schemist.converting.mini_helm2helm(s: str) → List[str][source]

schemist.features module

Tools for generating chemical features.

schemist.features.calculate_2d_features(strings: Iterable[str] | str, normalized: bool = True, histogram_normalized: bool = True, return_dataframe: bool = False, *args, **kwargs) → DataFrame | Tuple[ndarray, ndarray][source]

Calculate 2d features from string representation.

Parameters:

strings (str) – Input string representation(s).
input_representation (str) – Representation type
normalized (bool, optional) – Whether to return normalized features. Default: True.
histogram_normalized (bool, optional) – Whether to return histogram normalized features (faster). Default: True.
return_dataframe (bool, optional) – Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: False.

Returns:

If return_dataframe = True, a DataFrame with named feature columns, and the final column called “meta_feature_valid” being the validity indicator. Otherwise returns a tuple of Arrays with the first being the matrix of features and the second being the vector of validity indicators.

Return type:

DataFrame, Tuple of numpy Arrays

Examples

>>> features, validity = calculate_2d_features(strings='CCC')
>>> features[:,:3]
array([[4.22879602e-01, 1.30009101e-04, 2.00014001e-05]])
>>> validity
array([1.])
>>> features, validity = calculate_2d_features(strings=['CCC', 'CCCO'])
>>> features[:,:3]
array([[4.22879602e-01, 1.30009101e-04, 2.00014001e-05],
       [7.38891722e-01, 6.00042003e-04, 5.00035002e-05]])
>>> validity
array([1., 1.])
>>> calculate_2d_features(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid
CCC     True
CCCO    True
Name: meta_feature_valid, dtype: bool
>>> ## Unusal valence
>>> s = "O=S(=O)(OCC1OC(OC2(COS(=O)(=O)O[AlH3](O)O)OC(COS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C2OS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C1OS(=O)(=O)O[AlH3](O)O)O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O"
>>> calculate_2d_features(strings=s)[0].shape
(1, 200)
>>> s = 'CCc1c(C(=O)N2CC(c3nnc4c3CCC4)C2)nc(C)c1C(=O)OC'
>>> calculate_2d_features(strings=s)[1]
array([1.])

schemist.features.calculate_3d_features(strings: Iterable[str] | str, seed: int = 42, return_dataframe: bool = False, *args, **kwargs) → DataFrame | Tuple[ndarray, ndarray][source]

Calculate 3d features from string representation.

Parameters:

strings (str) – Input string representation(s).
input_representation (str) – Representation type
seed (int) – Seed for reproducible randomness
return_dataframe (bool, optional) – Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: False.

Returns:

If return_dataframe = True, a DataFrame with named feature columns, and the final column called “meta_feature_valid” being the validity indicator. Otherwise returns a tuple of Arrays with the first being the matrix of features and the second being the vector of validity indicators.

Return type:

DataFrame, Tuple of numpy Arrays

Examples

>>> features, validity = calculate_3d_features(strings='CCC')
>>> features.shape
(1, 11)
>>> sum(validity)
1
>>> features, validity = calculate_3d_features(strings=['CCC', 'CCCO'])
>>> features.shape
(2, 11)
>>> sum(validity)
2
>>> calculate_3d_features(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid
CCC     True
CCCO    True
Name: meta_feature_valid, dtype: bool
>>> ## Unusal valence
>>> s = "O=S(=O)(OCC1OC(OC2(COS(=O)(=O)O[AlH3](O)O)OC(COS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C2OS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C1OS(=O)(=O)O[AlH3](O)O)O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O"
>>> calculate_3d_features(strings=s)[0].shape
(1, 0)
>>> s = 'CCc1c(C(=O)N2CC(c3nnc4c3CCC4)C2)nc(C)c1C(=O)OC'
>>> sum(calculate_3d_features(strings=s)[1])
1

schemist.features.calculate_feature(feature_type: str | Iterable[str] = 'all', return_dataframe: bool = False, *args, **kwargs) → DataFrame | Tuple[ndarray, ndarray][source]

Calculate the binary fingerprint or descriptor vector of string representation(s).

Examples

>>> calculate_feature("2d", strings=['CCC', 'CCO'])[0].shape
(2, 200)
>>> calculate_feature("3d", strings=['CCC', 'CCO'])[0].shape
(2, 11)
>>> calculate_feature("fp", on_bits=False, strings=['CCC', 'CCO'])[0].shape
(2, 2048)
>>> calculate_feature("all", on_bits=False, strings=['CCC', 'CCO'])[0].shape
(2, 2259)

schemist.features.calculate_fingerprints(strings: Iterable[str] | str, fp_type: str = 'morgan', radius: int = 2, chiral: bool = True, on_bits: bool = True, return_dataframe: bool = False, *args, **kwargs) → DataFrame | Tuple[ndarray, ndarray][source]

Calculate the binary fingerprint of string representation(s).

Only Morgan fingerprints are allowed.

Parameters:

strings (str) – Input string representation(s).
input_representation (str) – Representation type
fp_type (str, opional) – Which fingerprint type to calculate. Default: ‘morgan’.
radius (int, optional) – Atom radius for fingerprints. Default: 2.
chiral (bool, optional) – Whether to take chirality into account. Default: True.
on_bits (bool, optional) – Whether to return the non-zero indices instead of the full binary vector. Default: True.
return_dataframe (bool, optional) – Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: False.

Returns:

If return_dataframe = True, a DataFrame with named feature columns, and the final column called “meta_feature_valid” being the validity indicator. Otherwise returns a tuple of Arrays with the first being the matrix of features and the second being the vector of validity indicators.

Return type:

DataFrame, Tuple of numpy Arrays

Raises:

NotImplementedError – If fp_type is not ‘morgan’.

Examples

>>> bits, validity = calculate_fingerprints(strings='CCC')
>>> bits.tolist()
[['80;294;1057;1344']]
>>> sum(validity)  
1
>>> bits, validity = calculate_fingerprints(strings=['CCC', 'CCCO'])
>>> bits.tolist()
[['80;294;1057;1344'], ['80;222;294;473;794;807;1057;1277']]
>>> sum(validity)  
2
>>> np.sum(calculate_fingerprints(strings=['CCC', 'CCCO'], on_bits=False)[0], axis=-1)
array([4, 8])
>>> calculate_fingerprints(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid
CCC     True
CCCO    True
Name: meta_feature_valid, dtype: bool

schemist.features.smiles_to_3d(smiles: str, seed: int = 42) → ndarray[source]

schemist.generating module

Tools for enumerating compounds. Currently only works with peptides.

schemist.generating.all_peptides_in_length_range(max_length: int, min_length: int = 1, by: int = 1, alphabet: Iterable[str] | None = None, d_aa_only: bool = False, include_d_aa: bool = False, *args, **kwargs) → Iterable[str][source]

schemist.generating.all_peptides_of_one_length(length: int, alphabet: Iterable[str] | None = None, d_aa_only: bool = False, include_d_aa: bool = False) → Iterable[str][source]

schemist.generating.random() → x in the interval [0, 1).

schemist.generating.react(strings: str | Iterable[str], reaction: str = 'N_to_C_cyclization', output_representation: str = 'smiles', **kwargs) → str | Iterable[str][source]

schemist.generating.sample_peptides_in_length_range(max_length: int, min_length: int = 1, by: int = 1, n: float | int | None = None, alphabet: Iterable[str] | None = None, d_aa_only: bool = False, include_d_aa: bool = False, naive_sampling_cutoff: float = 0.005, reservoir_sampling: bool = True, indexes: Iterable[int] | None = None, set_seed: int | None = None, *args, **kwargs) → Iterable[str][source]

schemist.io module

Tools to facilitate input and output.

schemist.io.read_sdf(filename: str | TextIO)[source]

schemist.io.read_weird_xml(filename: str | TextIO, header: bool = True, namespace: str = '{urn:schemas-microsoft-com:office:spreadsheet}') → DataFrame[source]

schemist.rest_lookup module

Tools for querying PubChem.

schemist.splitting module

Tools for splitting tabular datasets, optionally based on chemical features.

schemist.splitting.random() → x in the interval [0, 1).

schemist.splitting.split(split_type: str, *args, **kwargs) → DataSplits[source]

schemist.splitting.split_random(strings: str | Iterable[str], train: float = 1.0, test: float = 0.0, chunksize: int | None = None, set_seed: int | None = None, *args, **kwargs) → DataSplits[source]

schemist.splitting.split_scaffold(strings: str | Iterable[str], train: float = 1.0, test: float = 0.0, chunksize: int | None = None, progress: bool = True, *args, **kwargs) → DataSplits[source]

schemist.tables module

Tools for processing tabular data.

schemist.tables.assign_groups(df: DataFrame, grouper: Callable[[str | Iterable[str]], Dict[str, Tuple[int]]], group_name: str = 'group', column: str = 'smiles', input_representation: str = 'smiles', *args, **kwargs) → Tuple[Dict[str, Tuple[int]], DataFrame][source]

schemist.tables.cleaner(df: DataFrame, column: str = 'smiles', input_representation: str = 'smiles', prefix: str | None = None) → Tuple[Dict[str, int], DataFrame][source]

schemist.tables.converter(df: DataFrame, column: str = 'smiles', input_representation: str = 'smiles', output_representation: str | Iterable[str] = 'smiles', prefix: str | None = None, options: Mapping[str, Any] | None = None) → Tuple[Dict[str, int], DataFrame][source]

schemist.tables.featurizer(df: DataFrame, feature_type: str, column: str = 'smiles', ids: str | Iterable[str] | None = None, input_representation: str = 'smiles', prefix: str | None = None) → Tuple[Dict[str, int], DataFrame][source]

Generate a feature table based on a column of the input dataframe.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1,2,3], "b": ["A", "B", "C"], "smiles": ["C", "CCC", "CCCO"]})
>>> valid, fps = featurizer(df, "fp")
>>> fps.shape
(3, 5)
>>> featurizer(df, "fp", ids="b")[-1].shape
(3, 3)
>>> featurizer(df, "fp", ids=["a", "b"])[-1].shape
(3, 4)
>>> featurizer(df, "2d", ids=["a", "b"])[-1].shape
(3, 203)

schemist.tables.reactor(df: DataFrame, column: str = 'smiles', reaction: str | Iterable[str] = 'N_to_C_cyclization', prefix: str | None = None, *args, **kwargs) → Tuple[Dict[str, int], DataFrame][source]

schemist.tables.splitter(df: DataFrame, split_type: str = 'random', column: str = 'smiles', input_representation: str = 'smiles', *args, **kwargs) → Tuple[Dict[str, int], DataFrame][source]

schemist.typing module

Types used in schemist.

class schemist.typing.DataSplits(train, test, validation)

Bases: tuple

test: Alias for field number 1

train: Alias for field number 0

validation: Alias for field number 2

schemist package

Submodules

schemist.cleaning module

schemist.cli module

schemist.collating module

schemist.converting module

schemist.features module

schemist.generating module

schemist.io module

schemist.rest_lookup module

schemist.splitting module

schemist.tables module

schemist.typing module

schemist.utils module

Module contents