schemist package
Submodules
schemist.cleaning module
Chemical structure cleaning routines.
- schemist.cleaning.clean_selfies(selfies: str, *args, **kwargs) str[source]
Sanitize a SELFIES string or list of SELFIES strings.
schemist.cli module
Command-line interface for schemist.
schemist.collating module
Tools to collate chemical data files.
- schemist.collating.collate_inventory(catalog: DataFrame, root_dir: str | None = None, drop_invalid: bool = True, drop_unmapped: bool = False, catalog_smiles_column: str = 'input_smiles', id_column_name: str | None = None, id_n_digits: int = 8, id_prefix: str = '') DataFrame[source]
- schemist.collating.collate_inventory_from_file(catalog_path: str | TextIO, root_dir: str | None = None, format: str | None = None, *args, **kwargs) DataFrame[source]
schemist.converting module
Converting between chemical representation formats.
- schemist.converting.convert_string_representation(strings: Iterable[str] | str, input_representation: str = 'smiles', output_representation: Iterable[str] | str = 'smiles', **kwargs) str | None | Iterable[str | None] | Dict[str, str | None | Iterable[str | None]][source]
Convert between string representations of chemical structures.
schemist.features module
Tools for generating chemical features.
- schemist.features.calculate_2d_features(strings: Iterable[str] | str, normalized: bool = True, histogram_normalized: bool = True, return_dataframe: bool = False, *args, **kwargs) DataFrame | Tuple[ndarray, ndarray][source]
Calculate 2d features from string representation.
- Parameters:
strings (str) – Input string representation(s).
input_representation (str) – Representation type
normalized (bool, optional) – Whether to return normalized features. Default: True.
histogram_normalized (bool, optional) – Whether to return histogram normalized features (faster). Default: True.
return_dataframe (bool, optional) – Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: False.
- Returns:
If return_dataframe = True, a DataFrame with named feature columns, and the final column called “meta_feature_valid” being the validity indicator. Otherwise returns a tuple of Arrays with the first being the matrix of features and the second being the vector of validity indicators.
- Return type:
DataFrame, Tuple of numpy Arrays
Examples
>>> features, validity = calculate_2d_features(strings='CCC') >>> features[:,:3] array([[4.22879602e-01, 1.30009101e-04, 2.00014001e-05]]) >>> validity array([1.]) >>> features, validity = calculate_2d_features(strings=['CCC', 'CCCO']) >>> features[:,:3] array([[4.22879602e-01, 1.30009101e-04, 2.00014001e-05], [7.38891722e-01, 6.00042003e-04, 5.00035002e-05]]) >>> validity array([1., 1.]) >>> calculate_2d_features(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid CCC True CCCO True Name: meta_feature_valid, dtype: bool >>> ## Unusal valence >>> s = "O=S(=O)(OCC1OC(OC2(COS(=O)(=O)O[AlH3](O)O)OC(COS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C2OS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C1OS(=O)(=O)O[AlH3](O)O)O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O" >>> calculate_2d_features(strings=s)[0].shape (1, 200) >>> s = 'CCc1c(C(=O)N2CC(c3nnc4c3CCC4)C2)nc(C)c1C(=O)OC' >>> calculate_2d_features(strings=s)[1] array([1.])
- schemist.features.calculate_3d_features(strings: Iterable[str] | str, seed: int = 42, return_dataframe: bool = False, *args, **kwargs) DataFrame | Tuple[ndarray, ndarray][source]
Calculate 3d features from string representation.
- Parameters:
strings (str) – Input string representation(s).
input_representation (str) – Representation type
seed (int) – Seed for reproducible randomness
return_dataframe (bool, optional) – Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: False.
- Returns:
If return_dataframe = True, a DataFrame with named feature columns, and the final column called “meta_feature_valid” being the validity indicator. Otherwise returns a tuple of Arrays with the first being the matrix of features and the second being the vector of validity indicators.
- Return type:
DataFrame, Tuple of numpy Arrays
Examples
>>> features, validity = calculate_3d_features(strings='CCC') >>> features.shape (1, 11) >>> sum(validity) 1 >>> features, validity = calculate_3d_features(strings=['CCC', 'CCCO']) >>> features.shape (2, 11) >>> sum(validity) 2 >>> calculate_3d_features(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid CCC True CCCO True Name: meta_feature_valid, dtype: bool >>> ## Unusal valence >>> s = "O=S(=O)(OCC1OC(OC2(COS(=O)(=O)O[AlH3](O)O)OC(COS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C2OS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C(OS(=O)(=O)O[AlH3](O)O)C1OS(=O)(=O)O[AlH3](O)O)O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O.O[AlH3](O)O" >>> calculate_3d_features(strings=s)[0].shape (1, 0) >>> s = 'CCc1c(C(=O)N2CC(c3nnc4c3CCC4)C2)nc(C)c1C(=O)OC' >>> sum(calculate_3d_features(strings=s)[1]) 1
- schemist.features.calculate_feature(feature_type: str | Iterable[str] = 'all', return_dataframe: bool = False, *args, **kwargs) DataFrame | Tuple[ndarray, ndarray][source]
Calculate the binary fingerprint or descriptor vector of string representation(s).
Examples
>>> calculate_feature("2d", strings=['CCC', 'CCO'])[0].shape (2, 200) >>> calculate_feature("3d", strings=['CCC', 'CCO'])[0].shape (2, 11) >>> calculate_feature("fp", on_bits=False, strings=['CCC', 'CCO'])[0].shape (2, 2048) >>> calculate_feature("all", on_bits=False, strings=['CCC', 'CCO'])[0].shape (2, 2259)
- schemist.features.calculate_fingerprints(strings: Iterable[str] | str, fp_type: str = 'morgan', radius: int = 2, chiral: bool = True, on_bits: bool = True, return_dataframe: bool = False, *args, **kwargs) DataFrame | Tuple[ndarray, ndarray][source]
Calculate the binary fingerprint of string representation(s).
Only Morgan fingerprints are allowed.
- Parameters:
strings (str) – Input string representation(s).
input_representation (str) – Representation type
fp_type (str, opional) – Which fingerprint type to calculate. Default: ‘morgan’.
radius (int, optional) – Atom radius for fingerprints. Default: 2.
chiral (bool, optional) – Whether to take chirality into account. Default: True.
on_bits (bool, optional) – Whether to return the non-zero indices instead of the full binary vector. Default: True.
return_dataframe (bool, optional) – Whether to retrun a Pandas DataFrame instead of a numpy Array. Default: False.
- Returns:
If return_dataframe = True, a DataFrame with named feature columns, and the final column called “meta_feature_valid” being the validity indicator. Otherwise returns a tuple of Arrays with the first being the matrix of features and the second being the vector of validity indicators.
- Return type:
DataFrame, Tuple of numpy Arrays
- Raises:
NotImplementedError – If fp_type is not ‘morgan’.
Examples
>>> bits, validity = calculate_fingerprints(strings='CCC') >>> bits.tolist() [['80;294;1057;1344']] >>> sum(validity) 1 >>> bits, validity = calculate_fingerprints(strings=['CCC', 'CCCO']) >>> bits.tolist() [['80;294;1057;1344'], ['80;222;294;473;794;807;1057;1277']] >>> sum(validity) 2 >>> np.sum(calculate_fingerprints(strings=['CCC', 'CCCO'], on_bits=False)[0], axis=-1) array([4, 8]) >>> calculate_fingerprints(strings=['CCC', 'CCCO'], return_dataframe=True).meta_feature_valid CCC True CCCO True Name: meta_feature_valid, dtype: bool
schemist.generating module
Tools for enumerating compounds. Currently only works with peptides.
- schemist.generating.all_peptides_in_length_range(max_length: int, min_length: int = 1, by: int = 1, alphabet: Iterable[str] | None = None, d_aa_only: bool = False, include_d_aa: bool = False, *args, **kwargs) Iterable[str][source]
- schemist.generating.all_peptides_of_one_length(length: int, alphabet: Iterable[str] | None = None, d_aa_only: bool = False, include_d_aa: bool = False) Iterable[str][source]
- schemist.generating.random() x in the interval [0, 1).
- schemist.generating.react(strings: str | Iterable[str], reaction: str = 'N_to_C_cyclization', output_representation: str = 'smiles', **kwargs) str | Iterable[str][source]
- schemist.generating.sample_peptides_in_length_range(max_length: int, min_length: int = 1, by: int = 1, n: float | int | None = None, alphabet: Iterable[str] | None = None, d_aa_only: bool = False, include_d_aa: bool = False, naive_sampling_cutoff: float = 0.005, reservoir_sampling: bool = True, indexes: Iterable[int] | None = None, set_seed: int | None = None, *args, **kwargs) Iterable[str][source]
schemist.io module
Tools to facilitate input and output.
schemist.rest_lookup module
Tools for querying PubChem.
schemist.splitting module
Tools for splitting tabular datasets, optionally based on chemical features.
- schemist.splitting.random() x in the interval [0, 1).
- schemist.splitting.split(split_type: str, *args, **kwargs) DataSplits[source]
- schemist.splitting.split_random(strings: str | Iterable[str], train: float = 1.0, test: float = 0.0, chunksize: int | None = None, set_seed: int | None = None, *args, **kwargs) DataSplits[source]
- schemist.splitting.split_scaffold(strings: str | Iterable[str], train: float = 1.0, test: float = 0.0, chunksize: int | None = None, progress: bool = True, *args, **kwargs) DataSplits[source]
schemist.tables module
Tools for processing tabular data.
- schemist.tables.assign_groups(df: DataFrame, grouper: Callable[[str | Iterable[str]], Dict[str, Tuple[int]]], group_name: str = 'group', column: str = 'smiles', input_representation: str = 'smiles', *args, **kwargs) Tuple[Dict[str, Tuple[int]], DataFrame][source]
- schemist.tables.cleaner(df: DataFrame, column: str = 'smiles', input_representation: str = 'smiles', prefix: str | None = None) Tuple[Dict[str, int], DataFrame][source]
- schemist.tables.converter(df: DataFrame, column: str = 'smiles', input_representation: str = 'smiles', output_representation: str | Iterable[str] = 'smiles', prefix: str | None = None, options: Mapping[str, Any] | None = None) Tuple[Dict[str, int], DataFrame][source]
- schemist.tables.featurizer(df: DataFrame, feature_type: str, column: str = 'smiles', ids: str | Iterable[str] | None = None, input_representation: str = 'smiles', prefix: str | None = None) Tuple[Dict[str, int], DataFrame][source]
Generate a feature table based on a column of the input dataframe.
Examples
>>> import pandas as pd >>> df = pd.DataFrame({"a": [1,2,3], "b": ["A", "B", "C"], "smiles": ["C", "CCC", "CCCO"]}) >>> valid, fps = featurizer(df, "fp") >>> fps.shape (3, 5) >>> featurizer(df, "fp", ids="b")[-1].shape (3, 3) >>> featurizer(df, "fp", ids=["a", "b"])[-1].shape (3, 4) >>> featurizer(df, "2d", ids=["a", "b"])[-1].shape (3, 203)
schemist.typing module
Types used in schemist.