Tokenizers (`depthcharge.tokenizers`)

`PeptideTokenizer(residues=None, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')`

Bases: Tokenizer

A tokenizer for ProForma peptide sequences.

Parse and tokenize ProForma-compliant peptide sequences.

PARAMETER	DESCRIPTION
`residues`	Residues and modifications to add to the vocabulary beyond the standard 20 amino acids. TYPE: `dict[str, float]` DEFAULT: `None`
`replace_isoleucine_with_leucine`	Replace I with L residues, because they are isomeric and often indistinguishable by mass spectrometry. TYPE: `bool` DEFAULT: `False`
`reverse`	Reverse the sequence for tokenization, C-terminus to N-terminus. TYPE: `bool` DEFAULT: `False`
`start_token`	The start token to use. TYPE: `str` DEFAULT: `None`
`stop_token`	The stop token to use. TYPE: `str` DEFAULT: `'$'`

ATTRIBUTE	DESCRIPTION
`residues`	The residues and modifications and their associated masses. terminal modifcations are indicated by `-`. TYPE: `SortedDict[str, float]`
`index`	The mapping of residues and modifications to integer representations. TYPE: `SortedDict{str, int}`
`reverse_index`	The ordered residues and modifications where the list index is the integer representation for a token. TYPE: `list[None \| str]`
`start_token`	The start token TYPE: `str`
`stop_token`	The stop token. TYPE: `str`
`start_int`	The integer representation of the start token TYPE: `int`
`stop_int`	The integer representation of the stop token. TYPE: `int`
`padding_int`	The integer used to represent padding. TYPE: `int`

Functions

`calculate_precursor_ions(tokens, charges)`

Calculate the m/z for precursor ions.

PARAMETER	DESCRIPTION
`tokens`	The tokens corresponding to the peptide sequence. TYPE: `torch.Tensor of shape (n_sequences, len_seq)`
`charges`	The charge state for each peptide. TYPE: `torch.Tensor of shape (n_sequences,)`

RETURNS	DESCRIPTION
`Tensor`	The monoisotopic m/z for each charged peptide.

`detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)`

Retreive sequences from tokens.

PARAMETER	DESCRIPTION
`tokens`	The zero-padded tensor of integerized tokens to decode. TYPE: `torch.Tensor of shape (n_sequences, max_length)`
`join`	Join tokens into strings? TYPE: `bool` DEFAULT: `True`
`trim_start_token`	Remove the start token from the beginning of a sequence. TYPE: `bool` DEFAULT: `True`
`trim_stop_token`	Remove the stop token from the end of a sequence. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`list[str] or list[list[str]]`	The decoded sequences each as a string or list or strings.

`from_massivekb(replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')` `staticmethod`

Create a tokenizer with the observed peptide modications.

Modifications are parsed from MassIVE-KB peptide strings and added to the vocabulary.

PARAMETER	DESCRIPTION
`replace_isoleucine_with_leucine`	Replace I with L residues, because they are isobaric and often indistinguishable by mass spectrometry. TYPE: `bool` DEFAULT: `False`
`reverse`	Reverse the sequence for tokenization, C-terminus to N-terminus. TYPE: `bool` DEFAULT: `False`
`start_token`	The start token to use. TYPE: `str` DEFAULT: `None`
`stop_token`	The stop token to use. TYPE: `str` DEFAULT: `'$'`

RETURNS	DESCRIPTION
`MskbPeptideTokenizer`	A tokenizer for peptides with the observed modifications.

`from_proforma(sequences, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')` `classmethod`

Create a tokenizer with the observed peptide modications.

Modifications are parsed from ProForma 2.0-compliant peptide strings and added to the vocabulary.

PARAMETER	DESCRIPTION
`sequences`	The peptides from which to parse modifications. TYPE: `Iterable[str]`
`replace_isoleucine_with_leucine`	Replace I with L residues, because they are isobaric and often indistinguishable by mass spectrometry. TYPE: `bool` DEFAULT: `False`
`reverse`	Reverse the sequence for tokenization, C-terminus to N-terminus. TYPE: `bool` DEFAULT: `False`
`start_token`	The start token to use. TYPE: `str` DEFAULT: `None`
`stop_token`	The stop token to use. TYPE: `str` DEFAULT: `'$'`

RETURNS	DESCRIPTION
`PeptideTokenizer`	A tokenizer for peptides with the observed modifications.

`split(sequence)`

Split a ProForma peptide sequence.

PARAMETER	DESCRIPTION
`sequence`	The peptide sequence. TYPE: `str`

RETURNS	DESCRIPTION
`list[str]`	The tokens that comprise the peptide sequence.

`tokenize(sequences, add_start=False, add_stop=False, to_strings=False)`

Tokenize the input sequences.

PARAMETER	DESCRIPTION
`sequences`	The sequences to tokenize. TYPE: `Iterable[str] or str`
`add_start`	Prepend the start token to the beginning of the sequence. TYPE: `bool` DEFAULT: `False`
`add_stop`	Append the stop token to the end of the sequence. TYPE: `bool` DEFAULT: `False`
`to_strings`	Return each as a list of token strings rather than a tensor. This is useful for debugging. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`torch.tensor of shape (n_sequences, max_length) or list[list[str]]`	Either a tensor containing the integer values for each token, padded with 0's, or the list of tokens comprising each sequence.

`MoleculeTokenizer(selfies_vocab=None, start_token=None, stop_token='$')`

Bases: Tokenizer

A tokenizer for small molecules.

Tokenize SMILES and SELFIES representations of small molecules. SMILES are internally converted to SELFIES representations.

PARAMETER	DESCRIPTION
`selfies_vocab`	The SELFIES tokens to be considered. TYPE: `Iterable[str]` DEFAULT: `None`
`start_token`	The start token to use. TYPE: `str` DEFAULT: `None`
`stop_token`	The stop token to use. TYPE: `str` DEFAULT: `'$'`

ATTRIBUTE	DESCRIPTION
`index`	The mapping of residues and modifications to integer representations. TYPE: `SortedDict{str, int}`
`reverse_index`	The ordered residues and modifications where the list index is the integer representation for a token. TYPE: `list[None \| str]`
`start_token`	The start token TYPE: `str`
`stop_token`	The stop token. TYPE: `str`
`start_int`	The integer representation of the start token TYPE: `int`
`stop_int`	The integer representation of the stop token. TYPE: `int`
`padding_int`	The integer used to represent padding. TYPE: `int`

Functions

`detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)`

Retreive sequences from tokens.

PARAMETER	DESCRIPTION
`tokens`	The zero-padded tensor of integerized tokens to decode. TYPE: `torch.Tensor of shape (n_sequences, max_length)`
`join`	Join tokens into strings? TYPE: `bool` DEFAULT: `True`
`trim_start_token`	Remove the start token from the beginning of a sequence. TYPE: `bool` DEFAULT: `True`
`trim_stop_token`	Remove the stop token and anything following it from the sequence. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`list[str] or list[list[str]]`	The decoded sequences each as a string or list or strings.

`from_selfies(selfies, start_token=None, stop_token='$')` `classmethod`

Learn the vocabulary from SELFIES strings.

PARAMETER	DESCRIPTION
`selfies`	Create a vocabulary from all unique tokens in these SELFIES strings. TYPE: `Iterable[str] \| str`
`start_token`	The start token to use. TYPE: `str` DEFAULT: `None`
`stop_token`	The stop token to use. TYPE: `str` DEFAULT: `'$'`

RETURNS	DESCRIPTION
`MoleculeTokenizer`	The tokenizer restricted to the vocabulary present in the input SMILES strings.

`from_smiles(smiles, start_token=None, stop_token='$')` `classmethod`

Learn the vocabulary from SMILES strings.

PARAMETER	DESCRIPTION
`smiles`	Create a vocabulary from all unique tokens in these SMILES strings. TYPE: `Iterable[str] \| str`
`start_token`	The start token to use. TYPE: `str` DEFAULT: `None`
`stop_token`	The stop token to use. TYPE: `str` DEFAULT: `'$'`

RETURNS	DESCRIPTION
`MoleculeTokenizer`	The tokenizer restricted to the vocabulary present in the input SMILES strings.

`split(sequence)`

Split a SMILES or SELFIES string into SELFIES tokens.

PARAMETER	DESCRIPTION
`sequence`	The SMILES or SELFIES string representing a molecule. TYPE: `str`

RETURNS	DESCRIPTION
`List[str]`	The SELFIES tokens representing the molecule.

`tokenize(sequences, add_start=False, add_stop=False, to_strings=False)`

Tokenize the input sequences.

PARAMETER	DESCRIPTION
`sequences`	The sequences to tokenize. TYPE: `Iterable[str] or str`
`add_start`	Prepend the start token to the beginning of the sequence. TYPE: `bool` DEFAULT: `False`
`add_stop`	Append the stop token to the end of the sequence. TYPE: `bool` DEFAULT: `False`
`to_strings`	Return each as a list of token strings rather than a tensor. This is useful for debugging. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`torch.tensor of shape (n_sequences, max_length) or list[list[str]]`	Either a tensor containing the integer values for each token, padded with 0's, or the list of tokens comprising each sequence.

`Tokenizer(tokens, start_token=None, stop_token='$')`

Bases: ABC

An abstract base class for Depthcharge tokenizers.

PARAMETER	DESCRIPTION
`tokens`	The tokens to consider. TYPE: `Sequence[str]`
`start_token`	The start token to use. TYPE: `str` DEFAULT: `None`
`stop_token`	The stop token to use. TYPE: `str` DEFAULT: `'$'`

Functions

`detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)`

Retreive sequences from tokens.

PARAMETER	DESCRIPTION
`tokens`	The zero-padded tensor of integerized tokens to decode. TYPE: `torch.Tensor of shape (n_sequences, max_length)`
`join`	Join tokens into strings? TYPE: `bool` DEFAULT: `True`
`trim_start_token`	Remove the start token from the beginning of a sequence. TYPE: `bool` DEFAULT: `True`
`trim_stop_token`	Remove the stop token and anything following it from the sequence. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`list[str] or list[list[str]]`	The decoded sequences each as a string or list or strings.

`split(sequence)` `abstractmethod`

Split a sequence into the constituent string tokens.

`tokenize(sequences, add_start=False, add_stop=False, to_strings=False)`

Tokenize the input sequences.

PARAMETER	DESCRIPTION
`sequences`	The sequences to tokenize. TYPE: `Iterable[str] or str`
`add_start`	Prepend the start token to the beginning of the sequence. TYPE: `bool` DEFAULT: `False`
`add_stop`	Append the stop token to the end of the sequence. TYPE: `bool` DEFAULT: `False`
`to_strings`	Return each as a list of token strings rather than a tensor. This is useful for debugging. TYPE: `bool` DEFAULT: `False`

RETURNS	DESCRIPTION
`torch.tensor of shape (n_sequences, max_length) or list[list[str]]`	Either a tensor containing the integer values for each token, padded with 0's, or the list of tokens comprising each sequence.

Tokenizers (depthcharge.tokenizers)

PeptideTokenizer(residues=None, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')

Functions

calculate_precursor_ions(tokens, charges)

detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)

from_massivekb(replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$') staticmethod

from_proforma(sequences, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$') classmethod

split(sequence)

tokenize(sequences, add_start=False, add_stop=False, to_strings=False)

MoleculeTokenizer(selfies_vocab=None, start_token=None, stop_token='$')

Functions

detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)

from_selfies(selfies, start_token=None, stop_token='$') classmethod

from_smiles(smiles, start_token=None, stop_token='$') classmethod

split(sequence)

tokenize(sequences, add_start=False, add_stop=False, to_strings=False)

Tokenizer(tokens, start_token=None, stop_token='$')

Functions

detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)

split(sequence) abstractmethod

tokenize(sequences, add_start=False, add_stop=False, to_strings=False)

Tokenizers (`depthcharge.tokenizers`)

`PeptideTokenizer(residues=None, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')`

`calculate_precursor_ions(tokens, charges)`

`detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)`

`from_massivekb(replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')` `staticmethod`

`from_proforma(sequences, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')` `classmethod`

`split(sequence)`

`tokenize(sequences, add_start=False, add_stop=False, to_strings=False)`

`MoleculeTokenizer(selfies_vocab=None, start_token=None, stop_token='$')`

`detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)`

`from_selfies(selfies, start_token=None, stop_token='$')` `classmethod`

`from_smiles(smiles, start_token=None, stop_token='$')` `classmethod`

`split(sequence)`

`tokenize(sequences, add_start=False, add_stop=False, to_strings=False)`

`Tokenizer(tokens, start_token=None, stop_token='$')`

`detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)`

`split(sequence)` `abstractmethod`

`tokenize(sequences, add_start=False, add_stop=False, to_strings=False)`