Skip to content

Tokenizers (depthcharge.tokenizers)

PeptideTokenizer(residues=None, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')

Bases: Tokenizer

A tokenizer for ProForma peptide sequences.

Parse and tokenize ProForma-compliant peptide sequences.

PARAMETER DESCRIPTION
residues

Residues and modifications to add to the vocabulary beyond the standard 20 amino acids.

TYPE: dict[str, float] DEFAULT: None

replace_isoleucine_with_leucine

Replace I with L residues, because they are isomeric and often indistinguishable by mass spectrometry.

TYPE: bool DEFAULT: False

reverse

Reverse the sequence for tokenization, C-terminus to N-terminus.

TYPE: bool DEFAULT: False

start_token

The start token to use.

TYPE: str DEFAULT: None

stop_token

The stop token to use.

TYPE: str DEFAULT: '$'

ATTRIBUTE DESCRIPTION
residues

The residues and modifications and their associated masses. terminal modifcations are indicated by -.

TYPE: SortedDict[str, float]

index

The mapping of residues and modifications to integer representations.

TYPE: SortedDict{str, int}

reverse_index

The ordered residues and modifications where the list index is the integer representation for a token.

TYPE: list[None | str]

start_token

The start token

TYPE: str

stop_token

The stop token.

TYPE: str

start_int

The integer representation of the start token

TYPE: int

stop_int

The integer representation of the stop token.

TYPE: int

padding_int

The integer used to represent padding.

TYPE: int

Functions

calculate_precursor_ions(tokens, charges)

Calculate the m/z for precursor ions.

PARAMETER DESCRIPTION
tokens

The tokens corresponding to the peptide sequence.

TYPE: torch.Tensor of shape (n_sequences, len_seq)

charges

The charge state for each peptide.

TYPE: torch.Tensor of shape (n_sequences,)

RETURNS DESCRIPTION
Tensor

The monoisotopic m/z for each charged peptide.

detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)

Retreive sequences from tokens.

PARAMETER DESCRIPTION
tokens

The zero-padded tensor of integerized tokens to decode.

TYPE: torch.Tensor of shape (n_sequences, max_length)

join

Join tokens into strings?

TYPE: bool DEFAULT: True

trim_start_token

Remove the start token from the beginning of a sequence.

TYPE: bool DEFAULT: True

trim_stop_token

Remove the stop token from the end of a sequence.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list[str] or list[list[str]]

The decoded sequences each as a string or list or strings.

from_massivekb(replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$') staticmethod

Create a tokenizer with the observed peptide modications.

Modifications are parsed from MassIVE-KB peptide strings and added to the vocabulary.

PARAMETER DESCRIPTION
replace_isoleucine_with_leucine

Replace I with L residues, because they are isobaric and often indistinguishable by mass spectrometry.

TYPE: bool DEFAULT: False

reverse

Reverse the sequence for tokenization, C-terminus to N-terminus.

TYPE: bool DEFAULT: False

start_token

The start token to use.

TYPE: str DEFAULT: None

stop_token

The stop token to use.

TYPE: str DEFAULT: '$'

RETURNS DESCRIPTION
MskbPeptideTokenizer

A tokenizer for peptides with the observed modifications.

from_proforma(sequences, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$') classmethod

Create a tokenizer with the observed peptide modications.

Modifications are parsed from ProForma 2.0-compliant peptide strings and added to the vocabulary.

PARAMETER DESCRIPTION
sequences

The peptides from which to parse modifications.

TYPE: Iterable[str]

replace_isoleucine_with_leucine

Replace I with L residues, because they are isobaric and often indistinguishable by mass spectrometry.

TYPE: bool DEFAULT: False

reverse

Reverse the sequence for tokenization, C-terminus to N-terminus.

TYPE: bool DEFAULT: False

start_token

The start token to use.

TYPE: str DEFAULT: None

stop_token

The stop token to use.

TYPE: str DEFAULT: '$'

RETURNS DESCRIPTION
PeptideTokenizer

A tokenizer for peptides with the observed modifications.

split(sequence)

Split a ProForma peptide sequence.

PARAMETER DESCRIPTION
sequence

The peptide sequence.

TYPE: str

RETURNS DESCRIPTION
list[str]

The tokens that comprise the peptide sequence.

tokenize(sequences, add_start=False, add_stop=False, to_strings=False)

Tokenize the input sequences.

PARAMETER DESCRIPTION
sequences

The sequences to tokenize.

TYPE: Iterable[str] or str

add_start

Prepend the start token to the beginning of the sequence.

TYPE: bool DEFAULT: False

add_stop

Append the stop token to the end of the sequence.

TYPE: bool DEFAULT: False

to_strings

Return each as a list of token strings rather than a tensor. This is useful for debugging.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
torch.tensor of shape (n_sequences, max_length) or list[list[str]]

Either a tensor containing the integer values for each token, padded with 0's, or the list of tokens comprising each sequence.

MoleculeTokenizer(selfies_vocab=None, start_token=None, stop_token='$')

Bases: Tokenizer

A tokenizer for small molecules.

Tokenize SMILES and SELFIES representations of small molecules. SMILES are internally converted to SELFIES representations.

PARAMETER DESCRIPTION
selfies_vocab

The SELFIES tokens to be considered.

TYPE: Iterable[str] DEFAULT: None

start_token

The start token to use.

TYPE: str DEFAULT: None

stop_token

The stop token to use.

TYPE: str DEFAULT: '$'

ATTRIBUTE DESCRIPTION
index

The mapping of residues and modifications to integer representations.

TYPE: SortedDict{str, int}

reverse_index

The ordered residues and modifications where the list index is the integer representation for a token.

TYPE: list[None | str]

start_token

The start token

TYPE: str

stop_token

The stop token.

TYPE: str

start_int

The integer representation of the start token

TYPE: int

stop_int

The integer representation of the stop token.

TYPE: int

padding_int

The integer used to represent padding.

TYPE: int

Functions

detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)

Retreive sequences from tokens.

PARAMETER DESCRIPTION
tokens

The zero-padded tensor of integerized tokens to decode.

TYPE: torch.Tensor of shape (n_sequences, max_length)

join

Join tokens into strings?

TYPE: bool DEFAULT: True

trim_start_token

Remove the start token from the beginning of a sequence.

TYPE: bool DEFAULT: True

trim_stop_token

Remove the stop token and anything following it from the sequence.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list[str] or list[list[str]]

The decoded sequences each as a string or list or strings.

from_selfies(selfies, start_token=None, stop_token='$') classmethod

Learn the vocabulary from SELFIES strings.

PARAMETER DESCRIPTION
selfies

Create a vocabulary from all unique tokens in these SELFIES strings.

TYPE: Iterable[str] | str

start_token

The start token to use.

TYPE: str DEFAULT: None

stop_token

The stop token to use.

TYPE: str DEFAULT: '$'

RETURNS DESCRIPTION
MoleculeTokenizer

The tokenizer restricted to the vocabulary present in the input SMILES strings.

from_smiles(smiles, start_token=None, stop_token='$') classmethod

Learn the vocabulary from SMILES strings.

PARAMETER DESCRIPTION
smiles

Create a vocabulary from all unique tokens in these SMILES strings.

TYPE: Iterable[str] | str

start_token

The start token to use.

TYPE: str DEFAULT: None

stop_token

The stop token to use.

TYPE: str DEFAULT: '$'

RETURNS DESCRIPTION
MoleculeTokenizer

The tokenizer restricted to the vocabulary present in the input SMILES strings.

split(sequence)

Split a SMILES or SELFIES string into SELFIES tokens.

PARAMETER DESCRIPTION
sequence

The SMILES or SELFIES string representing a molecule.

TYPE: str

RETURNS DESCRIPTION
List[str]

The SELFIES tokens representing the molecule.

tokenize(sequences, add_start=False, add_stop=False, to_strings=False)

Tokenize the input sequences.

PARAMETER DESCRIPTION
sequences

The sequences to tokenize.

TYPE: Iterable[str] or str

add_start

Prepend the start token to the beginning of the sequence.

TYPE: bool DEFAULT: False

add_stop

Append the stop token to the end of the sequence.

TYPE: bool DEFAULT: False

to_strings

Return each as a list of token strings rather than a tensor. This is useful for debugging.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
torch.tensor of shape (n_sequences, max_length) or list[list[str]]

Either a tensor containing the integer values for each token, padded with 0's, or the list of tokens comprising each sequence.

Tokenizer(tokens, start_token=None, stop_token='$')

Bases: ABC

An abstract base class for Depthcharge tokenizers.

PARAMETER DESCRIPTION
tokens

The tokens to consider.

TYPE: Sequence[str]

start_token

The start token to use.

TYPE: str DEFAULT: None

stop_token

The stop token to use.

TYPE: str DEFAULT: '$'

Functions

detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)

Retreive sequences from tokens.

PARAMETER DESCRIPTION
tokens

The zero-padded tensor of integerized tokens to decode.

TYPE: torch.Tensor of shape (n_sequences, max_length)

join

Join tokens into strings?

TYPE: bool DEFAULT: True

trim_start_token

Remove the start token from the beginning of a sequence.

TYPE: bool DEFAULT: True

trim_stop_token

Remove the stop token and anything following it from the sequence.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
list[str] or list[list[str]]

The decoded sequences each as a string or list or strings.

split(sequence) abstractmethod

Split a sequence into the constituent string tokens.

tokenize(sequences, add_start=False, add_stop=False, to_strings=False)

Tokenize the input sequences.

PARAMETER DESCRIPTION
sequences

The sequences to tokenize.

TYPE: Iterable[str] or str

add_start

Prepend the start token to the beginning of the sequence.

TYPE: bool DEFAULT: False

add_stop

Append the stop token to the end of the sequence.

TYPE: bool DEFAULT: False

to_strings

Return each as a list of token strings rather than a tensor. This is useful for debugging.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
torch.tensor of shape (n_sequences, max_length) or list[list[str]]

Either a tensor containing the integer values for each token, padded with 0's, or the list of tokens comprising each sequence.