Tokenizers (depthcharge.tokenizers
)
PeptideTokenizer(residues=None, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')
Bases: Tokenizer
A tokenizer for ProForma peptide sequences.
Parse and tokenize ProForma-compliant peptide sequences.
PARAMETER | DESCRIPTION |
---|---|
residues |
Residues and modifications to add to the vocabulary beyond the standard 20 amino acids.
TYPE:
|
replace_isoleucine_with_leucine |
Replace I with L residues, because they are isomeric and often indistinguishable by mass spectrometry.
TYPE:
|
reverse |
Reverse the sequence for tokenization, C-terminus to N-terminus.
TYPE:
|
start_token |
The start token to use.
TYPE:
|
stop_token |
The stop token to use.
TYPE:
|
ATTRIBUTE | DESCRIPTION |
---|---|
residues |
The residues and modifications and their associated masses.
terminal modifcations are indicated by
TYPE:
|
index |
The mapping of residues and modifications to integer representations.
TYPE:
|
reverse_index |
The ordered residues and modifications where the list index is the integer representation for a token.
TYPE:
|
start_token |
The start token
TYPE:
|
stop_token |
The stop token.
TYPE:
|
start_int |
The integer representation of the start token
TYPE:
|
stop_int |
The integer representation of the stop token.
TYPE:
|
padding_int |
The integer used to represent padding.
TYPE:
|
Functions
calculate_precursor_ions(tokens, charges)
Calculate the m/z for precursor ions.
PARAMETER | DESCRIPTION |
---|---|
tokens |
The tokens corresponding to the peptide sequence.
TYPE:
|
charges |
The charge state for each peptide.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Tensor
|
The monoisotopic m/z for each charged peptide. |
detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)
Retreive sequences from tokens.
PARAMETER | DESCRIPTION |
---|---|
tokens |
The zero-padded tensor of integerized tokens to decode.
TYPE:
|
join |
Join tokens into strings?
TYPE:
|
trim_start_token |
Remove the start token from the beginning of a sequence.
TYPE:
|
trim_stop_token |
Remove the stop token from the end of a sequence.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[str] or list[list[str]]
|
The decoded sequences each as a string or list or strings. |
from_massivekb(replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')
staticmethod
Create a tokenizer with the observed peptide modications.
Modifications are parsed from MassIVE-KB peptide strings and added to the vocabulary.
PARAMETER | DESCRIPTION |
---|---|
replace_isoleucine_with_leucine |
Replace I with L residues, because they are isobaric and often indistinguishable by mass spectrometry.
TYPE:
|
reverse |
Reverse the sequence for tokenization, C-terminus to N-terminus.
TYPE:
|
start_token |
The start token to use.
TYPE:
|
stop_token |
The stop token to use.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
MskbPeptideTokenizer
|
A tokenizer for peptides with the observed modifications. |
from_proforma(sequences, replace_isoleucine_with_leucine=False, reverse=False, start_token=None, stop_token='$')
classmethod
Create a tokenizer with the observed peptide modications.
Modifications are parsed from ProForma 2.0-compliant peptide strings and added to the vocabulary.
PARAMETER | DESCRIPTION |
---|---|
sequences |
The peptides from which to parse modifications.
TYPE:
|
replace_isoleucine_with_leucine |
Replace I with L residues, because they are isobaric and often indistinguishable by mass spectrometry.
TYPE:
|
reverse |
Reverse the sequence for tokenization, C-terminus to N-terminus.
TYPE:
|
start_token |
The start token to use.
TYPE:
|
stop_token |
The stop token to use.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
PeptideTokenizer
|
A tokenizer for peptides with the observed modifications. |
split(sequence)
Split a ProForma peptide sequence.
PARAMETER | DESCRIPTION |
---|---|
sequence |
The peptide sequence.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[str]
|
The tokens that comprise the peptide sequence. |
tokenize(sequences, add_start=False, add_stop=False, to_strings=False)
Tokenize the input sequences.
PARAMETER | DESCRIPTION |
---|---|
sequences |
The sequences to tokenize.
TYPE:
|
add_start |
Prepend the start token to the beginning of the sequence.
TYPE:
|
add_stop |
Append the stop token to the end of the sequence.
TYPE:
|
to_strings |
Return each as a list of token strings rather than a tensor. This is useful for debugging.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
torch.tensor of shape (n_sequences, max_length) or list[list[str]]
|
Either a tensor containing the integer values for each token, padded with 0's, or the list of tokens comprising each sequence. |
MoleculeTokenizer(selfies_vocab=None, start_token=None, stop_token='$')
Bases: Tokenizer
A tokenizer for small molecules.
Tokenize SMILES and SELFIES representations of small molecules. SMILES are internally converted to SELFIES representations.
PARAMETER | DESCRIPTION |
---|---|
selfies_vocab |
The SELFIES tokens to be considered.
TYPE:
|
start_token |
The start token to use.
TYPE:
|
stop_token |
The stop token to use.
TYPE:
|
ATTRIBUTE | DESCRIPTION |
---|---|
index |
The mapping of residues and modifications to integer representations.
TYPE:
|
reverse_index |
The ordered residues and modifications where the list index is the integer representation for a token.
TYPE:
|
start_token |
The start token
TYPE:
|
stop_token |
The stop token.
TYPE:
|
start_int |
The integer representation of the start token
TYPE:
|
stop_int |
The integer representation of the stop token.
TYPE:
|
padding_int |
The integer used to represent padding.
TYPE:
|
Functions
detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)
Retreive sequences from tokens.
PARAMETER | DESCRIPTION |
---|---|
tokens |
The zero-padded tensor of integerized tokens to decode.
TYPE:
|
join |
Join tokens into strings?
TYPE:
|
trim_start_token |
Remove the start token from the beginning of a sequence.
TYPE:
|
trim_stop_token |
Remove the stop token from the end of a sequence.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[str] or list[list[str]]
|
The decoded sequences each as a string or list or strings. |
from_selfies(selfies, start_token=None, stop_token='$')
classmethod
Learn the vocabulary from SELFIES strings.
PARAMETER | DESCRIPTION |
---|---|
selfies |
Create a vocabulary from all unique tokens in these SELFIES strings.
TYPE:
|
start_token |
The start token to use.
TYPE:
|
stop_token |
The stop token to use.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
MoleculeTokenizer
|
The tokenizer restricted to the vocabulary present in the input SMILES strings. |
from_smiles(smiles, start_token=None, stop_token='$')
classmethod
Learn the vocabulary from SMILES strings.
PARAMETER | DESCRIPTION |
---|---|
smiles |
Create a vocabulary from all unique tokens in these SMILES strings.
TYPE:
|
start_token |
The start token to use.
TYPE:
|
stop_token |
The stop token to use.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
MoleculeTokenizer
|
The tokenizer restricted to the vocabulary present in the input SMILES strings. |
split(sequence)
Split a SMILES or SELFIES string into SELFIES tokens.
PARAMETER | DESCRIPTION |
---|---|
sequence |
The SMILES or SELFIES string representing a molecule.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
List[str]
|
The SELFIES tokens representing the molecule. |
tokenize(sequences, add_start=False, add_stop=False, to_strings=False)
Tokenize the input sequences.
PARAMETER | DESCRIPTION |
---|---|
sequences |
The sequences to tokenize.
TYPE:
|
add_start |
Prepend the start token to the beginning of the sequence.
TYPE:
|
add_stop |
Append the stop token to the end of the sequence.
TYPE:
|
to_strings |
Return each as a list of token strings rather than a tensor. This is useful for debugging.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
torch.tensor of shape (n_sequences, max_length) or list[list[str]]
|
Either a tensor containing the integer values for each token, padded with 0's, or the list of tokens comprising each sequence. |
Tokenizer(tokens, start_token=None, stop_token='$')
Bases: ABC
An abstract base class for Depthcharge tokenizers.
PARAMETER | DESCRIPTION |
---|---|
tokens |
The tokens to consider.
TYPE:
|
start_token |
The start token to use.
TYPE:
|
stop_token |
The stop token to use.
TYPE:
|
Functions
detokenize(tokens, join=True, trim_start_token=True, trim_stop_token=True)
Retreive sequences from tokens.
PARAMETER | DESCRIPTION |
---|---|
tokens |
The zero-padded tensor of integerized tokens to decode.
TYPE:
|
join |
Join tokens into strings?
TYPE:
|
trim_start_token |
Remove the start token from the beginning of a sequence.
TYPE:
|
trim_stop_token |
Remove the stop token from the end of a sequence.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
list[str] or list[list[str]]
|
The decoded sequences each as a string or list or strings. |
split(sequence)
abstractmethod
Split a sequence into the constituent string tokens.
tokenize(sequences, add_start=False, add_stop=False, to_strings=False)
Tokenize the input sequences.
PARAMETER | DESCRIPTION |
---|---|
sequences |
The sequences to tokenize.
TYPE:
|
add_start |
Prepend the start token to the beginning of the sequence.
TYPE:
|
add_stop |
Append the stop token to the end of the sequence.
TYPE:
|
to_strings |
Return each as a list of token strings rather than a tensor. This is useful for debugging.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
torch.tensor of shape (n_sequences, max_length) or list[list[str]]
|
Either a tensor containing the integer values for each token, padded with 0's, or the list of tokens comprising each sequence. |