Datasets (depthcharge.data
)
spectra_to_df(peak_file, *, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)
Read mass spectra into a Polars DataFrame.
Apache Parquet is a space efficient, columnar data storage format that is popular in the data science and engineering community. This function reads data from a mass spectrometry data format, extracts the mass spectrum and identifying information. By default, the schema is: peak_file: str scan_id: int ms_level: int precursor_mz: float64 precursor_charge: int8 mz_array: list[float64] intensity_array: list[float64]
An optional metadata DataFrame can be provided to add additional
metadata to each mass spectrum. This DataFrame must contain
a scan_id
column containing the integer scan identifier for
each mass spectrum. For mzML files, this is generally the integer
following scan=
, whereas for MGF files this is the zero-indexed
offset of the mass spectrum in the file.
Finally, custom fields can be extracted from the mass spectrometry data file for advanced use. This must be a CustomField, where the name is the new column and the accessor is a function to extract a value from the corresponding Pyteomics spectrum dictionary. The pyarrow data type must also be specified.
PARAMETER | DESCRIPTION |
---|---|
peak_file |
The mass spectrometry data file in mzML, mzXML, or MGF format.
TYPE:
|
metadata_df |
A
TYPE:
|
ms_level |
The level(s) of tandem mass spectra to keep.
TYPE:
|
preprocessing_fn |
The function(s) used to preprocess the mass spectra.
TYPE:
|
valid_charge |
Only consider spectra with the specified precursor charges. If
TYPE:
|
custom_fields |
Additional fields to extract during peak file parsing.
TYPE:
|
progress |
Enable or disable the progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
DataFrame
|
A dataframe containing the parsed mass spectra. |
spectra_to_parquet(peak_file, *, parquet_file=None, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)
Stream mass spectra to Apache Parquet, with preprocessing.
Apache Parquet is a space efficient, columnar data storage format that is popular in the data science and engineering community. This function reads data from a mass spectrometry data format, extracts the mass spectrum and identifying information. By default, the schema is: peak_file: str scan_id: int ms_level: int precursor_mz: float64 precursor_charge: int8 mz_array: list[float64] intensity_array: list[float64]
An optional metadata DataFrame can be provided to add additional
metadata to each mass spectrum. This DataFrame must contain
a scan_id
column containing the integer scan identifier for
each mass spectrum. For mzML files, this is generally the integer
following scan=
, whereas for MGF files this is the zero-indexed
offset of the mass spectrum in the file.
Finally, custom fields can be extracted from the mass spectrometry data file for advanced use. This must be a CustomField, where the name is the new column and the accessor is a function to extract a value from the corresponding Pyteomics spectrum dictionary. The pyarrow data type must also be specified.
PARAMETER | DESCRIPTION |
---|---|
peak_file |
The mass spectrometry data file in mzML, mzXML, or MGF format.
TYPE:
|
parquet_file |
The output file. By default this is the input file stem with a
TYPE:
|
batch_size |
The number of mass spectra to process simultaneously.
TYPE:
|
metadata_df |
A
TYPE:
|
ms_level |
The level(s) of tandem mass spectra to keep.
TYPE:
|
preprocessing_fn |
The function(s) used to preprocess the mass spectra.
TYPE:
|
valid_charge |
Only consider spectra with the specified precursor charges. If
TYPE:
|
custom_fields |
Additional fields to extract during peak file parsing.
TYPE:
|
progress |
Enable or disable the progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Path
|
The Parquet file that was written. |
spectra_to_stream(peak_file, *, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)
Stream mass spectra in an Apache Arrow format, with preprocessing.
Apache Arrow is a space efficient, columnar data format that is popular in the data science and engineering community. This function reads data from a mass spectrometry data format, extracts the mass spectrum and identifying information. By default, the schema is: peak_file: str scan_id: int ms_level: int precursor_mz: float precursor_charge: int mz_array: list[float] intensity_array: list[float]
An optional metadata DataFrame can be provided to add additional
metadata to each mass spectrum. This DataFrame must contain
a scan_id
column containing the integer scan identifier for
each mass spectrum. For mzML files, this is generally the integer
following scan=
, whereas for MGF files this is the zero-indexed
offset of the mass spectrum in the file.
Finally, custom fields can be extracted from the mass spectrometry data file for advanced use. This must be a CustomField, where the name is the new column and the accessor is a function to extract a value from the corresponding Pyteomics spectrum dictionary. The pyarrow data type must also be specified.
PARAMETER | DESCRIPTION |
---|---|
peak_file |
The mass spectrometry data file in mzML, mzXML, or MGF format.
TYPE:
|
batch_size |
The number of mass spectra in each RecordBatch.
TYPE:
|
metadata_df |
A
TYPE:
|
ms_level |
The level(s) of tandem mass spectra to keep.
TYPE:
|
preprocessing_fn |
The function(s) used to preprocess the mass spectra.
TYPE:
|
valid_charge |
Only consider spectra with the specified precursor charges. If
TYPE:
|
custom_fields |
Additional fields to extract during peak file parsing.
TYPE:
|
progress |
Enable or disable the progress bar.
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
Generator of pyarrow.RecordBatch
|
Batches of parsed spectra. |
SpectrumDataset(spectra, batch_size, path=None, parse_kwargs=None, **kwargs)
Bases: LanceDataset
Store and access a collection of mass spectra.
Parse and/or add mass spectra to an index in the lance data format. This format enables fast random access to spectra for training. This file is then served as a PyTorch IterableDataset, allowing spectra to be accessed efficiently for training and inference. This is accomplished using the Lance PyTorch integration.
The batch_size
parameter for this class indepedent of the batch_size
of the PyTorch DataLoader. Generally, we only want the former parameter to
greater than 1. Additionally, this dataset should not be
used with a DataLoader set to max_workers
> 1, unless specific care is
used to handle the
caveats of a PyTorch IterableDataset
If you wish to use an existing lance dataset, use the from_lance()
method.
PARAMETER | DESCRIPTION |
---|---|
spectra |
Spectra to add to this collection. These may be a DataFrame parsed
with
TYPE:
|
batch_size |
The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.
TYPE:
|
path |
The name and path of the lance dataset. If the path does
not contain the
TYPE:
|
parse_kwargs |
Keyword arguments passed
TYPE:
|
**kwargs |
Keyword arguments to initialize a
TYPE:
|
ATTRIBUTE | DESCRIPTION |
---|---|
peak_files |
TYPE:
|
path |
TYPE:
|
n_spectra |
TYPE:
|
dataset |
TYPE:
|
Attributes
n_spectra: int
property
The number of spectra in the Lance dataset.
path: Path
property
The path to the underyling lance dataset.
peak_files: list[str]
property
The files currently in the lance dataset.
Functions
add_spectra(spectra)
Add mass spectrometry data to the lance dataset.
Note that depthcharge does not verify whether the provided spectra already exist in the lance dataset.
PARAMETER | DESCRIPTION |
---|---|
spectra |
Spectra to add to this collection. These may be a DataFrame parsed
with
TYPE:
|
from_lance(path, batch_size, parse_kwargs=None, **kwargs)
classmethod
Load a previously created lance dataset.
PARAMETER | DESCRIPTION |
---|---|
path |
The path of the lance dataset.
TYPE:
|
batch_size |
The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.
TYPE:
|
parse_kwargs |
Keyword arguments passed
TYPE:
|
**kwargs |
Keyword arguments to initialize a
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
SpectrumDataset
|
The dataset of mass spectra. |
AnnotatedSpectrumDataset(spectra, annotations, tokenizer, batch_size, path=None, parse_kwargs=None, **kwargs)
Bases: SpectrumDataset
Store and access a collection of annotated mass spectra.
Parse and/or add mass spectra to an index in the lance data format. This format enables fast random access to spectra for training. This file is then served as a PyTorch IterableDataset, allowing spectra to be accessed efficiently for training and inference. This is accomplished using the Lance PyTorch integration.
The batch_size
parameter for this class indepedent of the batch_size
of the PyTorch DataLoader. Generally, we only want the former parameter to
greater than 1. Additionally, this dataset should not be
used with a DataLoader set to max_workers
> 1, unless specific care is
used to handle the
caveats of a PyTorch IterableDataset
If you wish to use an existing lance dataset, use the from_lance()
method.
PARAMETER | DESCRIPTION |
---|---|
spectra |
Spectra to add to this collection. These may be a DataFrame parsed
with
TYPE:
|
annotations |
The column name containing the annotations.
TYPE:
|
tokenizer |
The tokenizer used to transform the annotations into PyTorch tensors.
TYPE:
|
batch_size |
The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.
TYPE:
|
path |
The name and path of the lance dataset. If the path does
not contain the
TYPE:
|
parse_kwargs |
Keyword arguments passed
TYPE:
|
**kwargs |
Keyword arguments to initialize a
TYPE:
|
ATTRIBUTE | DESCRIPTION |
---|---|
peak_files |
TYPE:
|
path |
TYPE:
|
n_spectra |
TYPE:
|
dataset |
TYPE:
|
tokenizer |
The tokenizer for the annotations.
TYPE:
|
annotations |
The annotation column in the dataset.
TYPE:
|
Attributes
n_spectra: int
property
The number of spectra in the Lance dataset.
path: Path
property
The path to the underyling lance dataset.
peak_files: list[str]
property
The files currently in the lance dataset.
Functions
add_spectra(spectra)
Add mass spectrometry data to the lance dataset.
Note that depthcharge does not verify whether the provided spectra already exist in the lance dataset.
PARAMETER | DESCRIPTION |
---|---|
spectra |
Spectra to add to this collection. These may be a DataFrame parsed
with
TYPE:
|
from_lance(path, annotations, tokenizer, batch_size, parse_kwargs=None, **kwargs)
classmethod
Load a previously created lance dataset.
PARAMETER | DESCRIPTION |
---|---|
path |
The path of the lance dataset.
TYPE:
|
annotations |
The column name containing the annotations.
TYPE:
|
tokenizer |
The tokenizer used to transform the annotations into PyTorch tensors.
TYPE:
|
batch_size |
The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.
TYPE:
|
parse_kwargs |
Keyword arguments passed
TYPE:
|
**kwargs |
Keyword arguments to initialize a
TYPE:
|
RETURNS | DESCRIPTION |
---|---|
AnnotatedSpectrumDataset
|
The dataset of annotated mass spectra. |
StreamingSpectrumDataset(spectra, batch_size, **parse_kwargs)
Bases: IterableDataset
Stream mass spectra from a file or DataFrame.
While the on-disk dataset provided by depthcharge.data.SpectrumDataset
provides an excellent option for model training, this class provides
a PyTorch Dataset that is more suitable for inference.
When using a StreamingSpectrumDataset
, the order of mass spectra
cannot be shuffled.
The batch_size
parameter for this class indepedent of the batch_size
of the PyTorch DataLoader. Generally, we only want the former parameter to
greater than 1. Additionally, this dataset should not be
used with a DataLoader set to max_workers
> 1, unless specific care is
used to handle the
caveats of a PyTorch IterableDataset
PARAMETER | DESCRIPTION |
---|---|
spectra |
Spectra to add to this collection. These may be a DataFrame parsed
with
TYPE:
|
batch_size |
The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.
TYPE:
|
**parse_kwargs |
Keyword arguments passed
TYPE:
|
ATTRIBUTE | DESCRIPTION |
---|---|
batch_size |
The batch size to use for loading mass spectra.
TYPE:
|
AnalyteDataset(tokenizer, sequences, *args)
Bases: TensorDataset
A dataset for peptide sequences.
PARAMETER | DESCRIPTION |
---|---|
tokenizer |
A tokenizer specifying how to transform peptide sequences. into tokens.
TYPE:
|
sequences |
The peptide sequences in a format compatible with your tokenizer. ProForma is preferred.
TYPE:
|
*args |
Additional values to include during data loading.
TYPE:
|
Attributes
tokens: torch.Tensor
property
The peptide sequence tokens.