Datasets (`depthcharge.data`)

`spectra_to_df(peak_file, *, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)`

Read mass spectra into a Polars DataFrame.

Apache Parquet is a space efficient, columnar data storage format that is popular in the data science and engineering community. This function reads data from a mass spectrometry data format, extracts the mass spectrum and identifying information. By default, the schema is: peak_file: str scan_id: int ms_level: int precursor_mz: float64 precursor_charge: int8 mz_array: list[float64] intensity_array: list[float64]

An optional metadata DataFrame can be provided to add additional metadata to each mass spectrum. This DataFrame must contain a scan_id column containing the integer scan identifier for each mass spectrum. For mzML files, this is generally the integer following scan=, whereas for MGF files this is the zero-indexed offset of the mass spectrum in the file.

Finally, custom fields can be extracted from the mass spectrometry data file for advanced use. This must be a CustomField, where the name is the new column and the accessor is a function to extract a value from the corresponding Pyteomics spectrum dictionary. The pyarrow data type must also be specified.

PARAMETER	DESCRIPTION
`peak_file`	The mass spectrometry data file in mzML, mzXML, or MGF format. TYPE: `PathLike`
`metadata_df`	A `polars.DataFrame` containing additional metadata from the spectra. This is merged on the `scan_id` column which must be present, and optionally a `peak_file` column, if present. TYPE: `DataFrame or LazyFrame` DEFAULT: `None`
`ms_level`	The level(s) of tandem mass spectra to keep. `None` will retain all spectra. TYPE: `int, list of int, or None` DEFAULT: `2`
`preprocessing_fn`	The function(s) used to preprocess the mass spectra. `None`, the default, filters for the top 200 peaks above m/z 140, square root transforms the intensities and scales them to unit norm. See the preprocessing module for details and additional options. TYPE: `Callable or Iterable[Callable]` DEFAULT: `None`
`valid_charge`	Only consider spectra with the specified precursor charges. If `None`, any precursor charge is accepted. TYPE: `int or list of int` DEFAULT: `None`
`custom_fields`	Additional fields to extract during peak file parsing. TYPE: `CustomField or iterable of CustomField` DEFAULT: `None`
`progress`	Enable or disable the progress bar. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`DataFrame`	A dataframe containing the parsed mass spectra.

`spectra_to_parquet(peak_file, *, parquet_file=None, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)`

Stream mass spectra to Apache Parquet, with preprocessing.

Apache Parquet is a space efficient, columnar data storage format that is popular in the data science and engineering community. This function reads data from a mass spectrometry data format, extracts the mass spectrum and identifying information. By default, the schema is: peak_file: str scan_id: int ms_level: int precursor_mz: float64 precursor_charge: int8 mz_array: list[float64] intensity_array: list[float64]

An optional metadata DataFrame can be provided to add additional metadata to each mass spectrum. This DataFrame must contain a scan_id column containing the integer scan identifier for each mass spectrum. For mzML files, this is generally the integer following scan=, whereas for MGF files this is the zero-indexed offset of the mass spectrum in the file.

Finally, custom fields can be extracted from the mass spectrometry data file for advanced use. This must be a CustomField, where the name is the new column and the accessor is a function to extract a value from the corresponding Pyteomics spectrum dictionary. The pyarrow data type must also be specified.

PARAMETER	DESCRIPTION
`peak_file`	The mass spectrometry data file in mzML, mzXML, or MGF format. TYPE: `PathLike`
`parquet_file`	The output file. By default this is the input file stem with a `.parquet` extension. TYPE: `PathLike` DEFAULT: `None`
`batch_size`	The number of mass spectra to process simultaneously. TYPE: `int` DEFAULT: `100000`
`metadata_df`	A `polars.DataFrame` containing additional metadata from the spectra. This is merged on the `scan_id` column which must be present, and optionally a `peak_file` column, if present. TYPE: `DataFrame or LazyFrame` DEFAULT: `None`
`ms_level`	The level(s) of tandem mass spectra to keep. `None` will retain all spectra. TYPE: `int, list of int, or None` DEFAULT: `2`
`preprocessing_fn`	The function(s) used to preprocess the mass spectra. `None`, the default, filters for the top 200 peaks above m/z 140, square root transforms the intensities and scales them to unit norm. See the preprocessing module for details and additional options. TYPE: `Callable or Iterable[Callable]` DEFAULT: `None`
`valid_charge`	Only consider spectra with the specified precursor charges. If `None`, any precursor charge is accepted. TYPE: `int or list of int` DEFAULT: `None`
`custom_fields`	Additional fields to extract during peak file parsing. TYPE: `CustomField or iterable of CustomField` DEFAULT: `None`
`progress`	Enable or disable the progress bar. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`Path`	The Parquet file that was written.

`spectra_to_stream(peak_file, *, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)`

Stream mass spectra in an Apache Arrow format, with preprocessing.

Apache Arrow is a space efficient, columnar data format that is popular in the data science and engineering community. This function reads data from a mass spectrometry data format, extracts the mass spectrum and identifying information. By default, the schema is: peak_file: str scan_id: int ms_level: int precursor_mz: float precursor_charge: int mz_array: list[float] intensity_array: list[float]

An optional metadata DataFrame can be provided to add additional metadata to each mass spectrum. This DataFrame must contain a scan_id column containing the integer scan identifier for each mass spectrum. For mzML files, this is generally the integer following scan=, whereas for MGF files this is the zero-indexed offset of the mass spectrum in the file.

Finally, custom fields can be extracted from the mass spectrometry data file for advanced use. This must be a CustomField, where the name is the new column and the accessor is a function to extract a value from the corresponding Pyteomics spectrum dictionary. The pyarrow data type must also be specified.

PARAMETER	DESCRIPTION
`peak_file`	The mass spectrometry data file in mzML, mzXML, or MGF format. TYPE: `PathLike`
`batch_size`	The number of mass spectra in each RecordBatch. `None` will load all of the spectra in a single batch. TYPE: `int or None` DEFAULT: `100000`
`metadata_df`	A `polars.DataFrame` containing additional metadata from the spectra. This is merged on the `scan_id` column which must be present, and optionally a `peak_file` column, if present. TYPE: `DataFrame or LazyFrame` DEFAULT: `None`
`ms_level`	The level(s) of tandem mass spectra to keep. `None` will retain all spectra. TYPE: `int, list of int, or None` DEFAULT: `2`
`preprocessing_fn`	The function(s) used to preprocess the mass spectra. `None`, the default, filters for the top 200 peaks above m/z 140, square root transforms the intensities and scales them to unit norm. See the preprocessing module for details and additional options. TYPE: `Callable or Iterable[Callable]` DEFAULT: `None`
`valid_charge`	Only consider spectra with the specified precursor charges. If `None`, any precursor charge is accepted. TYPE: `int or list of int` DEFAULT: `None`
`custom_fields`	Additional fields to extract during peak file parsing. TYPE: `CustomField or iterable of CustomField` DEFAULT: `None`
`progress`	Enable or disable the progress bar. TYPE: `bool` DEFAULT: `True`

RETURNS	DESCRIPTION
`Generator of pyarrow.RecordBatch`	Batches of parsed spectra.

`SpectrumDataset(spectra, batch_size, path=None, parse_kwargs=None, **kwargs)`

Bases: LanceDataset

Store and access a collection of mass spectra.

Parse and/or add mass spectra to an index in the lance data format. This format enables fast random access to spectra for training. This file is then served as a PyTorch IterableDataset, allowing spectra to be accessed efficiently for training and inference. This is accomplished using the Lance PyTorch integration.

The batch_size parameter for this class indepedent of the batch_size of the PyTorch DataLoader. Generally, we only want the former parameter to greater than 1. Additionally, this dataset should not be used with a DataLoader set to max_workers > 1, unless specific care is used to handle the caveats of a PyTorch IterableDataset

If you wish to use an existing lance dataset, use the from_lance() method.

PARAMETER	DESCRIPTION
`spectra`	Spectra to add to this collection. These may be a DataFrame parsed with `depthcharge.spectra_to_df()`, parquet files created with `depthcharge.spectra_to_parquet()`, or a peak file in the mzML, mzXML, or MGF format. Additional spectra can be added later using the `.add_spectra()` method. TYPE: `polars.DataFrame, PathLike, or list of PathLike`
`batch_size`	The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader. TYPE: `int`
`path`	The name and path of the lance dataset. If the path does not contain the `.lance` then it will be added. If `None`, a file will be created in a temporary directory. TYPE: `PathLike, optional.` DEFAULT: `None`
`parse_kwargs`	Keyword arguments passed `depthcharge.spectra_to_stream()` for peak files that are provided. This argument has no affect for DataFrame or parquet file inputs. TYPE: `dict` DEFAULT: `None`
`**kwargs`	Keyword arguments to initialize a `[lance.torch.data.LanceDataset](https://lancedb.github.io/lance/api/python/lance.torch.html#lance.torch.data.LanceDataset)`. TYPE: `dict` DEFAULT: `{}`

ATTRIBUTE	DESCRIPTION
`peak_files`	TYPE: `list of str`
`path`	TYPE: `Path`
`n_spectra`	TYPE: `int`
`dataset`	TYPE: `LanceDataset`

Attributes

`n_spectra: int` `property`

The number of spectra in the Lance dataset.

`path: Path` `property`

The path to the underyling lance dataset.

`peak_files: list[str]` `property`

The files currently in the lance dataset.

Functions

`add_spectra(spectra)`

Add mass spectrometry data to the lance dataset.

Note that depthcharge does not verify whether the provided spectra already exist in the lance dataset.

PARAMETER	DESCRIPTION
`spectra`	Spectra to add to this collection. These may be a DataFrame parsed with `depthcharge.spectra_to_df()`, parquet files created with `depthcharge.spectra_to_parquet()`, or a peak file in the mzML, mzXML, or MGF format. TYPE: `polars.DataFrame, PathLike, or list of PathLike`

`from_lance(path, batch_size, parse_kwargs=None, **kwargs)` `classmethod`

Load a previously created lance dataset.

PARAMETER	DESCRIPTION
`path`	The path of the lance dataset. TYPE: `PathLike`
`batch_size`	The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader. TYPE: `int`
`parse_kwargs`	Keyword arguments passed `depthcharge.spectra_to_stream()` for peak files that are provided. TYPE: `dict` DEFAULT: `None`
`**kwargs`	Keyword arguments to initialize a `[lance.torch.data.LanceDataset](https://lancedb.github.io/lance/api/python/lance.torch.html#lance.torch.data.LanceDataset)`. TYPE: `dict` DEFAULT: `{}`

RETURNS	DESCRIPTION
`SpectrumDataset`	The dataset of mass spectra.

`AnnotatedSpectrumDataset(spectra, annotations, tokenizer, batch_size, path=None, parse_kwargs=None, **kwargs)`

Bases: SpectrumDataset

Store and access a collection of annotated mass spectra.

Parse and/or add mass spectra to an index in the lance data format. This format enables fast random access to spectra for training. This file is then served as a PyTorch IterableDataset, allowing spectra to be accessed efficiently for training and inference. This is accomplished using the Lance PyTorch integration.

The batch_size parameter for this class indepedent of the batch_size of the PyTorch DataLoader. Generally, we only want the former parameter to greater than 1. Additionally, this dataset should not be used with a DataLoader set to max_workers > 1, unless specific care is used to handle the caveats of a PyTorch IterableDataset

If you wish to use an existing lance dataset, use the from_lance() method.

PARAMETER	DESCRIPTION
`spectra`	Spectra to add to this collection. These may be a DataFrame parsed with `depthcharge.spectra_to_df()`, parquet files created with `depthcharge.spectra_to_parquet()`, or a peak file in the mzML, mzXML, or MGF format. Additional spectra can be added later using the `.add_spectra()` method. TYPE: `polars.DataFrame, PathLike, or list of PathLike`
`annotations`	The column name containing the annotations. TYPE: `str`
`tokenizer`	The tokenizer used to transform the annotations into PyTorch tensors. TYPE: `PeptideTokenizer`
`batch_size`	The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader. TYPE: `int`
`path`	The name and path of the lance dataset. If the path does not contain the `.lance` then it will be added. If `None`, a file will be created in a temporary directory. TYPE: `PathLike, optional.` DEFAULT: `None`
`parse_kwargs`	Keyword arguments passed `depthcharge.spectra_to_stream()` for peak files that are provided. This argument has no affect for DataFrame or parquet file inputs. TYPE: `dict` DEFAULT: `None`
`**kwargs`	Keyword arguments to initialize a `[lance.torch.data.LanceDataset](https://lancedb.github.io/lance/api/python/lance.torch.html#lance.torch.data.LanceDataset)`. TYPE: `dict` DEFAULT: `{}`

ATTRIBUTE	DESCRIPTION
`peak_files`	TYPE: `list of str`
`path`	TYPE: `Path`
`n_spectra`	TYPE: `int`
`dataset`	TYPE: `LanceDataset`
`tokenizer`	The tokenizer for the annotations. TYPE: `PeptideTokenizer`
`annotations`	The annotation column in the dataset. TYPE: `str`

Attributes

`n_spectra: int` `property`

The number of spectra in the Lance dataset.

`path: Path` `property`

The path to the underyling lance dataset.

`peak_files: list[str]` `property`

The files currently in the lance dataset.

Functions

`add_spectra(spectra)`

Add mass spectrometry data to the lance dataset.

Note that depthcharge does not verify whether the provided spectra already exist in the lance dataset.

PARAMETER	DESCRIPTION
`spectra`	Spectra to add to this collection. These may be a DataFrame parsed with `depthcharge.spectra_to_df()`, parquet files created with `depthcharge.spectra_to_parquet()`, or a peak file in the mzML, mzXML, or MGF format. TYPE: `polars.DataFrame, PathLike, or list of PathLike`

`from_lance(path, annotations, tokenizer, batch_size, parse_kwargs=None, **kwargs)` `classmethod`

Load a previously created lance dataset.

PARAMETER	DESCRIPTION
`path`	The path of the lance dataset. TYPE: `PathLike`
`annotations`	The column name containing the annotations. TYPE: `str`
`tokenizer`	The tokenizer used to transform the annotations into PyTorch tensors. TYPE: `PeptideTokenizer`
`batch_size`	The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader. TYPE: `int`
`parse_kwargs`	Keyword arguments passed `depthcharge.spectra_to_stream()` for peak files that are provided. TYPE: `dict` DEFAULT: `None`
`**kwargs`	Keyword arguments to initialize a `[lance.torch.data.LanceDataset](https://lancedb.github.io/lance/api/python/lance.torch.html#lance.torch.data.LanceDataset)`. TYPE: `dict` DEFAULT: `{}`

RETURNS	DESCRIPTION
`AnnotatedSpectrumDataset`	The dataset of annotated mass spectra.

`StreamingSpectrumDataset(spectra, batch_size, **parse_kwargs)`

Bases: IterableDataset

Stream mass spectra from a file or DataFrame.

While the on-disk dataset provided by depthcharge.data.SpectrumDataset provides an excellent option for model training, this class provides a PyTorch Dataset that is more suitable for inference.

When using a StreamingSpectrumDataset, the order of mass spectra cannot be shuffled.

The batch_size parameter for this class indepedent of the batch_size of the PyTorch DataLoader. Generally, we only want the former parameter to greater than 1. Additionally, this dataset should not be used with a DataLoader set to max_workers > 1, unless specific care is used to handle the caveats of a PyTorch IterableDataset

PARAMETER	DESCRIPTION
`spectra`	Spectra to add to this collection. These may be a DataFrame parsed with `depthcharge.spectra_to_df()`, parquet files created with `depthcharge.spectra_to_parquet()`, or a peak file in the mzML, mzXML, or MGF format. TYPE: `polars.DataFrame, PathLike, or list of PathLike`
`batch_size`	The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader. TYPE: `int`
`**parse_kwargs`	Keyword arguments passed `depthcharge.spectra_to_stream()` for peak files that are provided. This argument has no affect for DataFrame or parquet file inputs. TYPE: `dict` DEFAULT: `{}`

ATTRIBUTE	DESCRIPTION
`batch_size`	The batch size to use for loading mass spectra. TYPE: `int`

`AnalyteDataset(tokenizer, sequences, *args)`

Bases: TensorDataset

A dataset for peptide sequences.

PARAMETER	DESCRIPTION
`tokenizer`	A tokenizer specifying how to transform peptide sequences. into tokens. TYPE: `PeptideTokenizer`
`sequences`	The peptide sequences in a format compatible with your tokenizer. ProForma is preferred. TYPE: `Iterable[str]`
`*args`	Additional values to include during data loading. TYPE: `Tensor` DEFAULT: `()`

Attributes

`tokens: torch.Tensor` `property`

The peptide sequence tokens.

Datasets (depthcharge.data)

spectra_to_df(peak_file, *, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)

spectra_to_parquet(peak_file, *, parquet_file=None, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)

spectra_to_stream(peak_file, *, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)

SpectrumDataset(spectra, batch_size, path=None, parse_kwargs=None, **kwargs)

Attributes

n_spectra: int property

path: Path property

peak_files: list[str] property

Functions

add_spectra(spectra)

from_lance(path, batch_size, parse_kwargs=None, **kwargs) classmethod

AnnotatedSpectrumDataset(spectra, annotations, tokenizer, batch_size, path=None, parse_kwargs=None, **kwargs)

Attributes

n_spectra: int property

path: Path property

peak_files: list[str] property

Functions

add_spectra(spectra)

from_lance(path, annotations, tokenizer, batch_size, parse_kwargs=None, **kwargs) classmethod

StreamingSpectrumDataset(spectra, batch_size, **parse_kwargs)

AnalyteDataset(tokenizer, sequences, *args)

Attributes

tokens: torch.Tensor property

Datasets (`depthcharge.data`)

`spectra_to_df(peak_file, *, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)`

`spectra_to_parquet(peak_file, *, parquet_file=None, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)`

`spectra_to_stream(peak_file, *, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)`

`SpectrumDataset(spectra, batch_size, path=None, parse_kwargs=None, **kwargs)`

`n_spectra: int` `property`

`path: Path` `property`

`peak_files: list[str]` `property`

`add_spectra(spectra)`

`from_lance(path, batch_size, parse_kwargs=None, **kwargs)` `classmethod`

`AnnotatedSpectrumDataset(spectra, annotations, tokenizer, batch_size, path=None, parse_kwargs=None, **kwargs)`

`n_spectra: int` `property`

`path: Path` `property`

`peak_files: list[str]` `property`

`add_spectra(spectra)`

`from_lance(path, annotations, tokenizer, batch_size, parse_kwargs=None, **kwargs)` `classmethod`

`StreamingSpectrumDataset(spectra, batch_size, **parse_kwargs)`

`AnalyteDataset(tokenizer, sequences, *args)`

`tokens: torch.Tensor` `property`