Skip to content

Datasets (depthcharge.data)

spectra_to_df(peak_file, *, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)

Read mass spectra into a Polars DataFrame.

Apache Parquet is a space efficient, columnar data storage format that is popular in the data science and engineering community. This function reads data from a mass spectrometry data format, extracts the mass spectrum and identifying information. By default, the schema is: peak_file: str scan_id: int ms_level: int precursor_mz: float64 precursor_charge: int8 mz_array: list[float64] intensity_array: list[float64]

An optional metadata DataFrame can be provided to add additional metadata to each mass spectrum. This DataFrame must contain a scan_id column containing the integer scan identifier for each mass spectrum. For mzML files, this is generally the integer following scan=, whereas for MGF files this is the zero-indexed offset of the mass spectrum in the file.

Finally, custom fields can be extracted from the mass spectrometry data file for advanced use. This must be a CustomField, where the name is the new column and the accessor is a function to extract a value from the corresponding Pyteomics spectrum dictionary. The pyarrow data type must also be specified.

PARAMETER DESCRIPTION
peak_file

The mass spectrometry data file in mzML, mzXML, or MGF format.

TYPE: PathLike

metadata_df

A polars.DataFrame containing additional metadata from the spectra. This is merged on the scan_id column which must be present, and optionally a peak_file column, if present.

TYPE: DataFrame or LazyFrame DEFAULT: None

ms_level

The level(s) of tandem mass spectra to keep. None will retain all spectra.

TYPE: int, list of int, or None DEFAULT: 2

preprocessing_fn

The function(s) used to preprocess the mass spectra. None, the default, filters for the top 200 peaks above m/z 140, square root transforms the intensities and scales them to unit norm. See the preprocessing module for details and additional options.

TYPE: Callable or Iterable[Callable] DEFAULT: None

valid_charge

Only consider spectra with the specified precursor charges. If None, any precursor charge is accepted.

TYPE: int or list of int DEFAULT: None

custom_fields

Additional fields to extract during peak file parsing.

TYPE: CustomField or iterable of CustomField DEFAULT: None

progress

Enable or disable the progress bar.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
DataFrame

A dataframe containing the parsed mass spectra.

spectra_to_parquet(peak_file, *, parquet_file=None, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)

Stream mass spectra to Apache Parquet, with preprocessing.

Apache Parquet is a space efficient, columnar data storage format that is popular in the data science and engineering community. This function reads data from a mass spectrometry data format, extracts the mass spectrum and identifying information. By default, the schema is: peak_file: str scan_id: int ms_level: int precursor_mz: float64 precursor_charge: int8 mz_array: list[float64] intensity_array: list[float64]

An optional metadata DataFrame can be provided to add additional metadata to each mass spectrum. This DataFrame must contain a scan_id column containing the integer scan identifier for each mass spectrum. For mzML files, this is generally the integer following scan=, whereas for MGF files this is the zero-indexed offset of the mass spectrum in the file.

Finally, custom fields can be extracted from the mass spectrometry data file for advanced use. This must be a CustomField, where the name is the new column and the accessor is a function to extract a value from the corresponding Pyteomics spectrum dictionary. The pyarrow data type must also be specified.

PARAMETER DESCRIPTION
peak_file

The mass spectrometry data file in mzML, mzXML, or MGF format.

TYPE: PathLike

parquet_file

The output file. By default this is the input file stem with a .parquet extension.

TYPE: PathLike DEFAULT: None

batch_size

The number of mass spectra to process simultaneously.

TYPE: int DEFAULT: 100000

metadata_df

A polars.DataFrame containing additional metadata from the spectra. This is merged on the scan_id column which must be present, and optionally a peak_file column, if present.

TYPE: DataFrame or LazyFrame DEFAULT: None

ms_level

The level(s) of tandem mass spectra to keep. None will retain all spectra.

TYPE: int, list of int, or None DEFAULT: 2

preprocessing_fn

The function(s) used to preprocess the mass spectra. None, the default, filters for the top 200 peaks above m/z 140, square root transforms the intensities and scales them to unit norm. See the preprocessing module for details and additional options.

TYPE: Callable or Iterable[Callable] DEFAULT: None

valid_charge

Only consider spectra with the specified precursor charges. If None, any precursor charge is accepted.

TYPE: int or list of int DEFAULT: None

custom_fields

Additional fields to extract during peak file parsing.

TYPE: CustomField or iterable of CustomField DEFAULT: None

progress

Enable or disable the progress bar.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
Path

The Parquet file that was written.

spectra_to_stream(peak_file, *, batch_size=100000, metadata_df=None, ms_level=2, preprocessing_fn=None, valid_charge=None, custom_fields=None, progress=True)

Stream mass spectra in an Apache Arrow format, with preprocessing.

Apache Arrow is a space efficient, columnar data format that is popular in the data science and engineering community. This function reads data from a mass spectrometry data format, extracts the mass spectrum and identifying information. By default, the schema is: peak_file: str scan_id: int ms_level: int precursor_mz: float precursor_charge: int mz_array: list[float] intensity_array: list[float]

An optional metadata DataFrame can be provided to add additional metadata to each mass spectrum. This DataFrame must contain a scan_id column containing the integer scan identifier for each mass spectrum. For mzML files, this is generally the integer following scan=, whereas for MGF files this is the zero-indexed offset of the mass spectrum in the file.

Finally, custom fields can be extracted from the mass spectrometry data file for advanced use. This must be a CustomField, where the name is the new column and the accessor is a function to extract a value from the corresponding Pyteomics spectrum dictionary. The pyarrow data type must also be specified.

PARAMETER DESCRIPTION
peak_file

The mass spectrometry data file in mzML, mzXML, or MGF format.

TYPE: PathLike

batch_size

The number of mass spectra in each RecordBatch. None will load all of the spectra in a single batch.

TYPE: int or None DEFAULT: 100000

metadata_df

A polars.DataFrame containing additional metadata from the spectra. This is merged on the scan_id column which must be present, and optionally a peak_file column, if present.

TYPE: DataFrame or LazyFrame DEFAULT: None

ms_level

The level(s) of tandem mass spectra to keep. None will retain all spectra.

TYPE: int, list of int, or None DEFAULT: 2

preprocessing_fn

The function(s) used to preprocess the mass spectra. None, the default, filters for the top 200 peaks above m/z 140, square root transforms the intensities and scales them to unit norm. See the preprocessing module for details and additional options.

TYPE: Callable or Iterable[Callable] DEFAULT: None

valid_charge

Only consider spectra with the specified precursor charges. If None, any precursor charge is accepted.

TYPE: int or list of int DEFAULT: None

custom_fields

Additional fields to extract during peak file parsing.

TYPE: CustomField or iterable of CustomField DEFAULT: None

progress

Enable or disable the progress bar.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
Generator of pyarrow.RecordBatch

Batches of parsed spectra.

SpectrumDataset(spectra, batch_size, path=None, parse_kwargs=None, **kwargs)

Bases: LanceDataset

Store and access a collection of mass spectra.

Parse and/or add mass spectra to an index in the lance data format. This format enables fast random access to spectra for training. This file is then served as a PyTorch IterableDataset, allowing spectra to be accessed efficiently for training and inference. This is accomplished using the Lance PyTorch integration.

The batch_size parameter for this class indepedent of the batch_size of the PyTorch DataLoader. Generally, we only want the former parameter to greater than 1. Additionally, this dataset should not be used with a DataLoader set to max_workers > 1, unless specific care is used to handle the caveats of a PyTorch IterableDataset

If you wish to use an existing lance dataset, use the from_lance() method.

PARAMETER DESCRIPTION
spectra

Spectra to add to this collection. These may be a DataFrame parsed with depthcharge.spectra_to_df(), parquet files created with depthcharge.spectra_to_parquet(), or a peak file in the mzML, mzXML, or MGF format. Additional spectra can be added later using the .add_spectra() method.

TYPE: polars.DataFrame, PathLike, or list of PathLike

batch_size

The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.

TYPE: int

path

The name and path of the lance dataset. If the path does not contain the .lance then it will be added. If None, a file will be created in a temporary directory.

TYPE: PathLike, optional. DEFAULT: None

parse_kwargs

Keyword arguments passed depthcharge.spectra_to_stream() for peak files that are provided. This argument has no affect for DataFrame or parquet file inputs.

TYPE: dict DEFAULT: None

**kwargs

Keyword arguments to initialize a [lance.torch.data.LanceDataset](https://lancedb.github.io/lance/api/python/lance.torch.html#lance.torch.data.LanceDataset).

TYPE: dict DEFAULT: {}

ATTRIBUTE DESCRIPTION
peak_files

TYPE: list of str

path

TYPE: Path

n_spectra

TYPE: int

dataset

TYPE: LanceDataset

Attributes

n_spectra: int property

The number of spectra in the Lance dataset.

path: Path property

The path to the underyling lance dataset.

peak_files: list[str] property

The files currently in the lance dataset.

Functions

add_spectra(spectra)

Add mass spectrometry data to the lance dataset.

Note that depthcharge does not verify whether the provided spectra already exist in the lance dataset.

PARAMETER DESCRIPTION
spectra

Spectra to add to this collection. These may be a DataFrame parsed with depthcharge.spectra_to_df(), parquet files created with depthcharge.spectra_to_parquet(), or a peak file in the mzML, mzXML, or MGF format.

TYPE: polars.DataFrame, PathLike, or list of PathLike

from_lance(path, batch_size, parse_kwargs=None, **kwargs) classmethod

Load a previously created lance dataset.

PARAMETER DESCRIPTION
path

The path of the lance dataset.

TYPE: PathLike

batch_size

The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.

TYPE: int

parse_kwargs

Keyword arguments passed depthcharge.spectra_to_stream() for peak files that are provided.

TYPE: dict DEFAULT: None

**kwargs

Keyword arguments to initialize a [lance.torch.data.LanceDataset](https://lancedb.github.io/lance/api/python/lance.torch.html#lance.torch.data.LanceDataset).

TYPE: dict DEFAULT: {}

RETURNS DESCRIPTION
SpectrumDataset

The dataset of mass spectra.

AnnotatedSpectrumDataset(spectra, annotations, tokenizer, batch_size, path=None, parse_kwargs=None, **kwargs)

Bases: SpectrumDataset

Store and access a collection of annotated mass spectra.

Parse and/or add mass spectra to an index in the lance data format. This format enables fast random access to spectra for training. This file is then served as a PyTorch IterableDataset, allowing spectra to be accessed efficiently for training and inference. This is accomplished using the Lance PyTorch integration.

The batch_size parameter for this class indepedent of the batch_size of the PyTorch DataLoader. Generally, we only want the former parameter to greater than 1. Additionally, this dataset should not be used with a DataLoader set to max_workers > 1, unless specific care is used to handle the caveats of a PyTorch IterableDataset

If you wish to use an existing lance dataset, use the from_lance() method.

PARAMETER DESCRIPTION
spectra

Spectra to add to this collection. These may be a DataFrame parsed with depthcharge.spectra_to_df(), parquet files created with depthcharge.spectra_to_parquet(), or a peak file in the mzML, mzXML, or MGF format. Additional spectra can be added later using the .add_spectra() method.

TYPE: polars.DataFrame, PathLike, or list of PathLike

annotations

The column name containing the annotations.

TYPE: str

tokenizer

The tokenizer used to transform the annotations into PyTorch tensors.

TYPE: PeptideTokenizer

batch_size

The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.

TYPE: int

path

The name and path of the lance dataset. If the path does not contain the .lance then it will be added. If None, a file will be created in a temporary directory.

TYPE: PathLike, optional. DEFAULT: None

parse_kwargs

Keyword arguments passed depthcharge.spectra_to_stream() for peak files that are provided. This argument has no affect for DataFrame or parquet file inputs.

TYPE: dict DEFAULT: None

**kwargs

Keyword arguments to initialize a [lance.torch.data.LanceDataset](https://lancedb.github.io/lance/api/python/lance.torch.html#lance.torch.data.LanceDataset).

TYPE: dict DEFAULT: {}

ATTRIBUTE DESCRIPTION
peak_files

TYPE: list of str

path

TYPE: Path

n_spectra

TYPE: int

dataset

TYPE: LanceDataset

tokenizer

The tokenizer for the annotations.

TYPE: PeptideTokenizer

annotations

The annotation column in the dataset.

TYPE: str

Attributes

n_spectra: int property

The number of spectra in the Lance dataset.

path: Path property

The path to the underyling lance dataset.

peak_files: list[str] property

The files currently in the lance dataset.

Functions

add_spectra(spectra)

Add mass spectrometry data to the lance dataset.

Note that depthcharge does not verify whether the provided spectra already exist in the lance dataset.

PARAMETER DESCRIPTION
spectra

Spectra to add to this collection. These may be a DataFrame parsed with depthcharge.spectra_to_df(), parquet files created with depthcharge.spectra_to_parquet(), or a peak file in the mzML, mzXML, or MGF format.

TYPE: polars.DataFrame, PathLike, or list of PathLike

from_lance(path, annotations, tokenizer, batch_size, parse_kwargs=None, **kwargs) classmethod

Load a previously created lance dataset.

PARAMETER DESCRIPTION
path

The path of the lance dataset.

TYPE: PathLike

annotations

The column name containing the annotations.

TYPE: str

tokenizer

The tokenizer used to transform the annotations into PyTorch tensors.

TYPE: PeptideTokenizer

batch_size

The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.

TYPE: int

parse_kwargs

Keyword arguments passed depthcharge.spectra_to_stream() for peak files that are provided.

TYPE: dict DEFAULT: None

**kwargs

Keyword arguments to initialize a [lance.torch.data.LanceDataset](https://lancedb.github.io/lance/api/python/lance.torch.html#lance.torch.data.LanceDataset).

TYPE: dict DEFAULT: {}

RETURNS DESCRIPTION
AnnotatedSpectrumDataset

The dataset of annotated mass spectra.

StreamingSpectrumDataset(spectra, batch_size, **parse_kwargs)

Bases: IterableDataset

Stream mass spectra from a file or DataFrame.

While the on-disk dataset provided by depthcharge.data.SpectrumDataset provides an excellent option for model training, this class provides a PyTorch Dataset that is more suitable for inference.

When using a StreamingSpectrumDataset, the order of mass spectra cannot be shuffled.

The batch_size parameter for this class indepedent of the batch_size of the PyTorch DataLoader. Generally, we only want the former parameter to greater than 1. Additionally, this dataset should not be used with a DataLoader set to max_workers > 1, unless specific care is used to handle the caveats of a PyTorch IterableDataset

PARAMETER DESCRIPTION
spectra

Spectra to add to this collection. These may be a DataFrame parsed with depthcharge.spectra_to_df(), parquet files created with depthcharge.spectra_to_parquet(), or a peak file in the mzML, mzXML, or MGF format.

TYPE: polars.DataFrame, PathLike, or list of PathLike

batch_size

The batch size to use for loading mass spectra. Note that this is independent from the batch size for the PyTorch DataLoader.

TYPE: int

**parse_kwargs

Keyword arguments passed depthcharge.spectra_to_stream() for peak files that are provided. This argument has no affect for DataFrame or parquet file inputs.

TYPE: dict DEFAULT: {}

ATTRIBUTE DESCRIPTION
batch_size

The batch size to use for loading mass spectra.

TYPE: int

AnalyteDataset(tokenizer, sequences, *args)

Bases: TensorDataset

A dataset for peptide sequences.

PARAMETER DESCRIPTION
tokenizer

A tokenizer specifying how to transform peptide sequences. into tokens.

TYPE: PeptideTokenizer

sequences

The peptide sequences in a format compatible with your tokenizer. ProForma is preferred.

TYPE: Iterable[str]

*args

Additional values to include during data loading.

TYPE: Tensor DEFAULT: ()

Attributes

tokens: torch.Tensor property

The peptide sequence tokens.