Modeling Mass Spectra

We think that the biggest barriers to building deep learning models for mass spectra are (1) parsing the data into a reasonable format, (2) representing the mass spectra in an efficient manner for learning, (3) constructing models that leverage the structure of the data in a mass spectrum.

Parsing Mass Spectra

Depthcharge supports reading mass spectra from a variety of open formats: mzML, mzXML, and MGF.¹ Under the hood, Depthcharge uses the Apache Arrow data format to store mass spectrometry data in a tabular manner, and likewise Apache Parquet to store individual mass spectrometry data files on disk. This means that standard data science tools like Pandas or Polars can be used to interact with our mass spectra after it is parsed by Depthcharge.

For example, we can read an mzML file into a polars.DataFrame using spectra_to_df():

import polars as pl
import depthcharge as dc

mzml_file = "../../data/TMT10-Trial-8.mzML"

# Read an mzML into a DataFramoe:
df = dc.data.spectra_to_df(mzml_file, progress=False)
print(df.head())

output

shape: (4, 7) +--------------+--------------+ | peak_file | scan_id | --- | --- | str | str | | | +============================== | TMT10-Trial- | controllerTy | 2 | 8.mzML | pe=0 control | | | lerNum… | | | | | TMT10-Trial- | controllerTy | 2 | 8.mzML | pe=0 control | | | lerNum… | | | | | TMT10-Trial- | controllerTy | 2 | 8.mzML | pe=0 control | | | lerNum… | | | | | TMT10-Trial- | controllerTy | 2 | 8.mzML | pe=0 control | | | lerNum… | | | | +--------------+--------------+ ----------+--------------+--------------+-------------+-------------+ | ms_level | precursor_mz | precursor_ch | mz_array | intensity_a | | --- | --- | arge | --- | rray | | u8 | f64 | --- | list[f64] | --- | | | i16 | | list[f64] | ====================================================================+ | 804.774963 | 3 | [284.917023 | [0.004933, | | | | , | 0.005314, … | | | | 312.908997, | 0.00807… | | | | … 157… | | | 1001.6693 | 2 | [311.863007 | [0.008258, | | | | , | 0.008424, … | | | | 330.019012, | 0.00644… | | | | … 173… | | | 1047.6174 | 3 | [338.669006 | [0.011869, | | | | , | 0.010293, … | | | | 354.175995, | 0.01242… | | | | … 185… | | | 800.4349 | 3 | [247.792007 | [0.005207, | | | | , | 0.004437, … | | | | 313.678009, | 0.00915… | | | | … 194… | | ----------+--------------+--------------+-------------+-------------+

We can write the mass spectra directly to a Parquet file:

pq_file = dc.data.spectra_to_parquet(mzml_file, progress=False)
print(pl.read_parquet(pq_file).head())

output

shape: (4, 7) +--------------+--------------+----------+--------------+--------------+-------------+-------------+ | peak_file | scan_id | ms_level | precursor_mz | precursor_ch | mz_array | intensity_a | | --- | --- | --- | --- | arge | --- | rray | | str | str | u8 | f64 | --- | list[f64] | --- | | | | | | i16 | | list[f64] | +==================================================================================================+ | TMT10-Trial- | controllerTy | 2 | 804.774963 | 3 | [284.917023 | [0.004933, | | 8.mzML | pe=0 control | | | | , | 0.005314, … | | | lerNum… | | | | 312.908997, | 0.00807… | | | | | | | … 157… | | | TMT10-Trial- | controllerTy | 2 | 1001.6693 | 2 | [311.863007 | [0.008258, | | 8.mzML | pe=0 control | | | | , | 0.008424, … | | | lerNum… | | | | 330.019012, | 0.00644… | | | | | | | … 173… | | | TMT10-Trial- | controllerTy | 2 | 1047.6174 | 3 | [338.669006 | [0.011869, | | 8.mzML | pe=0 control | | | | , | 0.010293, … | | | lerNum… | | | | 354.175995, | 0.01242… | | | | | | | … 185… | | | TMT10-Trial- | controllerTy | 2 | 800.4349 | 3 | [247.792007 | [0.005207, | | 8.mzML | pe=0 control | | | | , | 0.004437, … | | | lerNum… | | | | 313.678009, | 0.00915… | | | | | | | … 194… | | +--------------+--------------+----------+--------------+--------------+-------------+-------------+

Or we can stream them from the original file in batches:

batch = next(dc.data.spectra_to_stream(mzml_file, progress=False))
print(pl.from_arrow(batch))

output

shape: (4, 7) +--------------+--------------+----------+--------------+--------------+-------------+-------------+ | peak_file | scan_id | ms_level | precursor_mz | precursor_ch | mz_array | intensity_a | | --- | --- | --- | --- | arge | --- | rray | | str | str | u8 | f64 | --- | list[f64] | --- | | | | | | i16 | | list[f64] | +==================================================================================================+ | TMT10-Trial- | controllerTy | 2 | 804.774963 | 3 | [284.917023 | [0.004933, | | 8.mzML | pe=0 control | | | | , | 0.005314, … | | | lerNum… | | | | 312.908997, | 0.00807… | | | | | | | … 157… | | | TMT10-Trial- | controllerTy | 2 | 1001.6693 | 2 | [311.863007 | [0.008258, | | 8.mzML | pe=0 control | | | | , | 0.008424, … | | | lerNum… | | | | 330.019012, | 0.00644… | | | | | | | … 173… | | | TMT10-Trial- | controllerTy | 2 | 1047.6174 | 3 | [338.669006 | [0.011869, | | 8.mzML | pe=0 control | | | | , | 0.010293, … | | | lerNum… | | | | 354.175995, | 0.01242… | | | | | | | … 185… | | | TMT10-Trial- | controllerTy | 2 | 800.4349 | 3 | [247.792007 | [0.005207, | | 8.mzML | pe=0 control | | | | , | 0.004437, … | | | lerNum… | | | | 313.678009, | 0.00915… | | | | | | | … 194… | | +--------------+--------------+----------+--------------+--------------+-------------+-------------+

Spectrum Preprocessing

Preprocessing steps, such as filtering peaks and transforming intensities is performed during parsing using and controlled using the preprocessing_fn parameter in all of the above functions. Depthcharge is closely tied into the spectrum_utils package and any of the processing methods within spectrum_utils may be applied to spectra in Depthcharge as well. Additionally, custom functions can be specified that accept a depthcharge.MassSpectrum as input and return a depthcharge.MassSpectrum as output.

The preprocessing_fn parameter defines collection of functions to apply to each mass spectrum in sequence. The default preprocessing_fn is:

[
    dc.data.preprocessing.set_mz_range(min_mz=140),
    dc.data.preprocessing.filter_intensity(max_num_peaks=200),
    dc.data.preprocessing.scale_intensity(scaling="root"),
    dc.data.preprocessing.scale_to_unit_norm,
]

However, we can change this process to meet our needs. As an example, let's create rather useless preprocessing function that sets the intensity of all the peaks to a value of one:

import numpy as np

def scale_to_one(spec):
    """Set intensities to one.

    Parameters
    ----------
    spec : depthcharge.MassSpectrum
        The mass spectrum to transform.

    Returns
    -------
    depthcharge.MassSpectrum
        The transformed mass spectrum.
    """
    spec.intensity = np.ones_like(spec.intensity)
    return spec

We can then use our preprocessing function, either by itself or in combination with other functions:

df = dc.data.spectra_to_df(
    mzml_file,
    progress=False,
    preprocessing_fn=[scale_to_one]
)
print(df.head())

output

shape: (4, 7) +--------------+--------------+----------+--------------+--------------+-------------+-------------+ | peak_file | scan_id | ms_level | precursor_mz | precursor_ch | mz_array | intensity_a | | --- | --- | --- | --- | arge | --- | rray | | str | str | u8 | f64 | --- | list[f64] | --- | | | | | | i16 | | list[f64] | +==================================================================================================+ | TMT10-Trial- | controllerTy | 2 | 804.774963 | 3 | [284.917023 | [1.0, 1.0, | | 8.mzML | pe=0 control | | | | , | … 1.0] | | | lerNum… | | | | 312.908997, | | | | | | | | … 157… | | | TMT10-Trial- | controllerTy | 2 | 1001.6693 | 2 | [311.863007 | [1.0, 1.0, | | 8.mzML | pe=0 control | | | | , | … 1.0] | | | lerNum… | | | | 330.019012, | | | | | | | | … 173… | | | TMT10-Trial- | controllerTy | 2 | 1047.6174 | 3 | [338.669006 | [1.0, 1.0, | | 8.mzML | pe=0 control | | | | , | … 1.0] | | | lerNum… | | | | 354.175995, | | | | | | | | … 185… | | | TMT10-Trial- | controllerTy | 2 | 800.4349 | 3 | [247.792007 | [1.0, 1.0, | | 8.mzML | pe=0 control | | | | , | … 1.0] | | | lerNum… | | | | 313.678009, | | | | | | | | … 194… | | +--------------+--------------+----------+--------------+--------------+-------------+-------------+

Extracting Additional Data

By default, Depthcharge only extracts a minimal amount of information from a mass spectrometry data file. Additional fields can be retrieved using the custom_fields parameter in the above parsing functions. However, we have to tell Depthcharge exactly what data we want to extract and how to extract it.

Currently, all mass spectrometry data file parsing is handled using the corresponding parser from Pyteomics, which yield a Python dictionary for each spectrum. The function we define to extract data must operate on this spectrum dictionary. Below, we define a custom field to extract the retention time for each spectrum:

import pyarrow as pa

ce_field = dc.data.CustomField(
    # The new column name:
    name="ret_time",
    # The function to extract the retention time:
    accessor=lambda x: x["scanList"]["scan"][0]["scan start time"],
    # The expected data type:
    dtype=pa.float64(),
)

df = dc.data.spectra_to_df(
    mzml_file,
    progress=False,
    custom_fields=ce_field,
)
print(df.head())

output

shape: (4, 8) +------------+------------+----------+------------+------------+------------+-----------+----------+ | peak_file | scan_id | ms_level | precursor_ | precursor_ | mz_array | intensity | ret_time | | --- | --- | --- | mz | charge | --- | _array | --- | | str | str | u8 | --- | --- | list[f64] | --- | f64 | | | | | f64 | i16 | | list[f64] | | +==================================================================================================+ | TMT10-Tria | controller | 2 | 804.774963 | 3 | [284.91702 | [0.004933 | 1.069915 | | l-8.mzML | Type=0 con | | | | 3, 312.908 | , | | | | trollerNum | | | | 997, … | 0.005314, | | | | … | | | | 157… | … | | | | | | | | | 0.00807… | | | TMT10-Tria | controller | 2 | 1001.6693 | 2 | [311.86300 | [0.008258 | 1.076773 | | l-8.mzML | Type=0 con | | | | 7, 330.019 | , | | | | trollerNum | | | | 012, … | 0.008424, | | | | … | | | | 173… | … | | | | | | | | | 0.00644… | | | TMT10-Tria | controller | 2 | 1047.6174 | 3 | [338.66900 | [0.011869 | 1.083688 | | l-8.mzML | Type=0 con | | | | 6, 354.175 | , | | | | trollerNum | | | | 995, … | 0.010293, | | | | … | | | | 185… | … | | | | | | | | | 0.01242… | | | TMT10-Tria | controller | 2 | 800.4349 | 3 | [247.79200 | [0.005207 | 1.09051 | | l-8.mzML | Type=0 con | | | | 7, 313.678 | , | | | | trollerNum | | | | 009, … | 0.004437, | | | | … | | | | 194… | … | | | | | | | | | 0.00915… | | +------------+------------+----------+------------+------------+------------+-----------+----------+

Adding Additional Outside Data

As a DataFrame or Parquet file, the parsed mass spectra are relatively easy to manipulate uses standard data science tools like Polars and Pandas. However, we can also efficiently add new data to our mass spectra during parsing by providing a separate metadata dataframe as the metadata_df parameter. We require that this dataframe have a scan_id field and it may optionally have a peak_file field that will be used to join the metadata table with the parsed spectra.

For example, we could use a metadata_df to pair peptide detections with the spectrum that they were detected from:

metadata_df = pl.DataFrame({
    "scan_id": [
        f"controllerType=0 controllerNumber=1 scan={x}"
        for x in (501, 507)
    ],
    "peptide": ["LESLIEK", "EDITHR"]
})

df = dc.data.spectra_to_df(
    mzml_file,
    progress=False,
    metadata_df=metadata_df,
)
print(df.head())

output

shape: (4, 8) +------------+------------+----------+------------+------------+------------+------------+---------+ | peak_file | scan_id | ms_level | precursor_ | precursor_ | mz_array | intensity_ | peptide | | --- | --- | --- | mz | charge | --- | array | --- | | str | str | u8 | --- | --- | list[f64] | --- | str | | | | | f64 | i16 | | list[f64] | | +==================================================================================================+ | TMT10-Tria | controller | 2 | 804.774963 | 3 | [284.91702 | [0.004933, | LESLIEK | | l-8.mzML | Type=0 con | | | | 3, 312.908 | 0.005314, | | | | trollerNum | | | | 997, … | … 0.00807… | | | | … | | | | 157… | | | | TMT10-Tria | controller | 2 | 1001.6693 | 2 | [311.86300 | [0.008258, | null | | l-8.mzML | Type=0 con | | | | 7, 330.019 | 0.008424, | | | | trollerNum | | | | 012, … | … 0.00644… | | | | … | | | | 173… | | | | TMT10-Tria | controller | 2 | 1047.6174 | 3 | [338.66900 | [0.011869, | EDITHR | | l-8.mzML | Type=0 con | | | | 6, 354.175 | 0.010293, | | | | trollerNum | | | | 995, … | … 0.01242… | | | | … | | | | 185… | | | | TMT10-Tria | controller | 2 | 800.4349 | 3 | [247.79200 | [0.005207, | null | | l-8.mzML | Type=0 con | | | | 7, 313.678 | 0.004437, | | | | trollerNum | | | | 009, … | … 0.00915… | | | | … | | | | 194… | | | +------------+------------+----------+------------+------------+------------+------------+---------+

Building PyTorch Datasets from Mass Spectra

Although the individual mass spectrometry data file parsing features are nice, often we will want to train models on more than one file and at a scale that is unlikely to fit in memory. For this task, Depthcharge provides three dataset classes for mass spectra:

SpectrumDataset - Use this class for training on mass spectra.
AnnotatedSpectrumDataset - Use this class for training on annotated mass spectra, such as peptide-spectrum matches.
StreamingSpectrumDataset - Use this class for running inference on mass spectra.

The SpectrumDataset and AnnotatedSpectrumDataset classes parse spectra into a Lance dataset which allows for efficient on-disk storage and fast random access of the stored spectra.

All of these classes can be created from the same mass spectrometry data formats as above, or can be created from previously parsed mass spectra as dataframes or Parquet files. Furthermore, when doing the former, all of the same features for preprocessing and adding additional data are available using the parse_kwargs parameter.

For example, we can create a dataset for our example file:

from depthcharge.data import SpectrumDataset

parse_kwargs = {"progress": False}

dataset = SpectrumDataset(mzml_file, batch_size=2, parse_kwargs=parse_kwargs)

The SpectrumDataset and AnnoatatedSpectrumDataset use the native PyTorch integration in Lance and all of the corresponding parameters are available as keyword arguments. Furthermore, all three of these classes are PyTorch IterableDataset classes, so they are ready to be used directly to train and evaulate deep learning models with PyTorch.

A word of caution: the batch size and any parallelism a best handled by the dataset class itself rather than the PyTorch DataLoader; hence, we recommend initializing DataLoaders with num_workers \<= 1 and batch_size = 1, or simply:

from torch.utils.data import DataLoader

loader = DataLoader(dataset, batch_size=None)

for batch in loader:
    print(batch["scan_id"], batch["precursor_mz"])

output

['controllerType=0 controllerNumber=1 scan=501', 'controllerType=0 controllerNumber=1 scan=504'] tensor([ 804.7750, 1001.6693]) ['controllerType=0 controllerNumber=1 scan=507', 'controllerType=0 controllerNumber=1 scan=510'] tensor([1047.6174, 800.4349])

Transfomer Models for Mass Spectra

Now that we know how to parse mass spectra, we can now build a model that uses them. In Depthcharge, we've designed Transformer models specifically for this task: the SpectrumTransformerEncoder. However, the dataset classes and other modules in Depthcharge are fully interoperable with any PyTorch module.

from depthcharge.transformers import SpectrumTransformerEncoder

model = SpectrumTransformerEncoder()

for batch in loader:
    out = model(batch["mz_array"], batch["intensity_array"])
    print(out[0].shape)

output

torch.Size([2, 119, 128]) torch.Size([2, 108, 128])

Note that by default, our Transformer model only considers the spectrum itself and not any of the precursor information. However, we can add it!

The first element output by each Transformer module in Depthcharge is a global representation of the sequence, which is a mass spectrum in this case. By default, it is set to 0s and ignored. We can change this behavior by creating a child class of our Transformer module and overriding the global_token_hook method. Let's create a hook that will add information about the precursor mass and charge to the global representation:

from depthcharge.encoders import FloatEncoder

class MyModel(SpectrumTransformerEncoder):
    """Our custom model class."""
    def __init__(self, *args, **kwargs):
        """Add parameters for the global token hook."""
        super().__init__(*args, **kwargs)
        self.precursor_mz_encoder = FloatEncoder(self.d_model)
        self.precursor_z_encoder = FloatEncoder(self.d_model, 1, 10)

    def global_token_hook(self, mz_array, intensity_array, *args, **kwargs):
        """Return a simple representation of the precursor."""
        mz_rep = self.precursor_mz_encoder(
            kwargs["precursor_mz"].type_as(mz_array)[None, :],
        )
        charge_rep = self.precursor_z_encoder(
            kwargs["precursor_charge"].type_as(mz_array)[None, :],
        )
        return (mz_rep + charge_rep)[0, :]

Now we can use our new MyModel class with our mass spectra:

model = MyModel()

for batch in loader:
    out = model(**batch)
    print(out[0].shape)

output

torch.Size([2, 119, 128]) torch.Size([2, 108, 128])

These Depthcharge modules are merely an accessible starting point for us to build a model fully customized to our task at hand.

We plan to support the Bruker .d format will be supported through timsrust. ↩