Changelog for Depthcharge
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[Unreleased]
[v0.4.7]
Fixed
- Add stop and start tokens for
AnnotatedSpectrumDataset
, when available. - When
reverse
is used for thePeptideTokenizer
, automatically reverse the decoded peptide.
[v0.4.6]
Added
- Added support for unsigned modification masses that don't quite conform to the Proforma standard.
[v0.4.5]
Changed
- The
scan_id
column for parsed spectra is not a sting instead of an integer. This is less space efficient, but we ran into issues with Sciex indexing when trying to use only an integer.
[v0.4.4]
Changed
- Partially revert length changes to
SpectrumDataset
andAnnotatedSpectrumDataset
. We removed__len__
from both due to problems with PyTorch Lightning compatibility. - Simplify dataset code by removing redundancy with
lance.pytorch.LanceDatset
. - Improved warning message for skipped spectra.
[v0.4.3]
Changed
- Length of the
SpectrumDataset
andAnnotatedSpectrumDataset
now reflect thesamples
parameter of thelance.pytorch.LanceDataset
parent class.
[v0.4.2]
Changed
- The length of
SpectrumDataset
andAnnotatedSpectrumDataset
is now the number of batches, not the number of spectra. This let's tools like PyTorch Lighting create their progress bars properly. - Parsing a dataset now no longer requires reading essentially the whole first file. Now the schema is inferred from the first 128 spectra.
[v0.4.1]
Added
- Significant updates to documentation. Add how to model mass spectra.
- Reading and writing from cloud storage on everything!
Changed
- Migrated to Mike for mkdocs to manage multiple versions.
- Moved test GitHub Action from pip to uv.
[v0.4.0]
We have completely reworked of the data module. Depthcharge now uses Apache Arrow-based formats instead of HDF5; spectra are converted either Parquet or streamed with PyArrow, optionally into Lance datasets.
We now also have full support for small molecules, with the MoleculeTokenizer
,
AnalyteTransformerEncoder
, and AnalyteTransformerDecoder
classes.
Breaking Changes
PeptideTransformer*
are nowAnalyteTransformer*
, providing full support for small molecule analytes. Additionally the interface has been completely reworked.- Mass spectrometry data parsers now function as iterators, yielding batches of spectra as
pyarrow.RecordBatch
objects. - Parsers can now be told to read arbitrary fields from their respective file formats with the
custom_fields
parameter. - The parsing functionality of
SpctrumDataset
and its subclasses have been moved to thespectra_to_*
functions in the data module. SpectrumDataset
and its subclasses now return dictionaries of data rather than a tuple of data. This allows us to incorporate arbitrary additional dataSpectrumDataset
and its subclasses are nowlance.torch.data.LanceDataset
subclasses, providing native PyTorch integration.- All dataset classes now do not have a
loader()
method.
Added
- Support for small molecules.
- Added the
StreamingSpectrumDataset
for fast inference. - Added
spectra_to_df
,spectra_to_df
,spectra_to_stream
to thedepthcharge.data
module.
Changed
- Determining the mass spectrometry data file format is now less fragile. It now looks for known line contents, rather than relying on the extension.
[v0.3.1] - 2023-08-18
Added
- Support for fine-tuning the wavelengths used for encoding floating point numbers like m/z and intensity to the
FloatEncoder
andPeakEncoder
.
Fixed
- The
tgt_mask
in thePeptideTransformerDecoder
was the incorrect type. Now it isbool
as it should be. Thanks @justin-a-sanders!
[v0.3.0] - 2023-06-06
Added
- Providing a proper tokenization class (also resolves #24 and #18)
- First-class support for ProForma peptide annotations, thanks to
spectrum_utils
andpyteomics
. - Adding primitive dataclasses for peptides, peptide ions, mass spectra ... and even small molecules 🚀
- Adding type hints to everything and stricter linting with Ruff.
- Adding a ton of tests.
- Tight integration with
spectrum_utils
💪
Changed
- Moving preprocessing onto parsing instead of data loading (similar to @bittremieux's proposal in #31)
- Combining the SpectrumIndex and SpectrumDataset classes into one.
- Changing peak encodings. Instead of encoding the intensity using a linear projection and summing with the sinusoidal m/z encodings, now the intensity is also sinusoidally encoded and is combined with the sinusoidal m/z encodings using a linear layer.
[v0.2.3] - 2023-08-18
Fixed
- Applied hotfix from v0.3.1
[v0.2.2] - 2023-05-15
Fixed
- Fixed retrieving version information.
[v0.2.1] - 2023-05-13
Changed
- Change target mask from float to boolean.
- Log the number spectra that are skipped due to an invalid precursor charge.
[v0.2.0] - 2023-03-06
Breaking Changes
- Dropped pytorch-lightning as a dependency.
- Removed SpectrumDataModule
- Removed full-blown models (depthcharge.models)
- Fixed sinusoidal encoders (Issue #27)
MassEncoder
is nowFloatEncoder
, because its generally useful for encoding floating-point numbers.
Added
- pre-commit hooks and linting with Ruff.
Changed
- Tensorboard is now an optional dependency.
Removed
- The example de novo peptide sequencing model.
[v0.1.0] - 2022-11-15
Changed
- The
detokenize()
method now returns a list instead of a string.
[v0.0.1] - 2022-09-29
Added
- This if the first release! All changes from this point forward will be recorded in this changelog.