Spectral denoising: formula denoising

The formula_denoising function removes chemical noise ions in MS/MS spectra by evaluating if it could be formed from a chemically plausible subformula loss to the precursor ion.

Basic usage

The formula_denoising function is used to remove noise ions in an annotated spectra. Thus, the molecular information is needed for this function (SMILES/formula, and adduct).

Even the parameter is called ‘SMILES’ but providing both moleuclar formula or SMILES representation would work. However, using SMILES is recommended if you do have it in hand or acquired from top-hits from library searches since it would avoid false negatives due to addtiong N2/O to non-aromatic compounds. However, this is just a small percent.

If you are interested in reproducing results from literature, please make sure SMILES is used.

The mass tolerance is used to search the subformula loss of a given fragment ion. It is recommended to start with a smaller value. If the actual mass difference is larger than the mass tolerance (determined by precursor ion), this value will be increased. The max mass range to search for precursor ion is +/- 10 mDa.

The formula_db can be found at: Formula_db. It is already sorted by mass.

import spectral_denoising as sd
from spectral_denoising.noise import *
from spectral_denoising.chem_utils import *
peak = np.array([[69.071, 7.917962], [86.066, 1.021589], [86.0969, 100.0]], dtype=np.float32)
smiles = 'C1=CC=CC=C1'
adduct = '[M+H]+'
pmz = calculate_precursormz(smiles, adduct)
noise = generate_chemical_noise(pmz, lamda=10,polarity='+', formula_db, n = 10)
peak_with_noise = add_noise(peak, noise)

peak_denoised = sd.formula_denoising(peak, 'C1=CC=CC=C1', '[M+H]+')
print(f'Entropy similarity of spectra with noise: {sd.entropy_similairty(peak_with_noise,peak, pmz ):.2f}.')
print(f'Entropy similarity of denoised spectra: {sd.entropy_similairty(peak_denoised,peak, pmz ):.2f}.')

The output will be:

Entropy similarity of spectrum with noise: 0.33.
Entropy similarity of denoised spectrum: 1.00.

Want to know details about implementation?

Step 0: Modify the master formula based on SMILES and adduct information

The very first step is to get the molecular formula of the precursor ion using the SMILES code. If the adduct contains atom other than a proton, the master formula will be modified accordingly to allow loss with the adduct. Moreover, 2 nitrogen and 1 oxygen is added if benzene substructure is present, to account for forming of rare adduct in the collision cells (More info.)

from spectral_denoising.spectral_denoising import *
smiles = 'O=c1nc[nH]c2nc[nH]c12'
adduct = '[M+Na]+'
print(prep_formula(smiles, adduct))

The output will be:

'C5H4N4NaO'

Step 1: Get precursor ion infrmation

Then we want to get precursor statistics. If real precursor ion exist, the algorithm will prefer to use it since then the loss calculation will be free of systematic error. If not, the algorithm will use the computed precursor m/z. The real mass error would also be calculated for this step, and if it is larger than the mass_tolerance fed into the function, it will also be slightly increased to account for that.

For this reason, it is recommended to use a smaller mass tolerance to start with.

At this step, the spectra is also sliced into 2 parts, the precursor region and fragment region, using slice_spectrum function. Since the algorithm focuses on the relative loss, only the fragment region is denoised, while the precursor region was kept intact and will be returned at the very last.

from spectral_denoising.spectral_denoising import *
from spectral_denoising.chem_utils import *
smiles = 'O=c1nc[nH]c2nc[nH]c12'
adduct = '[M+Na]+'
peak = np.array([[48.992496490478516 ,154.0],
                [63.006099700927734, 265.0],
                [79.02062225341797, 521.0],
                [159.02373146795, 999]],

                dtype = np.float32)
computed_pmz = calculate_precursormz(adduct, smiles)
pmz, mass_threshold = get_pmz_statistics(peak, computed_pmz, mass_tolerance=0.005)
print(pmz, mass_threshold)
159.02373 0.006996155

Step2: Populate all possible subformulas from master formula

The next step is to populate all possible subformulas (with their masses) from the master formula. This can be easiily done with get_all_subfromulas function. The candidate formulas and masses are sorted so that search speed get facilitated.

all_possible_candidate_formula,all_possible_mass = get_all_subformulas(master_formula)
print(all_possible_candidate_formula[1:5], all_possible_mass[1:5]) # only show first 5 for brevity
['H', 'H2', 'H3', 'H4'] [1.00782503 2.01565006 3.0234751  4.03130013]

Step 3: Evaluate if a given ion could be formed from a plausible subformula loss

For any given fragment ion, the algorithm will try to find a plausible subformula loss that could form this ion (function check_cnadidates).

If such loss can be found, this ion will be given a tag ‘True’, otherwise a ‘False’ tag. This is done through function get_denoise_tag.

Step 4: Retaining True fragment ions and add back the precursor ions

Once the denoised tag was created, only ions with ‘True’ tag will be kept. The return spectra will be denoised fragment ions and precursor ions (function add_spectra).

References

spectral_denoising.formula_denoising(msms, smiles, adduct, mass_tolerance=0.005)[source]

Perform formula denoising on the given mass spectrometry data. The function first re-generate formula based on chemical rules, get the statistic of the precursor m/z, and then perform formula denoising. The precursor region is not affected by the denoising process, only the frgamnet region is denoised.

Parameters:

msms (numpy.array): The mass spectrometry data to be denoised. smiles (str): The SMILES string representing the molecular structure. This will also recognize the molecular formula as input, but risking leading to false positives due to not incorporting possibilities of forming extra N2/H2O adducts. adduct (str): The adduct type used in the mass spectrometry. mass_tolerance (float, optional): The mass tolerance for precursor m/z calculation. Default is 0.005.

Returns:

numpy.ndarray: The denoised mass spectrometry data, or np.nan If the SMILES string or adduct is invalid, or all ions removed.

spectral_denoising.noise.generate_chemical_noise(pmz, lamda, polarity, formula_db, n=100)[source]

Generate chemical noise for a given mass-to-charge ratio (m/z) and other parameters. The m/z of the chemical noise is taken from a database of all true possible mass values. The detailes about this database can be found paper: LibGen: Generating High Quality Spectral Libraries of Natural Products for EAD-, UVPD-, and HCD-High Resolution Mass Spectrometers

Args:

pmz (float): The target mass-to-charge ratio (m/z) value.

lamda (float): The lambda parameter for the Poisson distribution used to generate intensities, which serves as both mean and standard deviation of the distribution.

polarity (str): The polarity of the adduct, either ‘+’ or ‘-‘.

formula_db (pandas.DataFrame): A DataFrame containing a column ‘mass’ with possible mass values.

n (int, optional): The number of noise peaks to generate. Default is 100.

Returns:

np.array: A synthetic spectrum with chemical noise.

Raises:

ValueError: If the polarity is not ‘+’ or ‘-‘.

spectral_denoising.spectral_denoising.formula_denoising(msms, smiles, adduct, mass_tolerance=0.005)[source]

Perform formula denoising on the given mass spectrometry data. The function first re-generate formula based on chemical rules, get the statistic of the precursor m/z, and then perform formula denoising. The precursor region is not affected by the denoising process, only the frgamnet region is denoised.

Parameters:

msms (numpy.array): The mass spectrometry data to be denoised. smiles (str): The SMILES string representing the molecular structure. This will also recognize the molecular formula as input, but risking leading to false positives due to not incorporting possibilities of forming extra N2/H2O adducts. adduct (str): The adduct type used in the mass spectrometry. mass_tolerance (float, optional): The mass tolerance for precursor m/z calculation. Default is 0.005.

Returns:

numpy.ndarray: The denoised mass spectrometry data, or np.nan If the SMILES string or adduct is invalid, or all ions removed.

spectral_denoising.spectral_denoising.prep_formula(smiles, adduct)[source]

Prepares the molecular formula based on the given SMILES string and adduct.

Args:

smiles (str): The SMILES representation of the molecule. adduct (str): The adduct string representing the ionization state.

Returns:

str: The calculated molecular formula, or NaN if the formula cannot be determined.

spectral_denoising.spectral_denoising.get_pmz_statistics(msms, c_pmz, mass_tolerance)[source]

Use the real precursor m/z to estimate the mass deviation in a given spectrum.

Parameters:

msms (numpy.ndarray): A 2D array where the first row contains m/z values and the second row contains intensity values. c_pmz (float): The computed m/z value around which to search for the most intense peak. mass_tolerance (float): The mass tolerance within which to search for the most intense peak.

Returns:

tuple: A tuple containing: - r_pmz (float): The actual precursor m/z. If not found (precursor is fully fragmented), the computed m/z is returned. - float: The deviation between computed and actual precursor m/z, scaled by 1.75 if it exceeds the initial mass tolerance.

spectral_denoising.spectral_denoising.get_all_subformulas(raw_formula)[source]

Generate all possible subformulas and their corresponding masses from a given chemical formula.

Args:

raw_formula (str): The input chemical formula, which can be in SMILES format or a standard chemical formula.

Returns:
tuple: A tuple containing:
  • all_possible_candidate_formula (list of str): A list of all possible subformulas derived from the input formula.

  • all_possible_mass (numpy.ndarray): An array of masses corresponding to each subformula.

Notes:
  • If the input formula is in SMILES format, it will be converted to a standard chemical formula.

  • The function uses the chemparse library to parse the chemical formula and itertools.product to generate all possible combinations of subformulas.

  • The resulting subformulas and their masses are sorted in ascending order of mass for enhancing search process.

spectral_denoising.spectral_denoising.get_all_subformulas(raw_formula)[source]

Generate all possible subformulas and their corresponding masses from a given chemical formula.

Args:

raw_formula (str): The input chemical formula, which can be in SMILES format or a standard chemical formula.

Returns:
tuple: A tuple containing:
  • all_possible_candidate_formula (list of str): A list of all possible subformulas derived from the input formula.

  • all_possible_mass (numpy.ndarray): An array of masses corresponding to each subformula.

Notes:
  • If the input formula is in SMILES format, it will be converted to a standard chemical formula.

  • The function uses the chemparse library to parse the chemical formula and itertools.product to generate all possible combinations of subformulas.

  • The resulting subformulas and their masses are sorted in ascending order of mass for enhancing search process.

spectral_denoising.spectral_denoising.get_denoise_tag(frag_msms, all_possible_candidate_formula, all_possible_mass, pmz, has_benzene, mass_threshold)[source]

Determine which ions in the fragment regions are chemically feasible ion. This function calculates the mass loss for each fragment in the MS/MS data and searches for candidate formulas within a specified mass threshold. If the has_benzene flag is set, the precursor mass (pmz) is adjusted by adding the mass of the N2O isotope to count for rare cases of forming N2/H2O adducts in the collision chamber. The ions will be given a True only if it can be associated with at least 1 chemically feasible subformula of the molecular formula.

Args:

frag_msms (numpy.array): Array of fragment MS/MS data, where each tuple contains the mass and intensity of a fragment. all_possible_candidate_formula (list): List of all possible candidate formulas. all_possible_mass (numpy.ndarray): Sorted array of all possible masses. pmz (float): Precursor mass. has_benzene (bool): Flag indicating if benzene is present. mass_threshold (float): Mass threshold for searching candidate formulas.

Returns:

list: List of denoise tags for each fragment.

spectral_denoising.spectral_denoising.check_candidates(candidates)[source]

Checks a list of candidates to see if any of them meet a certain ratio condition.

Args:

candidates (list): A list of candidate formulas to be checked.

Returns:

bool: True if at least one candidate meets the ratio condition, False otherwise.

spectral_denoising.seven_golden_rules.check_ratio(formula)[source]

Checks the composition of chemical formula using ratio checks in the “7 golden rules” (‘Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry’).

Args:

formula (str): The chemical formula to be checked.

Returns:

bool: True if the formula passes all checks, False otherwise. np.NAN: If the formula is invalid due to non-alphanumeric characters at the end.

The function performs the following checks:
  • Checks the number of hydrogen and carbon atoms based on the accurate mass.

  • Checks the number of nitrogen and oxygen atoms.

  • Ensures the it is not a pure carbon/nitrogen loss (except N2)

  • Checks the hydrogen to carbon ratio.

  • Checks the fluorine to carbon ratio.

  • Checks the chlorine to carbon ratio.

  • Checks the bromine to carbon ratio.

  • Checks the nitrogen to carbon ratio.

  • Checks the oxygen to carbon ratio.

  • Checks the phosphorus to carbon ratio.

  • Checks the sulfur to carbon ratio.

  • Checks the silicon to carbon ratio.

spectral_denoising.spectral_operations.slice_spectrum(msms, break_mz)[source]

Slices a mass spectrum into two parts based on a given m/z value.

Parameters:

msms (numpy.ndarray): The mass spectrum data, where each row represents a peak with m/z and intensity values. If a empty spectrum is provided, the function returns NaN. break_mz (float): The break point where to slice the spectrum.

Returns:
tuple: A tuple containing two numpy.ndarrays:
  • The first array contains all peaks with m/z values less than the break_mz.

  • The second array contains all peaks with m/z values greater than or equal to the break_mz.

spectral_denoising.spectral_operations.add_spectra(msms1, msms2)[source]

Add two spectra together. This function takes two spectra (msms1 and msms2) and combines them. If one of the inputs is a float and the other is not, it returns the non-float input. If both inputs are floats, it returns NaN.

Parameters:

msms1 (numpy.ndarray): The first spectrum. msms2 (numpy.ndarray): The second spectrum.

Returns:

numpy.ndarray: The combined spectrum if both inputs are not floats, one of the inputs if the other is a float, or NaN if both inputs are floats.

Notes:
  • This function is very naive mixing of 2 spectrum. If you wished to formulate the intensity, please do it before using this function.