spectral_denoising package

All API References

spectral_denoising.chem_utils module

spectral_denoising.chem_utils.calculate_precursormz(adduct_string, mol=None, testing=False)[source]

Calculate the precursor m/z (mass-to-charge ratio) for a given molecule and adduct string. Very robust function, handles a wide variety of adducts strings.

Args:

adduct_string (str): The adduct string representing the ion type and charge state. mol (str, optional): The molecular formula of the compound. Defaults to None. testing (bool, optional): If True, use a predefined molecule mass for testing purposes. Defaults to False. float: The calculated precursor m/z value.

Returns:

precursor_mz (float): The calculated precursor m/z value.

Raises:

Warning: If an unrecognized adduct is encountered in the adduct string, it will be ignored and a warning will be printed.

Notes:

The function uses the Formula class to calculate the mass of the molecule and ions.
The replace_adduct_string, determine_parent_coefs, determine_adduct_charge, and parse_adduct functions are assumed to be defined elsewhere in the codebase.
The electron mass is considered in the calculation to adjust for the loss/gain of electrons.

spectral_denoising.chem_utils.desalter(input)[source]

Processes the input molecule to remove salts and return an uncharged SMILES string.

Args:

input (str or RDKit Mol): The input molecule, which can be a SMILES string or an RDKit Mol object.

Returns:

uncharged_smiles (str): The uncharged SMILES string of the largest component of the input molecule.

Notes:

If the input is not a valid molecule, the function will attempt to convert it to a SMILES string.
If the input molecule contains multiple components, the largest component will be processed.
If the largest component has a formal charge of +1, acidic hydrogens will be removed.
If the input is NaN, the function will return np.NAN.

spectral_denoising.chem_utils.determine_adduct_charge(adduct_string)[source]

Determine the charge of an adduct based on its string representation. This function processes an adduct string to determine its charge. The adduct string is first standardized using the replace_adduct_string function. The charge is then determined based on the ending character(s) of the string.

Args:

adduct_string (str): The string representation of the adduct.

Returns:

charge (int): The charge of the adduct. Returns a positive integer for positive charges, a negative integer for negative charges, and NaN if the charge cannot be determined.

Notes:

If the adduct string ends with ‘+’, the function checks if the preceding character is a digit and if the charge is enclosed in brackets. If so, it extracts the charge; otherwise, it assumes a charge of +1.
If the adduct string ends with ‘-’, the function performs a similar check for negative charges.
If the adduct string does not end with ‘+’ or ‘-’, the function returns NaN and prints a message indicating that the charge could not be determined.

spectral_denoising.chem_utils.determine_parent_coefs(adduct_string)[source]

Determine the coefficient of of the adducts by ignoring the parent ion M.

Args:: adduct_string (str): The adduct string from which to determine the parent molecule coefficient.
Returns:: coefficient (int): The coefficient of the adduct (e.g. M+H, coef =1, M+2H+, coef = 2). If the adduct string does not match the expected pattern, the function prints an error message and returns numpy’s NaN.

spectral_denoising.chem_utils.get_bond_similarity(mol1, mol2)[source]

Calculate the bond similarity between two molecules. Detailed algorithm can be found in the following paper: Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification (Yuanyue Li et al., 2021)

Args:

mol1: The first molecule, which can be in various formats (e.g., SMILES string, RDKit molecule object). mol2: The second molecule, which can be in various formats (e.g., SMILES string, RDKit molecule object).

Returns:

dict: A dictionary containing the following keys:

“mol1_bond_number”: The number of bonds in the first molecule.
“mol2_bond_number”: The number of bonds in the second molecule.
“common_bond_number”: The number of common bonds between the two molecules.
“bond_difference”: The average bond difference between the two molecules.
“bond_similarity”: The bond similarity score between the two molecules, calculated as (2 * common_bond_number) / (mol1_bond_number + mol2_bond_number).
“minimal_diff”: The minimal difference in bond numbers after removing common bonds.

spectral_denoising.chem_utils.parse_adduct(adduct_string)[source]

Parses an adduct string into its components. This function takes an adduct string and breaks it down into its constituent parts, including the sign, count, and ion type. The adduct string is first processed to replace certain patterns, and then a regular expression is used to capture the different parts of the adduct.

Args:: adduct_string (str): The adduct string to be parsed.
Returns:: list: A list of lists, where each sublist contains the sign (str), count (int), and ion type (str) of each part of the adduct.

spectral_denoising.chem_utils.parse_formula(formula)[source]

Parses a chemical formula into its constituent elements and their quantities.

Args:: formula (str): A string representing the chemical formula (e.g., “H2O”, “C6H12O6”).
Returns:: list: A list of lists, where each inner list contains an element and its quantity (e.g., [[‘H’, 2], [‘O’, 1]] for “H2O”).

spectral_denoising.chem_utils.replace_adduct_string(adduct_string)[source]

Replaces specific adduct strings with their standardized or chemical formula equivalents. This function takes an adduct string and replaces it with a standardized version or its corresponding chemical formula.

Args:: adduct_string (str): The adduct string to be replaced.
Returns:: adduct_string (str): The replaced adduct string.

spectral_denoising.chem_utils.transpose_formula(lst)[source]

Transpose a parsed formula in a nested list format from [[element, quantity], …] to [[element, …], [quantity, …]].

Args:: lst (list): A list of lists where each sublist represents a row of the matrix.
Returns:: list: A transposed version of the input list of lists, where rows are converted to columns and vice versa.

spectral_denoising.constant module

Just collection of contants widely used in chemistry. Reference: https://fiehnlab.ucdavis.edu/staff/kind/Metabolomics/MS-Adduct-Calculator/

spectral_denoising.denoising_search module

spectral_denoising.denoising_search.denoising_search(msms, pmz, reference_lib, identitiy_search_mass_error=0.01, mass_tolernace=0.005, pmz_col='precursor_mz', smiles_col='smiles', adduct_col='adduct', msms_col='peaks', first_n=1, need_sort=True)[source]

spectral_denoising.denoising_search.denoising_search_batch(msms_query, pmz_query, reference_lib, identitiy_search_mass_error=0.01, mass_tolernace=0.005, pmz_col='precursor_mz', smiles_col='smiles', adduct_col='adduct', msms_col='peaks', first_n='all')[source]

Perform batch denoising search on given MS/MS data and precursor m/z values with parallel processing.

Parameters:

msms_query (list): List of MS/MS spectra to be denoised.

pmz_query (list): List of precursor m/z values corresponding to the MS/MS spectra.

reference_lib (pandas.DataFrame): Reference library containing known spectra for comparison.

identitiy_search_mass_error (float, optional): Mass error tolerance for identity search. Default is 0.01.

mass_tolerance (float, optional): Maximum allowed tolerance for denoising. Default is 0.005.

pmz_col (str, optional): Column name for precursor m/z in the reference library. Default is ‘precursor_mz’.

smiles_col (str, optional): Column name for SMILES in the reference library. Default is ‘smiles’.

adduct_col (str, optional): Column name for adducts in the reference library. Default is ‘adduct’.

msms_col (str, optional): Column name for MS/MS peaks in the reference library. Default is ‘peaks’.

Returns:

pandas.DataFrame: DataFrame containing the results of the denoising search. Each index in the result DataFrame corresponds to the denoising search result of the corresponding input MS/MS spectrum.

spectral_denoising.denoising_search.get_all_master_formulas(pmz_candidates, smiles_col='smiles', adduct_col='adduct')[source]

spectral_denoising.file_io module

spectral_denoising.file_io.check_pattern(input_string)[source]

Helper function for read_df. Regular expression to match pairs of floats in standard or scientific notation separated by a tab, with each pair on a new line

Args:: input_string (str): input string to check for the pattern
Returns:: bool: True if the pattern is found, False otherwise

spectral_denoising.file_io.export_denoising_searches(results, save_dir, top_n=10)[source]

Pair function of import_denoising_searches. Exports the results of a denoising search to a JSON file.

Args:: results (list): The list of results from a denoising search. save_path (str): The file path where the results should be saved. If the path does not end with ‘.json’, it will be appended automatically.
Returns:: None

spectral_denoising.file_io.read_df(path, keep_ms1_only=False)[source]

Pair function of write_df. Reads a CSV file into a DataFrame, processes specific columns based on a pattern check, and MS/MS in string format to 2-D numpy array (string is used to avoid storage issue in csv files).

Args:

path (str): The file path to the CSV file.

Returns:

pandas.DataFrame: The processed DataFrame with specific columns converted.

Raises:

FileNotFoundError: If the file at the specified path does not exist. pd.errors.EmptyDataError: If the CSV file is empty. pd.errors.ParserError: If the CSV file contains parsing errors.

Notes:

The function assumes that the first row of the CSV file contains the column headers.
The check_pattern function is used to determine which columns to process.
The so.str_to_arr function is used to convert the values in the selected columns.

spectral_denoising.file_io.read_msp(file_path)[source]

Reads the MSP files into the pandas dataframe, and sort/remove zero intensity ions in MS/MS spectra.

Args:: file_path (str): target path path for the MSP file.
Returns:: pd.DataFrame: DataFrame containing the MS/MS spectra information

spectral_denoising.file_io.save_df(df, save_path)[source]

Pair function of save_df.

Save a DataFrame contaning MS/MS spectra to a CSV file, converting any columns containing 2D numpy arrays to string format.

Args:

df (pandas.DataFrame): The DataFrame to be saved. save_path (str): The file path where the DataFrame should be saved. If the path does not end with ‘.csv’, it will be appended automatically.

Returns:

None

Notes:

This function identifies columns in the DataFrame that contain 2D numpy arrays with a second dimension of size 2.
These identified columns are converted to string format before saving to the CSV file.
The function uses tqdm to display a progress bar while processing the rows of the DataFrame.

spectral_denoising.file_io.standardize_col(df)[source]

Standardizes column names in the given DataFrame based on a provided mapping. Help to read in and processing files with MS Dial generated msp files.

Args: df (pd.DataFrame): The DataFrame whose column names need to be standardized.

standard_mapping (dict): A dictionary where keys are common variations of the name,: and values are the standard name.

Returns: pd.DataFrame: DataFrame with standardized column names.

spectral_denoising.file_io.write_to_msp(df, file_path, msms_col='peaks', normalize=False)[source]

Pair function of read_msp. Exports a pandas DataFrame to an MSP file.

Args:: df (pd.DataFrame): DataFrame containing spectrum information. Should have columns for ‘name’, ‘peaks’, and other metadata. file_path (str): Destination path for the MSP file.
Returns:: None

spectral_denoising.identifier_utils module

spectral_denoising.identifier_utils.cas_to_smiles(cas)[source]

Convert a CAS (Chemical Abstracts Service) number to a SMILES (Simplified Molecular Input Line Entry System) string.

Args:: cas (str): The CAS number of the chemical compound.
Returns:: str: The SMILES string of the chemical compound if found, otherwise NaN.

spectral_denoising.identifier_utils.create_classyfire_url(smiles_string, if_np=True)[source]: Generates a URL for ClassyFire or NPClassifier based on the provided SMILES string. Just a helper function

spectral_denoising.identifier_utils.everything_to_formula(input)[source]

Converts various chemical input to a molecular formula. This function takes an input which can be in different chemical formats (e.g., SMILES, molecular formula) and converts it to a standardized molecular formula. If the input is already a molecular formula, it is returned as is. If the input is a SMILES string, it is first converted to a molecular object and then to a molecular formula.

Args:

input (str): The chemical input which can be a SMILES string, molecular formula, or other recognizable chemical format.

Returns:

formula (str): The molecular formula of the input chemical. If the input is: invalid or cannot be converted, returns NaN.

spectral_denoising.identifier_utils.everything_to_image(molecule, savepath)[source]

Converts a molecular representation to an image and saves it to the specified path.

Args:: molecule (str or RDKit Mol object): The molecular representation, which can be a SMILES string, an RDKit Mol object, or any other format that can be converted to a SMILES string. savepath (str): The file path where the generated image will be saved.
Returns:: None

spectral_denoising.identifier_utils.everything_to_inchikey(input, first_block=True)[source]

Converts various chemical identifiers to an InChIKey or its first block. This function takes an input which can be an InChIKey, a molecule object, a SMILES string, a CAS number, or a chemical name, and converts it to an InChIKey. If the input is already an InChIKey, it can return either the full InChIKey or just the first block of it based on the first_block parameter.

Args:: input (str or RDKit Mol): The chemical identifier to be converted. It can be an InChIKey, a molecule object, a SMILES string, a CAS number, or a chemical name. first_block (bool, optional): If True, returns only the first block of the InChIKey. Defaults to True.
Returns:: inchikey (str): The InChIKey or its first block if first_block is True. Returns NaN if the input is invalid or cannot be converted.

spectral_denoising.identifier_utils.everything_to_mw(mol)[source]

Converts a given molecule representation to its molecular weight (MW). This function first checks if the input is a valid molecule object. If not, it attempts to convert the input to a SMILES string and then to a molecule object. Finally, it calculates and returns the exact molecular weight of the molecule.

Args:: mol: The input molecule representation. This can be a molecule object or another representation that can be converted to a SMILES string.
Returns:: float: The exact molecular weight of the molecule.
Raises:: ValueError: If the input cannot be converted to a valid molecule object.

spectral_denoising.identifier_utils.everything_to_smiles(input)[source]

Convert various chemical identifier formats to a SMILES string. This function takes an input which can be in different chemical identifier formats (SMILES, Mol, InChIKey, CAS number, or chemical name) and converts it to a SMILES string.

Args:: input (str or RDKit Mol): The chemical identifier to be converted. It can be a SMILES string, an RDKit Mol object, an InChIKey, a CAS number, or a chemical name.
Returns:: smiles (str): The corresponding SMILES string if the conversion is successful. Returns NaN if the input is NaN.

spectral_denoising.identifier_utils.get_classyfire(smiles, if_np=False)[source]

Retrieves the ClassyFire classification for a given SMILES string.

Args:

smiles (str): The SMILES string of the molecule to classify. if_np (bool, optional): A flag indicating whether the molecule is a natural product. Defaults to False.

Returns:

dict: The JSON response from the ClassyFire API if the request is successful,: otherwise numpy.NAN.

spectral_denoising.identifier_utils.inchikey_to_smiles(inchikey)[source]

helper function, but uses pubchem database

Args:: inchikey (str): The inchikey of the molecule to look up.
Returns:: str: The fetched isomeric SMILES code.

spectral_denoising.identifier_utils.is_cas_number(string)[source]

Check if a given string is a valid CAS (Chemical Abstracts Service) number. A CAS number is a unique numerical identifier assigned to every chemical substance described in the open scientific literature. It is formatted as one or more digits, followed by a hyphen, followed by two or more digits, followed by another hyphen, and ending with a single digit.

Args:: string (str): The string to be checked.
Returns:: bool: True if the string is a valid CAS number, False otherwise.

spectral_denoising.identifier_utils.is_formula(s)[source]

Check if a given string is a valid chemical formula. A valid chemical formula starts with an uppercase letter, optionally followed by a lowercase letter (for two-letter elements), and optionally followed by a number (for the count of atoms). This pattern repeats throughout the string.

Args:: s (str): The string to be checked.
Returns:: bool: True if the string is a valid chemical formula, False otherwise.

spectral_denoising.identifier_utils.is_inchikey(string)[source]

Check if a given string is a valid InChIKey using regex. An InChIKey is a 27-character string divided into three blocks by hyphens: - The first block contains 14 uppercase letters. - The second block contains 10 uppercase letters. - The third block contains a single uppercase letter or digit.

Args:: string (str): The string to be checked.
Returns:: bool: True if the string is a valid InChIKey, False otherwise.

spectral_denoising.identifier_utils.is_mol(mol)[source]

Check if the given object is an instance of Chem.rdchem.Mol.

Args:: mol: The object to check.
Returns:: bool: True if the object is an instance of Chem.rdchem.Mol, False otherwise.

spectral_denoising.identifier_utils.is_smiles(smiles_string)[source]

Check if a given string is a valid SMILES (Simplified Molecular Input Line Entry System) representation.

Args:

smiles_string (str): The SMILES string to be validated.

Returns:

bool: True if the SMILES string is valid, False otherwise.

Example:

>>> is_smiles("CCO")
True
>>> is_smiles("invalid_smiles")
False

spectral_denoising.identifier_utils.name_to_smiles(name)[source]

Convert a chemical name to its corresponding SMILES (Simplified Molecular Input Line Entry System) representation, with Pubchem as backend.

Args:: name (str): The chemical name to be converted.
Returns:: str: The SMILES representation of the chemical if found, otherwise numpy.nan.

spectral_denoising.identifier_utils.smiles_to_inchikey(smiles)[source]

helper function

Args:: smiles (str): A SMILES string representing a molecule.
Returns:: inchikey (str): The InChIKey of the molecule, first block only.

spectral_denoising.noise module

spectral_denoising.noise.add_noise(msms, noise)[source]

Add noise to a mass spectrum and process the resulting spectrum. This function takes a mass spectrum and a noise spectrum, standardizes the mass spectrum, adds the noise to it, normalizes the resulting spectrum, and sorts it.

Args:

msms (np.ndarray): The mass spectrum to which noise will be added.

noise (np.ndarray): The noise spectrum to be added to the mass spectrum.

Returns:

np.ndarray: The processed mass spectrum after adding noise, normalization, and sorting.

Notes:

The noise spectrum is generated with intensity as ralatie measure (from 0-1)
Thus, the mass spectrum is standardized using the standardize_spectrum function.

spectral_denoising.noise.generate_chemical_noise(pmz, lamda, polarity, formula_db, n=100)[source]

Generate chemical noise for a given mass-to-charge ratio (m/z) and other parameters. The m/z of the chemical noise is taken from a database of all true possible mass values. The detailes about this database can be found paper: LibGen: Generating High Quality Spectral Libraries of Natural Products for EAD-, UVPD-, and HCD-High Resolution Mass Spectrometers

Args:

pmz (float): The target mass-to-charge ratio (m/z) value.

lamda (float): The lambda parameter for the Poisson distribution used to generate intensities, which serves as both mean and standard deviation of the distribution.

polarity (str): The polarity of the adduct, either ‘+’ or ‘-‘.

formula_db (pandas.DataFrame): A DataFrame containing a column ‘mass’ with possible mass values.

n (int, optional): The number of noise peaks to generate. Default is 100.

Returns:

np.array: A synthetic spectrum with chemical noise.

Raises:

ValueError: If the polarity is not ‘+’ or ‘-‘.

spectral_denoising.noise.generate_noise(pmz, lamda, n=100)[source]

Generate synthetic electronic noise for spectral data.

Parameters:

pmz (float): The upper bound for the mass range.

lamda (float): The lambda parameter for the Poisson distribution, which serves as both mean and standard deviation of the distribution.

n (int, optional): The number of random noise ions to generate. Defaults to 100.

Returns:

np.array: A synthetic spectrum with electronic noise.

spectral_denoising.search_utils module

spectral_denoising.search_utils.quick_search_sorted(data_raw, column_name, value_start, value_end)[source]

Perform a quick search on a sorted column of a DataFrame to find rows within a specified range.

Parameters:: data_raw (pd.DataFrame): The input DataFrame containing the data to search. column_name (str): The name of the column to search within. value_start (float): The starting value of the range. value_end (float): The ending value of the range.
Returns:: pd.DataFrame: A DataFrame containing the rows where the values in the specified column fall within the given range.

spectral_denoising.search_utils.quick_search_values(data_raw, column_name, value_start, value_end)[source]

Perform a quick search on a DataFrame to find rows where the values in a specified column fall within a given range. Basically sorting the data first followed by quick_search_sorted.

Args:: data_raw (pd.DataFrame): The raw DataFrame to search. column_name (str): The name of the column to search within. value_start (numeric): The starting value of the range. value_end (numeric): The ending value of the range.
Returns:: pd.DataFrame: A DataFrame containing rows where the values in the specified column are within the range [value_start, value_end].

spectral_denoising.search_utils.string_search(data, column_name, item, reset_index=True, reverse=False)[source]

spectral_denoising.seven_golden_rules module

spectral_denoising.seven_golden_rules.check_huristic(formula)[source]

spectral_denoising.seven_golden_rules.check_ratio(formula)[source]

Checks the composition of chemical formula using ratio checks in the “7 golden rules” (‘Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry’).

Args:

formula (str): The chemical formula to be checked.

Returns:

bool: True if the formula passes all checks, False otherwise. np.NAN: If the formula is invalid due to non-alphanumeric characters at the end.

The function performs the following checks:

Checks the number of hydrogen and carbon atoms based on the accurate mass.
Checks the number of nitrogen and oxygen atoms.
Ensures the it is not a pure carbon/nitrogen loss (except N2)
Checks the hydrogen to carbon ratio.
Checks the fluorine to carbon ratio.
Checks the chlorine to carbon ratio.
Checks the bromine to carbon ratio.
Checks the nitrogen to carbon ratio.
Checks the oxygen to carbon ratio.
Checks the phosphorus to carbon ratio.
Checks the sulfur to carbon ratio.
Checks the silicon to carbon ratio.

spectral_denoising.seven_golden_rules.check_senior(formula)[source]

spectral_denoising.spectra_plotter module

spectral_denoising.spectra_plotter.get_color_gradient(c1, c2, n)[source]: Given two hex colors, returns a color gradient with n colors.

spectral_denoising.spectra_plotter.head_to_tail_plot(msms1, msms2, pmz=None, mz_start=None, mz_end=None, pmz2=None, ms2_error=0.02, title=None, color1=None, color2=None, savepath=None, show=True, publication=False, fontsize=12)[source]

Plots a head-to-tail comparison of two MS/MS spectra.

Parameters:

msms1 (np.array): First mass spectrum data in 2D np.array format. e,g. np.array([[mz1, intensity1], [mz2, intensity2], …]). msms2 (np.array): Second mass spectrum data. Same as msms1. pmz (float or str, optional): Precursor m/z value for the first spectrum. Default is None. If given, precursors will be removed from both spectra and precursor will be shown as a grey dashed line in the plot. mz_start (float, optional): Start of the m/z range for plotting. Zoom in function. Default is None. mz_end (float, optional): End of the m/z range for plotting. Zoom in function. Default is None. pmz2 (float or str, optional): Precursor m/z value for the second spectrum. Default is None. Just in case pmz1 and pmz2 are different. ms2_error (float, optional): Error tolerance for m/z values. Default is 0.02. color1 (str, optional): Color for the first spectrum’s peaks. Default is None. color2 (str, optional): Color for the second spectrum’s peaks. Default is None.

savepath (str, optional): Path to save the plot image. Default is None. show (bool, optional): If True, displays the plot. Default is True. Turn it off if you want to save the plot without displaying it. publication (bool, optional): If True, formats the plot for publication (size 3*2.5 inch for single column figure). Default is False. fontsize (int, optional): Font size for plot labels. Default is 12.

Returns:

matplotlib.pyplot or None: The plot object if show is True, otherwise None.

spectral_denoising.spectra_plotter.hex_to_RGB(hex_str)[source]: #FFFFFF -> [255,255,255]

spectral_denoising.spectra_plotter.ms2_plot(msms_1, pmz=None, lower=None, upper=None, savepath=None, color='blue')[source]

Plots a single MS/MS spectrum.

Parameters:: msms_1 (numpy.ndarray): MS/MS (or MS1) spectrum in 2D np.array format. e,g. np.array([[mz1, intensity1], [mz2, intensity2], …]). pmz (float, optional): Precursor m/z value. If provided, precursor will be removed from the spectrum. Default is None. lower (float, optional): Lower bound for m/z values to be plotted. Default is None. upper (float, optional): Upper bound for m/z values to be plotted. Default is None. savepath (str, optional): Path to save the plot image. If None, the plot will not be saved. Default is None. color (str, optional): Color of the spectrum lines. Default is ‘blue’.
Returns:: matplotlib.pyplot: The plot object.

spectral_denoising.spectra_plotter.wrap_labels(ax, width, break_long_words=False)[source]

spectral_denoising.spectral_denoising module

spectral_denoising.spectral_denoising.check_candidates(candidates)[source]

Checks a list of candidates to see if any of them meet a certain ratio condition.

Args:: candidates (list): A list of candidate formulas to be checked.
Returns:: bool: True if at least one candidate meets the ratio condition, False otherwise.

spectral_denoising.spectral_denoising.dict_to_formula(candidate, element_dict)[source]

Helper function, to get the chemical formula from a candidate list and element dictionary.

Args:: candidate (list of int): A list where each index corresponds to an element in element_dict and the value at each index represents the count of that element. element_dict (list of str): A list of element symbols where the index corresponds to the element’s position in the candidate list.
Returns:: str: A string representing the chemical formula, where each element symbol is followed by its count if greater than 1.

spectral_denoising.spectral_denoising.dnl_denoising(msms)[source]

Perform Dynamic noise level estimation denoising on given msms spectra. Details about the algorithm can be found in the paper: A Dynamic Noise Level Algorithm for Spectral Screening of Peptide MS/MS Spectra.

Parameters:

msms (numpy.ndarray): A 2D numpy array with shape (2, n) where n is the number of data points. For each instance, first item is pmz and second item is intensity.

Returns:

numpy.ndarray: A 2D numpy array containing the denoised mass spectrometry data, sorted and packed. If the input data has only two points and does not meet the criteria, returns NaN.

Notes:

The function assumes that the input data is a numpy array with two columns.
The function uses a linear regression model to predict the signal region.

spectral_denoising.spectral_denoising.electronic_denoising(msms)[source]

Perform electronic denoising on a given mass spectrometry (MS/MS) spectrum. This function processes the input MS/MS spectrum by sorting the peaks based on their intensity, and then iteratively selects and confirms peaks based on a specified intensity threshold. The confirmed peaks are then packed and sorted before being returned.

Parameters:: msms (np.ndarray): The first item is always m/z and the second item is intensity.
Returns:: np.ndarray: The cleaned spectrum with electronic noises removed. If no ion presents, will return np.nan.

spectral_denoising.spectral_denoising.formula_denoising(msms, smiles, adduct, mass_tolerance=0.005)[source]

Perform formula denoising on the given mass spectrometry data. The function first re-generate formula based on chemical rules, get the statistic of the precursor m/z, and then perform formula denoising. The precursor region is not affected by the denoising process, only the frgamnet region is denoised.

Parameters:: msms (numpy.array): The mass spectrometry data to be denoised. smiles (str): The SMILES string representing the molecular structure. This will also recognize the molecular formula as input, but risking leading to false positives due to not incorporting possibilities of forming extra N2/H2O adducts. adduct (str): The adduct type used in the mass spectrometry. mass_tolerance (float, optional): The mass tolerance for precursor m/z calculation. Default is 0.005.
Returns:: numpy.ndarray: The denoised mass spectrometry data, or np.nan If the SMILES string or adduct is invalid, or all ions removed.

spectral_denoising.spectral_denoising.get_all_subformulas(raw_formula)[source]

Generate all possible subformulas and their corresponding masses from a given chemical formula.

Args:

raw_formula (str): The input chemical formula, which can be in SMILES format or a standard chemical formula.

Returns:

tuple: A tuple containing:

all_possible_candidate_formula (list of str): A list of all possible subformulas derived from the input formula.
all_possible_mass (numpy.ndarray): An array of masses corresponding to each subformula.

Notes:

If the input formula is in SMILES format, it will be converted to a standard chemical formula.
The function uses the chemparse library to parse the chemical formula and itertools.product to generate all possible combinations of subformulas.
The resulting subformulas and their masses are sorted in ascending order of mass for enhancing search process.

spectral_denoising.spectral_denoising.get_denoise_tag(frag_msms, all_possible_candidate_formula, all_possible_mass, pmz, has_benzene, mass_threshold)[source]

Determine which ions in the fragment regions are chemically feasible ion. This function calculates the mass loss for each fragment in the MS/MS data and searches for candidate formulas within a specified mass threshold. If the has_benzene flag is set, the precursor mass (pmz) is adjusted by adding the mass of the N2O isotope to count for rare cases of forming N2/H2O adducts in the collision chamber. The ions will be given a True only if it can be associated with at least 1 chemically feasible subformula of the molecular formula.

Args:: frag_msms (numpy.array): Array of fragment MS/MS data, where each tuple contains the mass and intensity of a fragment. all_possible_candidate_formula (list): List of all possible candidate formulas. all_possible_mass (numpy.ndarray): Sorted array of all possible masses. pmz (float): Precursor mass. has_benzene (bool): Flag indicating if benzene is present. mass_threshold (float): Mass threshold for searching candidate formulas.
Returns:: list: List of denoise tags for each fragment.

spectral_denoising.spectral_denoising.get_pmz_statistics(msms, c_pmz, mass_tolerance)[source]

Use the real precursor m/z to estimate the mass deviation in a given spectrum.

Parameters:: msms (numpy.ndarray): A 2D array where the first row contains m/z values and the second row contains intensity values. c_pmz (float): The computed m/z value around which to search for the most intense peak. mass_tolerance (float): The mass tolerance within which to search for the most intense peak.
Returns:: tuple: A tuple containing: - r_pmz (float): The actual precursor m/z. If not found (precursor is fully fragmented), the computed m/z is returned. - float: The deviation between computed and actual precursor m/z, scaled by 1.75 if it exceeds the initial mass tolerance.

spectral_denoising.spectral_denoising.has_benzene(molecule)[source]

Check if the given molecule contains a benzene ring.

Args:

molecule (Union[Chem.Mol, str]): The molecule to check. It can be a RDKit molecule object: or a SMILES string.

Returns:

bool: True if the molecule contains a benzene ring, False otherwise.

spectral_denoising.spectral_denoising.ms_reduce(msms, reduce_factor=90)[source]

Reimplementation of MS-Reduce algorithm. Details about this algorithm can be found at: MS-REDUCE: an ultrafast technique for reduction of big mass spectrometry data for high-throughput processing

Parameters:: msms (numpy.ndarray): A 2D numpy array with shape (2, n) where n is the number of data points. For each instance, first item is pmz and second item is intensity. reduce_factor (int, optional): The percentage by which to reduce the number of peaks. Default is 90.
Returns:: numpy.ndarray: The reduced MS/MS spectrum as a 2D numpy array, sorted and packed.

spectral_denoising.spectral_denoising.prep_formula(smiles, adduct)[source]

Prepares the molecular formula based on the given SMILES string and adduct.

Args:: smiles (str): The SMILES representation of the molecule. adduct (str): The adduct string representing the ionization state.
Returns:: str: The calculated molecular formula, or NaN if the formula cannot be determined.

spectral_denoising.spectral_denoising.spectral_denoising(msms, smiles, adduct, mass_tolerance=0.005)[source]

Perform spectral denoising on the given mass spectrometry data. The function first performs electronic denoising, followed by formula denoising.

Parameters:

msms (numpy.array): The mass spectrometry data to be denoised. smiles (str): The SMILES representation of the molecule. adduct (str): The adduct type. mass_tolerance (float, optional): The mass tolerance for the denoising process. Default is 0.005.

Returns:

numpy.array: The denoised mass spectrometry data.Returns NaN if the input is invalid or if the denoising process fails.

Notes:

The function first checks if any of the inputs are of type np.nan, which is considered invalid.
It then performs electronic denoising on the msms data.
If electronic denoising resulted in empty spectrum (all ions removed), it will return np.nan.
If successful, it proceeds to formula denoising using the electronic denoised data, smiles, adduct, and mass_tolerance.

spectral_denoising.spectral_denoising.spectral_denoising_batch(msms_query, smiles_query, adduct_query, mass_tolerance=0.005)[source]

Perform batch spectral denoising on multiple sets of MS/MS spectra, SMILES strings, and adducts. Uses multiprocessing to parallelize the denoising process.

Parameters:

msms_query (list): A list of MS/MS spectra data.

smiles_query (list): A list of SMILES strings corresponding to the MS/MS spectra.

adduct_query (list): A list of adducts corresponding to the MS/MS spectra.

mass_tolerance (float, optional): The allowed deviation for the denoising process. Default is 0.005.

Returns:

list: A list of denoised MS/MS from the spectral denoising process.

Notes:

The lengths of msms_query, smiles_query, and adduct_query must be the same. If not, the function will print an error message and return an empty tuple.
The function uses multiprocessing to parallelize the denoising process, utilizing 6 processes.

spectral_denoising.spectral_denoising.spectral_denoising_with_master_formulas(msms, master_formula, benzene_tag, query_pmz, mass_tolerance=0.005)[source]

spectral_denoising.spectral_denoising.threshold_denoising(msms, threshold=1)[source]

The most widely used and simple denoising algorithm, which discard all peaks below a predefined threshold. This function filters out peaks in the mass spectrometry spectrum whose intensity is below a specified threshold percentage of the maximum intensity.

Parameters:: msms (numpy.ndarray): A 2D numpy array with shape (2, n) where n is the number of data points. For each instance, first item is pmz and second item is intensity. threshold (float, optional): The threshold percentage (0-100) of the maximum intensity below which peaks will be removed. Default is 1.
Returns:: numpy.ndarray: denoised spectrum as a 2D numpy array, sorted and packed.

spectral_denoising.spectral_operations module

spectral_denoising.spectral_operations.add_spectra(msms1, msms2)[source]

Add two spectra together. This function takes two spectra (msms1 and msms2) and combines them. If one of the inputs is a float and the other is not, it returns the non-float input. If both inputs are floats, it returns NaN.

Parameters:

msms1 (numpy.ndarray): The first spectrum. msms2 (numpy.ndarray): The second spectrum.

Returns:

numpy.ndarray: The combined spectrum if both inputs are not floats, one of the inputs if the other is a float, or NaN if both inputs are floats.

Notes:

This function is very naive mixing of 2 spectrum. If you wished to formulate the intensity, please do it before using this function.

spectral_denoising.spectral_operations.arr_to_str(msms)[source]: helper function for read_df and save_df

spectral_denoising.spectral_operations.break_spectrum(spectra)[source]

Breaks down a given spectrum into its mass and intensity components. Not often used.

Parameters:: spectra (numpy.ndarray): The input spectrum data. If a np.nan is provided, it returns two empty lists.
Returns:: numpy.ndarray: A MS/MS spectrum formated in 2D array where the each item are formated as [pmz, intensity].

spectral_denoising.spectral_operations.compare_spectra(msms1, msms2)[source]

Compare two mass spectra and return the spectrum of the second input that does not overlap with the first input. Juist a helper function, not actually in use.

Args:

msms1 (numpy.ndarray): The first mass spectrum to compare. msms2 (numpy.ndarray): The second mass spectrum to compare.

Returns:

numpy.ndarray: A packed spectrum of mass and intensity values from msms2: that do not overlap with msms1.

spectral_denoising.spectral_operations.entropy_similairty(msms1, msms2, pmz=None, ms2_error=0.02)[source]

Calculate the entropy similarity between two mass spectrometry spectra.

Parameters:: msms1 (numpy.ndarray): The first mass spectrometry spectrum. If a float is provided, NaN is returned. msms1 (numpy.ndarray): The second mass spectrometry spectrum. If a float is provided, NaN is returned. pmz (float, optional): The precursor m/z value. If provided, precursors in both spectra will be removed. ms2_error (float, optional): The tolerance for matching peaks in the spectra. Default is 0.02.
Returns:: float: The entropy similarity between the two spectra. Returns NaN if either input spectrum is invalid.

spectral_denoising.spectral_operations.msdial_to_array(msms)[source]

spectral_denoising.spectral_operations.normalize_spectrum(msms)[source]

Normalize the intensity values of a given mass spectrum. This function takes a mass spectrum (msms) as input, transposes it, and normalizes the intensity values (second row) by dividing each intensity by the sum of all intensities. The normalized spectrum is then transposed back to its original form and returned.

Parameters:: msms (numpy.ndarray): A 2D numpy array where the first row contains mass-to-charge ratios (m/z) and the second row contains intensity values.
Returns:: numpy.ndarray: A 2D numpy array with the same shape as the input, where the intensity values have been normalized.

spectral_denoising.spectral_operations.normalized_entropy(msms)[source]

spectral_denoising.spectral_operations.pack_spectrum(mass, intensity)[source]

Inverse of break_spectrum. Packs mass and intensity arrays into a single 2D array, which is standardized MS/MS spectrum data format in this project. This function takes two arrays, mass and intensity, and combines them into a single 2D array where each row corresponds to a pair of mass and intensity values. If either of the input arrays is empty, the function returns NaN.

Parameters:: mass (numpy.ndarray): An array of mass values. intensity (numpy.ndarray): An array of intensity values.
Returns:: numpy.ndarray: A 2D array with mass and intensity pairs if both input arrays are non-empty, otherwise NaN.

spectral_denoising.spectral_operations.remove_precursor(msms, pmz=None)[source]

Removes the precursor ion from the given mass spectrometry/mass spectrometry (MS/MS) spectrum.

Parameters:: msms (numpy.ndarray): A 2D numpy array. pmz (float, optional): The precursor m/z value. If not provided, function will try to guess from the spectrum.
Returns:: numpy.ndarray: The truncated MS/MS spectrum with the precursor ion removed.

spectral_denoising.spectral_operations.remove_zero_ions(msms)[source]

Remove zero intensity ions from a mass spectrometry dataset.

Parameters:: msms (numpy.ndarray or float): MS/MS spectrum in 2D numpy array.
Returns:: numpy.ndarray: A filtered 2D numpy array with rows where the second column (ion intensities) is greater than zero, or np.nan if the input is an empty spectrum.

spectral_denoising.spectral_operations.sanitize_spectrum(msms)[source]

Sanitize the given mass spectrum. This function performs the following operations on the input mass spectrum: 1. If the input is a nan, it returns NaN. 2. Sorts the spectrum using the sort_spectrum function. 3. Removes zero intensity ions using the remove_zero_ions function.

Parameters:: msms (numpy.ndarray): The mass spectrum to be sanitized.
Returns:: numpy.ndarray: The sanitized mass spectrum. If the input is a nan, returns nan.

spectral_denoising.spectral_operations.search_ions(msms, mz, span=3)[source]

Search for ions within a specified mass-to-charge ratio (m/z) range in a given mass spectrum.

Parameters:: msms (numpy.ndarray): The mass spectrum data. mz (float): The target mass-to-charge ratio to search for. span (float, optional): The range around the target m/z to search within. Default is 3.
Returns:: numpy.ndarray: the slice of MS/MS spectra in the given region.

spectral_denoising.spectral_operations.slice_spectrum(msms, break_mz)[source]

Slices a mass spectrum into two parts based on a given m/z value.

Parameters:

msms (numpy.ndarray): The mass spectrum data, where each row represents a peak with m/z and intensity values. If a empty spectrum is provided, the function returns NaN. break_mz (float): The break point where to slice the spectrum.

Returns:

tuple: A tuple containing two numpy.ndarrays:

The first array contains all peaks with m/z values less than the break_mz.
The second array contains all peaks with m/z values greater than or equal to the break_mz.

spectral_denoising.spectral_operations.sort_spectrum(msms)[source]

Sorts the spectrum data based on m/z values.

Parameters:: msms (numpy.ndarray): A 2D numpy array.
Returns:: numpy.ndarray: A 2D numpy array with the same shape as the input, but sorted by the m/z values in ascending order.

spectral_denoising.spectral_operations.spectral_entropy(msms, pmz=None)[source]

Calculate the entropy of the givens.

Parameters:

msms (numpy.ndarray): A 2D array where the each item are formated as [pmz, intensity].

pmz (float, optional): The precursor m/z value. If provided, precursors in both spectra will be removed.

Returns:

float: The entropy of the query MS/MS spectrum.

spectral_denoising.spectral_operations.standardize_spectrum(ms)[source]

Standardizes the intensity values of a given mass spectrum so that the base peak will have intensity of 1.

Parameters:: ms (numpy.ndarray): A 2D array where the first column represents mass values and the second column represents intensity values.
Returns:: numpy.ndarray: A 2D array with the same mass values and standardized intensity values. The intensity values are normalized to the range [0, 1] and rounded to 4 decimal places.

spectral_denoising.spectral_operations.str_to_arr(msms)[source]: helper function for read_df and save_df

spectral_denoising.spectral_operations.truncate_spectrum(msms, max_mz)[source]

Truncate the given mass spectrum to only include peaks with m/z values less than or equal to max_mz.

Parameters:: msms (numpy.ndarray): The mass spectrum to be truncated. If it is an empty spectrum (np.nan), will also return np.nan. max_mz (float): The maximum m/z value to retain in the truncated spectrum.
Returns:: numpy.ndarray: The truncated mass spectrum with m/z values less than or equal to max_mz.