Spectral denoising: useful functions

Here we provide a list of useful functions in the project for read and write data in differnt formats. We also provide functions for manipulating, visualizing, and comparing spectra.

Read data

file_io.read_msp will read in all standard MSP files into a pandas.Dataframe, with basic spectra cleaning performed (sorted + zero intensity ions removed).

spectral_denoising.file_io.read_msp(file_path)[source]

Reads the MSP files into the pandas dataframe, and sort/remove zero intensity ions in MS/MS spectra.

Args:

file_path (str): target path path for the MSP file.

Returns:

pd.DataFrame: DataFrame containing the MS/MS spectra information

file_io.read_df will read in a pandas.Dataframe from a csv file, and convert any stringed spectra column (m/z and intensity are separated by ‘t’ and new fragment are separated by ‘n’) to numpy array.

spectral_denoising.file_io.read_df(path, keep_ms1_only=False)[source]

Pair function of write_df. Reads a CSV file into a DataFrame, processes specific columns based on a pattern check, and MS/MS in string format to 2-D numpy array (string is used to avoid storage issue in csv files).

Args:

path (str): The file path to the CSV file.

Returns:

pandas.DataFrame: The processed DataFrame with specific columns converted.

Raises:

FileNotFoundError: If the file at the specified path does not exist. pd.errors.EmptyDataError: If the CSV file is empty. pd.errors.ParserError: If the CSV file contains parsing errors.

Notes:
  • The function assumes that the first row of the CSV file contains the column headers.

  • The check_pattern function is used to determine which columns to process.

  • The so.str_to_arr function is used to convert the values in the selected columns.

Write data

file_io.write_to_msp will write a pandas.Dataframe to a MSP file with output location specified. The target spectra column to be exported to MSP file should also be specified since each dataframe file could contain multiple versions of MS/MS spectra.

spectral_denoising.file_io.write_to_msp(df, file_path, msms_col='peaks', normalize=False)[source]

Pair function of read_msp. Exports a pandas DataFrame to an MSP file.

Args:

df (pd.DataFrame): DataFrame containing spectrum information. Should have columns for ‘name’, ‘peaks’, and other metadata. file_path (str): Destination path for the MSP file.

Returns:

None

file_io.save_df will write a pandas.Dataframe to a csv file with output location specified. All spectra column will be automatically converted to string format for saving.

spectral_denoising.file_io.save_df(df, save_path)[source]

Pair function of save_df.

Save a DataFrame contaning MS/MS spectra to a CSV file, converting any columns containing 2D numpy arrays to string format.

Args:

df (pandas.DataFrame): The DataFrame to be saved. save_path (str): The file path where the DataFrame should be saved. If the path does not end with ‘.csv’, it will be appended automatically.

Returns:

None

Notes:
  • This function identifies columns in the DataFrame that contain 2D numpy arrays with a second dimension of size 2.

  • These identified columns are converted to string format before saving to the CSV file.

  • The function uses tqdm to display a progress bar while processing the rows of the DataFrame.

Manipulating spectra

spectral_operations.break_spectrum will break a given np ndarray spectra into 2 np ndarrays of mass and intensities.

spectral_denoising.spectral_operations.break_spectrum(spectra)[source]

Breaks down a given spectrum into its mass and intensity components. Not often used.

Parameters:

spectra (numpy.ndarray): The input spectrum data. If a np.nan is provided, it returns two empty lists.

Returns:

numpy.ndarray: A MS/MS spectrum formated in 2D array where the each item are formated as [pmz, intensity].

spectral_operations.pack_spectrum do the reverse operation of break_spectrum. It will pack 2 np ndarrays of mass and intensities into a single np ndarray with shape [n, 2].

spectral_denoising.spectral_operations.pack_spectrum(mass, intensity)[source]

Inverse of break_spectrum. Packs mass and intensity arrays into a single 2D array, which is standardized MS/MS spectrum data format in this project. This function takes two arrays, mass and intensity, and combines them into a single 2D array where each row corresponds to a pair of mass and intensity values. If either of the input arrays is empty, the function returns NaN.

Parameters:

mass (numpy.ndarray): An array of mass values. intensity (numpy.ndarray): An array of intensity values.

Returns:

numpy.ndarray: A 2D array with mass and intensity pairs if both input arrays are non-empty, otherwise NaN.

spectral_operations.normalize_spectrum normalized the query spectrum so that the sum of intensities is 1.

spectral_denoising.spectral_operations.normalize_spectrum(msms)[source]

Normalize the intensity values of a given mass spectrum. This function takes a mass spectrum (msms) as input, transposes it, and normalizes the intensity values (second row) by dividing each intensity by the sum of all intensities. The normalized spectrum is then transposed back to its original form and returned.

Parameters:

msms (numpy.ndarray): A 2D numpy array where the first row contains mass-to-charge ratios (m/z) and the second row contains intensity values.

Returns:

numpy.ndarray: A 2D numpy array with the same shape as the input, where the intensity values have been normalized.

spectral_operations.standardize_spectrum standardize the query spectrum so that the base peak has intensity of 1.

spectral_denoising.spectral_operations.standardize_spectrum(ms)[source]

Standardizes the intensity values of a given mass spectrum so that the base peak will have intensity of 1.

Parameters:

ms (numpy.ndarray): A 2D array where the first column represents mass values and the second column represents intensity values.

Returns:

numpy.ndarray: A 2D array with the same mass values and standardized intensity values. The intensity values are normalized to the range [0, 1] and rounded to 4 decimal places.

spectral_operations.sort_spectrum sort the query spectrum by mass.

spectral_denoising.spectral_operations.sort_spectrum(msms)[source]

Sorts the spectrum data based on m/z values.

Parameters:

msms (numpy.ndarray): A 2D numpy array.

Returns:

numpy.ndarray: A 2D numpy array with the same shape as the input, but sorted by the m/z values in ascending order.

spectral_operations.remove_precursor will remove the precursor ion region from the query spectrum. If no pmz is provided, the function will use the max m/z in the query spectrum as the precursor ion.

spectral_denoising.spectral_operations.remove_precursor(msms, pmz=None)[source]

Removes the precursor ion from the given mass spectrometry/mass spectrometry (MS/MS) spectrum.

Parameters:

msms (numpy.ndarray): A 2D numpy array. pmz (float, optional): The precursor m/z value. If not provided, function will try to guess from the spectrum.

Returns:

numpy.ndarray: The truncated MS/MS spectrum with the precursor ion removed.

spectral_operations.remove_zero_ions will remove ions with zero intensity from the query spectrum.

spectral_denoising.spectral_operations.remove_zero_ions(msms)[source]

Remove zero intensity ions from a mass spectrometry dataset.

Parameters:

msms (numpy.ndarray or float): MS/MS spectrum in 2D numpy array.

Returns:

numpy.ndarray: A filtered 2D numpy array with rows where the second column (ion intensities) is greater than zero, or np.nan if the input is an empty spectrum.

spectral_operations.sanitize_spectrum is just a wrapper function for spectral_operations.remove_zero_ions and spectral_operations.sort_spectrum.

spectral_denoising.spectral_operations.sanitize_spectrum(msms)[source]

Sanitize the given mass spectrum. This function performs the following operations on the input mass spectrum: 1. If the input is a nan, it returns NaN. 2. Sorts the spectrum using the sort_spectrum function. 3. Removes zero intensity ions using the remove_zero_ions function.

Parameters:

msms (numpy.ndarray): The mass spectrum to be sanitized.

Returns:

numpy.ndarray: The sanitized mass spectrum. If the input is a nan, returns nan.

spectral_operations.truncate_spectrum will truncate the query spectrum up to the max_mz provided. This is the function used in remove_precursor function.

spectral_denoising.spectral_operations.truncate_spectrum(msms, max_mz)[source]

Truncate the given mass spectrum to only include peaks with m/z values less than or equal to max_mz.

Parameters:

msms (numpy.ndarray): The mass spectrum to be truncated. If it is an empty spectrum (np.nan), will also return np.nan. max_mz (float): The maximum m/z value to retain in the truncated spectrum.

Returns:

numpy.ndarray: The truncated mass spectrum with m/z values less than or equal to max_mz.

spectral_operations.slice_spectrum will slice the query spectrum at the break_mz.

spectral_denoising.spectral_operations.slice_spectrum(msms, break_mz)[source]

Slices a mass spectrum into two parts based on a given m/z value.

Parameters:

msms (numpy.ndarray): The mass spectrum data, where each row represents a peak with m/z and intensity values. If a empty spectrum is provided, the function returns NaN. break_mz (float): The break point where to slice the spectrum.

Returns:
tuple: A tuple containing two numpy.ndarrays:
  • The first array contains all peaks with m/z values less than the break_mz.

  • The second array contains all peaks with m/z values greater than or equal to the break_mz.

spectral_denoising.spectral_operations.pack_spectrum(mass, intensity)[source]

Inverse of break_spectrum. Packs mass and intensity arrays into a single 2D array, which is standardized MS/MS spectrum data format in this project. This function takes two arrays, mass and intensity, and combines them into a single 2D array where each row corresponds to a pair of mass and intensity values. If either of the input arrays is empty, the function returns NaN.

Parameters:

mass (numpy.ndarray): An array of mass values. intensity (numpy.ndarray): An array of intensity values.

Returns:

numpy.ndarray: A 2D array with mass and intensity pairs if both input arrays are non-empty, otherwise NaN.

spectral_operations.add_spectra will naively add 2 spectra together.

Formating spectra

spectral_operations.arr_to_str will format the np ndarray query spectrum into a string format with m/z and intensity separated by ‘t’ and new fragment separated by ‘n’. Reverse of str_to_arr.

spectral_denoising.spectral_operations.arr_to_str(msms)[source]

helper function for read_df and save_df

spectral_operations.str_to_arr will format the stringed query spectrum with m/z and intensity separated by ‘t’ and new fragment separated by ‘n’ into a np ndarray format. Reverse of arr_to_str.

spectral_denoising.spectral_operations.str_to_arr(msms)[source]

helper function for read_df and save_df

Visualizing spectra

spectra_plotter.head_to_tail_plot will plot the head-to-tail plot of the query spectra. If pmz is given, the precursor region will be removed before plotting and precursor m/z will be marked as grey dashed line. If savepath is provided, the plot will be saved to the location.

spectral_denoising.spectra_plotter.head_to_tail_plot(msms1, msms2, pmz=None, mz_start=None, mz_end=None, pmz2=None, ms2_error=0.02, title=None, color1=None, color2=None, savepath=None, show=True, publication=False, fontsize=12)[source]

Plots a head-to-tail comparison of two MS/MS spectra.

Parameters:

msms1 (np.array): First mass spectrum data in 2D np.array format. e,g. np.array([[mz1, intensity1], [mz2, intensity2], …]). msms2 (np.array): Second mass spectrum data. Same as msms1. pmz (float or str, optional): Precursor m/z value for the first spectrum. Default is None. If given, precursors will be removed from both spectra and precursor will be shown as a grey dashed line in the plot. mz_start (float, optional): Start of the m/z range for plotting. Zoom in function. Default is None. mz_end (float, optional): End of the m/z range for plotting. Zoom in function. Default is None. pmz2 (float or str, optional): Precursor m/z value for the second spectrum. Default is None. Just in case pmz1 and pmz2 are different. ms2_error (float, optional): Error tolerance for m/z values. Default is 0.02. color1 (str, optional): Color for the first spectrum’s peaks. Default is None. color2 (str, optional): Color for the second spectrum’s peaks. Default is None.

savepath (str, optional): Path to save the plot image. Default is None. show (bool, optional): If True, displays the plot. Default is True. Turn it off if you want to save the plot without displaying it. publication (bool, optional): If True, formats the plot for publication (size 3*2.5 inch for single column figure). Default is False. fontsize (int, optional): Font size for plot labels. Default is 12.

Returns:

matplotlib.pyplot or None: The plot object if show is True, otherwise None.

spectra_plotter.ms2_plot will plot the query spectrum. It takes similary parameters as head_to_tail_plot.

spectral_denoising.spectra_plotter.ms2_plot(msms_1, pmz=None, lower=None, upper=None, savepath=None, color='blue')[source]

Plots a single MS/MS spectrum.

Parameters:

msms_1 (numpy.ndarray): MS/MS (or MS1) spectrum in 2D np.array format. e,g. np.array([[mz1, intensity1], [mz2, intensity2], …]). pmz (float, optional): Precursor m/z value. If provided, precursor will be removed from the spectrum. Default is None. lower (float, optional): Lower bound for m/z values to be plotted. Default is None. upper (float, optional): Upper bound for m/z values to be plotted. Default is None. savepath (str, optional): Path to save the plot image. If None, the plot will not be saved. Default is None. color (str, optional): Color of the spectrum lines. Default is ‘blue’.

Returns:

matplotlib.pyplot: The plot object.

Generating noise

noise.generate_noise will generate synthetic noise with m/z from 50 to pmz. The m/z follows random distribution and the intensity follows Possion distribution with lamda provided. The parameter n is the number of noise ions to be generated.

spectral_denoising.noise.generate_noise(pmz, lamda, n=100)[source]

Generate synthetic electronic noise for spectral data.

Parameters:

pmz (float): The upper bound for the mass range.

lamda (float): The lambda parameter for the Poisson distribution, which serves as both mean and standard deviation of the distribution.

n (int, optional): The number of random noise ions to generate. Defaults to 100.

Returns:

np.array: A synthetic spectrum with electronic noise.

noise.generate_chemical_noise will generate synthetic chemical noise with m/z from 50 to pmz. The m/z are randomly sampled from the formula_db provided. The intensity is determined in the similar manner as generate_noise.

spectral_denoising.noise.generate_chemical_noise(pmz, lamda, polarity, formula_db, n=100)[source]

Generate chemical noise for a given mass-to-charge ratio (m/z) and other parameters. The m/z of the chemical noise is taken from a database of all true possible mass values. The detailes about this database can be found paper: LibGen: Generating High Quality Spectral Libraries of Natural Products for EAD-, UVPD-, and HCD-High Resolution Mass Spectrometers

Args:

pmz (float): The target mass-to-charge ratio (m/z) value.

lamda (float): The lambda parameter for the Poisson distribution used to generate intensities, which serves as both mean and standard deviation of the distribution.

polarity (str): The polarity of the adduct, either ‘+’ or ‘-‘.

formula_db (pandas.DataFrame): A DataFrame containing a column ‘mass’ with possible mass values.

n (int, optional): The number of noise peaks to generate. Default is 100.

Returns:

np.array: A synthetic spectrum with chemical noise.

Raises:

ValueError: If the polarity is not ‘+’ or ‘-‘.

noise.add_noise will add noise to the query spectrum. The noise is generated by generate_noise or generate_chemical_noise.

spectral_denoising.noise.add_noise(msms, noise)[source]

Add noise to a mass spectrum and process the resulting spectrum. This function takes a mass spectrum and a noise spectrum, standardizes the mass spectrum, adds the noise to it, normalizes the resulting spectrum, and sorts it.

Args:

msms (np.ndarray): The mass spectrum to which noise will be added.

noise (np.ndarray): The noise spectrum to be added to the mass spectrum.

Returns:

np.ndarray: The processed mass spectrum after adding noise, normalization, and sorting.

Notes:
  • The noise spectrum is generated with intensity as ralatie measure (from 0-1)

  • Thus, the mass spectrum is standardized using the standardize_spectrum function.