Analysis

padua.analysis.correlation(df, rowvar=False)[source]

Calculate column-wise Pearson correlations using numpy.ma.corrcoef

Input data is masked to ignore NaNs when calculating correlations. Data is returned as a Pandas DataFrame of column_n x column_n dimensions, with column index copied to both axes.

Parameters:df – Pandas DataFrame
Returns:Pandas DataFrame (n_columns x n_columns) of column-wise correlations
padua.analysis.enrichment(df)[source]

Calculate relative enrichment of peptide modifications.

Taking a modifiedsitepeptides DataFrame returns the relative enrichment of the specified modification in the table.

The returned data columns are generated from the input data columns.

Parameters:df – Pandas DataFrame
Returns:Pandas DataFrame of percentage modifications in the supplied data.
padua.analysis.go_enrichment(df, enrichment='function', organism='Homo sapiens', summary=True, fdr=0.05, ids_from=['Proteins', 'Protein IDs'])[source]

Calculate gene ontology (GO) enrichment for a specified set of indices, using the PantherDB GO enrichment service.

Provided with a processed data DataFrame will calculate the GO ontology enrichment specified by enrichment, for the specified organism. The IDs to use for genes are taken from the field ids_from, which by default is compatible with both proteinGroups and modified peptide tables. Setting the fdr parameter (default=0.05) sets the cut-off to use for filtering the results. If summary is True (default) the returned DataFrame contains just the ontology summary and FDR.

Parameters:
  • df – Pandas DataFrame to
  • enrichmentstr GO enrichment method to use (one of ‘function’, ‘process’, ‘cellular_location’, ‘protein_class’, ‘pathway’)
  • organismstr organism name (e.g. “Homo sapiens”)
  • summarybool return full, or summarised dataset
  • fdrfloat FDR cut-off to use for returned GO enrichments
  • ids_fromlist of str containing the index levels to select IDs from (genes, protein IDs, etc.) default=[‘Proteins’,’Protein IDs’]
Returns:

Pandas DataFrame containing enrichments, sorted by P value.

padua.analysis.modifiedaminoacids(df)[source]

Calculate the number of modified amino acids in supplied DataFrame.

Returns the total of all modifications and the total for each amino acid individually, as an int and a dict of int, keyed by amino acid, respectively.

Parameters:df – Pandas DataFrame containing processed data.
Returns:total_aas int the total number of all modified amino acids quants dict of int keyed by amino acid, giving individual counts for each aa.
padua.analysis.pca(df, n_components=2, mean_center=False, **kwargs)[source]

Principal Component Analysis, based on sklearn.decomposition.PCA

Performs a principal component analysis (PCA) on the supplied dataframe, selecting the first n_components components in the resulting model. The model scores and weights are returned.

For more information on PCA and the algorithm used, see the scikit-learn documentation.

Parameters:
  • df – Pandas DataFrame to perform the analysis on
  • n_componentsint number of components to select
  • mean_centerbool mean center the data before performing PCA
  • kwargs – additional keyword arguments to sklearn.decomposition.PCA
Returns:

scores DataFrame of PCA scores n_components x n_samples weights DataFrame of PCA scores n_variables x n_components

padua.analysis.sitespeptidesproteins(df, site_localization_probability=0.75)[source]

Generate summary count of modified sites, peptides and proteins in a processed dataset DataFrame.

Returns the number of sites, peptides and proteins as calculated as follows:

  • sites (>0.75; or specified site localization probability) count of all sites > threshold
  • peptides the set of Sequence windows in the dataset (unique peptides)
  • proteins the set of unique leading peptides in the dataset
Parameters:
  • df – Pandas DataFrame of processed data
  • site_localization_probabilityfloat site localization probability threshold (for sites calculation)
Returns:

tuple of int, containing sites, peptides, proteins