Analysis¶

padua.analysis.correlation(df, rowvar=False)[source]¶

Calculate column-wise Pearson correlations using numpy.ma.corrcoef

Input data is masked to ignore NaNs when calculating correlations. Data is returned as a Pandas DataFrame of column_n x column_n dimensions, with column index copied to both axes.

Parameters:	df – Pandas DataFrame
Returns:	Pandas DataFrame (n_columns x n_columns) of column-wise correlations

padua.analysis.enrichment(df)[source]¶

Calculate relative enrichment of peptide modifications.

Taking a modifiedsitepeptides DataFrame returns the relative enrichment of the specified modification in the table.

The returned data columns are generated from the input data columns.

Parameters:	df – Pandas `DataFrame`
Returns:	Pandas `DataFrame` of percentage modifications in the supplied data.

padua.analysis.go_enrichment(df, enrichment='function', organism='Homo sapiens', summary=True, fdr=0.05, ids_from=['Proteins', 'Protein IDs'])[source]¶

Calculate gene ontology (GO) enrichment for a specified set of indices, using the PantherDB GO enrichment service.

Provided with a processed data DataFrame will calculate the GO ontology enrichment specified by enrichment, for the specified organism. The IDs to use for genes are taken from the field ids_from, which by default is compatible with both proteinGroups and modified peptide tables. Setting the fdr parameter (default=0.05) sets the cut-off to use for filtering the results. If summary is True (default) the returned DataFrame contains just the ontology summary and FDR.

Parameters:

df – Pandas DataFrame to
enrichment – str GO enrichment method to use (one of ‘function’, ‘process’, ‘cellular_location’, ‘protein_class’, ‘pathway’)
organism – str organism name (e.g. “Homo sapiens”)
summary – bool return full, or summarised dataset
fdr – float FDR cut-off to use for returned GO enrichments
ids_from – list of str containing the index levels to select IDs from (genes, protein IDs, etc.) default=[‘Proteins’,’Protein IDs’]

Returns:

Pandas DataFrame containing enrichments, sorted by P value.

padua.analysis.modifiedaminoacids(df)[source]¶

Calculate the number of modified amino acids in supplied DataFrame.

Returns the total of all modifications and the total for each amino acid individually, as an int and a dict of int, keyed by amino acid, respectively.

Parameters:	df – Pandas `DataFrame` containing processed data.
Returns:	total_aas `int` the total number of all modified amino acids quants `dict` of `int` keyed by amino acid, giving individual counts for each aa.

padua.analysis.pca(df, n_components=2, mean_center=False, **kwargs)[source]¶

Principal Component Analysis, based on sklearn.decomposition.PCA

Performs a principal component analysis (PCA) on the supplied dataframe, selecting the first n_components components in the resulting model. The model scores and weights are returned.

For more information on PCA and the algorithm used, see the scikit-learn documentation.

Parameters:	df – Pandas `DataFrame` to perform the analysis on n_components – `int` number of components to select mean_center – `bool` mean center the data before performing PCA kwargs – additional keyword arguments to sklearn.decomposition.PCA
Returns:	scores `DataFrame` of PCA scores n_components x n_samples weights `DataFrame` of PCA scores n_variables x n_components

padua.analysis.sitespeptidesproteins(df, site_localization_probability=0.75)[source]¶

Generate summary count of modified sites, peptides and proteins in a processed dataset DataFrame.

Returns the number of sites, peptides and proteins as calculated as follows:

sites (>0.75; or specified site localization probability) count of all sites > threshold
peptides the set of Sequence windows in the dataset (unique peptides)
proteins the set of unique leading peptides in the dataset

Parameters:	df – Pandas `DataFrame` of processed data site_localization_probability – `float` site localization probability threshold (for sites calculation)
Returns:	`tuple` of `int`, containing sites, peptides, proteins