Analysis¶
-
padua.analysis.
correlation
(df, rowvar=False)[source]¶ Calculate column-wise Pearson correlations using
numpy.ma.corrcoef
Input data is masked to ignore NaNs when calculating correlations. Data is returned as a Pandas
DataFrame
of column_n x column_n dimensions, with column index copied to both axes.Parameters: df – Pandas DataFrame Returns: Pandas DataFrame (n_columns x n_columns) of column-wise correlations
-
padua.analysis.
enrichment
(df)[source]¶ Calculate relative enrichment of peptide modifications.
Taking a modifiedsitepeptides
DataFrame
returns the relative enrichment of the specified modification in the table.The returned data columns are generated from the input data columns.
Parameters: df – Pandas DataFrame
Returns: Pandas DataFrame
of percentage modifications in the supplied data.
-
padua.analysis.
go_enrichment
(df, enrichment='function', organism='Homo sapiens', summary=True, fdr=0.05, ids_from=['Proteins', 'Protein IDs'])[source]¶ Calculate gene ontology (GO) enrichment for a specified set of indices, using the PantherDB GO enrichment service.
Provided with a processed data
DataFrame
will calculate the GO ontology enrichment specified by enrichment, for the specified organism. The IDs to use for genes are taken from the field ids_from, which by default is compatible with both proteinGroups and modified peptide tables. Setting the fdr parameter (default=0.05) sets the cut-off to use for filtering the results. If summary isTrue
(default) the returnedDataFrame
contains just the ontology summary and FDR.Parameters: - df – Pandas
DataFrame
to - enrichment –
str
GO enrichment method to use (one of ‘function’, ‘process’, ‘cellular_location’, ‘protein_class’, ‘pathway’) - organism –
str
organism name (e.g. “Homo sapiens”) - summary –
bool
return full, or summarised dataset - fdr –
float
FDR cut-off to use for returned GO enrichments - ids_from –
list
ofstr
containing the index levels to select IDs from (genes, protein IDs, etc.) default=[‘Proteins’,’Protein IDs’]
Returns: Pandas
DataFrame
containing enrichments, sorted by P value.- df – Pandas
-
padua.analysis.
modifiedaminoacids
(df)[source]¶ Calculate the number of modified amino acids in supplied
DataFrame
.Returns the total of all modifications and the total for each amino acid individually, as an
int
and adict
ofint
, keyed by amino acid, respectively.Parameters: df – Pandas DataFrame
containing processed data.Returns: total_aas int
the total number of all modified amino acids quantsdict
ofint
keyed by amino acid, giving individual counts for each aa.
-
padua.analysis.
pca
(df, n_components=2, mean_center=False, **kwargs)[source]¶ Principal Component Analysis, based on sklearn.decomposition.PCA
Performs a principal component analysis (PCA) on the supplied dataframe, selecting the first
n_components
components in the resulting model. The model scores and weights are returned.For more information on PCA and the algorithm used, see the scikit-learn documentation.
Parameters: - df – Pandas
DataFrame
to perform the analysis on - n_components –
int
number of components to select - mean_center –
bool
mean center the data before performing PCA - kwargs – additional keyword arguments to sklearn.decomposition.PCA
Returns: scores
DataFrame
of PCA scores n_components x n_samples weightsDataFrame
of PCA scores n_variables x n_components- df – Pandas
-
padua.analysis.
sitespeptidesproteins
(df, site_localization_probability=0.75)[source]¶ Generate summary count of modified sites, peptides and proteins in a processed dataset
DataFrame
.Returns the number of sites, peptides and proteins as calculated as follows:
- sites (>0.75; or specified site localization probability) count of all sites > threshold
- peptides the set of Sequence windows in the dataset (unique peptides)
- proteins the set of unique leading peptides in the dataset
Parameters: - df – Pandas
DataFrame
of processed data - site_localization_probability –
float
site localization probability threshold (for sites calculation)
Returns: tuple
ofint
, containing sites, peptides, proteins