Analysis¶
-
padua.analysis.
anova_1way
(df, *args, fdr=0.05)[source]¶ Perform Analysis of Variation (ANOVA) on provided dataframe and for specified groups. Groups for analysis can be specified as individual arguments, e.g.
anova(df, “Group A”, “Group B”) anova(df, (“Group A”, 5), (“Group B”, 5))
At least 2 groups must be provided.
Returns: Dataframe containing selected groups and P/T/sig value for the comparisons.
-
padua.analysis.
correlation
(df, rowvar=False)[source]¶ Calculate column-wise Pearson correlations using
numpy.ma.corrcoef
Input data is masked to ignore NaNs when calculating correlations. Data is returned as a Pandas
DataFrame
of column_n x column_n dimensions, with column index copied to both axes.Parameters: df – Pandas DataFrame Returns: Pandas DataFrame (n_columns x n_columns) of column-wise correlations
-
padua.analysis.
enrichment_from_evidence
(dfe, modification='Phospho (STY)')[source]¶ Calculate relative enrichment of peptide modifications from evidence.txt.
Taking a modifiedsitepeptides
DataFrame
returns the relative enrichment of the specified modification in the table.The returned data columns are generated from the input data columns.
Parameters: df – Pandas DataFrame
of evidenceReturns: Pandas DataFrame
of percentage modifications in the supplied data.
-
padua.analysis.
enrichment_from_msp
(dfmsp, modification='Phospho (STY)')[source]¶ Calculate relative enrichment of peptide modifications from modificationSpecificPeptides.txt.
Taking a modifiedsitepeptides
DataFrame
returns the relative enrichment of the specified modification in the table.The returned data columns are generated from the input data columns.
Parameters: df – Pandas DataFrame
of modificationSpecificPeptidesReturns: Pandas DataFrame
of percentage modifications in the supplied data.
-
padua.analysis.
go_enrichment
(df, enrichment='function', organism='Homo sapiens', summary=True, fdr=0.05, ids_from=['Proteins', 'Protein IDs'])[source]¶ Calculate gene ontology (GO) enrichment for a specified set of indices, using the PantherDB GO enrichment service.
Provided with a processed data
DataFrame
will calculate the GO ontology enrichment specified by enrichment, for the specified organism. The IDs to use for genes are taken from the field ids_from, which by default is compatible with both proteinGroups and modified peptide tables. Setting the fdr parameter (default=0.05) sets the cut-off to use for filtering the results. If summary isTrue
(default) the returnedDataFrame
contains just the ontology summary and FDR.Parameters: - df – Pandas
DataFrame
to - enrichment –
str
GO enrichment method to use (one of ‘function’, ‘process’, ‘cellular_location’, ‘protein_class’, ‘pathway’) - organism –
str
organism name (e.g. “Homo sapiens”) - summary –
bool
return full, or summarised dataset - fdr –
float
FDR cut-off to use for returned GO enrichments - ids_from –
list
ofstr
containing the index levels to select IDs from (genes, protein IDs, etc.) default=[‘Proteins’,’Protein IDs’]
Returns: Pandas
DataFrame
containing enrichments, sorted by P value.- df – Pandas
-
padua.analysis.
modifiedaminoacids
(df)[source]¶ Calculate the number of modified amino acids in supplied
DataFrame
.Returns the total of all modifications and the total for each amino acid individually, as an
int
and adict
ofint
, keyed by amino acid, respectively.Parameters: df – Pandas DataFrame
containing processed data.Returns: total_aas int
the total number of all modified amino acids quantsdict
ofint
keyed by amino acid, giving individual counts for each aa.
-
padua.analysis.
pca
(df, n_components=2, mean_center=False, **kwargs)[source]¶ Principal Component Analysis, based on sklearn.decomposition.PCA
Performs a principal component analysis (PCA) on the supplied dataframe, selecting the first
n_components
components in the resulting model. The model scores and weights are returned.For more information on PCA and the algorithm used, see the scikit-learn documentation.
Parameters: - df – Pandas
DataFrame
to perform the analysis on - n_components –
int
number of components to select - mean_center –
bool
mean center the data before performing PCA - kwargs – additional keyword arguments to sklearn.decomposition.PCA
Returns: scores
DataFrame
of PCA scores n_components x n_samples weightsDataFrame
of PCA weights n_variables x n_components- df – Pandas
-
padua.analysis.
plsda
(df, a, b, n_components=2, mean_center=False, scale=True, **kwargs)[source]¶ Partial Least Squares Discriminant Analysis, based on sklearn.cross_decomposition.PLSRegression
Performs a binary group partial least squares discriminant analysis (PLS-DA) on the supplied dataframe, selecting the first
n_components
.Sample groups are defined by the selectors
a
andb
which are used to select columns from the supplied dataframe. The result model is applied to the entire dataset, projecting non-selected samples into the same space.For more information on PLS regression and the algorithm used, see the scikit-learn documentation.
Parameters: - df – Pandas
DataFrame
to perform the analysis on - a – Column selector for group a
- b – Column selector for group b
- n_components –
int
number of components to select - mean_center –
bool
mean center the data before performing PLS regression - kwargs – additional keyword arguments to sklearn.cross_decomposition.PLSRegression
Returns: scores
DataFrame
of PLSDA scores n_components x n_samples weightsDataFrame
of PLSDA weights n_variables x n_components- df – Pandas
-
padua.analysis.
plsr
(df, v, n_components=2, mean_center=False, scale=True, **kwargs)[source]¶ Partial Least Squares Regression Analysis, based on sklearn.cross_decomposition.PLSRegression
Performs a partial least squares regression (PLS-R) on the supplied dataframe
df
against the provided continuous variablev
, selecting the firstn_components
.For more information on PLS regression and the algorithm used, see the scikit-learn documentation.
Parameters: - df – Pandas
DataFrame
to perform the analysis on - v – Continuous variable to perform regression against
- n_components –
int
number of components to select - mean_center –
bool
mean center the data before performing PLS regression - kwargs – additional keyword arguments to sklearn.cross_decomposition.PLSRegression
Returns: scores
DataFrame
of PLS-R scores n_components x n_samples weightsDataFrame
of PLS-R weights n_variables x n_components- df – Pandas
-
padua.analysis.
sitespeptidesproteins
(df, site_localization_probability=0.75)[source]¶ Generate summary count of modified sites, peptides and proteins in a processed dataset
DataFrame
.Returns the number of sites, peptides and proteins as calculated as follows:
- sites (>0.75; or specified site localization probability) count of all sites > threshold
- peptides the set of Sequence windows in the dataset (unique peptides)
- proteins the set of unique leading proteins in the dataset
Parameters: - df – Pandas
DataFrame
of processed data - site_localization_probability –
float
site localization probability threshold (for sites calculation)
Returns: tuple
ofint
, containing sites, peptides, proteins