Analysis¶
-
padua.analysis.correlation(df, rowvar=False)[source]¶ Calculate column-wise Pearson correlations using
numpy.ma.corrcoefInput data is masked to ignore NaNs when calculating correlations. Data is returned as a Pandas
DataFrameof column_n x column_n dimensions, with column index copied to both axes.Parameters: df – Pandas DataFrame Returns: Pandas DataFrame (n_columns x n_columns) of column-wise correlations
-
padua.analysis.enrichment(df)[source]¶ Calculate relative enrichment of peptide modifications.
Taking a modifiedsitepeptides
DataFramereturns the relative enrichment of the specified modification in the table.The returned data columns are generated from the input data columns.
Parameters: df – Pandas DataFrameReturns: Pandas DataFrameof percentage modifications in the supplied data.
-
padua.analysis.go_enrichment(df, enrichment='function', organism='Homo sapiens', summary=True, fdr=0.05, ids_from=['Proteins', 'Protein IDs'])[source]¶ Calculate gene ontology (GO) enrichment for a specified set of indices, using the PantherDB GO enrichment service.
Provided with a processed data
DataFramewill calculate the GO ontology enrichment specified by enrichment, for the specified organism. The IDs to use for genes are taken from the field ids_from, which by default is compatible with both proteinGroups and modified peptide tables. Setting the fdr parameter (default=0.05) sets the cut-off to use for filtering the results. If summary isTrue(default) the returnedDataFramecontains just the ontology summary and FDR.Parameters: - df – Pandas
DataFrameto - enrichment –
strGO enrichment method to use (one of ‘function’, ‘process’, ‘cellular_location’, ‘protein_class’, ‘pathway’) - organism –
strorganism name (e.g. “Homo sapiens”) - summary –
boolreturn full, or summarised dataset - fdr –
floatFDR cut-off to use for returned GO enrichments - ids_from –
listofstrcontaining the index levels to select IDs from (genes, protein IDs, etc.) default=[‘Proteins’,’Protein IDs’]
Returns: Pandas
DataFramecontaining enrichments, sorted by P value.- df – Pandas
-
padua.analysis.modifiedaminoacids(df)[source]¶ Calculate the number of modified amino acids in supplied
DataFrame.Returns the total of all modifications and the total for each amino acid individually, as an
intand adictofint, keyed by amino acid, respectively.Parameters: df – Pandas DataFramecontaining processed data.Returns: total_aas intthe total number of all modified amino acids quantsdictofintkeyed by amino acid, giving individual counts for each aa.
-
padua.analysis.pca(df, n_components=2, mean_center=False, **kwargs)[source]¶ Principal Component Analysis, based on sklearn.decomposition.PCA
Performs a principal component analysis (PCA) on the supplied dataframe, selecting the first
n_componentscomponents in the resulting model. The model scores and weights are returned.For more information on PCA and the algorithm used, see the scikit-learn documentation.
Parameters: - df – Pandas
DataFrameto perform the analysis on - n_components –
intnumber of components to select - mean_center –
boolmean center the data before performing PCA - kwargs – additional keyword arguments to sklearn.decomposition.PCA
Returns: scores
DataFrameof PCA scores n_components x n_samples weightsDataFrameof PCA scores n_variables x n_components- df – Pandas
-
padua.analysis.sitespeptidesproteins(df, site_localization_probability=0.75)[source]¶ Generate summary count of modified sites, peptides and proteins in a processed dataset
DataFrame.Returns the number of sites, peptides and proteins as calculated as follows:
- sites (>0.75; or specified site localization probability) count of all sites > threshold
- peptides the set of Sequence windows in the dataset (unique peptides)
- proteins the set of unique leading peptides in the dataset
Parameters: - df – Pandas
DataFrameof processed data - site_localization_probability –
floatsite localization probability threshold (for sites calculation)
Returns: tupleofint, containing sites, peptides, proteins