Visualize

Visualization tools for proteomic data, using standard Pandas dataframe structures from imported data. These functions make some assumptions about the structure of data, but generally try to accomodate.

Depends on scikit-learn for PCA analysis

padua.visualize.box(df, s=None, title_from=None, subplots=False, figsize=(18, 6), groups=None, fcol=None, ecol=None, hatch=None, ylabel='', xlabel='')[source]

Generate a box plot from pandas DataFrame with sample grouping.

Plot group mean, median and deviations for specific values (proteins) in the dataset. Plotting is controlled via the s param, which is used as a search string along the y-axis. All matching values will be returned and plotted. Multiple search values can be provided as a list of str and these will be searched as an and query.

Box fill and edge colors can be controlled on a full-index basis by passing a dict of indexer:color to fcol and ecol respectively. Box hatching can be controlled by passing a dict of indexer:hatch to hatch.

Parameters:
  • df – Pandas DataFrame
  • sstr search y-axis for matching values (case-insensitive)
  • title_fromlist of str of index levels to generate title from
  • subplotsbool use subplots to separate plot groups
  • figsizetuple of int size of resulting figure
  • groups
  • fcoldict of str indexer:color where color is hex value or matplotlib color code
  • ecoldict of str indexer:color where color is hex value or matplotlib color code
  • hatchdict of str indexer:hatch where hatch is matplotlib hatch descriptor
  • ylabelstr ylabel for boxplot
  • xlabelstr xlabel for boxplot
Returns:

list of Figure

padua.visualize.column_correlations(df, cmap=<Mock id='139759959814384'>)[source]
Parameters:
  • df
  • cmap
Returns:

padua.visualize.comparedist(df1, df2, bins=50)[source]
Compare the distributions of two DataFrames giving visualisations of:
  • individual and combined distributions
  • distribution of non-common values
  • distribution of non-common values vs. each side

Plot distribution as area (fill_between) + mean, median vertical bars.

Parameters:
  • df1pandas.DataFrame
  • df2pandas.DataFrame
  • binsint number of bins for histogram
Returns:

Figure

padua.visualize.correlation(df, cm=<Mock id='139759960068448'>, vmin=None, vmax=None, labels=None, show_scatter=False)[source]

Generate a column-wise correlation plot from the provided data.

The columns of the supplied dataframes will be correlated (using analysis.correlation) to generate a Pearson correlation plot heatmap. Scatter plots of correlated samples can also be generated over the redundant half of the plot to give a visual indication of the protein distribution.

Parameters:
  • dfpandas.DataFrame
  • cm – Matplotlib colormap (default cm.PuOr_r)
  • vmin – Minimum value for colormap normalization
  • vmax – Maximum value for colormap normalization
  • labels – Index column to retrieve labels from
  • show_scatter – Show overlaid scatter plots for each sample in lower-left half. Note that this is slow for large numbers of samples.
Returns:

matplotlib.Figure generated Figure.

padua.visualize.enrichment(df)[source]
Parameters:df
Returns:
padua.visualize.hierarchical(df, cluster_cols=True, cluster_rows=False, n_col_clusters=False, n_row_clusters=False, fcol=None, z_score=0, method='ward', cmap=<Mock id='139759959986128'>, return_clusters=False, rdistance_fn=<Mock id='139759960027824'>, cdistance_fn=<Mock id='139759959917064'>)[source]

Hierarchical clustering of samples or proteins

Peform a hiearchical clustering on a pandas DataFrame and display the resulting clustering as a heatmap. The axis of clustering can be controlled with cluster_cols and cluster_rows. By default clustering is performed along the X-axis, therefore to cluster samples transpose the DataFrame as it is passed, using df.T.

Samples are z-scored along the 0-axis (y) by default. To override this use the z_score param with the axis to z_score or alternatively, None, to turn it off.

If a n_col_clusters or n_row_clusters is specified, this defines the number of clusters to identify and highlight in the resulting heatmap. At least this number of clusters will be selected, in some instances there will be more if 2 clusters rank equally at the determined cutoff.

If specified fcol will be used to colour the axes for matching samples.

Parameters:
  • df – Pandas DataFrame to cluster
  • cluster_colsbool if True cluster along column axis
  • cluster_rowsbool if True cluster along row axis
  • n_col_clustersint the ideal number of highlighted clusters in cols
  • n_row_clustersint the ideal number of highlighted clusters in rows
  • fcoldict of label:colors to be applied along the axes
  • z_scoreint to specify the axis to Z score or None to disable
  • methodstr describing cluster method, default ward
  • cmap – matplotlib colourmap for heatmap
  • return_clustersbool return clusters in addition to axis
Returns:

matplotlib axis, or axis and cluster data

padua.visualize.modificationlocalization(df)[source]

Plot the % of Class I, II and III localised peptides according to standard thresholds.

Generates a pie chart showing the % of peptides that fall within the Class I, II and III classifications based on localisation probability. These definitions are:

Class I     0.75 > x
Class II    0.50 > x <= 0.75
Class III   0.25 > x <= 0.50

Any peptides with a localisation score of <= 0.25 are excluded.

Parameters:df
Returns:matplotlib axis
padua.visualize.modifiedaminoacids(df, kind='pie')[source]

Generate a plot of relative numbers of modified amino acids in source DataFrame.

Plot a pie or bar chart showing the number and percentage of modified amino acids in the supplied data frame. The amino acids displayed will be determined from the supplied data/modification type.

Parameters:
  • df – processed DataFrame
  • kindstr type of plot; either “pie” or “bar”
Returns:

matplotlib ax

padua.visualize.pca(df, n_components=2, mean_center=False, fcol=None, ecol=None, marker='o', markersize=40, threshold=None, label_threshold=None, label_weights=None, label_scores=None, return_df=False, show_covariance_ellipse=False, *args, **kwargs)[source]

Perform Principal Component Analysis (PCA) from input DataFrame and generate scores and weights plots.

Principal Component Analysis is a technique for identifying the largest source of variation in a dataset. This function uses the implementation available in scikit-learn. The PCA is calculated via analysis.pca and will therefore give identical results.

Resulting scores and weights plots are generated showing the distribution of samples within the resulting PCA space. Sample color and marker size can be controlled by label, lookup and calculation (lambda) to generate complex plots highlighting sample separation.

For further information see the examples included in the documentation.

Parameters:
  • df – Pandas DataFrame
  • n_componentsint number of Principal components to return
  • mean_centerbool mean center the data before performing PCA
  • fcoldict of indexers:colors, where colors are hex colors or matplotlib color names
  • ecoldict of indexers:colors, where colors are hex colors or matplotlib color names
  • markerstr matplotlib marker name (default “o”)
  • markersizeint or callable which returns an int for a given indexer
  • thresholdfloat weight threshold for plot (horizontal line)
  • label_thresholdfloat weight threshold over which to draw labels
  • label_weightslist of str
  • label_scoreslist of str
  • return_dfbool return the resulting scores, weights as pandas DataFrames
  • show_covariance_ellipsebool show the covariance ellipse around each group
  • args – additional arguments passed to analysis.pca
  • kwargs – additional arguments passed to analysis.pca
Returns:

padua.visualize.plot_cov_ellipse(cov, pos, nstd=2, **kwargs)[source]

Plots an nstd sigma error ellipse based on the specified covariance matrix (cov). Additional keyword arguments are passed on to the ellipse patch artist.

cov : The 2x2 covariance matrix to base the ellipse on pos : The location of the center of the ellipse. Expects a 2-element

sequence of [x0, y0].
nstd : The radius of the ellipse in numbers of standard deviations.
Defaults to 2 standard deviations.

Additional keyword arguments are pass on to the ellipse patch.

A matplotlib ellipse artist
padua.visualize.plot_point_cov(points, nstd=2, **kwargs)[source]

Plots an nstd sigma ellipse based on the mean and covariance of a point “cloud” (points, an Nx2 array).

points : An Nx2 array of the data points. nstd : The radius of the ellipse in numbers of standard deviations.

Defaults to 2 standard deviations.

Additional keyword arguments are pass on to the ellipse patch.

A matplotlib ellipse artist
padua.visualize.rankintensity(df, colors=None, labels_from='Protein names', number_of_annotations=3, show_go_enrichment=False, go_ids_from=None, go_enrichment='function', go_max_labels=8, go_fdr=None)[source]

Rank intensity plot, showing intensity order vs. raw intensity value S curve.

Generates a plot showing detected protein intensity plotted against protein intensity rank. A series of colors can be provided to segment the S curve into regions. Gene ontology enrichments (as calculated via analysis.go_enrichment) can be overlaid on the output. Note that since the ranking reflects simple abundance there is little meaning to enrichment (FDR will remove most if not all items) and it is best considered an annotation of the ‘types’ of proteins in that region.

Parameters:
  • df – Pands DataFrame
  • colorslist of colors to segment the plot into
  • labels_from – Take labels from this column
  • number_of_annotations – Number of protein annotations at each tip
  • show_go_enrichment – Overlay plot with GO enrichment terms
  • go_ids_from – Get IDs for GO enrichment from this column
  • go_enrichment – Type of GO enrichment to show
  • go_max_labels – Maximum number of GO enrichment labels per segment
  • go_fdr – FDR cutoff to apply to the GO enrichment terms
Returns:

matplotlib Axes

padua.visualize.sitespeptidesproteins(df, labels=None, colors=None, site_localization_probability=0.75)[source]

Plot the number of sites, peptides and proteins in the dataset.

Generates a plot with sites, peptides and proteins displayed hierarchically in chevrons. The site count is limited to Class I (<=0.75 site localization probability) by default but may be altered using the site_localization_probability parameter.

Labels and alternate colours may be supplied as a 3-entry iterable.

Parameters:
  • df – pandas DataFrame to calculate numbers from
  • labels – list/tuple of 3 strings containing labels
  • colors – list/tuple of 3 colours as hex codes or matplotlib color codes
  • site_localization_probability – the cut-off for site inclusion (default=0.75; Class I)
Returns:

padua.visualize.venn(df1, df2, df3=None, labels=None, ix1=None, ix2=None, ix3=None, return_intersection=False, fcols=None)[source]

Plot a 2 or 3-part venn diagram showing the overlap between 2 or 3 pandas DataFrames.

Provided with two or three Pandas DataFrames, this will return a venn diagram showing the overlap calculated between the DataFrame indexes provided as ix1, ix2, ix3. Labels for each DataFrame can be provided as a list in the same order, while fcol can be used to specify the colors of each section.

Parameters:
  • df1 – Pandas DataFrame
  • df2 – Pandas DataFrame
  • df3 – Pandas DataFrame (optional)
  • labels – List of labels for the provided dataframes
  • ix1 – Index level name of of Dataframe 1 to use for comparison
  • ix2 – Index level name of of Dataframe 2 to use for comparison
  • ix3 – Index level name of of Dataframe 3 to use for comparison
  • return_intersection – Return the intersection of the supplied indices
  • fcols – List of colors for the provided dataframes
Returns:

ax, or ax with intersection

padua.visualize.volcano(df, a, b=None, fdr=0.05, threshold=2, minimum_sample_n=0, estimate_qvalues=False, labels_from=None, labels_for=None, title=None, markersize=64, s0=1e-05, draw_fdr=True, is_log2=False, fillna=None, label_sig_only=True, ax=None, fc='grey')[source]

Volcano plot of two sample groups showing t-test p value vs. log2(fc).

Generates a volcano plot for two sample groups, selected from df using a and b indexers. The mean of each group is calculated along the y-axis (per protein) and used to generate a log2 ratio. If a log2-transformed dataset is supplied set islog2=True (a warning will be given when negative values are present).

A two-sample independent t-test is performed between each group. If minimum_sample_n is supplied, any values (proteins) without this number of samples will be dropped from the analysis.

Individual data points can be labelled in the resulting plot by passing labels_from with a index name, and labels_for with a list of matching values for which to plot labels.

Parameters:
  • df – Pandas dataframe
  • atuple or str indexer for group A
  • btuple or str indexer for group B
  • fdrfloat false discovery rate cut-off
  • thresholdfloat log2(fc) ratio cut -off
  • minimum_sample_nint minimum sample for t-test
  • estimate_qvaluesbool estimate Q values (adjusted P)
  • labels_fromstr or int index level to get labels from
  • labels_forlist of str matching labels to show
  • titlestr title for plot
  • markersizeint size of markers
  • s0float smoothing factor between fdr/fc cutoff
  • draw_fdrbool draw the fdr/fc curve
  • is_log2bool is the data log2 transformed already?
  • fillnafloat fill NaN values with value (default: 0)
  • label_sig_onlybool only label significant values
  • ax – matplotlib axis on which to draw
  • fcstr hex or matplotlib color code, default color of points
Returns: