Plotting Tidy Data With Seaborn

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • How can I best organize my data tables?

  • How can I make plots from large complex data tables?

Objectives
import seaborn as sns

Seaborn plots using wide-form data

First let’s load our Europe GDP dataset and convert the column labels into integer years as we did previously in the Pandas episode. We’ll also slice out the first five countries to create a manageable subset of the data.

import pandas as pd

data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
# Extract year from last 4 characters of each column name
years = data.columns.str.strip('gdpPercap_')
# Convert year values to integers, saving results back to dataframe
data.columns = years.astype(int)

# Slice out the rows for the first 5 countries.
data_plot = data.iloc[:5]
# Transpose the dataframe so the countries are the columns.
data_plot = data_plot.T

data_plot
country      Albania       Austria       Belgium  Bosnia and Herzegovina      Bulgaria
1952     1601.056136   6137.076492   8343.105127              973.533195   2444.286648
1957     1942.284244   8842.598030   9714.960623             1353.989176   3008.670727
1962     2312.888958  10750.721110  10991.206760             1709.683679   4254.337839
1967     2760.196931  12834.602400  13149.041190             2172.352423   5577.002800
1972     3313.422188  16661.625600  16672.143560             2860.169750   6597.494398
1977     3533.003910  19749.422300  19117.974480             3528.481305   7612.240438
1982     3630.880722  21597.083620  20979.845890             4126.613157   8224.191647
1987     3738.932735  23687.826070  22525.563080             4314.114757   8239.854824
1992     2497.437901  27042.018680  25575.570690             2546.781445   6302.623438
1997     3193.054604  29095.920660  27561.196630             4766.355904   5970.388760
2002     4604.211737  32417.607690  30485.883750             6018.975239   7696.777725
2007     5937.029526  36126.492700  33692.605080             7446.298803  10680.792820

This data is organized in the so-called wide form. The table cells hold values of a single variable (in this case GDP per capita) with two other identifying variables (year and country) encoded in the row and column labels. This may appear convenient and aligns with how we generally organize things in spreadsheets but is actually quite limiting, most critically because it restricts us to storing just three variables. Nonetheless, seaborn will do its best to accept data of this form. Later we will see a different way of organizing dataframes that offers much more flexibility.

sns.pairplot(data_plot, kind='reg')

Pairplot of GDP values by country

sns.relplot(data_plot, kind='line')
sns.catplot(data_plot, kind='box')

Relplot of GDP values by country Catplot of GDP values by country

Seaborn plots using long-form data

An alternative way to organize dataframes is the long form, in which every variable is stored in its own column, with the column label providing its name. Every observation is stored as a separate row. This is often referred to as “tidy data” – you can read more about this concept in the paper Tidy Data by Hadley Wickham. seaborn works best with dataframes in this format, as it provides maximum control over which variables to plot and map to the various visual presentation styles.

Long-form dataframes also support an unlimited number of variables! You can record every variable you think might be useful and decide whether and how to plot it later.

long vs wide form data

melt_test = data.iloc[:3, :2].reset_index()
melt_test
   country         1952         1957
0  Albania  1601.056136  1942.284244
1  Austria  6137.076492  8842.598030
2  Belgium  8343.105127  9714.960623
melt_test.melt(id_vars='country')
   country variable        value
0  Albania     1952  1601.056136
1  Austria     1952  6137.076492
2  Belgium     1952  8343.105127
3  Albania     1957  1942.284244
4  Austria     1957  8842.598030
5  Belgium     1957  9714.960623
melt_test.melt(id_vars='country', var_name='year', value_name='gdpPercap')
   country  year    gdpPercap
0  Albania  1952  1601.056136
1  Austria  1952  6137.076492
2  Belgium  1952  8343.105127
3  Albania  1957  1942.284244
4  Austria  1957  8842.598030
5  Belgium  1957  9714.960623
data_long = data.reset_index().melt(id_vars='country', var_name='year', value_name='gdpPercap')
sns.relplot(data_long, kind='line', x='year', y='gdpPercap', hue='country')
sns.catplot(data_long, y='country', x='gdpPercap', kind='box')

Relplot of GDP values by country using long-form data Catplot of GDP values by country using long-form data

Plots using mass spectrometry data

We have been given a data file containing protein-level quantification from a series of mass spectrometry experiments. We would like to perform a quality control check that the number of unique proteins detected for each treatment condition are comparable and the variance between replicates is low.

prot = pd.read_csv('data/proteomics_protein_level_data.csv')

prot.iloc[:, :5]
        RUN      Protein  LogIntensities                   originalRUN  GROUP
0         1  1433B_HUMAN       15.088586   240805_Sarthy_Acla_chrom_01   Acla
1         2  1433B_HUMAN       14.783678   240805_Sarthy_Acla_chrom_02   Acla
2         3  1433B_HUMAN       15.178893   240805_Sarthy_Acla_chrom_03   Acla
3         4  1433B_HUMAN       15.194030   240805_Sarthy_DMSO_chrom_01   DMSO
4         5  1433B_HUMAN       14.852435   240805_Sarthy_DMSO_chrom_02   DMSO
...     ...          ...             ...                           ...    ...
101236   14   ZZZ3_HUMAN       12.360979   240805_Sarthy_Doxo_chrom_02   Doxo
101237   15   ZZZ3_HUMAN       12.300259   240805_Sarthy_Doxo_chrom_03   Doxo
101238   16   ZZZ3_HUMAN       12.142085  240805_Sarthy_Etopo_chrom_01  Etopo
101239   17   ZZZ3_HUMAN       12.467393  240805_Sarthy_Etopo_chrom_02  Etopo
101240   18   ZZZ3_HUMAN       12.379874  240805_Sarthy_Etopo_chrom_03  Etopo

[101241 rows x 5 columns]
prot.groupby('GROUP')['Protein'].nunique()
GROUP
Acla       6021
DMSO       6222
Dim_dox    6057
Doxo       6073
Etopo      6043
Name: Protein, dtype: int64
prot.groupby(['GROUP', 'RUN'])['Protein'].nunique()
GROUP    RUN
Acla     1      5564
         2      5562
         3      5645
DMSO     4      5644
         5      5660
         6      5678
         7      5589
         8      5630
         9      5652
Dim_dox  10     5629
         11     5634
         12     5586
Doxo     13     5741
         14     5587
         15     5551
Etopo    16     5633
         17     5671
         18     5585
Name: Protein, dtype: int64

Groupby aggregation order of operations

How do these two expressions differ? Is one preferable over the other? Try evaluating the second one without the final ['Protein'] indexing and try to explain what is happening.

prot.groupby('GROUP')['Protein'].nunique()
prot.groupby('GROUP').nunique()['Protein']

Solution

Both evaluate to the same final result, but the second expression computes the number of unique values per group in all columns in the dataframe and then throws away all the computed columns except for Protein. The first expression only ever computes the unique values for the Protein column, so it’s more efficient. Computers are fast so you may not notice the difference until your next dataset is very large.

BONUS: Use the %timeit IPython “magic” to time the performance of both expressions to see the difference for yourself.

trt_unique_proteins = prot.groupby(['GROUP', 'RUN'])['Protein'].nunique().reset_index()
sns.catplot(trt_unique_proteins, x='GROUP', y='Protein', kind='bar')

catplot of unique protein count by treatment condition

Groupby aggregation order of operations

How might we also add a unique color to each bar in the previous plot? HINT: We already used this feature in a previous line plot.

Solution

Add hue='GROUP' to the catplot argument list:

sns.catplot(trt_unique_proteins, x='GROUP', y='Protein', kind='bar', hue='GROUP')

We also have a second data file containing the measurements for the drug treatment conditions normalized by the DMSO carrier control condition. The protein abundance values are reported as the base-2 log of the ratio to the control condition, and p-values have also been computed. We will produce a volcano plot for each treatment condition to identify how many proteins show a strong and statistically significant change relative to the control.

prot_fold = pd.read_csv('data/proteomics_model.csv')

prot_fold.iloc[:, :8]
           Protein         Label    log2FC        SE    Tvalue    DF    pvalue  adj.pvalue
0      1433B_HUMAN     Acla-DMSO  0.044903  0.207659  0.216233  13.0  0.832162    0.986773
1      1433B_HUMAN  Dim_dox-DMSO  0.175012  0.207659  0.842785  13.0  0.414586    0.751113
2      1433B_HUMAN     Doxo-DMSO  0.174600  0.207659  0.840803  13.0  0.415656    0.999459
3      1433B_HUMAN    Etopo-DMSO  0.250530  0.207659  1.206448  13.0  0.249142    0.999273
4      1433E_HUMAN     Acla-DMSO  0.100954  0.176954  0.570509  13.0  0.578061    0.974975
...            ...           ...       ...       ...       ...   ...       ...         ...
25163  ZZEF1_HUMAN    Etopo-DMSO -0.175627  0.250450 -0.701246  13.0  0.495512    0.999273
25164   ZZZ3_HUMAN     Acla-DMSO  0.265786  0.125187  2.123119  12.0  0.055228    0.728978
25165   ZZZ3_HUMAN  Dim_dox-DMSO  0.170327  0.125187  1.360584  12.0  0.198653    0.564757
25166   ZZZ3_HUMAN     Doxo-DMSO  0.037156  0.125187  0.296807  12.0  0.771688    0.999459
25167   ZZZ3_HUMAN    Etopo-DMSO  0.132035  0.125187  1.054707  12.0  0.312333    0.999273

[25168 rows x 8 columns]
import numpy as np
prot_fold['neg_log10_pvalue'] = -np.log10(prot_fold['pvalue'])
sns.relplot(prot_fold, x='log2FC', y='neg_log10_pvalue', col='Label', kind='scatter')

volcano plot of protein abundance

strength_1 = (prot_fold['pvalue'] < 0.05) & (np.abs(prot_fold['log2FC']) > 0.5) & (np.abs(prot_fold['log2FC']) <= 1)
strength_2 = (prot_fold['pvalue'] < 0.05) & (np.abs(prot_fold['log2FC']) > 1)
prot_fold['strength'] = 0
prot_fold.loc[strength_1, 'strength'] = 1
prot_fold.loc[strength_2, 'strength'] = 2
grid = sns.relplot(prot_fold, x='log2FC', y='neg_log10_pvalue', col='Label', kind='scatter', hue='strength')
for ax in grid.axes.flat:
    ax.axhline(-log10(0.05), linestyle='--')

volcano plot of protein abundance with class colors

Key Points