【Python】縦断データの可視化（プロット・箱ひげ図・バイオリン図・信頼区間・ヒストグラム）

1. 目的
2. 準備
2.1. open-visualizationsのダウンロード
2.2. ライブラリのインストール
3. チュートリアル1 (plotnineを用いる場合)
3.1. ライブラリの読み込み
3.2. 保存用のフォルダを用意
3.3. データの読み込み
3.4. データの選択
3.5. プロット
3.6. プロットと直線
3.7. グループごとのプロットの位置を微妙に変える
3.8. プロットの色を変更
3.9. 箱ひげ図 (boxplots)
3.10. バイオリン図 (violin plot)
3.11. 信頼区間 (CI bar)
3.12. 各グループの平均を直線で結ぶ
3.13. プロット・箱ひげ図・バイオリン図・信頼区間
4. チュートリアル2 (matplotlibを使う場合)
4.1. ライブラリの読み込み
4.2. 保存用のフォルダを用意
4.3. データの初期化
4.4. プロット
4.5. プロットと直線
4.6. グループごとのプロットの位置を微妙に変える
4.7. the amount of jitter and create a dataframe containing the jittered x-axis values
4.8. 信頼区間 (CI bar)
4.9. バイオリン図 (violin plot)
4.10. 2群のBeforeとAfterをそれぞれプロット
4.11. さらに信頼区間の追加
4.12. プロット・箱ひげ図・バイオリン図・信頼区間
5. 高画質で保存したい場合
5.1. plotnineの場合
5.2. matplotlibの場合

1. 目的

Pythonを用いて、縦断データ(beforeとafter)の可視化

主にやることは、

プロット (dot plot)
箱ひげ図 (box plot)
バイオリン図 (violin plot)
信頼区間 (CI bar)
ヒストグラム (histogram)

最終的には、このような図を作成する。

2. 準備

2.1. open-visualizationsのダウンロード

縦断データの可視化は、open-visualizationsを用いると簡単にできる。

GitHubのリモートリポジトリからopen-visualizationsをダウンロードする。

$ git clone https://github.com/jorvlan/open-visualizations.git

git cloneするとopen-visualizationsというフォルダができる。open-visualizationsの中身のうち、Pythonフォルダの中身を利用する。

$ ls -1 open-visualizations/
LICENSE
Python
R
README.md
install.R
requirements.txt
runtime.txt

チュートリアルは、全部で2つに別れています。
チュートリアル1は、plotlineを扱い、チュートリアル2では、matplotlibを扱います。

$ ls -1 open-visualizations/Python/
README.md
requirements
tutorial_1
tutorial_2

2.2. ライブラリのインストール

open-visualizationsに必要なライブラリをインストールする。

$ pip3 install numpy
$ pip3 install pandas
$ pip3 install matplotlib
$ pip3 install scipy
$ pip3 install seaborn
$ pip3 install plotnine

もしも、plotnineが自身の環境にない場合(ex. Anaconda)、こちらからインストールできます。

3. チュートリアル1 (plotnineを用いる場合)

open-visualizations/Python/tutorial_1/repeated_measures_python.ipynbをjupyter-notebookで開く。

$ jupyter-notebook open-visualizations/Python/tutorial_1/repeated_measures_python.ipynb

3.1. ライブラリの読み込み

from plotnine import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

3.2. 保存用のフォルダを用意

/repmes_tutorial_1_python/figsに作成した図を保存する。

savefig = True

if savefig:
    
    #Load libraries
    import os
    from os.path import isdir
    
    #Get current working directory, but you can specify your own directory of course.
    cwd = os.getcwd()
    
    if  os.path.exists(cwd + "/repmes_tutorial_1_python/figs"):
        print("Directory already exists")

        #Assign the existing directory to a variable
        fig_dir = cwd + "/repmes_tutorial_1_python/figs"
        
    elif not os.path.exists(cwd + "/repmes_tutorial_1_python/figs"):
        print("Directory does not exist and will be created ......")
        os.makedirs(cwd + "/repmes_tutorial_1_python/figs")
        
        if isdir(cwd + "/repmes_tutorial_1_python/figs"):
            print('Directory was created succesfully')
        
        #Assign the created directory to a variable
        fig_dir = cwd + "/repmes_tutorial_1_python/figs" 
   
    else:
        print("Something went wrong")

3.3. データの読み込み

このチュートリアルでは、機械学習でよく用いられるirisデータセットを使います。

irisデータセットには、Iris (アヤメ属) に属する setosa (セトサ)、versicolor (バージカラー)、versinica (バージニカ) の特徴量と品種が収められています。

# Directly from URL

url = "https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/639388c2cbc2120a14dcf466e85730eb8be498bb/iris.csv"
iris_df = pd.read_csv(url)
iris_df

もしも既に、ご自身のPCにデータiris.csvをお持ちでしたら、以下のようにしてiris.csvを読み込みことができます。

# Locally

path = "/Users/jordyvanlangen/Downloads/iris.csv"

iris_df = pd.read_csv(path)
iris_df

150個のirisデータがあり、特徴量が4つあります。speciesは品種名です。

Out:

3.4. データの選択

150個のirisのデータから0から49番目のirisデータをbefore(前半グループ)、50から99番目までのデータをafter(後半グループ)としてデータを変数に格納します。

特徴量は、がく片長sepal_lengthのみを選択します。

# Create two variables and create a variable n 
before = iris_df['sepal_length'].iloc[0:50]
after = iris_df['sepal_length'].iloc[50:100]

前半グループのデータには、1のラベルを、後半グループのデータには、2のラベルを付けるため、ラベル情報をxに格納します。

# Create variable x indicating the conditions
n = len(before)
x = np.array([1, 2])
x = np.repeat(x, [n, n])
x = pd.DataFrame(x, columns = ['x'])

前半グループの特徴量をy1_50_dfに、後半グループの特徴量をy51_100_dfに格納します。

# Create variable y containing the values
y_array = np.array([before, after])

y1_50_df = pd.DataFrame(y_array[0], columns = ['y'])
y51_100_df = pd.DataFrame(y_array[1], columns = ['y'])

次に、y1_50_dfとy51_100_dfを1列になるように結合pd.concatしてyに格納します。
この時、前半後半のラベルxと特徴量yは対応しています。

frames = [y1_50_df, y51_100_df]
y = pd.concat((frames), ignore_index = True)

前半グループと後半グループのそれぞれのデータに1から50のidを振ります。データタイプdtypeは、カテゴリーcategoryにします。

# Define the grouping variable 'id'  
s1 = pd.Series(range(1,51), dtype="category")
s2 = pd.Series(range(1,51), dtype="category")
s1_s2 = pd.concat([s1, s2], ignore_index = True)

id_df = pd.DataFrame(s1_s2, columns = ['id'])

前半後半のラベルx、特徴量y、グループIDidを一つにまとめます。

# Merge the dataframes together 
d = pd.concat([y,x,id_df], axis=1)

pd.options.display.float_format = '{:.3f}'.format
print("The manipulated dataframe with 3 columns ")
print(d[['y', 'x', 'id']])

Out:

3.5. プロット

読み込んだデータをプロットして可視化します。

# 保存画像の大きさ設定
w = 6
h = 6
# ラベル名の設定
labels = ['Before', 'After']

# Plot figure 
fig1 = (
    ggplot(d) 
    + geom_point(aes(x='x', y = 'y'), color = 'magenta') 
    + theme_classic()
    + scale_x_continuous(breaks=(1, 2), labels = labels, name = 'Conditions', limits = (0.5,2.5))
    + ylab('Value')
    + ggtitle('Figure 1: repeated measures individual datapoints')

)

# Save figure 
if savefig:
    fig1.save(fig_dir + "/figure1.png", width = w, height = h, verbose = False)
    
# Show figure
fig1

縦軸が特徴量sepal_length、横軸が前半グループと後半グループのカテゴリーです。
作成した図は、/repmes_tutorial_1_python/figs/figure1.pngに保存されます。

Out:

3.6. プロットと直線

先程のプロット図から、前半グループと後半グループでidが同じものを直線で結んでみます。

直線は、geom_lineで描くことができます。

# Plot figure 
fig2 = (
    ggplot(d) 
    + geom_point(aes(x='x', y = 'y'), color = 'magenta') 
    + geom_line(aes(x='x', y = 'y', group='id'), color = 'lightgray')
    + theme_classic()
    + scale_x_continuous(breaks=(1, 2), labels = labels, name = 'Conditions', limits = (0.5,2.5))
    + ylab('Value')
    + ggtitle('Figure 2: repeated measures datapoints connected')
)

# Save figure
if savefig:
    fig2.save(fig_dir + "/figure2.png", width = w, height = h, verbose = False)
    
# Show figure
fig2

Out:

3.7. グループごとのプロットの位置を微妙に変える

これまでの図は、グループごとにプロットが一直線に並んでいるが、これらのプロットをランダムに位置をずらして配置する。

まずはじめに、x軸上の座標を各プロットごとにランダムに設定する。乱数は、正規分布に従う乱数np.random.normalを用いている。
前半グループは、平均が1で分散が0.05の正規分布で、後半グループは、平均が2で分散が0.05の正規分布に基づく乱数です。

# Create two variables indicating jitter 
np.random.seed(321)
xj = np.random.normal(1, 0.05, len(d['x'].iloc[0:50]))

np.random.seed(321)
xj_2 = np.random.normal(2, 0.05, len(d['x'].iloc[0:50]))

このx軸上のランダムな座標をjit_dfとして格納します。この変数jit_dfの列名はxjと設定されています。

# Create two dataframes of those variables an put them together 
xj_df = pd.DataFrame(xj, columns = ['xj'])
xj_2_df = pd.DataFrame(xj_2, columns = ['xj'])

frames_jit = [xj_df, xj_2_df]

# Use the concat function to concatenate them 
jit_df = pd.concat((frames_jit), ignore_index = True)

前半後半のラベルx、特徴量y、グループIDid、x軸上のプロット座標xjを一つにまとめてdに格納します。

# Merge the jitter dataframes with the other existing dataframes 
d = pd.concat([y,x,id_df,jit_df], axis=1)

pd.options.display.float_format = '{:.3f}'.format
print("The manipulated dataframe with 4 columns including jitter ")
print(d[['y', 'x', 'id','xj']])

Out:

データdを可視化します。これまで作図からの変更点は、geom_pointno
aesのxをx軸の変数を単なる2値の前半後半のラベルxから各プロットの座標xjに変えた点です。

# Plot figure 
fig3 = (
    ggplot(d) 
    + geom_point(aes(x='xj', y = 'y'), color = 'magenta', size = 2) 
    + geom_line(aes(x='xj', y = 'y', group='id'), color = 'lightgray')
    + theme_classic()
    + scale_x_continuous(breaks=(1, 2), labels = labels, name = 'Conditions', limits = (0.5,2.5))
    + ylab('Value')
    + ggtitle('Figure 3: repeated measures datapoints connected + jitter')
)

# Save figure
if savefig:
    fig3.save(fig_dir + "/figure3.png", width = w, height = h, verbose = False)
    
# Show figure
fig3

Out:

3.8. プロットの色を変更

前半グループBeforeのプロットの色を水色dodgerblueに、後半グループAfterのプロットをオレンジ色darkorangeに変えていきます。

色の変更は、geom_pointのcolorから変更できます。

# Plot figure 
fig4 = (
    ggplot(d)
    + geom_point(d.iloc[:50,:],aes(x='xj', y = 'y'), color = 'dodgerblue', show_legend=False, alpha = .6, size = 3)
    + geom_point(d.iloc[50:100,:],aes(x='xj', y = 'y'), color = 'darkorange', show_legend=False, alpha = .6, size = 3)
    + geom_line(aes(x='xj', y = 'y', group = 'id'), color = 'lightgray')
    + theme_classic()
    + scale_x_continuous(breaks=(1, 2), labels = labels, name = 'Conditions', limits = (0.5,2.5))
    + ylab('Value')
    + ggtitle('Figure 4: repeated measures grouped by color')
)

# Save figure
if savefig:
    fig4.save(fig_dir + "/figure4.png", width = w, height = h, verbose = False)
    
# Show figure
fig4

Out:

3.9. 箱ひげ図 (boxplots)

箱ひげ図 (boxplots)を追加します。

箱ひげ図は、geom_boxplotで描くことができます。

# Plot figure 
fig5 = (
    ggplot()
    + geom_point(d.iloc[:50,:],aes(x='xj', y = 'y'), color = 'dodgerblue', show_legend=False, alpha = .6, size = 2)
    + geom_point(d.iloc[50:100,:],aes(x='xj', y = 'y'), color = 'darkorange', show_legend=False, alpha = .6, size = 2)
    + geom_line(d,aes(x='xj', y = 'y', group = 'id'), color = 'lightgray', alpha = .3)
    + geom_boxplot(d.iloc[:50,:],aes(x='factor(x)', y = 'y'), fill = 'dodgerblue', show_legend=False, position = position_nudge(x=-.15),width = .05, alpha = .6)
    + geom_boxplot(d.iloc[50:100,:], aes(x='factor(x)', y = 'y'), fill = 'darkorange', show_legend=False, position = position_nudge(x=0.15),width = .05, alpha = .6)
    + theme_classic()
    + scale_x_discrete(labels=labels, name='Conditions')
    + ylab('Value')
    + ggtitle('Figure 5: repeated measures grouped by color + boxplots')
)

# Save figure
if savefig:
    fig5.save(fig_dir + "/figure5.png", width = w, height = h, verbose = False)
    
# Show figure
fig5

Out:

3.10 バイオリン図 (violin plot)

バイオリン図を作成する。

バイオリン図は、geom_violinで作成することができる。

# Plot figure 
fig6 = (
    ggplot()
    + geom_point(d.iloc[:50,:],aes(x='xj', y = 'y'), color = 'dodgerblue', show_legend=False, alpha = .6, size = 2)
    + geom_point(d.iloc[50:100,:],aes(x='xj', y = 'y'), color = 'darkorange', show_legend=False, alpha = .6, size = 2)
    + geom_line(d,aes(x='xj', y = 'y', group = 'id'), color = 'lightgray', alpha = .5)
    + geom_boxplot(d.iloc[:50,:],aes(x='factor(x)', y = 'y'), fill = 'dodgerblue', show_legend=False, position = position_nudge(x=-.15),width = .05, alpha = .6)
    + geom_boxplot(d.iloc[50:100,:], aes(x='factor(x)', y = 'y'), fill = 'darkorange', show_legend=False, position = position_nudge(x=0.15),width = .05, alpha = .6)
    + geom_violin(d.iloc[:50,:],aes(x='factor(x)', y = 'y'), color = 'white', fill = 'dodgerblue', show_legend=False, position = position_nudge(x=-.3), width = .2, alpha = .6)
    + geom_violin(d.iloc[50:100,:],aes(x='factor(x)', y = 'y'), color = 'white', fill = 'darkorange', show_legend=False, position = position_nudge(x=.3), width = .2, alpha = .6)
    + theme_classic()
    + scale_x_discrete(labels=labels, name='Conditions')
    + ylab('Value')
    + ggtitle('Figure 6: repeated measures grouped by color with box- and violinplots')

)

# Save figure
if savefig:
    fig6.save(fig_dir + "/figure6.png", width = w, height = h, verbose = False)
    
# Show figure
fig6

Out:

3.11. 信頼区間 (CI bar)

各グループの平均値と信頼区間 (CI)を追加する。

信頼区間を計算するためにscipy.statsをインポート。

# Load scipy.stats 
import scipy.stats as st

各グループにおける特徴量の平均score_mean、中央値score_median、分散score_std、標準誤差score_se、信頼区間score_ciを計算し、summary_dfに格納します。

# Calculate some basic statistics 
score_mean_1 = np.mean(d['y'].iloc[0:50])
score_mean_2 = np.mean(d['y'].iloc[50:100])

score_median_1 = np.median(d['y'].iloc[0:50])
score_median_2 = np.median(d['y'].iloc[50:100])

score_std_1 = np.std(d['y'].iloc[1:50])
score_std_2 = np.std(d['y'].iloc[50:100])

score_se_1 = score_std_1/np.sqrt(50) #adjust your n
score_se_2 = score_std_2/np.sqrt(50) #adjust your n

score_ci_1 = st.t.interval(0.95, len(d['y'].iloc[0:50])-1, loc=score_mean_1, scale=st.sem(d['y'].iloc[0:50]))
score_ci_2 = st.t.interval(0.95, len(d['y'].iloc[50:100])-1, loc=score_mean_2, scale=st.sem(d['y'].iloc[50:100]))


# Create dataframe with these variables 
summary_df = pd.DataFrame({'group': ["x", "z"],
                           'N': [50, 50],
                           'score_mean': [score_mean_1, score_mean_2],
                           'score_median': [score_median_1, score_median_2],
                           'score_std': [score_std_1, score_std_2],
                           'score_se': [score_se_1, score_se_2],
                           'score_ci': [score_ci_1[1]-score_ci_1[0], score_ci_2[1]-score_ci_2[0]]                          
                          })
summary_df

Out:

summary_dfを可視化します。

平均値のプロットには、geom_pointを、信頼区間は、geom_errorbarを用いることで描出できます。

# Plot figure 
fig7 = (
    ggplot()
    + geom_point(d.iloc[:50,:],aes(x='xj', y = 'y'), color = 'dodgerblue', show_legend=False, alpha = .6, size = 2)
    + geom_point(d.iloc[50:100,:],aes(x='xj', y = 'y'), color = 'darkorange', show_legend=False, alpha = .6, size = 2)
    + geom_line(d,aes(x='xj', y = 'y', group = 'id'), color = 'lightgray', alpha = .5)
    + geom_boxplot(d.iloc[:50,:],aes(x='factor(x)', y = 'y'), fill = 'dodgerblue', show_legend=False, position = position_nudge(x=-.24),width = .05, alpha = .6)
    + geom_boxplot(d.iloc[50:100,:], aes(x='factor(x)', y = 'y'), fill = 'darkorange', show_legend=False, position = position_nudge(x=0.24),width = .05, alpha = .6)
    + geom_point(summary_df.iloc[:1,:],aes(x = 1, y = summary_df.iloc[0,2]), color = 'dodgerblue', show_legend = False, position = position_nudge(x = -.15), alpha = .6, size = 2)
    + geom_point(summary_df.iloc[1:2,:],aes(x = 2, y = summary_df.iloc[1,2]), color = 'darkorange', show_legend = False, position = position_nudge(x = .15), alpha = .6, size = 2)
    + geom_errorbar(summary_df.iloc[:1,:],aes(x = 1, y = summary_df.iloc[0,2], ymin = summary_df.iloc[0,2]-summary_df.iloc[0,6], 
                                              ymax = summary_df.iloc[0,2]+summary_df.iloc[0,6]), color = 'dodgerblue', show_legend = False, position = position_nudge(-.15), width = .05, alpha = .6)
    + geom_errorbar(summary_df.iloc[1:2,:], aes(x = 2, y = summary_df.iloc[1,2], ymin = summary_df.iloc[1,2]-summary_df.iloc[1,6],
                                            ymax = summary_df.iloc[1,2]+summary_df.iloc[1,6]), color = 'darkorange', show_legend = False, position = position_nudge(.15), width = .05, alpha = .6)
    + theme_classic()
    + scale_x_discrete(labels=labels, name='Conditions')
    + ylab('Value')
    + ggtitle('Figure 7: repeated measures grouped by color, boxplots + statistics')
)

# Save figure
if savefig:
    fig7.save(fig_dir + "/figure7.png", width = w, height = h, verbose = False)
    
# Show figure
fig7

Out:

3.12. 各グループの平均を直線で結ぶ

前半グループの平均値と前半グループの平均値のプロットを直線で結びます。

直線は、geom_lineで引くことができます。

# Define the x-axis location of both means, 
# which in this case is: 1 & 'position_nudge(-.15)' = .85 and 2 & position_nudge(.15) = 2.15
x_tick_means = [0.85,2.15]

# Plot figure 
fig8 = (
    ggplot()
    + geom_point(d.iloc[:50,:],aes(x='xj', y = 'y'), color = 'dodgerblue', show_legend=False, alpha = .6, size = 2)
    + geom_point(d.iloc[50:100,:],aes(x='xj', y = 'y'), color = 'darkorange', show_legend=False, alpha = .6, size = 2)
    + geom_line(d,aes(x='xj', y = 'y', group = 'id'), color = 'lightgray', alpha = .5)
    + geom_line(summary_df,aes(x = x_tick_means, y = 'score_mean'), color = 'gray', size = 1)
    + geom_boxplot(d.iloc[:50,:],aes(x='factor(x)', y = 'y'), fill = 'dodgerblue', show_legend=False, position = position_nudge(x=-.24),width = .05, alpha = .6)
    + geom_boxplot(d.iloc[50:100,:], aes(x='factor(x)', y = 'y'), fill = 'darkorange', show_legend=False, position = position_nudge(x=0.24),width = .05, alpha = .6)
    + geom_point(summary_df.iloc[:1,:],aes(x = 1, y = summary_df.iloc[0,2]), color = 'dodgerblue', show_legend = False, position = position_nudge(x = -.15), alpha = .6, size = 2)
    + geom_point(summary_df.iloc[1:2,:],aes(x = 2, y = summary_df.iloc[1,2]), color = 'darkorange', show_legend = False, position = position_nudge(x = .15), alpha = .6, size = 2)
    + geom_errorbar(summary_df.iloc[:1,:],aes(x = 1, y = summary_df.iloc[0,2], ymin = summary_df.iloc[0,2]-summary_df.iloc[0,6], 
                                              ymax = summary_df.iloc[0,2]+summary_df.iloc[0,6]), color = 'dodgerblue', show_legend = False, position = position_nudge(-.15), width = .05, alpha = .6)
    + geom_errorbar(summary_df.iloc[1:2,:], aes(x = 2, y = summary_df.iloc[1,2], ymin = summary_df.iloc[1,2]-summary_df.iloc[1,6],
                                            ymax = summary_df.iloc[1,2]+summary_df.iloc[1,6]), color = 'darkorange', show_legend = False, position = position_nudge(.15), width = .05, alpha = .6)
    + theme_classic()
    + scale_x_discrete(labels=labels, name='Conditions')
    + ylab('Value')
    + ggtitle('Figure 8: repeated measures grouped by color, boxplots + statistics')
)

# Save figure
if savefig:
    fig8.save(fig_dir + "/figure8.png", width = w, height = h, verbose = False)
    
# Show figure
fig8

Out:

3.13. プロット・箱ひげ図・バイオリン図・信頼区間

これまでの、プロット・箱ひげ図・バイオリン図・信頼区間の図を一つにまとめてみましょう。

# Plot figure 
fig9 = (
    ggplot()
    + geom_point(d.iloc[:50,:],aes(x='x', y = 'y'), color = 'dodgerblue', show_legend=False, alpha = .6, size = 2)
    + geom_point(d.iloc[50:100,:],aes(x='x', y = 'y'), color = 'darkorange', show_legend=False, alpha = .6, size = 2)
    + geom_line(d,aes(x='x', y = 'y', group = 'id'), color = 'lightgray', alpha = .5)
    + geom_line(summary_df,aes(x = x_tick_means, y = 'score_mean'), color = 'gray', size = 1)
    + geom_boxplot(d.iloc[:50,:],aes(x='factor(x)', y = 'y'), color = 'gray', fill = 'dodgerblue', show_legend=False, position = position_nudge(x=-.24),width = .05, alpha = .2)
    + geom_boxplot(d.iloc[50:100,:], aes(x='factor(x)', y = 'y'), color = 'gray', fill = 'darkorange', show_legend=False, position = position_nudge(x=0.24),width = .05, alpha = .2)
    + geom_violin(d.iloc[:50,:],aes(x='factor(x)', y = 'y'), color = 'white', fill = 'dodgerblue', show_legend=False, position = position_nudge(x=0), width = .2, alpha = .2)
    + geom_violin(d.iloc[50:100,:],aes(x='factor(x)', y = 'y'), color = 'white', fill = 'darkorange', show_legend=False, position = position_nudge(x=0), width = .2, alpha = .2)
    + geom_point(summary_df.iloc[:1,:],aes(x = 1, y = summary_df.iloc[0,2]), color = 'dodgerblue', show_legend = False, position = position_nudge(x = -.15), alpha = .6, size = 2)
    + geom_point(summary_df.iloc[1:2,:],aes(x = 2, y = summary_df.iloc[1,2]), color = 'darkorange', show_legend = False, position = position_nudge(x = .15), alpha = .6, size = 2)
    + geom_errorbar(summary_df.iloc[:1,:],aes(x = 1, y = summary_df.iloc[0,2], ymin = summary_df.iloc[0,2]-summary_df.iloc[0,6], 
                                              ymax = summary_df.iloc[0,2]+summary_df.iloc[0,6]), color = 'gray', show_legend = False, position = position_nudge(-.15), width = .05, alpha = .6)
    + geom_errorbar(summary_df.iloc[1:2,:], aes(x = 2, y = summary_df.iloc[1,2], ymin = summary_df.iloc[1,2]-summary_df.iloc[1,6],
                                            ymax = summary_df.iloc[1,2]+summary_df.iloc[1,6]), color = 'gray', show_legend = False, position = position_nudge(.15), width = .05, alpha = .6)
    + theme_classic()
    + scale_x_discrete(labels=labels, name='Conditions')
    + ylab('Value')
    + ggtitle('Figure 9: repeated measures grouped by color, boxplots + statistics')
)

# Save figure
if savefig:
    fig9.save(fig_dir + "/figure9.png", width = w, height = h, verbose = False)
    
# Show figure
fig9

Out:

4. チュートリアル2 (matplotlibを使う場合)

open-visualizations/Python/tutorial_2/repeated_measures_python_2.ipynbをjupyter-notebookで開く。

$ jupyter-notebook open-visualizations/Python/tutorial_2/repeated_measures_python_2.ipynb

4.1. ライブラリの読み込み

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

4.2. 保存用のフォルダを用意

/repmes_tutorial_2_python/figsに作成した図を保存する。

savefig = True

if savefig:
    
    #Load libraries
    import os
    from os.path import isdir
    
    #Get current working directory, but you can specify your own directory of course.
    cwd = os.getcwd()
    
    if  os.path.exists(cwd + "/repmes_tutorial_2_python/figs"):
        print("Directory already exists")

        #Assign the existing directory to a variable
        fig_dir = cwd + "/repmes_tutorial_2_python/figs"
        
    elif not os.path.exists(cwd + "/repmes_tutorial_2_python/figs"):
        print("Directory does not exist and will be created ......")
        os.makedirs(cwd + "/repmes_tutorial_2_python/figs")
        
        if isdir(cwd + "/repmes_tutorial_2_python/figs"):
            print('Directory was created succesfully')
        
        #Assign the created directory to a variable
        fig_dir = cwd + "/repmes_tutorial_2_python/figs" 
   
    else:
        print("Something went wrong")

4.3. データの初期化

BeforeグループとAfterグループそれぞれのサンプルサイズを30とします。

N=30

Beforeグループのデータは、平均が0、分散が1の正規分布に従う乱数で生成します。
また、Beforeグループのデータからプラス1したものをAfterグループのデータとします。

np.random.seed(3)
data = np.random.normal(size=(N,))

#Create the dataframe in a wide format with 'Before' and 'After ' as columns
df = pd.DataFrame({'Before': data,
                    'After': data+1})

各グループのプロットのx座標を設定し、df_jitter_1に格納します。

df_jitter_1 = pd.DataFrame(np.random.normal(loc=0, scale=jitter_1, size=df.values.shape), columns=df.columns)

#Update the dataframe with adding a number based on the length on the columns. Otherwise all datapoints would be at the same x-axis location.
df_jitter_1 += np.arange(len(df.columns))

各グループのデータはこちら。

#Inspect the created dataframe
pd.options.display.float_format = '{:.3f}'.format
print("The dataframe with 2 variables ")
print(df[['Before', 'After']])

Out:

4.4. プロット

各グループのデータをプロットします。

ax.plotにプロットするデータを指定します。

# 出力画像の設定
w = 6
h = 6
title_size = 20
xlab_size = 15
ylab_size = 20
# ラベル名の設定
labels = ['Before', 'After']

# Create empty figure and plot the individual datapoints
fig, ax = plt.subplots(figsize=(15,9))

for col in df:
    ax.plot(df_jitter_1[col], df[col], 'o', alpha=.6, zorder=2, ms=10, mew=1.5)
   
    #Additional settings
    ax.set_xticks(range(len(df.columns)))
    ax.set_xticklabels((labels), size= xlab_size)
    ax.set_xlim(-1, len(df.columns))
    ax.set_ylabel('Value', size = ylab_size)
    ax.set_title('Figure 1: individual datapoints', size = title_size)
    sns.despine()

if savefig:
    plt.savefig(fig_dir + "/figure1.png", width = w, height = h)

Out:

4.5. プロットと直線

各グループのインデックスが同じプロット同士を直線で結びます。

プロットを直線で結ぶには、以下の行を追加します。

for idx in df.index:
    ax.plot(df_jitter_1.loc[idx,['Before','After']], df.loc[idx,['Before','After']], color = 'gray', linewidth = 2, linestyle = '--',alpha = .3)

追加すると以下のようなコードになります。

# Create empty figure and plot the individual datapoints
fig, ax = plt.subplots(figsize=(15,9))

for col in df:
    ax.plot(df_jitter_1[col], df[col], 'o', alpha=.6, zorder=2, ms=10, mew=1.5)

for idx in df.index:
    ax.plot(df_jitter_1.loc[idx,['Before','After']], df.loc[idx,['Before','After']], color = 'gray', linewidth = 2, linestyle = '--',alpha = .3)    
    
    #Additonal settings
    ax.set_xticks(range(len(df.columns)))
    ax.set_xticklabels((labels), size= xlab_size)
    ax.set_xlim(-1, len(df.columns))
    ax.set_ylabel('Value', size = ylab_size)
    ax.set_title('Figure 2: individual datapoints with lines', size = title_size)
    sns.despine()

if savefig:
    plt.savefig(fig_dir + "/figure2.png", width = w, height = h)

Out:

4.6. グループごとのプロットの位置を微妙に変える

各グループのプロットの位置が一列にならないようにランダムに少しずらします。
それぞれのプロットの座標はdf_jitter_2に格納します。

####  4.7. <a name='theamountofjitterandcreateadataframecontainingthejitteredx-axisvalues'></a>the amount of jitter and create a dataframe containing the jittered x-axis values
jitter_2 = 0.05
np.random.seed(3)
df_jitter_2 = pd.DataFrame(np.random.normal(loc=0, scale=jitter_2, size=df.values.shape), columns=df.columns)

#Update the dataframe with adding a number based on the length on the columns. Otherwise all datapoints would be at the same x-axis location.
df_jitter_2 += np.arange(len(df.columns))

df_jitter_1をdf_jitter_2に変えて可視化します。

# Create empty figure and plot the individual datapoints
fig, ax = plt.subplots(figsize=(15,9))

for col in df:
    ax.plot(df_jitter_2[col], df[col], 'o', alpha=.6, zorder=2, ms=10, mew=1.5)

for idx in df.index:
    ax.plot(df_jitter_2.loc[idx,['Before','After']], df.loc[idx,['Before','After']], color = 'gray', linewidth = 2, linestyle = '--',alpha = .3)    
    
    #Additonal settings
    ax.set_xticks(range(len(df.columns)))
    ax.set_xticklabels((labels), size= xlab_size)
    ax.set_xlim(-1, len(df.columns))
    ax.set_ylabel('Value', size = ylab_size)
    ax.set_title('Figure 3: individual datapoints with lines and jitter', size = title_size)
    sns.despine()

if savefig:
    plt.savefig(fig_dir + "/figure3.png", width = w, height = h)

Out:

4.8. 信頼区間 (CI bar)

信頼区間 (CI bar)を追加します。

信頼区間は、seabornのpointplotメソッドで描出します。

sns.pointplot(x='variable', y='value', ci=95, data=df_long, join=False, scale=1.5, color = 'black', capsize = .03) #palette = 'Paired'

以上を踏まえて、以下のコードを実行します。

#Merge dataframe from wide to long for sns.pointplot
df_long = pd.melt(df,  value_vars=['Before','After'])

# Create empty figure and plot the individual datapoints
fig, ax = plt.subplots(figsize=(15,9))

for col in df:
    ax.plot(df_jitter_2[col], df[col], 'o', alpha=.6, zorder=2, ms=10, mew=1.5)

for idx in df.index:
    ax.plot(df_jitter_2.loc[idx,['Before','After']], df.loc[idx,['Before','After']], color = 'gray', linewidth = 2, linestyle = '--',alpha = .3)    
    sns.pointplot(x='variable', y='value', ci=95, data=df_long, join=False, scale=1.5, color = 'black', capsize = .03) #palette = 'Paired'    

    
#Additonal settings
    ax.set_xticks(range(len(df.columns)))
    ax.set_xticklabels((labels), size= xlab_size)
    ax.set_xlim(-1, len(df.columns))
    ax.set_ylabel('Value', size = ylab_size)
    ax.set_title('Figure 4: individual datapoints with lines, jitter and statistics', size = title_size)
    sns.despine()
    
    
if savefig:
    plt.savefig(fig_dir + "/figure4.png", width = w, height = h)

Out:

4.9. バイオリン図 (violin plot)

バイオリン図 ( violin plot)を作成する。

バイオリン図はseabornのviolinplotで描出することができる。

sns.violinplot(x='variable', y='value', data=df_long, hue = 'variable', split = True, inner = 'quartile', cut=1)

以上を踏まえて、以下のコードを実行します。

# Create empty figure and plot the individual datapoints
fig, ax = plt.subplots(figsize=(15,9))

for col in df:
    ax.plot(df_jitter_2[col], df[col], 'o', alpha=1, zorder=2, ms=10, mew=1.5)

for idx in df.index:
    ax.plot(df_jitter_2.loc[idx,['Before','After']], df.loc[idx,['Before','After']], color = 'gray', linewidth = 2, linestyle = '--',alpha = .3)    
    sns.pointplot(x='variable', y='value', ci=95, data=df_long, join=False, scale=0.01, color = 'black', capsize = .03)    
    sns.violinplot(x='variable', y='value', data=df_long, hue = 'variable', split = True, inner = 'quartile', cut=1)

    
#Additonal settings
    ax.set_xticks(range(len(df.columns)))
    ax.set_xticklabels((labels), size= xlab_size)
    ax.set_xlim(-1, len(df.columns))
    ax.set_ylabel('Value', size = ylab_size)
    ax.set_title('Figure 5: individual datapoints, lines, jitter, statistics, violins', size = title_size)
    ax.legend_.remove()
    sns.despine()
    plt.setp(ax.collections, alpha=.02) 
    
    
if savefig:
    plt.savefig(fig_dir + "/figure5.png", width = w, height = h)

Out:

4.10. 2群のBeforeとAfterをそれぞれプロット

これまでは、1群のBeforeとAfterを図示してきました。
次は、健常群、患者群それぞれのBeforeとAfterを図示するように2群のBeforeとAfterを図示していきます。

#Create a dataframe do display 4 conditions
df_2 = pd.DataFrame({'Before': data,
                    'After': data+1,
                  'Before1': data,
                  'After1': data-1})

df_jitter_3 = pd.DataFrame(np.random.normal(loc=0, scale=jitter_2, size=df_2.values.shape), columns=df_2.columns)
df_jitter_3

#Do an additional step to create a jittered values for the 4 columns.. i.e., jitter values around condition 1 and 2 + jitter values for condition 3 and 4.
df_jitter_3 += np.arange(len(df_2.columns))
df_jitter_3

# Create empty figure and plot the individual datapoints
fig, ax = plt.subplots(figsize=(15,9))

for col in df_2:
    ax.plot(df_jitter_3[col], df_2[col], 'o', alpha=.6, zorder=2, ms=10, mew=1.5)
    
for idx in df_2.index:
    ax.plot(df_jitter_3.loc[idx,['Before','After']], df_2.loc[idx,['Before','After']], color = 'gray', linewidth = 2, linestyle = '--',alpha = .3)    
    ax.plot(df_jitter_3.loc[idx,['Before1','After1']], df_2.loc[idx,['Before1','After1']], color = 'gray', linewidth = 2, linestyle = '--', alpha =.3)

    #Additonal settings
    ax.set_xticks(range(len(df.columns)))
    ax.set_xticklabels((['Before', 'After', 'Before', 'After']), size= xlab_size)
    ax.set_xlim(-1, len(df.columns))
    ax.set_ylabel('Value', size = ylab_size)
    ax.set_title('Figure 6: individual datapoints with lines, jitter: 4 conditions', size = title_size)
    sns.despine()
    plt.setp(ax.collections, alpha=.02) 
    plt.setp(ax, xticks=[0, 1, 2, 3, 4])

    
if savefig:
    plt.savefig(fig_dir + "/figure6.png", width = w, height = h)

Out:

4.11. さらに信頼区間の追加

さらに信頼区間を追加する。

#Merge dataframe from wide to long for sns.pointplot
df_long_2 = pd.melt(df_2,  value_vars=['Before','After', 'Before1', 'After1'])

# Create empty figure and plot the individual datapoints
fig, ax = plt.subplots(figsize=(15,9))

for col in df_2:
    ax.plot(df_jitter_3[col], df_2[col], 'o', alpha=.6, zorder=2, ms=10, mew=1.5)
    
for idx in df_2.index:
    ax.plot(df_jitter_3.loc[idx,['Before','After']], df_2.loc[idx,['Before','After']], color = 'gray', linewidth = 2, linestyle = '--',alpha = .3)    
    ax.plot(df_jitter_3.loc[idx,['Before1','After1']], df_2.loc[idx,['Before1','After1']], color = 'gray', linewidth = 2, linestyle = '--', alpha = .3)
    sns.pointplot(x='variable', y='value', ci=95, data=df_long_2, join=False, scale=1.5, color = 'black', capsize = .03)  
   

    #Additonal settings
    ax.set_xticks(range(len(df.columns)))
    ax.set_xticklabels((['Before', 'After', 'Before', 'After']), size= xlab_size)
    ax.set_xlim(-1, len(df.columns))
    ax.set_ylabel('Value', size = ylab_size)
    ax.set_title('Figure 7: individual datapoints with lines, jitter, statistics: 4 conditions', size = title_size)
    sns.despine() 
    plt.setp(ax, xticks=[0, 1, 2, 3, 4])


if savefig:
    plt.savefig(fig_dir + "/figure7.png", width = w, height = h)

Out:

4.12. プロット・箱ひげ図・バイオリン図・信頼区間

これまでの、プロット・箱ひげ図・バイオリン図・信頼区間の図を一つにまとめてみましょう。

# Create empty figure and plot the individual datapoints
fig, ax = plt.subplots(figsize=(15,9))

for col in df:
    ax.plot(df_jitter_2[col], df[col], 'o', alpha=.8, zorder=2, ms=10, mew=1.5)
    
for idx in df.index:
    ax.plot(df_jitter_2.loc[idx,['Before','After']], df.loc[idx,['Before','After']], color = 'gray', linewidth = 2, linestyle = '-',alpha = .2)

for value in df_long_2:
    sns.violinplot(x='variable', y='value', data=df_long, hue = 'variable', split = True, inner = 'quartile', cut=1, dodge = True)
    sns.boxplot(x='variable', y='value', data=df_long, hue = 'variable', dodge = True, width = 0.2, fliersize = 2)
    
    #Additonal settings
    ax.set_xticks(range(len(df.columns)))
    ax.set_xticklabels((['Before', 'After']), size= xlab_size)
    ax.set_xlim(-1, len(df.columns))
    ax.set_ylabel('Value', size = ylab_size)
    ax.set_title('Figure 8: individual datapoints with lines, jitter, statistics, box- and violin', size = title_size)
    sns.despine() 
    ax.legend_.remove()
    plt.setp(ax.collections, alpha=.1)


if savefig:
    plt.savefig(fig_dir + "/figure8.png", width = w, height = h)

Out:

5. 高画質で保存したい場合

図を高画質で保存したい場合は、拡張子を.tifにして、dpiを600にするとよい。

5.1. `plotnine`の場合

fig.save("/figure.tif", width = w, height = h, verbose = False, dpi = 600)

5.2. `matplotlib`の場合

plt.savefig("/figure.tif", width = w, height = h, dpi = 600)

著者情報：斎藤勇哉

順天堂大学医学部大学院医学研究科放射線診断学講座所属
脳MRI 画像解析が専門であり、テーマは①神経変性疾患の機序解明、②医用人工知能の開発、③多施設データのハーモナイゼーション、④速読が脳に与える影響や学習効果、⑤SNS解析を用いたマーケティング戦略の改善。
医療分野に関わらず、自然言語処理・スクレイピング・データ分析・Web アプリ開発を得意とし、企業や他大学の研究を支援。
主な使用言語は、Python、Shell Script、MATLAB、HTML、CSS

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

【Python】縦断データの可視化（プロット・箱ひげ図・バイオリン図・信頼区間・ヒストグラム）

1. 目的

2. 準備

2.1. open-visualizationsのダウンロード

2.2. ライブラリのインストール

3. チュートリアル1 (plotnineを用いる場合)

3.1. ライブラリの読み込み

3.2. 保存用のフォルダを用意

3.3. データの読み込み

3.4. データの選択

3.5. プロット

3.6. プロットと直線

3.7. グループごとのプロットの位置を微妙に変える

3.8. プロットの色を変更

3.9. 箱ひげ図 (boxplots)

3.10 バイオリン図 (violin plot)

3.11. 信頼区間 (CI bar)

3.12. 各グループの平均を直線で結ぶ

3.13. プロット・箱ひげ図・バイオリン図・信頼区間

4. チュートリアル2 (matplotlibを使う場合)

4.1. ライブラリの読み込み

4.2. 保存用のフォルダを用意

4.3. データの初期化

4.4. プロット

4.5. プロットと直線

4.6. グループごとのプロットの位置を微妙に変える

4.8. 信頼区間 (CI bar)

4.9. バイオリン図 (violin plot)

4.10. 2群のBeforeとAfterをそれぞれプロット

4.11. さらに信頼区間の追加

4.12. プロット・箱ひげ図・バイオリン図・信頼区間

5. 高画質で保存したい場合

5.1. `plotnine`の場合

5.2. `matplotlib`の場合

著者情報：斎藤勇哉

関連

コメントを残すコメントをキャンセル

1. 目的

2. 準備

2.1. open-visualizationsのダウンロード

2.2. ライブラリのインストール

3. チュートリアル1 (plotnineを用いる場合)

3.1. ライブラリの読み込み

3.2. 保存用のフォルダを用意

3.3. データの読み込み

3.4. データの選択

3.5. プロット

3.6. プロットと直線

3.7. グループごとのプロットの位置を微妙に変える

3.8. プロットの色を変更

3.9. 箱ひげ図 (boxplots)

3.10 バイオリン図 (violin plot)

3.11. 信頼区間 (CI bar)

3.12. 各グループの平均を直線で結ぶ

3.13. プロット・箱ひげ図・バイオリン図・信頼区間

4. チュートリアル2 (matplotlibを使う場合)

4.1. ライブラリの読み込み

4.2. 保存用のフォルダを用意

4.3. データの初期化

4.4. プロット

4.5. プロットと直線

4.6. グループごとのプロットの位置を微妙に変える

4.8. 信頼区間 (CI bar)

4.9. バイオリン図 (violin plot)

4.10. 2群のBeforeとAfterをそれぞれプロット

4.11. さらに信頼区間の追加

4.12. プロット・箱ひげ図・バイオリン図・信頼区間

5. 高画質で保存したい場合

5.1. plotnineの場合

5.2. matplotlibの場合

関連

コメントを残すコメントをキャンセル

5.1. `plotnine`の場合

5.2. `matplotlib`の場合