
如何攻破可视化图表?附完整 Python 源代码

Lemonbit CSDN 2019-02-15

翻译 | Lemon
责编 | 郭    芮

本文总结了 Matplotlib 以及 Seaborn 用的最多的50个图形。这些图表列表允许开发者使用 Python 的 Matplotlib 和 seaborn 库选择要显示的可视化对象。




  • 在不歪曲事实的情况下传达正确和必要的信息。

  • 设计简单,不必太费力就能理解它。

  • 从审美角度支持信息而不是掩盖信息。

  • 信息没有超负荷。



  1. # !pip install brewer2mpl

  2. import numpy as np

  3. import pandas as pd

  4. import matplotlib as mpl

  5. import matplotlib.pyplot as plt

  6. import seaborn as sns

  7. import warnings; warnings.filterwarnings(action='once')

  8. large = 22; med = 16; small = 12

  9. params = {'axes.titlesize': large,

  10.          'legend.fontsize': med,

  11.          'figure.figsize': (16, 10),

  12.          'axes.labelsize': med,

  13.          'axes.titlesize': med,

  14.          'xtick.labelsize': med,

  15.          'ytick.labelsize': med,

  16.          'figure.titlesize': large}

  17. plt.rcParams.update(params)

  18. plt.style.use('seaborn-whitegrid')

  19. sns.set_style("white")

  20. %matplotlib inline

  21. # Version

  22. print(mpl.__version__)  #> 3.0.0

  23. print(sns.__version__)  #> 0.9.0

  1. 3.0.2

  2. 0.9.0

关联 (Correlation)


1、散点图(Scatter plot)

散点图是用于研究两个变量之间关系的经典的和基本的图表。如果数据中有多个组,则可能需要以不同颜色可视化每个组。在 matplotlib 中,可以使用 plt.scatterplot() 方便地执行此操作。

  1. # Import dataset

  2. midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")

  3. # Prepare Data

  4. # Create as many colors as there are unique midwest['category']

  5. categories = np.unique(midwest['category'])

  6. colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]

  7. # Draw Plot for Each Category

  8. plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')

  9. for i, category in enumerate(categories):

  10.    plt.scatter('area', 'poptotal',

  11.                data=midwest.loc[midwest.category==category, :],

  12.                s=20, cmap=colors[i], label=str(category))

  13.    # "c=" 修改为 "cmap=",Python数据之道 备注

  14. # Decorations

  15. plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),

  16.              xlabel='Area', ylabel='Population')

  17. plt.xticks(fontsize=12); plt.yticks(fontsize=12)

  18. plt.title("Scatterplot of Midwest Area vs Population", fontsize=22)

  19. plt.legend(fontsize=12)    

  20. plt.show()    


2、带边界的气泡图(Bubble plot with Encircling)

有时,开发者希望在边界内显示一组点以强调其重要性。在这个例子中,你从数据框中获取记录,并用下面代码中描述的 encircle() 来使边界显示出来。

  1. from matplotlib import patches

  2. from scipy.spatial import ConvexHull

  3. import warnings; warnings.simplefilter('ignore')

  4. sns.set_style("white")

  5. # Step 1: Prepare Data

  6. midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")

  7. # As many colors as there are unique midwest['category']

  8. categories = np.unique(midwest['category'])

  9. colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]

  10. # Step 2: Draw Scatterplot with unique color for each category

  11. fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')    

  12. for i, category in enumerate(categories):

  13.    plt.scatter('area', 'poptotal', data=midwest.loc[midwest.category==category, :],

  14.                s='dot_size', cmap=colors[i], label=str(category), edgecolors='black', linewidths=.5)

  15.    # "c=" 修改为 "cmap=",Python数据之道 备注

  16. # Step 3: Encircling

  17. # https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot

  18. def encircle(x,y, ax=None, **kw):

  19.    if not ax: ax=plt.gca()

  20.    p = np.c_[x,y]

  21.    hull = ConvexHull(p)

  22.    poly = plt.Polygon(p[hull.vertices,:], **kw)

  23.    ax.add_patch(poly)

  24. # Select data to be encircled

  25. midwest_encircle_data = midwest.loc[midwest.state=='IN', :]                        

  26. # Draw polygon surrounding vertices    

  27. encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="k", fc="gold", alpha=0.1)

  28. encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="firebrick", fc="none", linewidth=1.5)

  29. # Step 4: Decorations

  30. plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),

  31.              xlabel='Area', ylabel='Population')

  32. plt.xticks(fontsize=12); plt.yticks(fontsize=12)

  33. plt.title("Bubble Plot with Encircling", fontsize=22)

  34. plt.legend(fontsize=12)    

  35. plt.show()    


3、带线性回归最佳拟合线的散点图 (Scatter plot with linear regression line of best fit)

如果想了解两个变量如何相互改变,那么最佳拟合线就是常用的方法。下图显示了数据中各组之间最佳拟合线的差异。要禁用分组并仅为整个数据集绘制一条最佳拟合线,请从下面的 sns.lmplot()调用中删除 hue ='cyl'参数。

  1. # Import Data

  2. df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

  3. df_select = df.loc[df.cyl.isin([4,8]), :]

  4. # Plot

  5. sns.set_style("white")

  6. gridobj = sns.lmplot(x="displ", y="hwy", hue="cyl", data=df_select,

  7.                     height=7, aspect=1.6, robust=True, palette='tab10',

  8.                     scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))

  9. # Decorations

  10. gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))

  11. plt.title("Scatterplot with line of best fit grouped by number of cylinders", fontsize=20)

  12. plt.show()



或者,可以在其每列中显示每个组的最佳拟合线。可以通过在 sns.lmplot() 中设置 col=groupingcolumn 参数来实现,如下:

  1. # Import Data

  2. df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

  3. df_select = df.loc[df.cyl.isin([4,8]), :]

  4. # Each line in its own column

  5. sns.set_style("white")

  6. gridobj = sns.lmplot(x="displ", y="hwy",

  7.                     data=df_select,

  8.                     height=7,

  9.                     robust=True,

  10.                     palette='Set1',

  11.                     col="cyl",

  12.                     scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))

  13. # Decorations

  14. gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))

  15. plt.show()


4、抖动图 (Jittering with stripplot)

通常,多个数据点具有完全相同的 X 和 Y 值。结果,多个点绘制会重叠并隐藏。为避免这种情况,请将数据点稍微抖动,以便可以直观地看到它们。使用 seaborn 的 stripplot() 很方便实现这个功能。

  1. # Import Data

  2. df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

  3. # Draw Stripplot

  4. fig, ax = plt.subplots(figsize=(16,10), dpi= 80)    

  5. sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5)

  6. # Decorations

  7. plt.title('Use jittered plots to avoid overlapping of points', fontsize=22)

  8. plt.show()


5、计数图 (Counts Plot)

避免点重叠问题的另一个选择是增加点的大小,这取决于该点中有多少点。 因此,点的大小越大,其周围的点的集中度越高。

  1. # Import Data

  2. df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

  3. df_counts = df.groupby(['hwy', 'cty']).size().reset_index(name='counts')

  4. # Draw Stripplot

  5. fig, ax = plt.subplots(figsize=(16,10), dpi= 80)    

  6. sns.stripplot(df_counts.cty, df_counts.hwy, size=df_counts.counts*2, ax=ax)

  7. # Decorations

  8. plt.title('Counts Plot - Size of circle is bigger as more points overlap', fontsize=22)

  9. plt.show()


6、边缘直方图 (Marginal Histogram)

边缘直方图具有沿 X 和 Y 轴变量的直方图,这用于可视化 X 和 Y 之间的关系以及单独的 X 和 Y 的单变量分布。这种图经常用于探索性数据分析(EDA)。


7、边缘箱形图 (Marginal Boxplot)

边缘箱图与边缘直方图具有相似的用途。然而,箱线图有助于精确定位 X 和 Y 的中位数、第25和第75百分位数。


8、相关图 (Correllogram)


  1. # Import Dataset

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")

  3. # Plot

  4. plt.figure(figsize=(12,10), dpi= 80)

  5. sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True)

  6. # Decorations

  7. plt.title('Correlogram of mtcars', fontsize=22)

  8. plt.xticks(fontsize=12)

  9. plt.yticks(fontsize=12)

  10. plt.show()


9、矩阵图 (Pairwise Plot)


  1. # Load Dataset

  2. df = sns.load_dataset('iris')

  3. # Plot

  4. plt.figure(figsize=(10,8), dpi= 80)

  5. sns.pairplot(df, kind="scatter", hue="species", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))

  6. plt.show()


  1. # Load Dataset

  2. df = sns.load_dataset('iris')

  3. # Plot

  4. plt.figure(figsize=(10,8), dpi= 80)

  5. sns.pairplot(df, kind="reg", hue="species")

  6. plt.show()


偏差 (Deviation)

10、发散型条形图 (Diverging Bars)

如果想根据单个指标查看项目的变化情况,并可视化此差异的顺序和数量,那么散型条形图 (Diverging Bars)是一个很好的工具。它有助于快速区分数据中组的性能,并且非常直观,并且可以立即传达这一点。

  1. # Prepare Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")

  3. x = df.loc[:, ['mpg']]

  4. df['mpg_z'] = (x - x.mean())/x.std()

  5. df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']]

  6. df.sort_values('mpg_z', inplace=True)

  7. df.reset_index(inplace=True)

  8. # Draw plot

  9. plt.figure(figsize=(14,10), dpi= 80)

  10. plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=5)

  11. # Decorations

  12. plt.gca().set(ylabel='$Model$', xlabel='$Mileage$')

  13. plt.yticks(df.index, df.cars, fontsize=12)

  14. plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})

  15. plt.grid(linestyle='--', alpha=0.5)

  16. plt.show()


11、发散型文本 (Diverging Texts)

发散型文本 (Diverging Texts)与发散型条形图 (Diverging Bars)相似,如果你想以一种漂亮和可呈现的方式显示图表中每个项目的价值,就可以使用这种方法。

  1. # Prepare Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")

  3. x = df.loc[:, ['mpg']]

  4. df['mpg_z'] = (x - x.mean())/x.std()

  5. df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']]

  6. df.sort_values('mpg_z', inplace=True)

  7. df.reset_index(inplace=True)

  8. # Draw plot

  9. plt.figure(figsize=(14,14), dpi= 80)

  10. plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z)

  11. for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):

  12.    t = plt.text(x, y, round(tex, 2), horizontalalignment='right' if x < 0 else 'left',

  13.                 verticalalignment='center', fontdict={'color':'red' if x < 0 else 'green', 'size':14})

  14. # Decorations    

  15. plt.yticks(df.index, df.cars, fontsize=12)

  16. plt.title('Diverging Text Bars of Car Mileage', fontdict={'size':20})

  17. plt.grid(linestyle='--', alpha=0.5)

  18. plt.xlim(-2.5, 2.5)

  19. plt.show()


12、发散型包点图 (Diverging Dot Plot)

发散型包点图 (Diverging Dot Plot)也类似于发散型条形图 (Diverging Bars)。 然而,与发散型条形图 (Diverging Bars)相比,条的缺失减少了组之间的对比度和差异。

  1. # Prepare Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")

  3. x = df.loc[:, ['mpg']]

  4. df['mpg_z'] = (x - x.mean())/x.std()

  5. df['colors'] = ['red' if x < 0 else 'darkgreen' for x in df['mpg_z']]

  6. df.sort_values('mpg_z', inplace=True)

  7. df.reset_index(inplace=True)

  8. # Draw plot

  9. plt.figure(figsize=(14,16), dpi= 80)

  10. plt.scatter(df.mpg_z, df.index, s=450, alpha=.6, color=df.colors)

  11. for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):

  12.    t = plt.text(x, y, round(tex, 1), horizontalalignment='center',

  13.                 verticalalignment='center', fontdict={'color':'white'})

  14. # Decorations

  15. # Lighten borders

  16. plt.gca().spines["top"].set_alpha(.3)

  17. plt.gca().spines["bottom"].set_alpha(.3)

  18. plt.gca().spines["right"].set_alpha(.3)

  19. plt.gca().spines["left"].set_alpha(.3)

  20. plt.yticks(df.index, df.cars)

  21. plt.title('Diverging Dotplot of Car Mileage', fontdict={'size':20})

  22. plt.xlabel('$Mileage$')

  23. plt.grid(linestyle='--', alpha=0.5)

  24. plt.xlim(-2.5, 2.5)

  25. plt.show()


13、带标记的发散型棒棒糖图 (Diverging Lollipop Chart with Markers)



14、面积图 (Area Chart)


  1. import numpy as np

  2. import pandas as pd

  3. # Prepare Data

  4. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv", parse_dates=['date']).head(100)

  5. x = np.arange(df.shape[0])

  6. y_returns = (df.psavert.diff().fillna(0)/df.psavert.shift(1)).fillna(0) * 100

  7. # Plot

  8. plt.figure(figsize=(16,10), dpi= 80)

  9. plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] >= 0, facecolor='green', interpolate=True, alpha=0.7)

  10. plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] <= 0, facecolor='red', interpolate=True, alpha=0.7)

  11. # Annotate

  12. plt.annotate('Peak \n1975', xy=(94.0, 21.0), xytext=(88.0, 28),

  13.             bbox=dict(boxstyle='square', fc='firebrick'),

  14.             arrowprops=dict(facecolor='steelblue', shrink=0.05), fontsize=15, color='white')

  15. # Decorations

  16. xtickvals = [str(m)[:3].upper()+"-"+str(y) for y,m in zip(df.date.dt.year, df.date.dt.month_name())]

  17. plt.gca().set_xticks(x[::6])

  18. plt.gca().set_xticklabels(xtickvals[::6], rotation=90, fontdict={'horizontalalignment': 'center', 'verticalalignment': 'center_baseline'})

  19. plt.ylim(-35,35)

  20. plt.xlim(1,100)

  21. plt.title("Month Economics Return %", fontsize=22)

  22. plt.ylabel('Monthly returns %')

  23. plt.grid(alpha=0.5)

  24. plt.show()


排序 (Ranking)

15、有序条形图 (Ordered Bar Chart)



16、棒棒糖图 (Lollipop Chart)


  1. # Prepare Data

  2. df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())

  4. df.sort_values('cty', inplace=True)

  5. df.reset_index(inplace=True)

  6. # Draw plot

  7. fig, ax = plt.subplots(figsize=(16,10), dpi= 80)

  8. ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=2)

  9. ax.scatter(x=df.index, y=df.cty, s=75, color='firebrick', alpha=0.7)

  10. # Title, Label, Ticks and Ylim

  11. ax.set_title('Lollipop Chart for Highway Mileage', fontdict={'size':22})

  12. ax.set_ylabel('Miles Per Gallon')

  13. ax.set_xticks(df.index)

  14. ax.set_xticklabels(df.manufacturer.str.upper(), rotation=60, fontdict={'horizontalalignment': 'right', 'size':12})

  15. ax.set_ylim(0, 30)

  16. # Annotate

  17. for row in df.itertuples():

  18.    ax.text(row.Index, row.cty+.5, s=round(row.cty, 2), horizontalalignment= 'center', verticalalignment='bottom', fontsize=14)

  19. plt.show()


17、包点图 (Dot Plot)


  1. # Prepare Data

  2. df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())

  4. df.sort_values('cty', inplace=True)

  5. df.reset_index(inplace=True)

  6. # Draw plot

  7. fig, ax = plt.subplots(figsize=(16,10), dpi= 80)

  8. ax.hlines(y=df.index, xmin=11, xmax=26, color='gray', alpha=0.7, linewidth=1, linestyles='dashdot')

  9. ax.scatter(y=df.index, x=df.cty, s=75, color='firebrick', alpha=0.7)

  10. # Title, Label, Ticks and Ylim

  11. ax.set_title('Dot Plot for Highway Mileage', fontdict={'size':22})

  12. ax.set_xlabel('Miles Per Gallon')

  13. ax.set_yticks(df.index)

  14. ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'})

  15. ax.set_xlim(10, 27)

  16. plt.show()


18、坡度图 (Slope Chart)



19、哑铃图 (Dumbbell Plot)



分布 (Distribution)

20、连续变量的直方图 (Histogram for Continuous Variable)


  1. # Import Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. # Prepare data

  4. x_var = 'displ'

  5. groupby_var = 'class'

  6. df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)

  7. vals = [df[x_var].values.tolist() for i, df in df_agg]

  8. # Draw

  9. plt.figure(figsize=(16,9), dpi= 80)

  10. colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]

  11. n, bins, patches = plt.hist(vals, 30, stacked=True, density=False, color=colors[:len(vals)])

  12. # Decoration

  13. plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})

  14. plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)

  15. plt.xlabel(x_var)

  16. plt.ylabel("Frequency")

  17. plt.ylim(0, 25)

  18. plt.xticks(ticks=bins[::3], labels=[round(b,1) for b in bins[::3]])

  19. plt.show()


21、类型变量的直方图 (Histogram for Categorical Variable)


  1. # Import Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. # Prepare data

  4. x_var = 'manufacturer'

  5. groupby_var = 'class'

  6. df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)

  7. vals = [df[x_var].values.tolist() for i, df in df_agg]

  8. # Draw

  9. plt.figure(figsize=(16,9), dpi= 80)

  10. colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]

  11. n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)])

  12. # Decoration

  13. plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})

  14. plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)

  15. plt.xlabel(x_var)

  16. plt.ylabel("Frequency")

  17. plt.ylim(0, 40)

  18. plt.xticks(ticks=bins, labels=np.unique(df[x_var]).tolist(), rotation=90, horizontalalignment='left')

  19. plt.show()


22、密度图 (Density Plot)

密度图是一种常用工具,用于可视化连续变量的分布。通过“响应”变量对它们进行分组,可以检查 X 和 Y 之间的关系。以下情况用于表示目的,以描述城市里程的分布如何随着汽缸数的变化而变化。

  1. # Import Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. # Draw Plot

  4. plt.figure(figsize=(16,10), dpi= 80)

  5. sns.kdeplot(df.loc[df['cyl'] == 4, "cty"], shade=True, color="g", label="Cyl=4", alpha=.7)

  6. sns.kdeplot(df.loc[df['cyl'] == 5, "cty"], shade=True, color="deeppink", label="Cyl=5", alpha=.7)

  7. sns.kdeplot(df.loc[df['cyl'] == 6, "cty"], shade=True, color="dodgerblue", label="Cyl=6", alpha=.7)

  8. sns.kdeplot(df.loc[df['cyl'] == 8, "cty"], shade=True, color="orange", label="Cyl=8", alpha=.7)

  9. # Decoration

  10. plt.title('Density Plot of City Mileage by n_Cylinders', fontsize=22)

  11. plt.legend()

  12. plt.show()


23、直方密度线图 (Density Curves with Histogram)


  1. # Import Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. # Draw Plot

  4. plt.figure(figsize=(13,10), dpi= 80)

  5. sns.distplot(df.loc[df['class'] == 'compact', "cty"], color="dodgerblue", label="Compact", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})

  6. sns.distplot(df.loc[df['class'] == 'suv', "cty"], color="orange", label="SUV", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})

  7. sns.distplot(df.loc[df['class'] == 'minivan', "cty"], color="g", label="minivan", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})

  8. plt.ylim(0, 0.35)

  9. # Decoration

  10. plt.title('Density Plot of City Mileage by Vehicle Type', fontsize=22)

  11. plt.legend()

  12. plt.show()


24、Joy Plot

Joy Plot允许不同组的密度曲线重叠,这是一种可视化大量分组数据的彼此关系分布的好方法。它看起来很悦目,并清楚地传达了正确的信息。它可以使用基于 matplotlib 的 joypy 包轻松构建 (注:需要安装 joypy 库)。

  1. # !pip install joypy

  2. # Python数据之道 备注

  3. import joypy

  4. # Import Data

  5. mpg = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  6. # Draw Plot

  7. plt.figure(figsize=(16,10), dpi= 80)

  8. fig, axes = joypy.joyplot(mpg, column=['hwy', 'cty'], by="class", ylim='own', figsize=(14,10))

  9. # Decoration

  10. plt.title('Joy Plot of City and Highway Mileage by Class', fontsize=22)

  11. plt.show()


25、分布式包点图 (Distributed Dot Plot)



26、箱形图 (Box Plot)



  1. # Import Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. # Draw Plot

  4. plt.figure(figsize=(13,10), dpi= 80)

  5. sns.boxplot(x='class', y='hwy', data=df, notch=False)

  6. # Add N Obs inside boxplot (optional)

  7. def add_n_obs(df,group_col,y):

  8.    medians_dict = {grp[0]:grp[1][y].median() for grp in df.groupby(group_col)}

  9.    xticklabels = [x.get_text() for x in plt.gca().get_xticklabels()]

  10.    n_obs = df.groupby(group_col)[y].size().values

  11.    for (x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs):

  12.        plt.text(x, medians_dict[xticklabel]*1.01, "#obs : "+str(n_ob), horizontalalignment='center', fontdict={'size':14}, color='white')

  13. add_n_obs(df,group_col='class',y='hwy')    

  14. # Decoration

  15. plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)

  16. plt.ylim(10, 40)

  17. plt.show()


27、包点+箱形图 (Dot + Box Plot)

包点+箱形图 (Dot + Box Plot)传达类似于分组的箱形图信息。此外,这些点可以了解每组中有多少数据点。

  1. # Import Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. # Draw Plot

  4. plt.figure(figsize=(13,10), dpi= 80)

  5. sns.boxplot(x='class', y='hwy', data=df, hue='cyl')

  6. sns.stripplot(x='class', y='hwy', data=df, color='black', size=3, jitter=1)

  7. for i in range(len(df['class'].unique())-1):

  8.    plt.vlines(i+.5, 10, 45, linestyles='solid', colors='gray', alpha=0.2)

  9. # Decoration

  10. plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)

  11. plt.legend(title='Cylinders')

  12. plt.show()


28、小提琴图 (Violin Plot)


  1. # Import Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. # Draw Plot

  4. plt.figure(figsize=(13,10), dpi= 80)

  5. sns.violinplot(x='class', y='hwy', data=df, scale='width', inner='quartile')

  6. # Decoration

  7. plt.title('Violin Plot of Highway Mileage by Vehicle Class', fontsize=22)

  8. plt.show()


29、人口金字塔 (Population Pyramid)


  1. # Read data

  2. df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")

  3. # Draw Plot

  4. plt.figure(figsize=(13,10), dpi= 80)

  5. group_col = 'Gender'

  6. order_of_bars = df.Stage.unique()[::-1]

  7. colors = [plt.cm.Spectral(i/float(len(df[group_col].unique())-1)) for i in range(len(df[group_col].unique()))]

  8. for c, group in zip(colors, df[group_col].unique()):

  9.    sns.barplot(x='Users', y='Stage', data=df.loc[df[group_col]==group, :], order=order_of_bars, color=c, label=group)

  10. # Decorations    

  11. plt.xlabel("$Users$")

  12. plt.ylabel("Stage of Purchase")

  13. plt.yticks(fontsize=12)

  14. plt.title("Population Pyramid of the Marketing Funnel", fontsize=22)

  15. plt.legend()

  16. plt.show()


30、分类图 (Categorical Plots)

由 seaborn库 提供的分类图可用于可视化彼此相关的2个或更多分类变量的计数分布。

  1. # Load Dataset

  2. titanic = sns.load_dataset("titanic")

  3. # Plot

  4. g = sns.catplot("alive", col="deck", col_wrap=4,

  5.                data=titanic[titanic.deck.notnull()],

  6.                kind="count", height=3.5, aspect=.8,

  7.                palette='tab20')

  8. fig.suptitle('sf')

  9. plt.show()


  1. # Load Dataset

  2. titanic = sns.load_dataset("titanic")

  3. # Plot

  4. sns.catplot(x="age", y="embark_town",

  5.            hue="sex", col="class",

  6.            data=titanic[titanic.embark_town.notnull()],

  7.            orient="h", height=5, aspect=1, palette="tab10",

  8.            kind="violin", dodge=True, cut=0, bw=.2)


组成 (Composition)

31、华夫饼图 (Waffle Chart)

可以使用 pywaffle包创建华夫饼图,并用于显示更大群体中的组的组成(注:需要安装 pywaffle 库)。

  1. #! pip install pywaffle

  2. # Reference: https://stackoverflow.com/questions/41400136/how-to-do-waffle-charts-in-python-square-piechart

  3. from pywaffle import Waffle

  4. # Import

  5. df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  6. # Prepare Data

  7. df = df_raw.groupby('class').size().reset_index(name='counts')

  8. n_categories = df.shape[0]

  9. colors = [plt.cm.inferno_r(i/float(n_categories)) for i in range(n_categories)]

  10. # Draw Plot and Decorate

  11. fig = plt.figure(

  12.    FigureClass=Waffle,

  13.    plots={

  14.        '111': {

  15.            'values': df['counts'],

  16.            'labels': ["{0} ({1})".format(n[0], n[1]) for n in df[['class', 'counts']].itertuples()],

  17.            'legend': {'loc': 'upper left', 'bbox_to_anchor': (1.05, 1), 'fontsize': 12},

  18.            'title': {'label': '# Vehicles by Class', 'loc': 'center', 'fontsize':18}

  19.        },

  20.    },

  21.    rows=7,

  22.    colors=colors,

  23.    figsize=(16, 9)

  24. )



32、饼图 (Pie Chart)


  1. # Import

  2. df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  3. # Prepare Data

  4. df = df_raw.groupby('class').size()

  5. # Make the plot with pandas

  6. df.plot(kind='pie', subplots=True, figsize=(8, 8))

  7. plt.title("Pie Chart of Vehicle Class - Bad")

  8. plt.ylabel("")

  9. plt.show()



33、树形图 (Treemap)

树形图类似于饼图,它可以更好地完成工作而不会误导每个组的贡献(注:需要安装 squarify 库)。

  1. # pip install squarify

  2. import squarify

  3. # Import Data

  4. df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  5. # Prepare Data

  6. df = df_raw.groupby('class').size().reset_index(name='counts')

  7. labels = df.apply(lambda x: str(x[0]) + "\n (" + str(x[1]) + ")", axis=1)

  8. sizes = df['counts'].values.tolist()

  9. colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]

  10. # Draw Plot

  11. plt.figure(figsize=(12,8), dpi= 80)

  12. squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)

  13. # Decorate

  14. plt.title('Treemap of Vechile Class')

  15. plt.axis('off')

  16. plt.show()


34、条形图 (Bar Chart)

条形图是基于计数或任何给定指标可视化项目的经典方式。在下面的图表中,我为每个项目使用了不同的颜色,但开发者通常可能希望为所有项目选择一种颜色,除非按组对其进行着色。颜色名称存储在下面代码中的all_colors中,可以通过在 plt.plot()中设置颜色参数来更改条的颜色。

  1. import random

  2. # Import Data

  3. df_raw = pd.read_csv("https://github.com/selva86/datasets/raw/master/mpg_ggplot2.csv")

  4. # Prepare Data

  5. df = df_raw.groupby('manufacturer').size().reset_index(name='counts')

  6. n = df['manufacturer'].unique().__len__()+1

  7. all_colors = list(plt.cm.colors.cnames.keys())

  8. random.seed(100)

  9. c = random.choices(all_colors, k=n)

  10. # Plot Bars

  11. plt.figure(figsize=(16,10), dpi= 80)

  12. plt.bar(df['manufacturer'], df['counts'], color=c, width=.5)

  13. for i, val in enumerate(df['counts'].values):

  14.    plt.text(i, val, float(val), horizontalalignment='center', verticalalignment='bottom', fontdict={'fontweight':500, 'size':12})

  15. # Decoration

  16. plt.gca().set_xticklabels(df['manufacturer'], rotation=60, horizontalalignment= 'right')

  17. plt.title("Number of Vehicles by Manaufacturers", fontsize=22)

  18. plt.ylabel('# Vehicles')

  19. plt.ylim(0, 45)

  20. plt.show()


变化 (Change)

35、时间序列图 (Time Series Plot)

时间序列图用于显示给定度量随时间变化的方式。在这里,可以看到 1949 年至 1969 年间航空客运量的变化情况。

  1. # Import Data

  2. df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')

  3. # Draw Plot

  4. plt.figure(figsize=(16,10), dpi= 80)

  5. plt.plot('date', 'traffic', data=df, color='tab:red')

  6. # Decoration

  7. plt.ylim(50, 750)

  8. xtick_location = df.index.tolist()[::12]

  9. xtick_labels = [x[-4:] for x in df.date.tolist()[::12]]

  10. plt.xticks(ticks=xtick_location, labels=xtick_labels, rotation=0, fontsize=12, horizontalalignment='center', alpha=.7)

  11. plt.yticks(fontsize=12, alpha=.7)

  12. plt.title("Air Passengers Traffic (1949 - 1969)", fontsize=22)

  13. plt.grid(axis='both', alpha=.3)

  14. # Remove borders

  15. plt.gca().spines["top"].set_alpha(0.0)    

  16. plt.gca().spines["bottom"].set_alpha(0.3)

  17. plt.gca().spines["right"].set_alpha(0.0)    

  18. plt.gca().spines["left"].set_alpha(0.3)  

  19. plt.show()


36、带波峰波谷标记的时序图 (Time Series with Peaks and Troughs Annotated)



37、自相关和部分自相关图 (Autocorrelation (ACF) and Partial Autocorrelation (PACF) Plot)





  1. from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

  2. # Import Data

  3. df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')

  4. # Draw Plot

  5. fig, (ax1, ax2) = plt.subplots(1, 2,figsize=(16,6), dpi= 80)

  6. plot_acf(df.traffic.tolist(), ax=ax1, lags=50)

  7. plot_pacf(df.traffic.tolist(), ax=ax2, lags=20)

  8. # Decorate

  9. # lighten the borders

  10. ax1.spines["top"].set_alpha(.3); ax2.spines["top"].set_alpha(.3)

  11. ax1.spines["bottom"].set_alpha(.3); ax2.spines["bottom"].set_alpha(.3)

  12. ax1.spines["right"].set_alpha(.3); ax2.spines["right"].set_alpha(.3)

  13. ax1.spines["left"].set_alpha(.3); ax2.spines["left"].set_alpha(.3)

  14. # font size of tick labels

  15. ax1.tick_params(axis='both', labelsize=12)

  16. ax2.tick_params(axis='both', labelsize=12)

  17. plt.show()


38、交叉相关图 (Cross Correlation plot)



39、时间序列分解图 (Time Series Decomposition Plot)


  1. from statsmodels.tsa.seasonal import seasonal_decompose

  2. from dateutil.parser import parse

  3. # Import Data

  4. df = pd.read_csv('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')

  5. dates = pd.DatetimeIndex([parse(d).strftime('%Y-%m-01') for d in df['date']])

  6. df.set_index(dates, inplace=True)

  7. # Decompose

  8. result = seasonal_decompose(df['traffic'], model='multiplicative')

  9. # Plot

  10. plt.rcParams.update({'figure.figsize': (10,10)})

  11. result.plot().suptitle('Time Series Decomposition of Air Passengers')

  12. plt.show()


40、多个时间序列 (Multiple Time Series)



41、使用辅助 Y 轴来绘制不同范围的图形 (Plotting with different scales using secondary Y axis)



42、带有误差带的时间序列 (Time Series with Error Bands)





43、堆积面积图 (Stacked Area Chart)



44、未堆积的面积图 (Area Chart UnStacked)


  1. # Import Data

  2. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/economics.csv")

  3. # Prepare Data

  4. x = df['date'].values.tolist()

  5. y1 = df['psavert'].values.tolist()

  6. y2 = df['uempmed'].values.tolist()

  7. mycolors = ['tab:red', 'tab:blue', 'tab:green', 'tab:orange', 'tab:brown', 'tab:grey', 'tab:pink', 'tab:olive']      

  8. columns = ['psavert', 'uempmed']

  9. # Draw Plot

  10. fig, ax = plt.subplots(1, 1, figsize=(16,9), dpi= 80)

  11. ax.fill_between(x, y1=y1, y2=0, label=columns[1], alpha=0.5, color=mycolors[1], linewidth=2)

  12. ax.fill_between(x, y1=y2, y2=0, label=columns[0], alpha=0.5, color=mycolors[0], linewidth=2)

  13. # Decorations

  14. ax.set_title('Personal Savings Rate vs Median Duration of Unemployment', fontsize=18)

  15. ax.set(ylim=[0, 30])

  16. ax.legend(loc='best', fontsize=12)

  17. plt.xticks(x[::50], fontsize=10, horizontalalignment='center')

  18. plt.yticks(np.arange(2.5, 30.0, 2.5), fontsize=10)

  19. plt.xlim(-10, x[-1])

  20. # Draw Tick lines  

  21. for y in np.arange(2.5, 30.0, 2.5):    

  22.    plt.hlines(y, xmin=0, xmax=len(x), colors='black', alpha=0.3, linestyles="--", lw=0.5)

  23. # Lighten borders

  24. plt.gca().spines["top"].set_alpha(0)

  25. plt.gca().spines["bottom"].set_alpha(.3)

  26. plt.gca().spines["right"].set_alpha(0)

  27. plt.gca().spines["left"].set_alpha(.3)

  28. plt.show()


45、日历热力图 (Calendar Heat Map)

与时间序列相比,日历地图是可视化基于时间的数据的备选和不太优选的选项。虽然可以在视觉上吸引人,但数值并不十分明显。然而,它可以很好地描绘极端值和假日效果(注:需要安装 calmap 库)。

  1. import matplotlib as mpl

  2. # pip install calmap  

  3. # Python数据之道 备注

  4. import calmap

  5. # Import Data

  6. df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv", parse_dates=['date'])

  7. df.set_index('date', inplace=True)

  8. # Plot

  9. plt.figure(figsize=(16,10), dpi= 80)

  10. calmap.calendarplot(df['2014']['VIX.Close'], fig_kws={'figsize': (16,10)}, yearlabel_kws={'color':'black', 'fontsize':14}, subplot_kws={'title':'Yahoo Stock Prices'})

  11. plt.show()


46、季节图 (Seasonal Plot)



分组 (Groups)

47、树状图 (Dendrogram)


  1. import scipy.cluster.hierarchy as shc

  2. # Import Data

  3. df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv')

  4. # Plot

  5. plt.figure(figsize=(16, 10), dpi= 80)  

  6. plt.title("USArrests Dendograms", fontsize=22)  

  7. dend = shc.dendrogram(shc.linkage(df[['Murder', 'Assault', 'UrbanPop', 'Rape']], method='ward'), labels=df.State.values, color_threshold=100)  

  8. plt.xticks(fontsize=12)

  9. plt.show()


48、簇状图 (Cluster Plot)

簇状图 (Cluster Plot)可用于划分属于同一群集的点。下面是根据USArrests数据集将美国各州分为5组的代表性示例,此图使用“谋杀”和“攻击”列作为X和Y轴。或者,可以将第一个到主要组件用作X轴和Y轴。

  1. from sklearn.cluster import AgglomerativeClustering

  2. from scipy.spatial import ConvexHull

  3. # Import Data

  4. df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/USArrests.csv')

  5. # Agglomerative Clustering

  6. cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')  

  7. cluster.fit_predict(df[['Murder', 'Assault', 'UrbanPop', 'Rape']])  

  8. # Plot

  9. plt.figure(figsize=(14, 10), dpi= 80)  

  10. plt.scatter(df.iloc[:,0], df.iloc[:,1], c=cluster.labels_, cmap='tab10')  

  11. # Encircle

  12. def encircle(x,y, ax=None, **kw):

  13.    if not ax: ax=plt.gca()

  14.    p = np.c_[x,y]

  15.    hull = ConvexHull(p)

  16.    poly = plt.Polygon(p[hull.vertices,:], **kw)

  17.    ax.add_patch(poly)

  18. # Draw polygon surrounding vertices    

  19. encircle(df.loc[cluster.labels_ == 0, 'Murder'], df.loc[cluster.labels_ == 0, 'Assault'], ec="k", fc="gold", alpha=0.2, linewidth=0)

  20. encircle(df.loc[cluster.labels_ == 1, 'Murder'], df.loc[cluster.labels_ == 1, 'Assault'], ec="k", fc="tab:blue", alpha=0.2, linewidth=0)

  21. encircle(df.loc[cluster.labels_ == 2, 'Murder'], df.loc[cluster.labels_ == 2, 'Assault'], ec="k", fc="tab:red", alpha=0.2, linewidth=0)

  22. encircle(df.loc[cluster.labels_ == 3, 'Murder'], df.loc[cluster.labels_ == 3, 'Assault'], ec="k", fc="tab:green", alpha=0.2, linewidth=0)

  23. encircle(df.loc[cluster.labels_ == 4, 'Murder'], df.loc[cluster.labels_ == 4, 'Assault'], ec="k", fc="tab:orange", alpha=0.2, linewidth=0)

  24. # Decorations

  25. plt.xlabel('Murder'); plt.xticks(fontsize=12)

  26. plt.ylabel('Assault'); plt.yticks(fontsize=12)

  27. plt.title('Agglomerative Clustering of USArrests (5 Groups)', fontsize=22)

  28. plt.show()


49、安德鲁斯曲线 (Andrews Curve)


  1. from pandas.plotting import andrews_curves

  2. # Import

  3. df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")

  4. df.drop(['cars', 'carname'], axis=1, inplace=True)

  5. # Plot

  6. plt.figure(figsize=(12,9), dpi= 80)

  7. andrews_curves(df, 'cyl', colormap='Set1')

  8. # Lighten borders

  9. plt.gca().spines["top"].set_alpha(0)

  10. plt.gca().spines["bottom"].set_alpha(.3)

  11. plt.gca().spines["right"].set_alpha(0)

  12. plt.gca().spines["left"].set_alpha(.3)

  13. plt.title('Andrews Curves of mtcars', fontsize=22)

  14. plt.xlim(-3,3)

  15. plt.grid(alpha=0.3)

  16. plt.xticks(fontsize=12)

  17. plt.yticks(fontsize=12)

  18. plt.show()


50、平行坐标 (Parallel Coordinates)


  1. from pandas.plotting import parallel_coordinates

  2. # Import Data

  3. df_final = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/diamonds_filter.csv")

  4. # Plot

  5. plt.figure(figsize=(12,9), dpi= 80)

  6. parallel_coordinates(df_final, 'cut', colormap='Dark2')

  7. # Lighten borders

  8. plt.gca().spines["top"].set_alpha(0)

  9. plt.gca().spines["bottom"].set_alpha(.3)

  10. plt.gca().spines["right"].set_alpha(0)

  11. plt.gca().spines["left"].set_alpha(.3)

  12. plt.title('Parallel Coordinated of Diamonds', fontsize=22)

  13. plt.grid(alpha=0.3)

  14. plt.xticks(fontsize=12)

  15. plt.yticks(fontsize=12)

  16. plt.show()


原文: https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/。



 热 文 推 荐 

☞ ofo 回应海外部门集体解散;罗永浩将现身快如发布会;支付宝更名? | 极客头条

☞ 拥抱开源四年的 .NET,现在怎么样了?

☞ 要来了!国内安卓统一推送标准将于3月开启测试

☞ 2019八大科技趋势,指引你走向技术下一站

☞ 程序员有话说 | 同一起点的程序员,有人累到要猝死,有人清闲得要命

☞ Istio调用链埋点原理剖析—是否真的“零修改”分享实录

 Google AI骗过了Google,工程师竟无计可施?

☞ 趣挨踢 | 关于遗留代码的那些事儿

cout << "点个好看吧!" << endl;
echo "点个好看吧!"

点击“阅读原文”,打开 CSDN App 阅读更贴心!


