ggplot2版: 50个matplotlib常用可视化图

ggplot2版: 50个matplotlib常用可视化图

此前看过一篇教程,总结了python matplotlib制作7类50种图的代码。我不习惯这样的画图方式,于是尝试用R ggplot2重画这些图。本文目的是,尽可能实现和原教程中一模一样的图,以此来熟悉ggplot2中的各种调节功能。

注意:

  1. 本文只是在尽可能地复现这些图,而并非认可它们清晰、明确,实际上其中一些图画得非常不好。然而,重点在于,通过使用ggplot2以各种特定形式画图,有助于了解ggplot2适合实现哪些功能或不能实现哪些功能,从而进一步了解ggplot2。

  2. 本文是在Rstudio中使用Rmd格式写的,所有图像都是在Rmd中的显示效果,相同代码在Rgui、R窗口中或保存时,画图效果可能会有差别,尤其是文字、线条、图例的大小和位置,这需要再次调整相关参数。

  3. 画图数据来自github、R或python内部,其中非R内部数据集保留了网址或说明,可自行寻找下载。

  4. 画图主要使用ggplot2包,部分功能依赖于ggplot2的辅助包。有些图使用了不同方法,最推荐的是可调参数多的或者是包依赖少的。

  5. 文中所有代码均为原创,转载请注明出处。

原matplotlib教程:

Top 50 matplotlib Visualizations – The Master Plots (with full python code), https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/

00. 环境设定

定义文件路径和普遍需要的包。

setwd(“D:/#R/learn_ggplot2”)

library(ggplot2)
library(data.table)
library(magrittr)

>>> 相关性 —————

相关性下的图用于可视化2个或多个变量之间的关系。也就是说,一个变量相对于另一个变量如何变化。

01. 散点图

Scatter plot

用于研究两个变量之间的关系。如果数据中有多个组,则可能需要以不同的颜色可视化每个组。

# 数据来自github,与ggplot2::midwest不同
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")
df <- fread("data1_midwest.csv")  #连接在上面,这里已下载到本地保存使用
df_color <- c("#1578bc", "#2477b4", "#fb810b", "#299f32", "#d5282c", "#d32b29",
              "#9066bf", "#885647", "#ea75c1", "#e376c3", "#7c7e7e", "#c1bd1f", 
              "#19c0ca", "#18becc")

ggplot(df, aes(x = area, y = poptotal, color = category))+
  geom_point(size = 1.3)+
  scale_x_continuous(breaks = seq(0, 0.1, by = 0.02), limits = c(0, 0.1), expand = c(0,0,0,0))+
  scale_y_continuous(breaks = seq(0, 90000, by = 10000), limits = c(0, 90000), expand = c(0,0))+
  scale_color_manual(values = df_color)+
  labs(x = "Area", y = "Population", color = NULL, title = "Scatterplot of Midwest Area vs Population")+
  theme_bw()+
  theme(aspect.ratio = 1/1.7,
        axis.ticks = element_blank(), 
        panel.grid = element_blank(),
        legend.text = element_text(size = 7),
        legend.spacing.x = unit(0, "cm"),
        legend.key.height = unit(0.35, "cm"),
        legend.key.width = unit(0.5, "cm"),
        legend.position = c(0.94, 0.72), #markdown中r图像比例有问题,这项可调,
        legend.background = element_blank(),#这项为了在markdown中好看,除去了背景
        plot.title = element_text(hjust = 0.5)) 

02. 气泡图(带包围线)

Bubble plot with Encircling

可以显示边界内的一组点以强调其重要性。

# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")
df <- fread("data1_midwest.csv")
# 将category定义为具有顺序的因子类型
df_class <- unique(df$category) %>% .[order(.)]
df[, category:=factor(category, levels = df_class)]
# 定义散点和图标的大小:通过01标准化和乘数、加数调整相对大小
# 不知道matplotlib内部是什么算法,这里就用最大值了
circsize <- df[,max(dot_size), by=category] %>% setorder(category) %>% 
  .[, V1 := (V1 - min(.$V1))/(max(.$V1) - min(.$V1))*4+2] %>% 
  set_colnames(c("category", "size")) 
# 画圈圈住state == "IN"的点
stateIN <- df[state == "IN", c("area", "poptotal")] %>% 
  .[c(chull(.), chull(.)[1]),]
# 颜色
df_color <- c("#1578bc", "#2477b4", "#fb810b", "#299f32", "#d5282c", "#d32b29",
              "#9066bf", "#885647", "#ea75c1", "#e376c3", "#7c7e7e", "#c1bd1f", 
              "#19c0ca", "#18becc")

# 注意,stroke表示有shape=21的fill的color部分,即外圈的大小
# 为了保证和原图效果一样,这里用polygon图层显示底纹,path图层显示线,实际上polygon也有线图层了,只不过在散点图之下
# 点的大小尺度,和python中有所差别,可能与这里的线性变换有关
# ggalt::geom_encircle也可以画包围线,但是包围线是光滑的,并不是端点连线
ggplot()+
  geom_polygon(data = stateIN, mapping = aes(x = area, y = poptotal, group = 1), fill = "#fffae5", color = "#ba413f", size = 0.4, alpha = 1)+
  geom_point(data = df, mapping = aes(x = area, y = poptotal, fill = category, size = dot_size), color = "black", shape = 21, stroke = 0.3)+
  geom_path(data = stateIN, mapping = aes(x = area, y = poptotal, group = 1), color = "#ba413f", size = 0.4)+  #线图层在点图层上方
  scale_size(range = c(min(circsize$size), max(circsize$size)))+
  scale_fill_manual(values = df_color)+
  scale_x_continuous(breaks = seq(0, 0.1, by = 0.02), limits = c(0, 0.1), expand = c(0,0,0,0.005))+
  scale_y_continuous(breaks = seq(0, 90000, by = 10000), limits = c(0, 90000), expand = c(0,0))+
  guides(fill = guide_legend(override.aes = list(size = circsize$size)), size = FALSE)+
  labs(x = "Area", y = "Population", fill = NULL, size = NULL, title = "Bubble plot with Encircling")+
  theme_bw()+
  theme(aspect.ratio = 1/1.7,
        axis.ticks = element_blank(), 
        panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5),
        legend.text = element_text(size = 7),
        legend.spacing.x = unit(0, "cm"),
        legend.key.height = unit(0.35, "cm"),
        legend.key.width = unit(0.5, "cm"),
        legend.position = c(0.95, 0.71), #markdown中r图像比例有问题,这项可调,
        legend.background = element_blank()) #这项为了在markdown中好看,除去了背景

03. 散点图(带拟合线)

Scatter plot with linear regression line of best fit

如果想了解两个变量是如何相互变化的,那么可选最佳拟合线。

图1,显示不同类别

df <- ggplot2::mpg %>% setDT()
df_select <- df[cyl %in% c(4,8),] %>% .[,cyl:=as.factor(cyl)]
# 颜色
cyl_color <- c("#1f77b4", "#ff983e")

# geom_smooth的填充范围,只有数据和全图可选,而且se只会按垂直方向填充,这与python不同
ggplot(df_select, aes(displ, hwy, color = cyl, fill = cyl))+        
  geom_point(color = "black", shape = 21, size = 2.3, stroke = 0.2)+
  geom_smooth(formula = "y~ x", method = "lm", fullrange = T, alpha = 0.2, show.legend = F)+
  scale_x_continuous(breaks = 1:7, limits = c(0.5, 7.5), expand = c(0,0,0,0.005))+
  scale_y_continuous(breaks = seq(0, 45, by = 5), limits = c(0, 45), expand = c(0,0))+
  scale_color_manual(values = cyl_color)+
  scale_fill_manual(values = cyl_color)+
  labs(title = "Scatterplot with line of best fit grouped by number of cylinders")+
  theme_bw()+
  theme(aspect.ratio = 1/1.6,     #这里是高1宽1.6,和python的seaborn长宽比是反着的
        axis.line = element_line(linewidth = 0.3), 
        axis.ticks = element_blank(), 
        panel.grid = element_blank(),
        panel.border = element_blank(),
        plot.title = element_text(hjust = 0.5)) 

图2,分组显示类别

df <- ggplot2::mpg %>% setDT()
df_select <- df[cyl %in% c(4,8),] %>% .[,cyl:=as.factor(cyl)]

# 注意这个图仍和seaborn的有差别,坐标轴上,对方的相当于两个图放一起了,这个没难度,这里就按分面画了一下
ggplot(df_select, aes(displ, hwy))+        
  geom_point(color = "black", fill = "#1f77b4", shape = 21, size = 2.3, stroke = 0.2)+
  geom_smooth(formula = "y~ x", method = "lm", fill = "#1f77b4", color = "#1f77b4", fullrange = T, alpha = 0.2, show.legend = F)+
  scale_x_continuous(breaks = 1:7, limits = c(0.5, 7.5), expand = c(0,0,0,0.005))+
  scale_y_continuous(breaks = seq(0, 45, by = 5), limits = c(0, 45), expand = c(0,0))+
  facet_wrap(vars(cyl), scales = "free", labeller = labeller(.default = function(x)  paste0("cyl = ", x)))+
  # facet_wrap(vars(cyl), scales = "free", labeller = function(variable, value) paste0(variable, "=", value))+ #这个方法被新的ggplot弃用了,但是和上一行同样效果
  theme_bw()+
  theme(aspect.ratio = 1/1,    
        axis.line = element_line(linewidth = 0.3), 
        axis.ticks = element_blank(), 
        panel.grid = element_blank(),
        panel.border = element_blank(),
        strip.background = element_rect(fill = F, colour = F),
        plot.title = element_text(hjust = 0.5)) 

04. 散点图(带抖动)

Jittering with stripplot

通常,多个数据点具有完全相同的X和Y值,将导致多个点相互绘制并隐藏。为避免这种情况,可稍微抖动点,以便直观地看到它们。

df <- ggplot2::mpg %>% setDT() %>% .[,cty:=factor(cty)]
df_color <- c( "#ff869a", "#9fb428",  "#3fcc84", "#61b5f2","#b19fef","#fa82af")

# 注意这个图的横坐标是因子型,因此并不等宽
# 注意偏离值,python中只能水平偏移,R语言中水平和数值都可设定,并且默认值都不是0,要自定义
# 离散的话,用scale_fill_discrete(df_color),将自动匹配数量
# 连续的话,用scale_*_gradient/scale_*_gradient2/scale_*_gradientn,分别将中高两种、中高低三种,和任意颜色映射到梯度中,2中的中需要自定义中点值
ggplot(df, aes(cty, hwy, fill = cty))+        
  geom_jitter(width = 0.25, height = 0, color = "black", shape = 21, size = 2.3, stroke = 0.2)+
  # scale_x_continuous(breaks = seq(9, 35, by = 1), limits = c(9, 35))+
  scale_y_continuous(breaks = seq(10, 45, by = 5), limits = c(10, 45), expand = c(0,0))+
  scale_fill_discrete(df_color)+ 
  # scale_fill_gradient2(low = "#ff869a", mid = "#3fcc84", high = "#fa82af", midpoint = 32)+ 
  # scale_fill_gradientn(colours = colorn)+ 
  labs(title = "Use jittered plots to avoid overlapping of points")+
  guides(fill = F)+
  theme_bw()+
  theme(aspect.ratio = 1/1.5,    #高宽比1:1.5
        axis.ticks = element_blank(), 
        panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5)) 

05. 计数图

Counts Plot

避免点重叠问题的另一种选择是根据该点中有多少个点来增加点的大小。因此,点的大小越大,点周围的集中度就越大。

df <- ggplot2::mpg %>% 
  setDT() %>% 
  .[, length(drv), by=c("cty", "hwy")] %>% 
  set_colnames(c("cty", "hwy", "size")) %>% 
  .[, cty:=factor(cty)]
df_color <- c( "#ff869a", "#9fb428",  "#3fcc84", "#61b5f2","#b19fef","#fa82af")

# 注意这个图的横坐标是因子型,因此并不等宽
ggplot(df, aes(cty, hwy, size = size, fill = cty))+        
  geom_point(color = "black", shape = 21, stroke = 0.2)+
  scale_y_continuous(breaks = seq(10, 45, by = 5), limits = c(10, 45), expand = c(0,0))+
  scale_fill_discrete(df_color)+ 
  labs(title = "Counts Plot - Size of circle is bigger as more points overlap")+
  guides(fill = F, size = F)+
  theme_bw()+
  theme(aspect.ratio = 1/2,    #高宽比1:2
        axis.ticks = element_blank(), 
        panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5)) 

06. 散点图(带边际直方)

Marginal Histogram

边际直方图具有沿X轴和Y轴变量的直方图。这用于可视化X和Y之间的关系以及X和Y的单变量分布。此图经常用于探索性数据分析。

用ggpubr::ggscatterhist也可以,但是是图和边缘图一起画,可调参数较少

这里仅用ggExtra::ggMarginal作图

library(ggExtra)
df <- ggplot2::mpg

# 绘制散点图和分布密度,分布图位置不能改只能在右和上
p <- ggplot(df, aes(x = displ, y = hwy, size = cty, fill = manufacturer)) +
  geom_point(color = "black", shape = 21, stroke = 0.2)+
  scale_size(range = c(1,4))+
  scale_y_continuous(breaks = seq(10, 45, by = 5), limits = c(10, 45), expand = c(0,0))+
  guides(size = F, fill = F)+
  labs(title = "Scatterplot with Histograms \n displ vs hwy")+
  theme_bw()+
  theme(aspect.ratio = 1/2,    #高宽比1:2
        axis.ticks = element_blank(), 
        panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5)) 
ggMarginal(p, type = "histogram", 
           fill = "#ff1493", color = "#ff1493", size = 4, 
           xparams = list(bins = 41), yparams = list(bins = 48))