ggplot2版: 50个matplotlib常用可视化图
此前看过一篇教程,总结了python matplotlib制作7类50种图的代码。我不习惯这样的画图方式,于是尝试用R ggplot2重画这些图。本文目的是,尽可能实现和原教程中一模一样的图,以此来熟悉ggplot2中的各种调节功能。
注意:
本文只是在尽可能地复现这些图,而并非认可它们清晰、明确,实际上其中一些图画得非常不好。然而,重点在于,通过使用ggplot2以各种特定形式画图,有助于了解ggplot2适合实现哪些功能或不能实现哪些功能,从而进一步了解ggplot2。
本文是在Rstudio中使用Rmd格式写的,所有图像都是在Rmd中的显示效果,相同代码在Rgui、R窗口中或保存时,画图效果可能会有差别,尤其是文字、线条、图例的大小和位置,这需要再次调整相关参数。
画图数据来自github、R或python内部,其中非R内部数据集保留了网址或说明,可自行寻找下载。
画图主要使用ggplot2包,部分功能依赖于ggplot2的辅助包。有些图使用了不同方法,最推荐的是可调参数多的或者是包依赖少的。
文中所有代码均为原创,转载请注明出处。
原matplotlib教程:
Top 50 matplotlib Visualizations – The Master Plots (with full python code), https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/
定义文件路径和普遍需要的包。
setwd(“D:/#R/learn_ggplot2”)
library(ggplot2)
library(data.table)
library(magrittr)
knitr::opts_chunk$set(message = FALSE, warning = FALSE) #全局不返回message
相关性下的图用于可视化2个或多个变量之间的关系。也就是说,一个变量相对于另一个变量如何变化。
Scatter plot
用于研究两个变量之间的关系。如果数据中有多个组,则可能需要以不同的颜色可视化每个组。
# 数据来自github,与ggplot2::midwest不同
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")
df <- fread("data1_midwest.csv") #连接在上面,这里已下载到本地保存使用
df_color <- c("#1578bc", "#2477b4", "#fb810b", "#299f32", "#d5282c", "#d32b29",
"#9066bf", "#885647", "#ea75c1", "#e376c3", "#7c7e7e", "#c1bd1f",
"#19c0ca", "#18becc")
ggplot(df, aes(x = area, y = poptotal, color = category))+
geom_point(size = 1.3)+
scale_x_continuous(breaks = seq(0, 0.1, by = 0.02), limits = c(0, 0.1), expand = c(0,0,0,0))+
scale_y_continuous(breaks = seq(0, 90000, by = 10000), limits = c(0, 90000), expand = c(0,0))+
scale_color_manual(values = df_color)+
labs(x = "Area", y = "Population", color = NULL, title = "Scatterplot of Midwest Area vs Population")+
theme_bw()+
theme(aspect.ratio = 1/1.7,
axis.ticks = element_blank(),
panel.grid = element_blank(),
legend.text = element_text(size = 7),
legend.spacing.x = unit(0, "cm"),
legend.key.height = unit(0.35, "cm"),
legend.key.width = unit(0.5, "cm"),
legend.position = c(0.94, 0.72), #markdown中r图像比例有问题,这项可调,
legend.background = element_blank(),#这项为了在markdown中好看,除去了背景
plot.title = element_text(hjust = 0.5))
Bubble plot with Encircling
可以显示边界内的一组点以强调其重要性。
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")
df <- fread("data1_midwest.csv")
# 将category定义为具有顺序的因子类型
df_class <- unique(df$category) %>% .[order(.)]
df[, category:=factor(category, levels = df_class)]
# 定义散点和图标的大小:通过01标准化和乘数、加数调整相对大小
# 不知道matplotlib内部是什么算法,这里就用最大值了
circsize <- df[,max(dot_size), by=category] %>% setorder(category) %>%
.[, V1 := (V1 - min(.$V1))/(max(.$V1) - min(.$V1))*4+2] %>%
set_colnames(c("category", "size"))
# 画圈圈住state == "IN"的点
stateIN <- df[state == "IN", c("area", "poptotal")] %>%
.[c(chull(.), chull(.)[1]),]
# 颜色
df_color <- c("#1578bc", "#2477b4", "#fb810b", "#299f32", "#d5282c", "#d32b29",
"#9066bf", "#885647", "#ea75c1", "#e376c3", "#7c7e7e", "#c1bd1f",
"#19c0ca", "#18becc")
# 注意,stroke表示有shape=21的fill的color部分,即外圈的大小
# 为了保证和原图效果一样,这里用polygon图层显示底纹,path图层显示线,实际上polygon也有线图层了,只不过在散点图之下
# 点的大小尺度,和python中有所差别,可能与这里的线性变换有关
# ggalt::geom_encircle也可以画包围线,但是包围线是光滑的,并不是端点连线
ggplot()+
geom_polygon(data = stateIN, mapping = aes(x = area, y = poptotal, group = 1), fill = "#fffae5", color = "#ba413f", size = 0.4, alpha = 1)+
geom_point(data = df, mapping = aes(x = area, y = poptotal, fill = category, size = dot_size), color = "black", shape = 21, stroke = 0.3)+
geom_path(data = stateIN, mapping = aes(x = area, y = poptotal, group = 1), color = "#ba413f", size = 0.4)+ #线图层在点图层上方
scale_size(range = c(min(circsize$size), max(circsize$size)))+
scale_fill_manual(values = df_color)+
scale_x_continuous(breaks = seq(0, 0.1, by = 0.02), limits = c(0, 0.1), expand = c(0,0,0,0.005))+
scale_y_continuous(breaks = seq(0, 90000, by = 10000), limits = c(0, 90000), expand = c(0,0))+
guides(fill = guide_legend(override.aes = list(size = circsize$size)), size = FALSE)+
labs(x = "Area", y = "Population", fill = NULL, size = NULL, title = "Bubble plot with Encircling")+
theme_bw()+
theme(aspect.ratio = 1/1.7,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5),
legend.text = element_text(size = 7),
legend.spacing.x = unit(0, "cm"),
legend.key.height = unit(0.35, "cm"),
legend.key.width = unit(0.5, "cm"),
legend.position = c(0.95, 0.71), #markdown中r图像比例有问题,这项可调,
legend.background = element_blank()) #这项为了在markdown中好看,除去了背景
Scatter plot with linear regression line of best fit
如果想了解两个变量是如何相互变化的,那么可选最佳拟合线。
df <- ggplot2::mpg %>% setDT()
df_select <- df[cyl %in% c(4,8),] %>% .[,cyl:=as.factor(cyl)]
# 颜色
cyl_color <- c("#1f77b4", "#ff983e")
# geom_smooth的填充范围,只有数据和全图可选,而且se只会按垂直方向填充,这与python不同
ggplot(df_select, aes(displ, hwy, color = cyl, fill = cyl))+
geom_point(color = "black", shape = 21, size = 2.3, stroke = 0.2)+
geom_smooth(formula = "y~ x", method = "lm", fullrange = T, alpha = 0.2, show.legend = F)+
scale_x_continuous(breaks = 1:7, limits = c(0.5, 7.5), expand = c(0,0,0,0.005))+
scale_y_continuous(breaks = seq(0, 45, by = 5), limits = c(0, 45), expand = c(0,0))+
scale_color_manual(values = cyl_color)+
scale_fill_manual(values = cyl_color)+
labs(title = "Scatterplot with line of best fit grouped by number of cylinders")+
theme_bw()+
theme(aspect.ratio = 1/1.6, #这里是高1宽1.6,和python的seaborn长宽比是反着的
axis.line = element_line(linewidth = 0.3),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
df <- ggplot2::mpg %>% setDT()
df_select <- df[cyl %in% c(4,8),] %>% .[,cyl:=as.factor(cyl)]
# 注意这个图仍和seaborn的有差别,坐标轴上,对方的相当于两个图放一起了,这个没难度,这里就按分面画了一下
ggplot(df_select, aes(displ, hwy))+
geom_point(color = "black", fill = "#1f77b4", shape = 21, size = 2.3, stroke = 0.2)+
geom_smooth(formula = "y~ x", method = "lm", fill = "#1f77b4", color = "#1f77b4", fullrange = T, alpha = 0.2, show.legend = F)+
scale_x_continuous(breaks = 1:7, limits = c(0.5, 7.5), expand = c(0,0,0,0.005))+
scale_y_continuous(breaks = seq(0, 45, by = 5), limits = c(0, 45), expand = c(0,0))+
facet_wrap(vars(cyl), scales = "free", labeller = labeller(.default = function(x) paste0("cyl = ", x)))+
# facet_wrap(vars(cyl), scales = "free", labeller = function(variable, value) paste0(variable, "=", value))+ #这个方法被新的ggplot弃用了,但是和上一行同样效果
theme_bw()+
theme(aspect.ratio = 1/1,
axis.line = element_line(linewidth = 0.3),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
strip.background = element_rect(fill = F, colour = F),
plot.title = element_text(hjust = 0.5))
Jittering with stripplot
通常,多个数据点具有完全相同的X和Y值,将导致多个点相互绘制并隐藏。为避免这种情况,可稍微抖动点,以便直观地看到它们。
df <- ggplot2::mpg %>% setDT() %>% .[,cty:=factor(cty)]
df_color <- c( "#ff869a", "#9fb428", "#3fcc84", "#61b5f2","#b19fef","#fa82af")
# 注意这个图的横坐标是因子型,因此并不等宽
# 注意偏离值,python中只能水平偏移,R语言中水平和数值都可设定,并且默认值都不是0,要自定义
# 离散的话,用scale_fill_discrete(df_color),将自动匹配数量
# 连续的话,用scale_*_gradient/scale_*_gradient2/scale_*_gradientn,分别将中高两种、中高低三种,和任意颜色映射到梯度中,2中的中需要自定义中点值
ggplot(df, aes(cty, hwy, fill = cty))+
geom_jitter(width = 0.25, height = 0, color = "black", shape = 21, size = 2.3, stroke = 0.2)+
# scale_x_continuous(breaks = seq(9, 35, by = 1), limits = c(9, 35))+
scale_y_continuous(breaks = seq(10, 45, by = 5), limits = c(10, 45), expand = c(0,0))+
scale_fill_discrete(df_color)+
# scale_fill_gradient2(low = "#ff869a", mid = "#3fcc84", high = "#fa82af", midpoint = 32)+
# scale_fill_gradientn(colours = colorn)+
labs(title = "Use jittered plots to avoid overlapping of points")+
guides(fill = F)+
theme_bw()+
theme(aspect.ratio = 1/1.5, #高宽比1:1.5
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Counts Plot
避免点重叠问题的另一种选择是根据该点中有多少个点来增加点的大小。因此,点的大小越大,点周围的集中度就越大。
df <- ggplot2::mpg %>%
setDT() %>%
.[, length(drv), by=c("cty", "hwy")] %>%
set_colnames(c("cty", "hwy", "size")) %>%
.[, cty:=factor(cty)]
df_color <- c( "#ff869a", "#9fb428", "#3fcc84", "#61b5f2","#b19fef","#fa82af")
# 注意这个图的横坐标是因子型,因此并不等宽
ggplot(df, aes(cty, hwy, size = size, fill = cty))+
geom_point(color = "black", shape = 21, stroke = 0.2)+
scale_y_continuous(breaks = seq(10, 45, by = 5), limits = c(10, 45), expand = c(0,0))+
scale_fill_discrete(df_color)+
labs(title = "Counts Plot - Size of circle is bigger as more points overlap")+
guides(fill = F, size = F)+
theme_bw()+
theme(aspect.ratio = 1/2, #高宽比1:2
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Marginal Histogram
边际直方图具有沿X轴和Y轴变量的直方图。这用于可视化X和Y之间的关系以及X和Y的单变量分布。此图经常用于探索性数据分析。
用ggpubr::ggscatterhist也可以,但是是图和边缘图一起画,可调参数较少
这里仅用ggExtra::ggMarginal作图
library(ggExtra)
df <- ggplot2::mpg
# 绘制散点图和分布密度,分布图位置不能改只能在右和上
p <- ggplot(df, aes(x = displ, y = hwy, size = cty, fill = manufacturer)) +
geom_point(color = "black", shape = 21, stroke = 0.2)+
scale_size(range = c(1,4))+
scale_y_continuous(breaks = seq(10, 45, by = 5), limits = c(10, 45), expand = c(0,0))+
guides(size = F, fill = F)+
labs(title = "Scatterplot with Histograms \n displ vs hwy")+
theme_bw()+
theme(aspect.ratio = 1/2, #高宽比1:2
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
ggMarginal(p, type = "histogram",
fill = "#ff1493", color = "#ff1493", size = 4,
xparams = list(bins = 41), yparams = list(bins = 48))
Marginal Boxplot
边际箱线图的作用与边际直方图类似。箱线图有助于精确定位X和Y的中位数、第25个和第75个百分位数。
library(ggExtra)
df <- ggplot2::mpg
# 绘制散点图和分布密度,分布图位置不能改只能在右和上
p <- ggplot(df, aes(x = displ, y = hwy, size = cty, fill = manufacturer)) +
geom_point(color = "black", shape = 21, stroke = 0.2)+
scale_size(range = c(1,4))+
scale_y_continuous(breaks = seq(10, 45, by = 5), limits = c(10, 45), expand = c(0,0))+
guides(size = F, fill = F)+
labs(title = "Scatterplot with Histograms \n displ vs hwy")+
theme_bw()+
theme(aspect.ratio = 1/2, #高宽比1:2
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
# 注意,使用boxplot添加最值横线的方法如下,但是在margin图上没法加这个,所以与pythonseaborn的有差别
# ggplot(mpg, aes(class, hwy)) + stat_boxplot(geom = "errorbar", width = 0.1) + geom_boxplot()
ggMarginal(p, type = "boxplot", size = 15, fill = "#3174a2",
xparams = list(aes(group = 1)),
yparams = list(aes(group = 1)))
Correllogram
相关图用于直观地查看给定数据(或 2D 数组)中所有可能的数值变量对之间的相关指标。
df <- names(datasets::mtcars)
df_color <- c("#c01a26", "#fffebc", "#016837")
text.color <- function(x) if(x > 0.53|x < -0.46) return("w") else return("b") #数字颜色根据背景颜色调整,更容易看清数字
mtcars.cor <- cor(datasets::mtcars) %>%
data.table() %>%
.[,v1:=df] %>%
melt(id="v1", variable.name="v2", value.name = "cor") %>%
.[, c("v1", "v2") := list(factor(v1, levels = df), factor(v2, levels = rev(df)))] %>%
.[, cor:=round(cor, 2)] %>%
.[, text.col := text.color(cor), by = c("v1","v2")]
# 这个图颜色通过对Python的取色得来,但是颜色没有原图鲜艳
ggplot(mtcars.cor, aes(v1, v2, fill = cor, label = cor))+
geom_tile()+
geom_text(mapping = aes(color = text.col), show.legend = F, size = 3)+
scale_fill_gradientn(colours = df_color)+
scale_color_manual(values = c("b" = "black", "w" = "white"))+
guides(fill = guide_colorbar(ticks = FALSE, barheight = 19, barwidth = 1.25))+
labs(x = NULL, y = NULL, title = "Correlogram of mtcars", fill = NULL)+
theme_bw()+
theme(aspect.ratio = 1,
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
Pairwise Plot
成对图是探索性分析中最受欢迎的一种,用于了解所有可能的数值变量对之间的关系。它是双变量分析的必备工具。
library(GGally)
# 定义散点图
scatter_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point(..., size = 1, alpha = 0.8) +
geom_smooth(formula = "y ~ x", method = "lm", show.legend = F, alpha = 0.2) +
guides(fill = guide_legend(override.aes = list(size = 2)))
}
# 定义密度图
diag_density <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping)+
geom_density(alpha = 0.2)
}
# 坐标轴和python有差异
ggpairs(datasets::iris, aes(color = Species, fill = Species), columns = 1:4, legend = c(1,2), progress = FALSE,
upper = list(continuous = scatter_fn),
lower = list(continuous = scatter_fn),
diag = list(continuous = diag_density)) +
theme_bw()+
theme(axis.ticks = element_blank(),
panel.grid = element_blank(),
strip.background = element_rect(fill = F, colour = F),
strip.text = element_text(size = 10))
library(GGally)
scatter_fn <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point(..., size = 1, alpha = 0.8) + # 这里不能定义color="white",会和ggpairs有冲突
guides(fill = guide_legend(override.aes = list(size = 2)))
}
diag_density <- function(data, mapping, ...){
p <- ggplot(data = data, mapping = mapping)+
geom_density(alpha = 0.2)
}
# 坐标轴和python有差异
ggpairs(datasets::iris, aes(color = Species, fill = Species), columns = 1:4, legend = c(1,2), progress = FALSE,
upper = list(continuous = scatter_fn),
lower = list(continuous = scatter_fn),
diag = list(continuous = diag_density)) +
theme_bw()+
theme(axis.ticks = element_blank(),
panel.grid = element_blank(),
strip.background = element_rect(fill = F, colour = F),
strip.text = element_text(size = 10))
Diverging Bars
如果想查看项目如何基于单个指标而变化,并可视化此差异的顺序和数量,则发散条形图是一个很好的工具。它有助于快速区分数据中组的性能,并且非常直观,可以立即传达要点。
geom_linerange、geom_segment和geom_bar都可以实现这类图,前两种方法基本一样
ggplot中,几种画线或线段的函数 1. geom_linerange(), aes是x, ymin, ymax; y线段三个,或x线段三个 2. geom_segment(), aes是x, xend, y, yend,四个都得有 3. geom_vline(), geom_hline(), aes是xintercept或yintercept 4. geom_abline(), aes是slope和intercept
df <- datasets::mtcars %>% data.table() %>% .[,cars:=rownames(mtcars)]
df[, mpg_z := (mpg-mean(mpg))/sd(mpg)]
df[mpg_z > 0, mpg_c := "g"]
df[mpg_z <= 0, mpg_c := "r"]
# 注意Ferrari Dino和原版数值不一样,可能是数据的问题,用本地数据R和Python画出图结果一样
ggplot(df, aes(y = reorder(cars, mpg_z), color = mpg_c))+
geom_linerange(aes(xmin = 0, xmax = mpg_z), linewidth = 2, alpha = 0.5, show.legend = F)+
scale_color_manual(values = c("r" = "red", "g" = "green"))+
scale_x_continuous(limits = c(-2, 2.5), breaks = seq(-2, 2.5, by = 0.5), expand = c(0,0.02))+
scale_y_discrete(expand = c(0.03, 0.03))+
labs(x = "Mileage", y = "Model", title = "Diverging Bars of Car Mileage")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
axis.title = element_text(face = "italic"),
axis.text.y = element_text(size = 8),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
df <- datasets::mtcars %>% data.table() %>% .[,cars:=rownames(mtcars)]
df[, mpg_z := (mpg-mean(mpg))/sd(mpg)]
df[mpg_z > 0, mpg_c := "g"]
df[mpg_z <= 0, mpg_c := "r"]
# 注意Ferrari Dino和原版数值不一样,可能是数据的问题,用本地数据R和Python画出图结果一样
ggplot(df)+
geom_segment(aes(x = 0, xend = mpg_z, y = reorder(cars, mpg_z), yend = reorder(cars, mpg_z),color = mpg_c), linewidth = 2, alpha = 0.5, show.legend = F)+
scale_color_manual(values = c("r" = "red", "g" = "green"))+
scale_x_continuous(limits = c(-2, 2.5), breaks = seq(-2, 2.5, by = 0.5), expand = c(0,0.02))+
scale_y_discrete(expand = c(0.03, 0.03))+
labs(x = "Mileage", y = "Model", title = "Diverging Bars of Car Mileage")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
axis.title = element_text(face = "italic"),
axis.text.y = element_text(size = 8),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
df <- datasets::mtcars %>% data.table() %>% .[,cars:=rownames(mtcars)]
df[, mpg_z := (mpg-mean(mpg))/sd(mpg)]
df[mpg_z > 0, mpg_c := "g"]
df[mpg_z <= 0, mpg_c := "r"]
# 注意Ferrari Dino和原版数值不一样,可能是数据的问题,用本地数据R和Python画出图结果一样
ggplot(df, aes(x = reorder(cars, mpg_z), y = mpg_z, fill = mpg_c))+
geom_bar(stat = "identity", alpha = 0.5, width = 0.5, show.legend = F)+
scale_fill_manual(values = c("r" = "red", "g" = "green"))+
scale_x_discrete(expand = c(0.03, 0.03))+
scale_y_continuous(limits = c(-2, 2.5), breaks = seq(-2, 2.5, by = 0.5), expand = c(0,0.02))+
coord_flip()+
labs(x = "Model", y = "Mileage", title = "Diverging Bars of Car Mileage")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
axis.title = element_text(face = "italic"),
axis.text.y = element_text(size = 8),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
Diverging Texts
发散文本类似于发散条形,如果您想以美观且美观的方式显示图表中每个项目的值,则首选文本。
df <- datasets::mtcars %>% data.table() %>% .[,cars:=rownames(mtcars)]
df[, mpg_z := (mpg-mean(mpg))/sd(mpg)]
df[mpg_z > 0, mpg_c := "g"]
df[mpg_z <= 0, mpg_c := "r"]
# 注意Ferrari Dino和原版数值不一样,可能是数据的问题,用本地数据R和Python画出图结果一样
ggplot(df, aes(y = reorder(cars, mpg_z)))+
geom_linerange(aes(xmin = 0, xmax = mpg_z), linewidth = 1, show.legend = F)+
geom_text(data = df[mpg_c == "g",], mapping = aes(x = mpg_z, label = sprintf("%.2f",mpg_z)), hjust = 0, color = "green4", size = 3)+
geom_text(data = df[mpg_c == "r",], mapping = aes(x = mpg_z, label = sprintf("%.2f",mpg_z)), hjust = 1, color = "red", size = 3)+
scale_x_continuous(limits = c(-2, 2.5), breaks = seq(-2, 2.5, by = 0.5), expand = c(0,0.02))+
scale_y_discrete(expand = c(0.03, 0.03))+
labs(x = "Mileage", y = "Model", title = "Diverging Text Bars of Car Mileage")+
theme_bw()+
theme(axis.ticks = element_blank(),
axis.title = element_text(face = "italic"),
axis.text.y = element_text(size = 8),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
Diverging Dot Plot
分流点图也类似于发散条。然而,与发散条相比,没有条形减少了组之间的对比度和差异量。
df <- datasets::mtcars %>% data.table() %>% .[,cars:=rownames(mtcars)]
df[, mpg_z := (mpg-mean(mpg))/sd(mpg)]
df[mpg_z > 0, mpg_c := "g"]
df[mpg_z <= 0, mpg_c := "r"]
# 注意Ferrari Dino和原版数值不一样,可能是数据的问题,用本地数据R和Python画出图结果一样
ggplot(df, aes(y = reorder(cars, mpg_z), x = mpg_z, color = mpg_c, label = sprintf("%.1f",mpg_z)))+
geom_point(size = 5, alpha = 0.6, show.legend = F)+
geom_text(size = 2, hjust = 0.5, color = "white")+
scale_x_continuous(limits = c(-2, 2.5), breaks = seq(-2, 2.5, by = 0.5), expand = c(0,0.02))+
scale_y_discrete(expand = c(0.03, 0.03))+
scale_color_manual(values = c("r" = "red", "g" = "green4"))+
labs(x = "Mileage", y = "Model", title = "Diverging Dotplot of Car Mileage")+
theme_bw()+
theme(axis.ticks = element_blank(),
axis.title = element_text(face = "italic"),
axis.text.y = element_text(size = 8),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
Diverging Lollipop Chart with Markers
带标记的棒棒糖提供了一种灵活的方法来可视化偏差,方法是强调想要引起注意的任何重要数据点并在图表中适当地进行推理。
df <- datasets::mtcars %>% data.table() %>% .[,cars:=rownames(mtcars)]
df[, mpg_z := (mpg-mean(mpg))/sd(mpg)]
# 方框标注数据
df_rect <- data.table(x_min = c(1.5, -1.85),
x_max = c(2.45, -1.35),
y_min = c(28, 0),
y_max = c(32.8, 3),
rec_group = c("g", "r"))
# 线段括号数据
df_line <- data.table(x_point = c(-0.1, 0, -0.1, 0, 0, 0, 0, 0.5)+0.05, #原图太难看了,这里整体向右平移0.05
y_point = c(10.5, 10.5, 13.5, 13.5, 10.5, 13.5, 12, 12),
line_group = c(1,1,2,2,3,3,4,4))
# 注意Ferrari Dino和原版数值不一样,可能是数据的问题,用本地数据R和Python画出图结果一样
# 注意原版,代码和图中的标题也不一样,说明也是复制前面的代码,而且后期改了代码,但没再运行
ggplot(df, aes(y = reorder(cars, mpg_z), x = mpg_z))+
geom_linerange(aes(xmin = 0, xmax = mpg_z), linewidth = 0.5)+
geom_point(size = 5, alpha = 0.6)+
geom_linerange(data = df[cars == "Fiat X1-9",], aes(y = cars, xmin = 0, xmax = mpg_z), linewidth = 0.5, color = "orange")+
geom_point(data = df[cars == "Fiat X1-9",], aes(y = cars, x = mpg_z), size = 7, alpha = 1, color = "orange")+
geom_rect(data = df_rect, mapping = aes(x= NULL, y = NULL,
xmin = x_min, xmax = x_max,
ymin = y_min, ymax = y_max, fill = rec_group), alpha = 0.4, show.legend = F)+
geom_line(data = df_line, mapping = aes(x = x_point, y = y_point, group = line_group), linewidth = 0.8, color = "blue1")+
geom_label(data = NULL, aes(x = 0.5, y = 12, label = "Mercedes Models"), hjust = 0, fill = "red", color = "white")+
scale_fill_manual(values = c("g" = "green3", "r" = "red"))+
scale_x_continuous(limits = c(-2, 2.5), breaks = seq(-2, 2.5, by = 0.5), expand = c(0,0.1))+
scale_y_discrete(expand = c(0.05, 0.05))+
labs(x = "Mileage", y = "Model", title = "Diverging Lollipop of Car Mileage")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
axis.title = element_text(face = "italic"),
axis.text.y = element_text(size = 8),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
Area Chart
通过对轴和线之间的区域进行着色,面积图不仅更加强调波峰和波谷,还更加强调高点和低点的持续时间。高点持续时间越长,线下面积越大。
这个图需要注意,如果不填充不同颜色,则没什么问题,一旦要以x轴为分界,上下填充不同颜色,则需要差值以增加数据密度。
#显示英文日期名
Sys.setlocale(category = "LC_TIME", locale = "English")
## [1] "English_United States.1252"
# 准备一个差值函数,用以扩充数据量,
# 目的是,例如对于y正、负、正交替出现的情况,即x1的y>0,之后有部分x2,x3...xn-1的y<0,然后又是xn的y>0,函数将x尽量细分,以使x1和xn的y值接近0,
# 这样对y>0和y<0填充不同颜色时,让x2,x3...xn-1的y>0部分填充足够小以至于看不见(若不对y>0和y<0填充不同颜色,则不会填充这部分)
# 函数处理过后的数据,对y>0和y<0填充不同颜色,与填充相同颜色,填充部分看起来是一样的
dat_interp = function(x, y, line = 0, length.out=500) {
a <- approx(x, x, xout=seq(min(x), max(x), length.out=length.out))$y
b <- approx(x, y, xout=seq(min(x), max(x), length.out=length.out))$y
if("POSIXct" %in% class(x)) a <- as.POSIXct(a, origin='1970-01-01', tz='UTC')
res <- data.table(x = a, y = b)
res[y < line, value:="lower"]
res[y >= line, value:="greater"]
return(res)
}
# 这个数据与ggplot2::economics数值稍有差异,这里没用R内部数据
# df <- fread("https://github.com/selva86/datasets/raw/master/economics.csv")
df <- fread("data5_economics.csv") %>% .[1:100,]
df[, returns:=c(0,diff(df$psavert)/df$psavert[1:99])*100] #注意这里的计算方法,本期增长率=(本期-上期)/上期
# 差值为新数据
df_date <- df[["date"]] %>% as.POSIXct(tz = "UTC") #提取日期
df_inte <- dat_interp(df_date, df$returns, length.out = 5000)
# 箭头数据
df_arro <- data.table(x = as.POSIXct(c("1975-03-15", "1975-04-20")), y = c(27, 22))
# 标签数据
df_labe <- data.table(x = as.POSIXct("1975-03-15"), y = 27, l = "Peak\n1975")
b_date <- as.POSIXct("1967-07-01", tz = "UTC")
e_date <- as.POSIXct("1975-11-01", tz = "UTC")
ggplot()+
geom_area(data = df_inte, aes(x, y, fill = value), show.legend = F)+
geom_line(data = df_arro, aes(x,y), arrow = arrow(length = unit(0.2, "cm"), type = "closed", angle = 25))+
geom_label(data = df_labe, aes(x,y,label = l), hjust = 1, vjust = 0, fill = "red3", color = "white", size = 3)+
scale_fill_manual(values = c("greater" = "green3", "lower" = "red"))+
scale_x_datetime(date_labels = "%b-%Y", breaks = seq(b_date, e_date, by = "6 month"), limits = c(b_date, e_date), expand = c(0,0))+
scale_y_continuous(limits = c(-35,35), expand = c(0,0), breaks = seq(-30,30,10))+
labs(x = NULL, y = "Monthly returns %", title = "Month Economics Return %")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text.y = element_text(size = 8),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
Ordered Bar Chart
有序条形图有效地传达项目的排名顺序。但是,将图表上方的指标值相加,用户将从图表本身获得准确的信息。这是一种基于计数或任何给定指标可视化项目的经典方法。
df <- ggplot2::mpg %>%
data.table() %>%
.[, lapply(.SD, mean), .SDcols="cty", by = "manufacturer"] %>%
setorder(cty) %>%
.[, manufacturer:=toupper(manufacturer)]
# 简化图,除坐标轴底纹外都相同
ggplot(df, aes(reorder(manufacturer, cty), cty))+
geom_bar(stat = "identity", fill = "#c96464", width = 0.5)+
geom_text(aes(label = sprintf("%.1f", cty)), vjust = -0.5)+
scale_y_continuous(limits = c(0, 30), breaks = seq(0, 30, 5), expand = c(0,0))+
labs(x = NULL, y = "Miles Per Gallon", title = "Bar Chart for Highway Mileage")+
theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.ticks = element_blank(), plot.background = element_rect(fill = NULL),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
df <- ggplot2::mpg %>%
data.table() %>%
.[, lapply(.SD, mean), .SDcols="cty", by = "manufacturer"] %>%
setorder(cty) %>%
.[, manufacturer:=toupper(manufacturer)]
# 如果追求一致,可以换一种思路,自己画x和y轴,并添加标签,但是会相当麻烦
ggplot(df, aes(reorder(manufacturer, cty), cty))+
geom_bar(stat = "identity", fill = "#c96464", width = 0.5)+
geom_text(aes(label = sprintf("%.1f", cty)), vjust = -0.5)+
# 自定义border
geom_hline(yintercept = c(0,30), linewidth = 0.5)+
geom_linerange(data = NULL, aes(x = -0, ymin = 0, ymax = 30))+
geom_linerange(data = NULL, aes(x = 16, ymin = 0, ymax = 30))+
# 自定义x轴标签和填充
geom_text(aes(y = -0.3, label = manufacturer), vjust = 1, hjust = 1, angle = 45, size = 2.5)+
geom_rect(data = NULL, aes(xmin = 0, xmax = 9.25, ymin = -5, ymax = 0), fill = "red1", alpha = 0.008)+
geom_rect(data = NULL, aes(xmin = 9.25, xmax = 16, ymin = -5, ymax = 0), fill = "green3", alpha = 0.008)+
# 将y轴负值部分留出空白,y轴标签可不自定义,但显示范围仅限于正值
scale_x_discrete(expand = c(0,0))+
scale_y_continuous(limits = c(-5, 30), breaks = seq(0, 30, 5), expand = c(0,0))+
labs(x = NULL, y = "Miles Per Gallon", title = "Bar Chart for Highway Mileage")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.text.x = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
axis.title.y = element_text(hjust = 0.65), #微调y标题以居中,注意标题方向上的水平,是图的竖直方向
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
Lollipop Chart
棒棒糖图表的作用与有序条形图类似,具有视觉上令人愉悦的方式。
df <- ggplot2::mpg %>%
data.table() %>%
.[, lapply(.SD, mean), .SDcols="cty", by = "manufacturer"] %>%
setorder(cty) %>%
.[, manufacturer:=toupper(manufacturer)]
ggplot(df, aes(reorder(manufacturer, cty), cty))+
geom_linerange(aes(ymin = 0, ymax = cty), color = "red4")+
geom_point(color = "red4")+
geom_text(aes(label = sprintf("%.2f", cty)), size = 3, vjust = -1)+
scale_y_continuous(limits = c(0, 30), breaks = seq(0, 30, 5), expand = c(0,0))+
labs(x = NULL, y = "Miles Per Gallon", title = "Lollipop Chart for Highway Mileage")+
theme_bw()+
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Dot Plot
点图表示项目的排名顺序。由于它沿水平轴对齐,因此您可以更轻松地可视化点之间的距离。
df <- ggplot2::mpg %>%
data.table() %>%
.[, lapply(.SD, mean), .SDcols="cty", by = "manufacturer"] %>%
setorder(cty) %>%
.[, manufacturer:=toupper(manufacturer)]
ggplot(df, aes(cty, reorder(manufacturer, cty)))+
geom_linerange(aes(xmin = 11, xmax = 26), linetype = 2, linewidth = 0.25, alpha = 0.3)+
geom_point(color = "red4", size = 3)+
scale_x_continuous(limits = c(10, 27), breaks = seq(10, 27, 2), expand = c(0,0))+
labs(x = "Miles Per Gallon", y = NULL, title = "Dot Plot for Highway Mileage")+
theme_bw()+
theme(aspect.ratio = 1/1.7,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Slope Chart
斜率图最适合比较给定人员/项目的“之前”和“之后”位置。
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/gdppercap.csv", header = T)
df <- fread("data6_gdppercap.csv", header = T)
df_dat <- df[, trend:=ifelse(df[["1952"]] - df[["1957"]]>0, "down", "up")] %>%
melt(id.vars = c("continent", "trend"), variable.name = "year", value.name = "gdp")
df_lab_bef <- df_dat[year == "1952", ] %>% .[,l:=sprintf("%s, %.1f", continent, gdp)]
df_lab_aft <- df_dat[year == "1957", ] %>% .[,l:=sprintf("%s, %.1f", continent, gdp)]
# 一个线段上,不同点负值不同颜色,则按这条线段第一个点的颜色画
ggplot(df_dat, aes(year, gdp, color = trend))+
geom_line(aes(group = continent), show.legend = F)+
geom_linerange(aes(x = year, ymin = 0, ymax = 13000), color = "gray60", linewidth = 0.2, alpha = 0.5)+
geom_point(show.legend = F)+
geom_text(data = df_lab_bef, aes(0.98, gdp, label = l), color = "black", size = 3, hjust = 1)+
geom_text(data = df_lab_aft, aes(2.03, gdp, label = l), color = "black", size = 3, hjust = 0)+
geom_text(data = NULL, aes(0.98, 13000, label = "BEFORE"), color = "black", size = 5, hjust = 1)+
geom_text(data = NULL, aes(2.03, 13000, label = "AFTER"), color = "black", size = 5, hjust = 0)+
scale_color_manual(values = c("up" = "green4", "down" = "red2"))+
scale_x_discrete(expand = c(0.2,0.2))+
scale_y_continuous(limits = c(0, 15000), breaks = seq(500, 12500, 2000))+
labs(x = NULL, y = "Mean GDP Per Capita", title = "Slopechart: Comparing GDP Per Capita between 1952 vs 1957")+
theme_bw()+
theme(aspect.ratio = 1/1.7,
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
Dumbbell Plot
哑铃图传达了各种物品的“之前”和“之后”位置以及物品的等级顺序。如果您想可视化特定项目/计划对不同对象的影响,它非常有用。
# 数据排序,默认结果R和python不同,这里没有采用默认形式,第一排序指标相同,次要指标按pct_2013再排一次
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/health.csv")
df <- fread("data7_health.csv") %>% setorder(pct_2014, pct_2013) %>% .[, index:=(nrow(.)-1):0]
df_line <- data.table(x = c(0.05, 0.10, 0.15, 0.20), ymin = c(0,0,0,0), ymax = c(25,25,25,25))
# 原版里横坐标百分数标错了,这里改过来了
ggplot(df, aes(y = index))+
geom_point(aes(x = pct_2013), color = "skyblue4")+
geom_point(aes(x = pct_2014), color = "skyblue")+
geom_linerange(aes(xmin = pct_2014, xmax = pct_2013), color = "skyblue")+
geom_linerange(data = df_line, aes(x = x, y = NULL, ymin = ymin, ymax = ymax), linetype = 3, linewidth = 0.1)+
scale_y_continuous(limits = c(0, 25), breaks = seq(0, 25, 5)) +
scale_x_continuous(limits = c(0.01, 0.249), breaks = seq(0, 0.25, 0.05), labels = scales::percent, expand = c(0,0))+
labs(x = NULL, y = "Mean GDP Per Capita", title = "Dumbell Chart: Pct Change - 2013 vs 2014")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Histogram for Continuous Variable
直方图显示给定变量的频率分布。下面的表示基于分类变量对频率条进行分组,从而更深入地了解连续变量和分类变量。
fill_color <- c("#9e013d", "#ec5b43", "#ffbe6c","#fdffb6", "#bce69f", "#57adab", "#6153a6")
x_b <- c(1.6 , 2.14, 2.68, 3.22, 3.76, 4.3 , 4.84, 5.38, 5.92, 6.46, 7.0 )
# 注意,这个划分的结果和python的不同,我也不知道,如何完全复刻python的划分方式
ggplot(ggplot2::mpg, aes(displ, fill = class))+
geom_histogram(color = "white", bins = 30, position = position_stack(reverse = T), closed = "right")+
scale_fill_manual(values = fill_color)+
scale_x_continuous(breaks = x_b)+
scale_y_continuous(limits = c(0, 35), breaks = seq(0, 35, 5), expand = c(0,0))+
labs(x = "displ", y = "Frequency", title = "Stacked Histogram of displ by class")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
panel.grid = element_blank(),
legend.position = c(0.85, 0.7),
plot.title = element_text(hjust = 0.5))
python这个图我认为画的实在不好,下面我用R另画了一个,清楚的展示了每个bar的宽度和大小,若要对照具体数据查看,使用这个代码setDT(mpg);dcast(mpg[,c("displ","class")], class~displ)
fill_color <- c("#9e013d", "#ec5b43", "#ffbe6c","#fdffb6", "#bce69f", "#57adab", "#6153a6")
x_b <- c(1.6 , 2.14, 2.68, 3.22, 3.76, 4.3 , 4.84, 5.38, 5.92, 6.46, 7.0 )-0.02 #-0.02让截断正好在bar的两端,这个0.02是试出来的
x_bin_w <- seq(x_b[1], x_b[length(x_b)], length.out = 30) #最后一个直方块是在最后一个截断右边,因此长度还是30
# 注意,这个划分的结果和python的不同,我也不知道,如何完全复刻python的划分方式
# 注意,当x轴的limits比数据宽(至少多宽出一个bar的宽度),则会由提示出现移除了缺失值,实际上并没有对影响数据
ggplot(ggplot2::mpg, aes(displ, fill = class))+
geom_histogram(color = "white", bins = 30, position = position_stack(reverse = T), closed = "right")+
scale_fill_manual(values = fill_color)+
scale_x_continuous(breaks = x_bin_w, labels = function(x) sprintf("%.2f", x), expand = c(0.02,0.02))+
scale_y_continuous(limits = c(0, 35), breaks = seq(0, 35, 1), expand = c(0,0))+
labs(x = "displ", y = "Frequency", title = "Stacked Histogram of displ by class")+
theme_bw()+
theme(aspect.ratio = 1/2,
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 8),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "gray"),
legend.position = c(0.9, 0.65),
plot.title = element_text(hjust = 0.5))
分类变量的直方图显示该变量的频率分布。通过为条形着色,您可以可视化与表示颜色的另一个分类变量相关的分布。
Histogram for Categorical Variable
不用看了,python这个图有极大的问题: 1.
离散数据不能画直方图,只能画条形图 2.
用这两条代码,可以看出教程中的图(用Python再画出来的还是正确的)和数据根本不对应:df[manufacturer == "lincoln"];df[manufacturer == "mercury"]
fill_color <- c("#9e013d", "#ec5b43", "#ffbe6c","#fdffb6", "#bce69f", "#57adab", "#6153a6")
df <- ggplot2::mpg %>% data.table() %>% .[,manufacturer:=factor(manufacturer, levels = sort(unique(mpg$manufacturer)))]
ggplot(df, aes(manufacturer, fill = class))+
geom_bar(color = "white", position = position_stack(reverse = T))+
scale_fill_manual(values = fill_color)+
scale_y_continuous(limits = c(0, 40), breaks = seq(0, 40, 5), expand = c(0,0))+
labs(x = "manufacturer", y = "Frequency", title = "Stacked Histogram of manufacturer by class")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
panel.grid = element_blank(),
legend.position = c(0.55, 0.66),
plot.title = element_text(hjust = 0.5))
Density Plot
密度图是可视化连续变量分布的常用工具。通过按“响应”变量对它们进行分组,可检查X和Y之间的关系。下图描述里程的分布如何随气缸数量而变化。
d_color <- c("green", "deeppink", "dodgerblue", "orange")
df <- ggplot2::mpg %>% data.table() %>% .[, cyl:=factor(cyl, levels = sort(unique(mpg$cyl)))]
ggplot(df, aes(cty, group = cyl, fill = cyl, color = cyl))+
geom_density(alpha = 0.5)+
scale_fill_manual(values = d_color, labels = paste0("Cyl = ", levels(df$cyl)))+
scale_color_manual(values = d_color, labels = paste0("Cyl = ", levels(df$cyl)))+
scale_x_continuous(limits = c(5, 40), breaks = seq(5, 35, 5), expand = c(0,0))+
scale_y_continuous(limits = c(0, 0.75), breaks = seq(0, 0.7, 0.1), expand = c(0,0))+
guides(fill = guide_legend(title = NULL), color = guide_legend(title = NULL))+
labs(x = "Cty", y = "Density", title = "Density Plot of City Mileage by n_Cylinders")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
panel.grid = element_blank(),
legend.text = element_text(size = 10),
legend.position = c(0.9, 0.8),
plot.title = element_text(hjust = 0.5))
Density Curves with Histogram
带有直方图的密度曲线将两个图传达的集体信息汇集在一起。
library(stringr)
c_color <- c("dodgerblue", "orange", "green4")
s_class <- c("compact", "suv", "minivan")
df <- ggplot2::mpg %>% data.table() %>% .[class %in% s_class, ] %>% .[,class:=factor(class, levels = s_class)]
# 这个图与python不同的是,所有的geom_histogram以同样的x轴标度进行划分,因此前后柱子是完全重叠的
ggplot()+
geom_histogram(data = df, aes(cty, y = after_stat(density), fill = class), color = "white", bins = 28, position = position_identity(), alpha = 0.5)+
geom_density(data = df, aes(x = cty, group = class, color = class), linewidth = 1)+
scale_color_manual(values = c_color, labels = str_to_title(s_class))+
scale_fill_manual(values = c_color, labels =str_to_title(s_class))+
scale_x_continuous(limits = c(5, 40), breaks = seq(5, 35, 5), expand = c(0,0))+
scale_y_continuous(limits = c(0, 0.35), breaks = seq(0, 0.35, 0.05), expand = c(0,0))+
guides(fill = guide_legend(title = NULL), color = guide_legend(title = NULL))+
labs(x = "Cty", y = NULL, title = "Density Plot of City Mileage by Vehicle Type")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
panel.grid = element_blank(),
legend.text = element_text(size = 13),
legend.position = c(0.8, 0.8),
plot.title = element_text(hjust = 0.5))
Joy Plot(山脊图/嵴线图/峰峦图)
嵴线图允许不同组的密度曲线重叠,这是可视化大量组彼此之间分布的好方法。它看起来赏心悦目,并清楚地传达了正确的信息。
这个图也与python的有所差别,添加了标签,自定义了核密度估计的带宽。python的带宽我有点不明白,上面的带宽貌似很小,下面的又好像很大。
library(ggridges)
library(stringr)
c_color <- c("dodgerblue", "orange")
d_class <- ggplot2::mpg$class %>% unique() %>% sort(decreasing = T)
df <- mpg %>% data.table() %>%
.[,c("hwy", "cty", "class")] %>%
.[,class:=factor(class, levels = d_class)] %>%
melt(id.vars = "class")
# geom_density_ridge函数,没有2是开放图,有2的是密度线和坐标轴都有实线,相当于围起来了
ggplot(df, aes(x = value, y = class, fill = variable))+
geom_density_ridges2(aes(scale = 3), bandwidth = 0.5)+ #aes中scale是纵向图的长度,bandwidt是核密度的带宽
scale_fill_manual(values = c_color, labels =str_to_title(s_class))+
scale_x_continuous(limits = c(5, 50), breaks = seq(10, 50, 10), expand = c(0,0))+
scale_y_discrete(expand = c(0, 0, 0.4, 0))+ #expand四个元素的向量,分别为下,下,上,上
guides(fill = guide_legend(title = NULL))+
labs(x = NULL, y = NULL, title = "Joy Plot of City and Highway Mileage by Class")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
legend.text = element_text(size = 13),
legend.position = c(0.85, 0.88),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
Distributed Dot Plot
分布点图显示按组分割的点的单变量分布。点越暗,该区域的数据点集中度越高。通过以不同的方式着色中位数,组的真实位置立即变得明显。
library(stringr)
df <- ggplot2::mpg %>% data.table() %>% .[,c("manufacturer", "cty")]
df_median <- df[, lapply(.SD, median), .SDcols = "cty", by = "manufacturer"] %>% setorder(-cty)
manu_order <- df_median[["manufacturer"]] %>% factor(., levels = .)
df[, manufacturer:=factor(manufacturer, levels = manu_order)]
df_median[, manufacturer:=factor(manufacturer, levels = manu_order)]
ggplot(df, aes(x = cty, y = manufacturer))+
geom_linerange(data = df_median, aes(x = NULL, xmin = 0, xmax = 40), linewidth = 0.1, color = "gray50")+
geom_point(shape = 21, size = 3, fill = "white", stroke = 0.05)+
geom_point(data = df_median, aes(cty, manufacturer), shape = 21, size = 3, fill = "red2", stroke = 0.05)+
geom_point(data = NULL, aes(34.5, 15.5), shape = 21, size = 3, fill = "red2", stroke = 0.05)+
geom_text(data = NULL, aes(39.8, 15.5), label = "Median", size = 4, hjust = 1)+
geom_text(data = NULL, aes(39.8, 14.5), label = "red dots are the median", color = "red2", size = 3, fontface = "italic", hjust = 1, vjust = 0)+
scale_x_continuous(limits = c(0, 40), breaks = seq(5, 40, 5), expand = c(0,0))+
scale_y_discrete(expand = c(0.02, 0.02, 0.05, 0.05), labels = str_to_title(manu_order))+
labs(x = "Miles Per Gallon (City)", y = NULL, title = "Distribution of City Mileage by Make")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
legend.text = element_text(size = 13),
legend.position = c(0.6, 0.7),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
Box Plot
箱形图是可视化分布的好方法,牢记中位数、第25个第75个四分位数和异常值。同时,需要小心解释框的大小,这可能会扭曲该组中包含的观测值个数,可在图中标记数字。
原版有问题,
标记的观测值数量和箱型图不对应,
y轴限制小于图像内容,有3个点没画出来
df_class <- ggplot2::mpg[["class"]] %>% unique %>% sort()
df_color <- c("#3274a1", "#e4802d", "#389538", "#bf3d3e", "#9372b2", "#825d53", "#d785bc")
df <- ggplot2::mpg %>% data.table() %>% .[,c("class", "hwy")] %>% .[, class:=factor(class, levels = df_class)]
df_count <- df[, length(hwy), by = "class"] %>%
.[, y:=df[, lapply(.SD, median), .SDcols = 2, by = class]$hwy] %>%
.[, l:=paste0("#obs: ", V1)] %>%
.[, class:=factor(class, levels = df_class)]
ggplot(df, aes(x = class, y = hwy, fill = class))+
stat_boxplot(geom = "errorbar", width = 0.5)+
geom_boxplot(show.legend = F)+
geom_text(data = df_count, aes(class, y, label = l), color = "white", vjust = -0.3, size = 2.5)+
scale_fill_manual(values = df_color)+
scale_y_continuous(limits = c(10, 45), breaks = seq(10,45,5), expand = c(0,0))+
labs(x = "class", y = "hwy", title = "Box Plot of Highway Mileage by Vehicle Class")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Dot + Box Plot
点+箱线图传达的信息与分组分割的箱线图类似。此外,这些点还可以了解每个组中有多少个数据点。
df_class <- ggplot2::mpg[["class"]] %>% unique %>% sort()
df_color <- c("#3274a1", "#e4802d", "#389538", "#bf3d3e")
df_line <- data.table(x = seq(1.5, 6.5, 1), ymin = 10, ymax = 45)
df <- ggplot2::mpg %>% data.table() %>%
.[,c("class", "hwy", "cyl")] %>%
.[, c("class", "cyl"):=list(factor(class, levels = df_class), factor(cyl, levels = c(4,5,6,8)))]
ggplot(df, aes(x = class, y = hwy, fill = cyl))+
stat_boxplot(geom = "errorbar", position = position_dodge(width = 1), width = 0.3, linewidth = 0.3)+
geom_boxplot(width = 0.5, position = position_dodge(width = 1), linewidth = 0.3)+
geom_jitter(aes(color = cyl), height = 0, width = 0.2, size = 1, alpha = 0.5, show.legend = F)+
geom_linerange(data = df_line, aes(x = x, ymin = ymin, ymax = ymax, y = NULL, fill = NULL), color = "gray", linewidth = 0.2)+
scale_fill_manual(values = df_color)+
scale_color_manual(values = df_color)+
scale_y_continuous(limits = c(7, 48), breaks = seq(10,45,5), expand = c(0,0))+
guides(fill = guide_legend(keyheight = unit(0.4, "cm"), keywidth = unit(0.5, "cm"), title = "Cylinders"))+
labs(x = "class", y = "hwy", title = "Box Plot of Highway Mileage by Vehicle Class")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
panel.grid = element_blank(),
legend.title = element_text(size = 9),
legend.text = element_text(size = 8),
legend.position = c(0.08, 0.15),
plot.title = element_text(hjust = 0.5))
Violin Plot
小提琴图是箱形图的视觉上令人愉悦的替代方案。小提琴的形状或面积取决于它所持有的观察次数。然而,小提琴图可能更难阅读,并且在专业环境中并不常用。
df_class <- ggplot2::mpg[["class"]] %>% unique %>% sort()
df_color <- c("#3274a1", "#e4802d", "#389538", "#bf3d3e", "#9372b2", "#825d53", "#d785bc")
df <- ggplot2::mpg %>% data.table() %>% .[,c("class", "hwy")] %>% .[, class:=factor(class, levels = df_class)]
# 画图,图形胖瘦有区别、分位线格式有区别(这个不知道怎么改)
ggplot(df, aes(x = class, y = hwy, fill = class))+
geom_violin(trim = F, draw_quantiles = c(0.25, 0.5, 0.75), show.legend = F)+
scale_fill_manual(values = df_color)+
scale_y_continuous(limits = c(5, 55), breaks = seq(10,50,5), expand = c(0,0))+
labs(x = "class", y = "hwy", title = "Violin Plot of Highway Mileage by Vehicle Class")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Population Pyramid
人口金字塔可用于显示按体积排序的组的分布。或者,它也可以用来显示人口的分阶段过滤,如下所述,它用于显示有多少人通过营销漏斗的每个阶段。
library(stringr)
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")
df <- fread("data8_email.csv") %>% setorder(Gender, Users) %>% .[c(4:39),]
df[, index:=factor(c((nrow(df)/2):1, 1:(nrow(df)/2)))]
df_l <- df[(nrow(df)/2):1,]$Stage
# 定义一个标签转化函数
lab_tran <- function(x){
y <- sprintf("%.1e", x) %>% str_remove("\\+0")
return(y)
}
ggplot(df, aes(y = index, color = Gender))+
geom_linerange(aes(xmin = Users, xmax = 0), linewidth = 5)+
scale_color_manual(values = c("Male" = "red4", "Female" = "blue4"))+
scale_x_continuous(labels = lab_tran, breaks = seq(-1.5e+7, 1.5e+7, 0.5e+7))+
scale_y_discrete(labels = df_l)+
guides(color = guide_legend(title = NULL))+
labs(x = "User", y = NULL, title = "Population Pyramid of the Marketing Funnel")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_text(face = "italic"),
panel.grid = element_blank(),
legend.position = c(0.88, 0.85),
plot.title = element_text(hjust = 0.5))
library(stringr)
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/email_campaign_funnel.csv")
df <- fread("data8_email.csv") %>% setorder(Gender, Users) %>% .[c(4:39),]
df[, index:=factor(c((nrow(df)/2):1, 1:(nrow(df)/2)))]
df_l <- df[(nrow(df)/2):1,]$Stage
# 定义一个标签转化函数
lab_tran <- function(x){
y <- sprintf("%.1e", x) %>% str_remove("\\+0")
return(y)
}
ggplot(df, aes(x = index, y = Users, fill = Gender))+
geom_bar(stat = "identity")+
scale_fill_manual(values = c("Male" = "red4", "Female" = "blue4"))+
coord_flip()+ #这里对掉了xy轴,则后续scale中的x和y仍按对调前的来,无论在这行代码的前还是后
scale_y_continuous(labels = lab_tran, breaks = seq(-1.5e+7, 1.5e+7, 0.5e+7))+
scale_x_discrete(labels = df_l)+
guides(fill = guide_legend(title = NULL))+
labs(y = "User", x = NULL, title = "Population Pyramid of the Marketing Funnel")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
axis.text.y = element_text(size = 8),
axis.title.x = element_text(face = "italic"),
panel.grid = element_blank(),
legend.position = c(0.88, 0.85),
plot.title = element_text(hjust = 0.5))
Categorical Plots
条形分类图可用于可视化2个或多个分类变量彼此相关的计数分布。
# 数据来源,python中seaborn库的titanic数据集
# sns.load_dataset("titanic")
df <- fread("data9_titanic.csv") %>% .[,c("alive", "deck")] %>% .[deck !="",]
ggplot(df, aes(alive, fill = alive))+
geom_bar(stat = "count", show.legend = F)+
geom_linerange(data = NULL, aes(x = 0, ymin = -Inf, ymax = Inf, fill = NULL), linewidth = 0.2)+
geom_linerange(data = NULL, aes(y = 0, xmin = -Inf, xmax = Inf, fill = NULL), linewidth = 0.2)+
facet_wrap(deck~., nrow = 2, labeller = labeller(.default = function(x) paste0("deck = ", x)))+
scale_fill_manual(values = c("yes" = "#3174a2", "no" = "#b5c7e1"))+
scale_y_continuous(limits = c(0,35), breaks = seq(0,30,10), expand = c(0,0))+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
strip.background = element_rect(fill = F, color = F),
plot.title = element_text(hjust = 0.5))
# 数据来源,python中seaborn库的titanic数据集
# sns.load_dataset("titanic")
df <- fread("data9_titanic.csv") %>%
.[,c("age", "sex", "embark_town", "class")] %>%
na.omit() %>%
.[embark_town != "", ] %>%
.[,embark_town := factor(embark_town, levels = c("Queenstown", "Cherbourg", "Southampton"))]
# 注意,class First和Second,在Queenstown,男女都只有一个数据,python将画一条线,R中则排除了
ggplot(df, aes(age, embark_town, fill = sex))+
geom_violin(trim = T, orientation = "y")+
geom_linerange(data = NULL, aes(x = -5, ymin = -Inf, ymax = Inf, fill = NULL), linewidth = 0.2)+
geom_linerange(data = NULL, aes(y = 0, xmin = -Inf, xmax = Inf, fill = NULL), linewidth = 0.2)+
facet_wrap(class~., labeller = labeller(.default = function(x) paste0("class = ", x)))+
scale_fill_manual(values = c("male" = "#3273a2", "female" = "#e38227"))+
scale_x_continuous(limits = c(-5,80), breaks = seq(0, 80, 20), expand = c(0,0))+
theme_bw()+
theme(aspect.ratio = 1/1,
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
strip.background = element_rect(fill = F, color = F),
panel.spacing = unit(2, "lines"),
plot.title = element_text(hjust = 0.5))
Waffle Chart
用于显示较大总体中组的组成。
第三方包,独立于ggplot,方便,可调参数少
library(waffle)
df_color <- c("#faffa5", "#fbc226", "#f67f15", "#d5483f", "#9e2a61", "#62176b", "#270c52")
df <- ggplot2::mpg %>% data.table() %>% .[,c("class")] %>% table()
df_value <- as.vector(df)
names(df_value) <- names(df)
waffle(df_value, colors = df_color, rows = 7, title = "# Vehicles by Class")
第三方包,基于ggplot,可调参多
# devtools::install_github("liamgilbey/ggwaffle")
library(ggwaffle)
df_color <- c("#faffa5", "#fbc226", "#f67f15", "#d5483f", "#9e2a61", "#62176b", "#270c52")
df <- ggplot2::mpg %>% waffle_iron(aes_d(group =class), rows = 7)
# 注意guide_legend默认单位是line,因此keyheight和keywidth如果仅为数值型,相同数值长宽也不一样
ggplot(data = df, aes(x, y, fill = group))+
geom_waffle()+
coord_equal()+
scale_fill_manual(values = df_color)+
guides(fill = guide_legend(keyheight = unit(0.4, "cm"), keywidth = unit(0.4, "cm"), title = NULL))+
labs(x = NULL, y = NULL, title = "# Vehicles by Class")+
theme_bw()+
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
df_color <- c("#faffa5", "#fbc226", "#f67f15", "#d5483f", "#9e2a61", "#62176b", "#270c52")
df <- ggplot2::mpg
num <- nrow(df)
df.row <- 7
df.col <- ceiling(num/df.row)
df.plot <- data.table(x=rep(1:df.col, each=df.row), y=rep(1:df.row, df.col)) %>%
.[1:num, ] %>%
.[, group:=sort(df$class)]
ggplot(data = df.plot, aes(x, y, fill = group))+
geom_tile(color="white", size=2)+ #size是边框的宽度
scale_fill_manual(values = df_color)+
guides(fill = guide_legend(keyheight = unit(0.4, "cm"), keywidth = unit(0.4, "cm"), title = NULL))+
labs(x = NULL, y = NULL, title = "# Vehicles by Class")+
theme_bw()+
theme(aspect.ratio = df.row/df.col,
axis.ticks = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
library(ggwaffle)
library(patchwork)
# 注意图3,原版python标签没显示全,
# 自定义主题
theme_wa <- function(...){
theme_bw()+
theme(axis.ticks = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank(),
legend.text = element_text(size = 7),
legend.title = element_text(size = 8),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5, size = 9))
}
# 图1
df_color1 <- c("#8dd3c7", "#ffffb4", "#fb8073", "#fdb45f", "#b2df67", "#d9d9d9", "#cbebc5")
df1 <- ggplot2::mpg %>% data.table() %>% waffle_iron(aes_d(group = class), rows = 9)
p1 <- ggplot(data = df1, aes(x, y, fill = group))+
geom_waffle()+
coord_equal()+
scale_fill_manual(values = df_color1)+
guides(fill = guide_legend(keyheight = unit(0.3, "cm"), keywidth = unit(0.3, "cm"), title = "Class"))+
labs(x = NULL, y = NULL, title = "# Vehicles by Class")+
theme_wa()
# 图2
df_color2 <- c("#9e0340", "#f88d52", "#feffbe", "#86cfa6")
df2 <- ggplot2::mpg %>% data.table() %>% .[,cyl:=factor(cyl)] %>% waffle_iron(aes_d(group = cyl), rows = 9)
p2 <- ggplot(data = df2, aes(x, y, fill = group))+
geom_waffle()+
coord_equal()+
scale_fill_manual(values = df_color2)+
guides(fill = guide_legend(keyheight = unit(0.3, "cm"), keywidth = unit(0.3, "cm"), title = "Cyl"))+
labs(x = NULL, y = NULL, title = "# Vehicles by Cyl")+
theme_wa()
# 图3
df_color3 <- c("#3b3875", "#5554a1", "#6c6fc3", "#617439", "#8ba055", "#8a9f58", "#bacf70", "#856e39", "#baa038", "#813f32", "#ae4a46", "#cf6571", "#7c406e", "#a35192", "#ce6dbd")
df3 <- ggplot2::mpg %>% data.table() %>% waffle_iron(aes_d(group = manufacturer), rows = 9)
p3 <- ggplot(data = df3, aes(x, y, fill = group))+
geom_waffle()+
coord_equal()+
scale_fill_manual(values = df_color3)+
guides(fill = guide_legend(keyheight = unit(0.3, "cm"), keywidth = unit(0.3, "cm"), title = "Manufacturer", ncol = 3))+
labs(x = NULL, y = NULL, title = "# Vehicles by Make")+
theme_wa()
# 汇总,三个图在一张图中,因此比较小
p1+p2+p3+ plot_layout(ncol = 1)
Pie Chart
饼图是显示组成的经典方法。但是现在通常不建议使用,因为饼部分的面积有时会产生误导。因此,如果要使用饼图,强烈建议明确写下饼图每个部分的百分比或数字。
df_color <- c("#1f76b6", "#ff7d0e", "#2ca12c", "#d72528", "#9665bc", "#8d564b", "#e376c4")
df_class <- ggplot2::mpg$class %>% unique() %>% sort()
# 生成数据:频数
df <- ggplot2::mpg %>%
.[,c("class")] %>%
table() %>%
data.table() %>%
.[,class:=factor(class, levels = df_class)]
# 生成数据:占比和位置
# 注意lab的位置在每一类的中间,如果直接用占比+position_fill(reverse = T),lab将出现在每一类结尾
df[, prob:=N/sum(df$N)]
df[, lab_y:=cumsum(df$prob) - 0.5*df$prob]
# bar要用position_fill(reverse = T),因为默认的堆叠方向,和text的方向是相反的
# 将bar横坐标定义为1,然后将bar宽度定义为1,则bar的范围为0.5-1.5,此时text横坐标为1.75,就把text标记在pie外侧了
ggplot(df, aes(x = 1, y = prob, fill = class))+
geom_bar(stat = "identity", position = position_fill(reverse = T), width = 1, show.legend = F)+
geom_text(aes(x = 1.75, y = lab_y, fill = NULL, label = class))+ #对齐方式最好用中间,然后调整x值实现,因为极坐标后有点乱
coord_polar(theta = "y", direction = -1, start = pi*1.5)+ #1为原bar从下至上顺时针
scale_fill_manual(values = df_color)+
labs(x = NULL, y = NULL, title = "Pie Chart of Vehicle Class - Bad")+
theme_bw()+
theme(aspect.ratio = 1/1,
axis.ticks = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
ggplot包对pie的可调选项较多,但是没有explode选项
df_color <- c("#1a9e76", "#d95d03", "#7370ae", "#e9278c", "#66a81e", "#e4ac0a", "#a57718")
df_class <- ggplot2::mpg$class %>% unique() %>% sort()
df <- ggplot2::mpg %>%
.[,c("class")] %>%
table() %>%
data.table() %>%
.[,class:=factor(class, levels = df_class)]
# 生成数据:占比和位置
#注意lab的位置在每一类的中间,如果直接用占比+position_fill(reverse = T),lab将出现在每一类结尾
df[, prob:=N/sum(df$N)]
df[, lab_y:=cumsum(df$prob) - 0.5*df$prob]
df[, lab_t:=sprintf("%.1f%%(%d)",prob*100,N)]
# bar要用position_fill(reverse = T),因为默认的堆叠方向,和text的方向是相反的
# 将bar横坐标定义为1,然后将bar宽度定义为1,则bar的范围为0.5-1.5,此时text横坐标为1.65,就把text标记在pie外侧了
ggplot(df, aes(x = 1, y = prob, fill = class))+
geom_bar(stat = "identity", position = position_stack(reverse = T), width = 1)+
geom_text(aes(x = 1, y = lab_y, fill = NULL, label = lab_t), color = "white", size = 3)+
coord_polar(theta = "y", direction = -1, start = pi*0.27)+ #1为原bar从下至上顺时针
scale_fill_manual(values = df_color)+
labs(x = NULL, y = NULL, fill = "Vehicle Class", title = "Class of Vehicles: Pie Chart")+
theme_bw()+
theme(aspect.ratio = 1/1,
axis.ticks = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
ggforce包有explode,一般情况下,使用geom_arc_bar,stat = “pie”,此时不能调节旋转方向和角度
library(ggforce)
df <- ggplot2::mpg %>%
.[,c("class")] %>%
table() %>%
data.table() %>%
.[,class:=factor(class, levels = df_class)]
df[, prob:=N/sum(df$N)]
df[, explo:=c(0,0,0,0,0,0.1,0)] #explode,标记部分突出的大小
df[, lab_t:=sprintf("%.1f%%(%d)",prob*100,N)] #要标记的文本
df[, lab_y:=(cumsum(df$prob) - 0.5*df$prob)*2*pi] #文本中心位置,射线角度
df[, lab_s:=c(0.5,0.5,0.5,0.5,0.5,0.55,0.5)] #文字尺度,即半径的一半
ggplot(data = df) +
geom_arc_bar(aes(x0 = 0, y0 = 0, r0 = 0, r = 1, amount = N, fill = class, explode = explo), stat = 'pie', color = F) +
geom_text(aes(x = lab_s * sin(lab_y), y = lab_s * cos(lab_y), label = lab_t, hjust = 0.5, vjust = 0.5), color = "white", size = 3)+
scale_fill_manual(values = df_color)+
coord_fixed()+
labs(x = NULL, y = NULL, fill = "Vehicle Class", title = "Class of Vehicles: Pie Chart")+
theme_bw()+
theme(aspect.ratio = 1/1,
axis.ticks = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
定义了一个pie数据生成函数,所有参数可调,可生成与python完全一样的图
library(ggforce)
# 定义一个饼图数据生成函数
piedata <- function(data, r_start = 0, r_end = 1, lab_r = (r_start+r_end)*0.7,
radian = 0, direction = 1,
e_index = NA_integer_, e_scale = 0.1, model = "number"){
# 定义一个将绘图数据转化为pie图数据的函数
# 注意所有参数并无可行性检查,因此输入错误没有自定义stop功能,要自行保证输入正确参数
# 自定义选项包括:
# data:以data.table类型存储的数据,仅有两列,一列为class,另一列为数据
# model:转化方式,为"number"时,data格式为类+数量,或"prob"时,data为类+占比
# r_start:半径起始长度,默认为0
# r_end:半径终止长度,默认为1,起始0终止1即为整个扇形区域
# lab_r:标签所在位置半径长度,默认为0.7倍扇形半径
# radian:扇形起始的附加弧度,默认为0弧度
# direction:排列方向,默认为1顺时针
# e_index:突出显示的类,默认无突出
# e_scale:突出显示尺度,默认为0.1,即突出长度为e_scale*r_end
# 读取和定义数据
setDT(data)
class_name <- data[[1]]
class_numb <- length(class_name)
# 生成class和prob列
if(model == "number"){
names(data) <- c("class", "numb")
da <- data %>%
.[,class:=factor(class, levels = class_name)] %>%
.[,prob:=numb/sum(data$numb)]
# .[,c(1,3)]
} else{
da <- data %>%
set_colnames(c("class", "prop")) %>%
.[,class:=factor(class, levels = class_name)]
}
# 生成圆心、半径绘制范围
da[,c("x_0", "y_0"):=list(0,0)]
da[,c("r_start", "r_end"):=list(r_start,r_end)]
# 生成每一类扇形的起始弧度(包括附加弧度和排列方向)
da_prob <- da$prob
rad_s <- c(0, cumsum(da_prob))[1:class_numb]*2*pi + radian #定义扇形起始弧度
rad_e <- cumsum(da_prob)*2*pi + radian #定义扇形终止弧度
if(direction == -1) {rad_s <- rad_s*-1; rad_e <- rad_e*-1} #定义旋转方向后,弧度的变化
da[,c("rad_start", "rad_end"):=list(rad_s, rad_e)]
# 设置标签的位置、内容
rad_mean <- (rad_s+rad_e)/2 #扇形中心弧度
da[,c("lab_x", "lab_y"):=list(lab_r*sin(rad_mean), lab_r*cos(rad_mean))] #定义标签xy坐标
da[,lab_t:= sprintf("%.1f%%(%d)", prob*100, numb)] #定义标签内容,要改在这该吧
# 对有explode项的x0,y0,lab_x,lab_y进行修改
if(!is.na(e_index)[1]){
e_value <- e_scale*r_end #explode的数值,是e_csale乘以半径r
x_expl <- e_value*sin(rad_mean[e_index])
y_expl <- e_value*cos(rad_mean[e_index])
da[e_index, c("x_0","y_0"):=list(x_expl,y_expl)]
lab_x_e <- (lab_r+e_value)*sin(rad_mean[e_index])
lab_y_e <- (lab_r+e_value)*cos(rad_mean[e_index])
da[e_index, c("lab_x","lab_y"):=list(lab_x_e,lab_y_e)]
}
# 结束,输出数据
return(da)
}
# 生成参数
df_color <- c("#1a9e76", "#d95d03", "#7370ae", "#e9278c", "#66a81e", "#e4ac0a", "#a57718")
df_class <- ggplot2::mpg$class %>% unique() %>% sort()
# 生成数据:频数格式
df <- ggplot2::mpg %>%
.[,c("class")] %>%
table() %>%
data.table() %>%
.[,class:=factor(class, levels = df_class)]
# 生成饼图数据
df_pie <- piedata(df, r_start = 0, r_end = 1, lab_r = 0.7, radian = pi*0.3, direction = -1, e_index = 6, e_scale = 0.1, model = "number")
ggplot(df_pie) +
geom_arc_bar(aes(x0 = x_0, y0 = y_0, r0 = r_start, r = r_end, start = rad_start, end = rad_end, fill = class), stat="arc_bar", color = F) +
geom_text(aes(x = lab_x, y = lab_y, label = lab_t), color = "white", size = 3.5)+
scale_fill_manual(values = df_color)+
scale_x_continuous(limits = c(-1.2, 1.2), expand = c(0,0)) +
scale_y_continuous(limits = c(-1.2, 1.2), expand = c(0,0)) +
labs(x = NULL, y = NULL, fill = "Vehicle Class", title = "Class of Vehicles: Pie Chart")+
coord_fixed()+
theme_bw()+
theme(aspect.ratio = 1/1,
axis.ticks = element_blank(),
axis.text = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
Treemap
树形图类似于饼图,它可以在不误导每个组的贡献的情况下更好地工作。
# treemapify基于ggplot包,但是可调参数不多,另有单独包treemap,但是和ggplot没关系了
library(treemapify)
df_color <- c("#b34168", "#e9766f", "#feb87d", "#feedaf", "#f0f9b6", "#b6e1b7", "#6fb4c1")
df_class <- ggplot2::mpg$class %>% unique() %>% sort()
# 生成数据:频数格式
df <- ggplot2::mpg %>%
.[,c("class")] %>%
table() %>%
data.table() %>%
.[,class:=factor(class, levels = df_class)]
df[, lab:=sprintf("%s\n(%d)", class, N)]
# 画图,这个图与python的排列方式不同,用layout可改,但是很难和python的完全一致,
ggplot(df, aes(area = N, fill = class, label = lab, subgroup = class)) +
geom_treemap(start = "topright", layout = "squarified", show.legend = F)+
geom_treemap_text(start = "topright", place = "center", reflow = T, size = 13)+
geom_treemap_subgroup_border(start = "topright", color = "white", size = 1, show.legend = F)+
scale_fill_manual(values = df_color)+
labs(x = NULL, y = NULL, title = "Treemap of Vechile Class")+
theme_bw()+
theme(aspect.ratio = 1/1.3,
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
Bar Chart
条形图是基于计数或任何给定指标可视化项目的经典方法。
# 生成参数
df_color <- c("#02008b", "#add8e7", "#cc8540", "#6b8f23", "#eee8ab", "#fff1f6", "#7f007f",
"#76889b", "#a52a2b", "#abdae3", "#000000", "#008080", "#ff6247", "#dcdcdc", "#a92121")
df_manufacturer <- ggplot2::mpg$manufacturer %>% unique() %>% sort()
# 生成数据:频数格式
df <- ggplot2::mpg %>%
.[,c("manufacturer")] %>%
table() %>%
data.table() %>%
.[,manufacturer:=factor(manufacturer, levels = df_manufacturer)]
df[, lab:=sprintf("%.1f", N)]
ggplot(df, aes(manufacturer, N, label = lab, fill = manufacturer))+
geom_bar(stat = "identity", width = 0.5, show.legend = F)+
geom_text(vjust = -0.2, size = 3)+
scale_fill_manual(values = df_color)+
scale_y_continuous(limits = c(0, 46), breaks = seq(0, 45, 5), expand = c(0,0))+
labs(x = NULL, y = "# Vehicles", title = "Number of Vehicles by Manaufacturers")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Time Series Plot
时间序列图用于可视化给定指标如何随时间变化。在这里,您可以看到1949年至1969年间航空客运量的变化。
# 读取数据
# df <- fread('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')
df <- fread("data10_passenger.csv") %>% .[,date:=as.POSIXct(date)]
# 起始日期
date_s <- as.POSIXct("1949-01-01", tz = "UTC")
date_e <- as.POSIXct("1960-12-01", tz = "UTC")
# 画图
ggplot(df, aes(date, value, group = 1))+
geom_line(color = "red3")+
scale_x_datetime(date_labels = "%Y",
breaks = seq(date_s, date_e, by = "1 year"),
limits = c(date_s, date_e), expand = c(0.05,0.05))+
scale_y_continuous(limits = c(50, 750), breaks = seq(100, 700, 100), expand = c(0,0))+
labs(x = NULL, y = NULL, title = "Air Passengers Traffic (1949 - 1969)")+
theme_bw()+
theme(aspect.ratio = 1/2,
axis.ticks = element_blank(),
axis.line = element_line(color = "gray60"),
panel.grid.major = element_line(color = "gray90", linewidth = 0.15),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
Time Series with Peaks and Troughs Annotated
时间序列绘制了所有的波峰和波谷,并注释了所选特殊事件的发生。
# 设置日期显示为英文格式
Sys.setlocale(category = "LC_TIME", locale = "English") #显示英文日期名
## [1] "English_United States.1252"
# 读取数据
# df <- fread('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')
df <- fread("data10_passenger.csv") %>% .[,date:=as.POSIXct(date)]
# 生成最值点
df_v <- df$value
df_p <- df[which(diff(sign(diff(df_v))) == -2)+1, ] %>% .[,type:="Peaks"]
df_t <- df[which(diff(sign(diff(df_v))) == 2)+1,] %>% .[,type:="Troughs"]
df_point <- rbind(df_p, df_t)
# 生成标记文本和要标记的点
p_show <- c(1,4,7,10,13,16,29,34,39,44,49,54)
df_point[, lab:=format(date, "%b %Y")]
df_point[p_show, show:=T]
df_peak_lab <- df_point[show==TRUE&type=="Peaks", ]
df_trou_lab <- df_point[show==TRUE&type=="Troughs", ]
# 生成起始时间
date_s <- as.POSIXct("1949-01-01", tz = "UTC")
date_e <- as.POSIXct("1960-12-01", tz = "UTC")
# 注意,一些看似是最值点但是并没有标出的,是因为最值处有两个以上的相同值,例如1950-07-01的170和1950-08-01的170,1951-07-01的199和1951-08-01的199
ggplot(df, aes(date, value, group = 1))+
geom_line(aes(linetype = "Air Traffic"), color = "#6098c4")+
geom_point(data = df_point, aes(date, value, fill = type, shape = type), color = "transparent")+
geom_text(data = df_peak_lab, aes(date, value, label=lab), color = "green4", vjust = -0.5, size = 2.5)+
geom_text(data = df_trou_lab, aes(date, value, label=lab), color = "red2", vjust = 1.5, size = 3)+
scale_linetype_manual(values = 1)+
scale_shape_manual(values = c("Peaks" = 24, "Troughs" = 25))+
scale_fill_manual(values = c("Peaks" = "green4", "Troughs" = "red2"))+
guides(linetype = guide_legend(title = NULL, order = 1),
shape = guide_legend(title = NULL, order = 2),
fill = guide_legend(title = NULL, order = 2))+
scale_x_datetime(date_labels = "%Y",
breaks = seq(date_s, date_e, by = "1 year"),
limits = c(date_s, date_e), expand = c(0.05,0.05))+
scale_y_continuous(limits = c(50, 750), breaks = seq(100, 700, 100), expand = c(0,0))+
labs(x = NULL, y = NULL, title = "Air Passengers Traffic (1949 - 1969)")+
theme_bw()+
theme(aspect.ratio = 1/2,
legend.spacing.y = unit(0, "cm"),
legend.margin = margin(), #图例margin默认上下左右都是0
legend.position = c(0.1, 0.8),
axis.ticks = element_blank(),
axis.line = element_line(color = "gray60"),
panel.grid.major = element_line(color = "gray90", linewidth = 0.15),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
plot.title = element_text(hjust = 0.5))
Autocorrelation (ACF) and Partial Autocorrelation (PACF) Plot
ACF图显示了时间序列与其自身滞后的相关性。每条垂直线(在自相关图上)表示序列与其从滞后 0 开始的滞后之间的相关性。图中的蓝色阴影区域是显著性水平。那些位于蓝线以上的滞后是显著滞后。另一方面,PACF显示了任何给定滞后(时间序列)与当前序列的自相关,但删除了滞后之间的贡献。
下图中,对于AirPassengers,我们看到多达14个滞后越过了蓝线,因此意义重大。这意味着,14年前看到的航空客运量对今天看到的客流量有影响。
ggplot中,几种画线段函数
geom_linerange(), aes是x, ymin, ymax; y线段三个,或x线段三个
geom_segment(), aes是x, xend, y, yend,四个都得有
geom_vline(), geom_hline(), aes是xintercept或yintercept
geom_abline(), aes是slope和intercept
# 注意这图和原html中的python的,在相关性计算和显著性水平上,都有所不同
library(patchwork)
# df <- fread('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')
df <- fread("data10_passenger.csv") %>% .[,date:=as.POSIXct(date)]
# 计算自相关
acf_res <- acf(df$value, plot = FALSE, lag.max = 50)
acf_df <- data.frame(Lag = acf_res$lag, ACF = acf_res$acf)
acf_ci <- qnorm((1 + 0.95)/2) * sqrt((1 + 2 * cumsum(acf_res$acf^2))/nrow(acf_df))
acf_df$cib = -acf_ci
acf_df$cit = acf_ci
p1 <- ggplot(acf_df, aes(Lag, ACF))+
geom_point(color = "steelblue", size=0.8)+
geom_linerange(aes(x=Lag, ymin=0, ymax=ACF))+
geom_hline(aes(yintercept = 0))+
geom_ribbon(aes(x = Lag, y=NULL, ymin = cib, ymax = cit), fill = "steelblue", alpha = 0.5)+
scale_x_continuous(breaks = seq(0, 50, 10))+
scale_y_continuous(limits = c(-1.5, 1.5), breaks = seq(-1.5, 1.5, by=0.5))+
labs(x=NULL, y=NULL, title = "Autocorrelation")+
theme_bw()+
theme(aspect.ratio = 1/1.2,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
# 计算偏自相关
pacf_res <- pacf(df$value, plot = FALSE, lag.max = 20)
pacf_df <- data.frame(Lag = c(0, pacf_res$lag), PACF = c(1, pacf_res$acf)) #python中有01,这里没有,补一个
pacf_th <- qnorm((1 + 0.95)/2)/sqrt(length(df$value)) #显著性水平95%阈值计算方法
# pacf_th <- 1.96*sqrt(1/length(df$value)) #显著性水平95%阈值,另一种方法
p2 <- ggplot(pacf_df, aes(Lag, PACF))+
geom_point(color = "steelblue", size=0.8)+
geom_linerange(aes(x=Lag, ymin=0, ymax=PACF))+
geom_hline(aes(yintercept = 0))+
geom_rect(aes(xmin=1, xmax=20, ymin=-pacf_th, ymax=pacf_th), fill = "steelblue", alpha=0.015)+
scale_x_continuous(breaks = seq(0,20, 2.5))+
scale_y_continuous(limits = c(-0.75, 1), breaks = seq(-0.75, 1, by=0.25))+
labs(x=NULL, y=NULL, title = "Partial Autocorrelation")+
theme_bw()+
theme(aspect.ratio = 1/1.2,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
# 汇总,两个图在一张图中
p1+p2+plot_layout(nrow = 1)
Cross Correlation plot
互相关图显示了两个时间序列之间的滞后。
# 注意这个图和原python差异较大,暂不知如何计算相同的相关性
# df <- fread('https://github.com/selva86/datasets/raw/master/mortality.csv')
df <- fread('data11_mortality.csv')
df.ccf <- ccf(df$mdeaths, df$mdeaths, lag.max = 200, plot = F)
dt <- data.table(x=c(df.ccf$lag), y=c(df.ccf$acf)) %>% .[x>=0,]
df.th <- 1.96*sqrt(1/nrow(df))
ggplot(dt, aes(x, y))+
geom_linerange(aes(x=x, ymin=0, ymax=y), color="steelblue")+
geom_hline(aes(yintercept = df.th), linewidth = 0.2)+
geom_hline(aes(yintercept = -df.th), linewidth = 0.2)+
geom_hline(aes(yintercept = 0))+
scale_x_continuous(breaks = seq(0,70, 10))+
scale_y_continuous(limits = c(-0.75, 1), breaks = seq(-0.75, 1, by=0.25))+
labs(x=NULL, y=NULL, title = "Cross Correlation Plot: mdeaths vs fdeaths")+
theme_bw()+
theme(aspect.ratio = 1/1.2,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Time Series Decomposition Plot
时间序列分解图显示了时间序列分为趋势、季节和残差分量。
library(forecast)
library(patchwork)
# df <- fread('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')
df <- fread("data10_passenger.csv")
# 将日期列转换为日期格式
df$date <- as.Date(df$date, format = "%Y-%m")
# 将数据转换为时间序列格式
ts_data <- ts(df$value, start = c(1949, 1), frequency = 12)
# 对时间序列数据进行分解
decomposed_data <- decompose(ts_data, type = "multiplicative")
# 创建数据框以便于绘图
plot_data <- data.table(
date = as.POSIXct(df$date),
observed = as.numeric(decomposed_data$x),
trend = as.numeric(decomposed_data$trend),
seasonal = as.numeric(decomposed_data$seasonal),
residual = as.numeric(decomposed_data$random)
)
# 生成起始时间
date_s <- as.POSIXct("1949-01-01", tz = "UTC")
date_e <- as.POSIXct("1961-01-01", tz = "UTC")
p1 <- ggplot(plot_data, aes(date, observed))+
geom_line(color="steelblue")+
geom_text(aes(date_s, 580), label="Time Series Decomposition of Air Passengers", hjust=0, size=2.5)+
scale_x_datetime(date_labels = "%Y",
breaks = seq(date_s, date_e, by = "2 year"),
limits = c(date_s, date_e), expand = c(0.05,0.05))+
scale_y_continuous(breaks = seq(200, 600, 200))+
labs(x=NULL, y="Observed")+
theme_bw()+
theme(aspect.ratio = 1/4,
axis.ticks = element_blank(),
axis.text.x = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
p2 <- ggplot(plot_data[!is.na(trend), ], aes(date, trend))+
geom_line(color="steelblue")+
scale_x_datetime(date_labels = "%Y",
breaks = seq(date_s, date_e, by = "2 year"),
limits = c(date_s, date_e), expand = c(0.05,0.05))+
scale_y_continuous(breaks = seq(200, 400, 100))+
labs(x=NULL, y="Trend")+
theme_bw()+
theme(aspect.ratio = 1/4,
axis.ticks = element_blank(),
axis.text.x = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
p3 <- ggplot(plot_data, aes(date, seasonal ))+
geom_line(color="steelblue")+
scale_x_datetime(date_labels = "%Y",
breaks = seq(date_s, date_e, by = "2 year"),
limits = c(date_s, date_e), expand = c(0.05,0.05))+
scale_y_continuous(breaks = seq(0.8, 1.2, 0.2))+
labs(x=NULL, y="Seasonal ")+
theme_bw()+
theme(aspect.ratio = 1/4,
axis.ticks = element_blank(),
axis.text.x = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
p4 <- ggplot(plot_data[!is.na(residual),], aes(date, residual))+
geom_line(color="steelblue")+
scale_x_datetime(date_labels = "%Y",
breaks = seq(date_s, date_e, by = "2 year"),
limits = c(date_s, date_e), expand = c(0.05,0.05))+
scale_y_continuous(breaks = seq(0.9, 1.1, 0.1))+
labs(x=NULL, y="Residual")+
theme_bw()+
theme(aspect.ratio = 1/4,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
# 汇总,两个图在一张图中
p1+p2+p3+p4+plot_layout(ncol = 1)
Multiple Time Series
可在同一图表上绘制多个测量相同值的时间序列。
隐藏legend的几种方法
# 设置日期显示为英文格式
Sys.setlocale(category = "LC_TIME", locale = "English")
## [1] "English_United States.1252"
date_s <- as.POSIXct("1974-01-01", tz = "UTC")
date_n <- as.POSIXct("1979-01-01", tz = "UTC")
date_e <- as.POSIXct("1980-12-01", tz = "UTC")
# df1 <- fread('https://github.com/selva86/datasets/raw/master/mortality.csv')
df1 <- fread("data11_mortality.csv")
df2 <- df1 %>%
.[, date:=paste("01", date) %>% as.POSIXct(format="%d %b %Y", tz = "UTC")] %>%
melt(id="date")
gr <- data.table(date.s=as.POSIXct("1974-01-01", tz = "UTC"),
date.e=as.POSIXct("1980-01-01", tz = "UTC"),
y=seq(500, 2900, 400)) #grid线
dt <- data.table(x=as.POSIXct("1980-01-01", tz = "UTC"),
variable=c("mdeaths", "fdeaths"),
value=c(tail(df1, 1)$mdeaths, tail(df1, 1)$fdeaths)) #标签线
ggplot(df2, aes(date, value))+
geom_line(aes(color=variable), show.legend = F)+
geom_linerange(data=gr, aes(x=NULL, xmin=date.s, xmax=date.e, y=y), color="gray", alpha=0.2)+
geom_text(data=dt, aes(x=x,y=value,color=variable, label=variable), hjust=0, show.legend = F)+
scale_x_datetime(date_labels = "%b %Y",
breaks = seq(date_s, date_n, by = "1 year"),
limits = c(date_s, date_e), expand = c(0.05,0.05))+
scale_y_continuous(breaks = seq(100, 2900, 400), limits = c(100, 3000), expand=c(0,0))+
labs(x=NULL, y=NULL, title="Number of Deaths from Lung Diseases in the UK (1974-1979)")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Plotting with different scales using secondary Y axis
如果要显示在同一时间点测量两个不同数量的两个时间序列,则可以在右侧的辅助 Y 轴上绘制第二个序列。
df <- ggplot2::economics
y1 <- df$psavert
y2 <- df$unemploy
target.range <- range(y1)
origin.range <- range(y2)
a <- (target.range[2] - target.range[1]) / (origin.range[2] - origin.range[1])
b <- target.range[1] - a * origin.range[1]
plot.df <- data.table(
date=as.POSIXct(df$date),
psavert=df$psavert,
unemploy=df$unemploy*a+b) %>%
melt(id="date")
b_date <- as.POSIXct("1967-07-01", tz = "UTC")
e_date <- as.POSIXct("2015-04-01", tz = "UTC")
dat.color <- c("#B03834", "#698DBD")
ggplot(plot.df, aes(date, value, color=variable))+
geom_line(show.legend = T)+
scale_color_manual(values = dat.color)+
scale_x_datetime(date_labels = "%Y-%m-%d", breaks = seq(b_date, e_date, by = "5 year"), limits = c(b_date, e_date), expand = c(0.05,0.05))+
scale_y_continuous(
name = "Personal Savings Rate", #y1特征
breaks = seq(2,16,2),
sec.axis = sec_axis(trans=~(.-b)/a, #反向变换到原数据
breaks = seq(4000,14000,2000),
name="# Unemployed (1000's)") #y2特征
)+
labs(x= NULL, title = "Personal Savings Rate vs Unemployed: Plotting in Secondary Y Axis", color=NULL)+
theme_bw()+
theme(aspect.ratio = 1/1.8,
axis.text.x = element_text(size=6.2),
axis.ticks = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(linewidth = 0.2),
legend.position = c(0.5, 0.83),
plot.title = element_text(hjust = 0.5))
df <- ggplot2::economics
y1 <- df$psavert
y2 <- df$unemploy
target.range <- range(y1)
origin.range <- range(y2)
a <- (target.range[2] - target.range[1]) / (origin.range[2] - origin.range[1])
b <- target.range[1] - a * origin.range[1]
plot.df <- data.table(
date=as.POSIXct(df$date),
psavert=df$psavert,
unemploy=df$unemploy*a+b) %>%
melt(id="date")
b_date <- as.POSIXct("1967-07-01", tz = "UTC")
e_date <- as.POSIXct("2015-04-01", tz = "UTC")
dat.color <- c("#B03834", "#698DBD")
# 四周边框
dat.bor <- data.table(x1=c(rep(b_date-lubridate::years(1), 2), b_date-lubridate::years(1), e_date+lubridate::years(1)),
x2=c(rep(e_date+lubridate::years(1), 2), b_date-lubridate::years(1), e_date+lubridate::years(1)),
y1=c(18,0,0,0),
y2=c(18,0,18,18))
# 内部网格线
dat.gri <- data.table(x1=c(seq(b_date, e_date, "5 year"), rep(b_date-lubridate::years(1), 8)),
x2=c(seq(b_date, e_date, "5 year"), rep(e_date+lubridate::years(1), 8)),
y1=c(rep(0,10), seq(2,16,2)),
y2=c(rep(18,10), seq(2,16,2)))
# y轴标签
dat.y1.tex <- data.table(x=rep(b_date-lubridate::years(1), 8), y=seq(2,16,2), lab=seq(2,16,2))
y2.lab <- seq(4000, 14000, 2000)
dat.y2.tex <- data.table(x=rep(e_date+lubridate::years(1), 6), y=y2.lab*a+b, lab=y2.lab)
# y轴标题
dat.y1.tit <- data.table(x=b_date-lubridate::years(6), y=9, lab="Personal Savings Rate")
dat.y2.tit <- data.table(x=e_date+lubridate::years(7), y=9, lab="# Unemployed (1000's)")
ggplot(plot.df, aes(date, value))+
geom_line(aes(color=variable),show.legend = F)+
geom_segment(data=dat.bor, aes(x=x1, xend=x2, y=y1, yend=y2))+ #border
geom_segment(data=dat.gri, aes(x=x1, xend=x2, y=y1, yend=y2), linewidth=0.2, color="gray", alpha=0.4)+ #grid
geom_text(data=dat.y1.tex, aes(x,y,label=lab), color=dat.color[1], size=3, hjust=1.2)+
geom_text(data=dat.y2.tex, aes(x,y,label=lab), color=dat.color[2], size=3, hjust=-0.1)+
geom_text(data=dat.y1.tit, aes(x,y,label=lab), color=dat.color[1], size=5, vjust=1.5, angle=90)+
geom_text(data=dat.y2.tit, aes(x,y,label=lab), color=dat.color[2], size=5, vjust=1.5, angle=-90)+
scale_color_manual(values = dat.color)+
scale_x_datetime(date_labels = "%Y-%m-%d", breaks = seq(b_date, e_date, by = "5 year"),
limits = c(b_date-lubridate::years(6), e_date+lubridate::years(7)), expand = c(0,0))+
scale_y_continuous(breaks = seq(2,16,2), limits = c(0,18), expand = c(0,0))+
labs(x= NULL, y=NULL,title = "Personal Savings Rate vs Unemployed: Plotting in Secondary Y Axis", color=NULL)+
theme_bw()+
theme(aspect.ratio = 1/1.8,
panel.border = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(size=6.2),
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
df <- ggplot2::economics
y1 <- df$psavert
y2 <- df$unemploy
target.range <- range(y1)
origin.range <- range(y2)
a <- (target.range[2] - target.range[1]) / (origin.range[2] - origin.range[1])
b <- target.range[1] - a * origin.range[1]
plot.df <- data.table(
date=as.POSIXct(df$date),
psavert=df$psavert,
unemploy=df$unemploy*a+b) %>%
melt(id="date")
b_date <- as.POSIXct("1967-07-01", tz = "UTC")
e_date <- as.POSIXct("2015-04-01", tz = "UTC")
dat.color <- c("#B03834", "#698DBD")
# 四周边框
dat.bor <- data.table(x1=c(rep(b_date-lubridate::years(1), 2), b_date-lubridate::years(1), e_date+lubridate::years(1)),
x2=c(rep(e_date+lubridate::years(1), 2), b_date-lubridate::years(1), e_date+lubridate::years(1)),
y1=c(18,0,0,0),
y2=c(18,0,18,18))
# 内部网格线
dat.gri <- data.table(x1=c(seq(b_date, e_date, "5 year"), rep(b_date-lubridate::years(1), 8)),
x2=c(seq(b_date, e_date, "5 year"), rep(e_date+lubridate::years(1), 8)),
y1=c(rep(0,10), seq(2,16,2)),
y2=c(rep(18,10), seq(2,16,2)))
# y轴标签
dat.y1.tex <- data.table(x=rep(b_date-lubridate::years(1), 8), y=seq(2,16,2), lab=seq(2,16,2))
y2.lab <- ((seq(2,16,2)-b)/a) %>% round()
dat.y2.tex <- data.table(x=rep(e_date+lubridate::years(1), 8), y=seq(2,16,2), lab=y2.lab)
# y轴标题
dat.y1.tit <- data.table(x=b_date-lubridate::years(6), y=9, lab="Personal Savings Rate")
dat.y2.tit <- data.table(x=e_date+lubridate::years(7), y=9, lab="# Unemployed (1000's)")
ggplot(plot.df, aes(date, value))+
geom_line(aes(color=variable),show.legend = F)+
geom_segment(data=dat.bor, aes(x=x1, xend=x2, y=y1, yend=y2))+ #border
geom_segment(data=dat.gri, aes(x=x1, xend=x2, y=y1, yend=y2), linewidth=0.2, color="gray", alpha=0.4)+ #grid
geom_text(data=dat.y1.tex, aes(x,y,label=lab), color=dat.color[1], size=3, hjust=1.2)+
geom_text(data=dat.y2.tex, aes(x,y,label=lab), color=dat.color[2], size=3, hjust=-0.1)+
geom_text(data=dat.y1.tit, aes(x,y,label=lab), color=dat.color[1], size=5, vjust=1.5, angle=90)+
geom_text(data=dat.y2.tit, aes(x,y,label=lab), color=dat.color[2], size=5, vjust=1.5, angle=-90)+
scale_color_manual(values = dat.color)+
scale_x_datetime(date_labels = "%Y-%m-%d", breaks = seq(b_date, e_date, by = "5 year"),
limits = c(b_date-lubridate::years(6), e_date+lubridate::years(7)), expand = c(0,0))+
scale_y_continuous(breaks = seq(2,16,2), limits = c(0,18), expand = c(0,0))+
labs(x= NULL, y=NULL,title = "Personal Savings Rate vs Unemployed: Plotting in Secondary Y Axis", color=NULL)+
theme_bw()+
theme(aspect.ratio = 1/1.8,
panel.border = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(size=6.2),
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Time Series with Error Bands
如果有一个时间序列数据集,其中包含每个时间点(日期/时间戳)的多个观测值,则可以构建具有误差带或置信区间的时间序列。
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/user_orders_hourofday.csv")
df <- fread("data12_userorders.csv")
df.plot <- df[, .(mean = mean(quantity), std = plotrix::std.error(quantity)*1.96), by = "order_hour_of_day"]
# plotrix::std.error函数用于计算均值的标准误差,标准误差*1.96是正态分布下的95%置信区间
ggplot(df.plot, aes(order_hour_of_day, mean))+
geom_ribbon(aes(ymin=mean-std, ymax=mean+std), fill="steelblue")+
geom_line(color="white")+
scale_x_continuous(breaks = seq(0,22,2))+
scale_y_continuous(breaks = seq(8,18,2), limits = c(8,20))+
labs(x="Hour of Day", y="# Orders", title="User Orders by Hour of Day (95% confidence)")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
panel.border = element_blank(),
axis.line = element_line(),
axis.ticks = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
plot.title = element_text(hjust = 0.5))
# df <- fread("https://www.kaggle.com/olistbr/brazilian-ecommerce#olist_orders_dataset.csv")
df <- fread("data13_orders45d.csv")
df.plot <- df[, .(mean = mean(quantity), std = plotrix::std.error(quantity)*1.96), by = "purchase_date"]
df.plot[, purchase_date:=as.POSIXct(purchase_date)]
date_s <- as.POSIXct("2017-05-01", tz = "UTC")
date_e <- as.POSIXct("2017-06-15", tz = "UTC")
ggplot(df.plot, aes(purchase_date, mean))+
geom_ribbon(aes(ymin=mean-std, ymax=mean+std), fill="#4B5C7B")+
geom_line(color="white")+
scale_x_datetime(limits = c(date_s, date_e), breaks = seq(date_s, date_e, by="6 days"))+
scale_y_continuous(limits = c(4,10), breaks = seq(4,10,1), expand = c(0,0))+
labs(x=NULL, y="# Daily Orders", title="Daily Order Quantity of Brazilian Retail with Error Bands (95% confidence)")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
panel.border = element_blank(),
axis.text = element_text(size=5),
axis.line = element_line(),
axis.ticks = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
plot.title = element_text(hjust = 0.5))
Stacked Area Chart
堆积面积图直观地表示了多个时间序列的贡献程度,以便于相互比较。
原教程严重错误,’uempmed’和’psavert’的数据和标签不对应,以至于两个指标画反了!(但他的描述还是对的?)
# 设置日期显示为英文格式
Sys.setlocale(category = "LC_TIME", locale = "English")
## [1] "English_United States.1252"
# df <- fread('https://raw.githubusercontent.com/selva86/datasets/master/nightvisitors.csv')
df <- fread("data14_nightvisitors.csv")
df.color <- c("#C05253", "#6B93C3", "#7CB25A", "#E49944", "#97776F", "#999999", "#D193CB", "#C8CA56")
df.varia <- c("Sydney", "NSW", "Melbourne", "VIC", "BrisbaneGC", "QLD", "Capitals", "Other")
df <- df[,.SD, .SDcols = c("yearmon", df.varia)]
df <- df[, yearmon:=as.POSIXct(paste("01", yearmon), format= "%d %b %Y", tz = "UTC")]
df.plot <- df %>% melt("yearmon") %>% .[, variable:=factor(variable, levels = rev(df.varia))]
date_s <- as.POSIXct("1998-01-01", tz = "UTC")
date_e <- as.POSIXct("2011-10-01", tz = "UTC")
ggplot(df.plot, aes(x=yearmon, y=value, fill=variable))+
geom_area(position = position_stack())+
geom_line(aes(group=variable), color="white", position = position_stack(), linewidth = 0.2)+
scale_fill_manual(values = rev(df.color))+
guides(fill = guide_legend(keyheight = unit(0.3, "cm"), keywidth = unit(0.3, "cm"), title = NULL, nrow = 2))+
scale_x_datetime(limits = c(date_s, date_e), breaks = seq(date_s, date_e, "15 month"),date_labels = "%b %Y", expand = c(0,0))+
scale_y_continuous(limits = c(0, 100000), breaks = seq(10000,90000,20000), expand = c(0,0))+
labs(x=NULL, y=NULL, title="Night Visitors in Australian Regions")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
panel.border = element_blank(),
axis.text = element_text(size=5),
axis.line = element_line(linewidth=0.1),
axis.ticks = element_blank(),
panel.grid = element_blank(),
legend.text = element_text(size=5),
legend.background = element_blank(),
legend.position = c(0.8, 0.95),
plot.title = element_text(hjust = 0.5))
Area Chart UnStacked
非堆叠面积图用于可视化两个或多个序列的进度(起伏)。在下面的图表中,您可以清楚地看到随着失业持续时间中位数的增加,个人储蓄率是如何下降的。未堆叠的面积图很好地揭示了这种现象。
原图严重错误,‘uempmed’, ’psavert’顺序反了,以至于图两个指标反了!(但他的描述还是对的?)
df <- ggplot2::economics %>% data.table()
df[, date:=as.POSIXct(date)]
date_s <- as.POSIXct("1967-07-01", tz = "UTC")
date_e <- as.POSIXct("2015-04-01", tz = "UTC")
df.color <- c("steelblue", "red2")
df.plot <- df[, c("date", "psavert", "uempmed")] %>%
melt("date") %>%
.[, variable:=factor(variable, levels = c("uempmed", "psavert"))]
# 注意,geom_area也有color,就是最上面那条线,但是由于和fill在一起不好调节,额外加一个geom_line
ggplot(df.plot, aes(date, value, fill=variable))+
geom_area(alpha=0.4, position = position_identity())+
geom_line(aes(color=variable), alpha=8,position = position_identity(), show.legend = F)+
scale_fill_manual(values = df.color)+
scale_color_manual(values = df.color)+
guides(fill = guide_legend(keyheight = unit(0.3, "cm"), keywidth = unit(0.6, "cm"), title = NULL, nrow = 2))+
scale_x_datetime(limits = c(date_s, date_e), breaks = seq(date_s, date_e, "50 month"),date_labels = "%Y-%m-%d", expand = c(0,0))+
scale_y_continuous(limits = c(0,30), breaks = seq(2.5,27.5,2.5), expand = c(0,0))+
labs(x=NULL, y=NULL, title="Personal Savings Rate vs Median Duration of Unemployment")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
panel.border = element_blank(),
axis.text = element_text(size=5),
axis.line = element_line(linewidth=0.1),
axis.ticks = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
legend.text = element_text(size=6),
legend.background = element_blank(),
legend.position = c(0.95, 0.97),
plot.title = element_text(hjust = 0.5))
Calendar Heat Map
与时间序列相比,日历地图是可视化基于时间的数据的替代且不太受欢迎的选项。虽然在视觉上很吸引人,但数值并不十分明显。然而,它可以很好地描绘极端值和假日效果。
# 设置日期显示为英文格式
Sys.setlocale(category = "LC_TIME", locale = "English")
## [1] "English_United States.1252"
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/yahoo.csv")
df <- fread("data15_yahoo.csv")
df.t1 <- df %>% .[, date:=as.POSIXct(date)] %>% .[year==2014,c("date", "VIX.Close")]
df.t2 <- data.table(date=seq(as.POSIXct("2014-01-01", tz="UTC"), as.POSIXct("2014-12-31", tz="UTC"), by="1 day"))
df.plot <- df.t1[df.t2, on="date"] %>%
set_colnames(c("date", "value")) %>%
.[is.na(value), value:=0] %>%
.[, monthf:=as.character(format(date, "%b"))] %>%
.[, weekf:=as.character(format(date, "%a"))]
mon <- unique(df.plot$monthf)
wee <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") %>% rev()
df.plot[, monthf:=factor(monthf, levels = mon)]
df.plot[, weekf:=factor(weekf, levels = wee)]
# df.plot[, monwee:=1+lubridate::isoweek(date)-min(lubridate::isoweek(date)), by="monthf"] #生成月内周数
tweek <- data.table::isoweek(df.plot$date) #等于lubridate::isoweek(x)
tweek[363:365] <- 53 #将最后三日赋予本年度的53周
df.plot[, week:=tweek]
df.text.y <- data.table(x=54, y=wee) #标记星期
df.text.x <- df.plot[, mean(week), by="monthf"] #标记月份
# 注意,离散坐标轴不支持sec.axis参数,因此此图右侧的文字只能用geom_text添加
ggplot(df.plot, aes(week, weekf, fill=value))+
geom_tile(width = 1, height = 1, color="white", show.legend = F)+
geom_text(data=df.text.y, aes(x, y, label=y, fill=NULL), hjust=0, size=2.5)+ #右侧标记星期
geom_text(data=NULL, aes(x=0, y="Thu", fill=NULL), label="2014", angle=90, size=5, vjust=-0.1)+ #左侧标记年份
geom_text(data=df.text.x, aes(V1, 0, label=monthf, fill=NULL), size=3)+ #下侧标记月份
scale_fill_gradient(low = "gray80", high = "red")+
scale_x_discrete(expand = c(0.1,0.2))+
scale_y_discrete(breaks=NULL, expand = c(0.3,0))+
labs(x=NULL, y=NULL)+
theme_bw()+
theme(aspect.ratio = 7*1.3/(53*1.3), #想要方格子,只能是调节这个比例
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.border = element_blank())
Seasonal Plot
季节图可用于比较时间序列在上一季同一天的表现(年/月/周等)。
# 设置日期显示为英文格式
Sys.setlocale(category = "LC_TIME", locale = "English")
## [1] "English_United States.1252"
# df <- fread('https://github.com/selva86/datasets/raw/master/AirPassengers.csv')
df <- fread("data10_passenger.csv")
df[, date:=as.POSIXct(date)]
df[, year:=year(date)]
df[, month:=format(date, "%b")]
mont <- df$month %>% unique()
df[, month:=factor(month, levels = mont)]
df[, year:=factor(year)]
ggplot(df, aes(month, value, color=year, group=year))+
geom_line(show.legend = F)+
geom_text(data=df[month=="Dec",], aes(x=month, y=value, label=year), size=2.5, hjust=0, show.legend = F)+
scale_x_discrete(expand = c(0.01,0.5))+
scale_y_continuous(limits = c(50, 710), breaks = seq(100,700,100), expand = c(0,0))+
labs(x=NULL, y="Air Traffic", title="Monthly Seasonal Plot: Air Passengers Traffic (1949 - 1969)")+
theme_bw()+
theme(aspect.ratio = 1/1.6,
panel.border = element_blank(),
axis.line = element_line(linewidth=0.2),
axis.title.y = element_text(face = "italic"),
axis.ticks = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5))
Dendrogram
树状图根据给定的距离度量将相似的点组合在一起,并根据点的相似性将它们组织成树状的链接。
# hclust部分结果含义
# result_hc$merge
# 表示融合过程,每一行是一次融合,负值表示原始观察对象的序号,正值表示已经融合的项目
# 例如:第一行若为 -3 -5,则表示第一次,融合了两个原始观察对象,分别是序号3和5
# 例如:若第三行是1 2,则表示融合了第一次和第二次两个项目,即merge中的第一行和第二行
# 有n个观测对象,则有n-1次融合,即merge为n-1行,负值一共出现n次,因此正值出现2*(n-1)-n=n-2次
# result_hc$height
# 长度为n-1的向量,表示每次的聚类高度
# result_hc$labels
# 观测对象,按原始顺序
# result_hc$order
# 长度为n的向量,表示原观测向量中,每个对象的顺序,即result_hc$labels[result_hc$order]是结果向量
# cutree(result_hc, 3)
# 通过这个函数,查看当给定类别个数时,对象的所属类别
# 定义一个画图函数,画图并标记颜色给指定个数的分类,这个函数修改自网上下载的一段代码,具体来源记不清了
hPlot <- function(data, k=3, method = "ward.D2"){
result_hc <- hclust(dist(data), method = method)
len <- nrow(data) #数据行数
lab <- result_hc$labels[result_hc$order] #标签名,按最终结果从左至右的顺序
# ranks保存x值,包括原始对象的单脚(在1~len)和len-1次聚类的左右脚中心(在len+(len-1))
# 其中最后一次(第len-1次)聚类的两脚中心,没有实际用处
ranks <- c(1:(2*len-1))
ranks[1:len] <- order(result_hc$order) #替换下面注释的for循环,结果一样
# for (i in c(1:len)){
# ranks[i] <- which(result_hc$order == i)
# }
# ranks[(len+1):(len+2)] <- 0 #这行感觉没用,第一次聚类肯定是原始对象,len+1一定被改写了,len+2同理
for (i in 1:(len-1)){
rank1 <- 0
rank2 <- 0
# 第i次聚合,第1个对象,左脚
if(result_hc$merge[i,1] < 0){
rank1 <- ranks[-result_hc$merge[i,1]] #如果是聚类原始对象,直接是位置
} else{
rank1 <- ranks[result_hc$merge[i,1] + len] #如果已经聚类,聚类次数+len查找已生成的两脚中点
}
# 第i次聚合,第2个对象,右脚
if(result_hc$merge[i,2]<0){
rank2 <- ranks[-result_hc$merge[i,2]] #如果是聚类原始对象,直接是位置
} else{
rank2 <- ranks[result_hc$merge[i,2]+len] #如果已经聚类,聚类次数+len查找已生成的两脚中点
}
ranks[i+len] <- (rank1 + rank2)/2 #生成左右脚中点
}
# heigh保存y值,一共有原始对象的单脚len个,高度为0,和len-1个聚类结果,共2*len-1个高度,聚类结果中有
heigh <- c(rep(0, len), result_hc$height)
# heigh <- c(1:(2*len-1))
# heigh[1:len] <- 0
# heigh[(len+1):(2*len-1)] <- result_hc$height
# merge_保存聚类信息,与结果的merge大小相同,但是将原始对象改为正值,将已聚类对象改为+len
# 将矩阵当向量进行索引,即以列向量的顺序索引
merge_ <- result_hc$merge
for (i in 1:(2*(len-1))) {
if(result_hc$merge[i]>0){
merge_[i] <- result_hc$merge[i]+len
} else{
merge_[i] <- -result_hc$merge[i]
}
}
# x,y:一共是len-1类,每类为一个“n”形,四个控制点
x <- c(1:4*(len-1))
y <- c(1:4*(len-1))
for (i in 1:(len-1)){
index <- (i-1)*4 + 1 #每次计算四个x和四个y
x[index] <- ranks[merge_[i,1]] #左脚下点x
y[index] <- heigh[merge_[i,1]] #左脚下点y,可能是原始对象也可能是已聚类对象,所以用merge_
x[index+1] <- ranks[merge_[i,1]] #左脚上点x
y[index+1] <- result_hc$height[i] #左脚上点y,只可能是已聚类对象,所以用result_hc$height
x[index+2] <- ranks[merge_[i,2]] #右脚上点x
y[index+2] <- result_hc$height[i] #右脚上点y,同左脚上点y
x[index+3] <- ranks[merge_[i,2]] #右脚下点x
y[index+3] <- heigh[merge_[i,2]] #右脚下点y
}
clusterData <- data.table(x = x, y = y, group = rep(1:(len-1), each=4))
# 按给定分类k,画聚类图的不同颜色,其中无法分类的放一组
res.clus <- cutree(result_hc, k)
# 数据中,每个聚合四个脚的最小和最大值
data.range <- clusterData %>% .[, range(x), by="group"]
# 聚类中,每个类别单独脚的最小和最大值
clus.range <- data.table(x=1:len, clust =res.clus[rank(result_hc$order)]) %>%
.[, range(x), by="clust"]
# 若数据的最小和最大值,在某一类别的最小和最大范围内,则属于该类别
merge.clus <- data.table(group=1:(len-1), clust=0)
for(i in 1:(len-1)){
tdata.range <- data.range[group==i,]
for(j in 1:k){
tclus.range <- clus.range[clust==j,]
minx <- tdata.range[[1,2]] >= tclus.range[[1,2]] #数据的最小值是否大于分类的最小值
maxx <- tdata.range[[2,2]] <= tclus.range[[2,2]] #数据的最大值是否小于分类的最大值
if(minx&maxx){merge.clus[group==i, clust:=j]} #如果以上都满足,则数据是这一类
}
}
# 将绘图数据与分类数据连接
dt.plot <- merge.clus[clusterData, on="group"] %>% .[, clust:=factor(clust)]
dt.color <- c("0"="blue", "1"="green4", "2"="red", "3"="cyan", "4"="khaki",
"6"="purple", "5"="black", "7"="magenta", "8"="orange", "9"="pink")
pic <- ggplot() +
geom_line(data = dt.plot, mapping = aes(x=x, y=y, group=group, color=clust), linewidth = 0.5, show.legend = F) +
scale_color_manual(values = dt.color)+
labs(x = NULL, y = NULL, title = "USArrests Dendrograms")+
scale_x_continuous(breaks = 1:len, labels = lab, expand = c(0.01,0.01)) +
scale_y_continuous(limits = c(0,750), breaks = seq(0,700,100), expand = c(0,0)) +
theme_bw()+
theme(aspect.ratio = 1/2,
axis.text.x = element_text(angle = 90, hjust = 1.1, vjust = 0.5, size = 7),
axis.ticks = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
return(pic)
}
# 执行函数
hPlot(datasets::USArrests, k=6)
Cluster Plot
聚类图可用于划分属于同一聚类的点。下面是一个代表性示例,根据USArrests数据集将美国各州分为5个组。此聚类图使用“谋杀”和“袭击”列作为X轴和Y轴。或者您可以使用第一个到主分量作为x轴和Y轴。
# 注意原图点的颜色和填充的颜色不对应
df <- datasets::USArrests
# 注意类别要为factor不能是数值型,否则分类图会有问题
dt <- df %>% data.table() %>% .[,1:2] %>% .[,class:=dist(df) %>% hclust(method = "ward.D2") %>% cutree(5) %>% factor()]
# 按类别,生成多边形边界
dt.chu <- data.table()
for(i in unique(dt$class)){
tdt <- dt[class==i,]
tch <- chull(tdt[,1:2])
x <- tdt[c(tch, tch[1]), ]
dt.chu <- rbind(dt.chu, x)
}
df.color <- c("steelblue", "red", "yellow3", "green4", "orange")
ggplot(dt, aes(Murder, Assault))+
geom_polygon(data=dt.chu, aes(fill=class), alpha=.2, show.legend = F)+
geom_point(aes(color=class), show.legend = F)+
scale_fill_manual(values = df.color)+
scale_color_manual(values = df.color)+
scale_x_continuous(limits = c(0, 18), breaks = seq(0, 17.5, 2.5), expand = c(0,0))+
scale_y_continuous(limits = c(0,350), breaks = seq(50, 350, 50), expand = c(0,0))+
labs(x="Murder", y="Assault", title="Agglomerative Clustering of USArrests (5 Groups)")+
theme_bw()+
theme(aspect.ratio = 1/1.5,
axis.ticks = element_blank(),
axis.title.y = element_text(face = "italic"),
panel.grid = element_blank(),
plot.title = element_text(hjust = 0.5))
Andrews Curve
Andrews 曲线有助于可视化是否存在基于给定分组的数值特征的固有分组。如果特征(数据集中的列)不能帮助区分组(下图中的cyl),那么行就不能很好地分离。
df <- datasets::mtcars
dt <- data.table(df)
do.andrews <- function(x, breaks = 30) {
# 定义一个计算Andrews曲线的函数
t <- seq(-pi, pi, pi / breaks)
m <- nrow(x)
n <- ncol(x)
f <- matrix(t, nrow = length(t), ncol = n)
j <- seq(2, n)
f[, j[j %% 2 == 0]] <- sweep(f[, j[j %% 2 == 0], drop = FALSE], 2, j[j %% 2 == 0] / 2, `*`)
f[, j[j %% 2 != 0]] <- sweep(f[, j[j %% 2 != 0], drop = FALSE], 2, j[j %% 2 != 0] %/% 2, `*`)
f[, j[j %% 2 == 0]] <- sin(f[, j[j %% 2 == 0]])
f[, j[j %% 2 != 0]] <- cos(f[, j[j %% 2 != 0]])
f <- x[, j] %*% t(f[, j])
f <- f + matrix(x[,1] / sqrt(2), nrow = m, ncol = length(t), byrow = FALSE)
res <- reshape2::melt(f, varnames = c('RowId', 'AndrewsId'), value.name = 'AndrewsValue')
res[,'AndrewsId'] <- t[res[,'AndrewsId']]
return(res)
}
# df.plot <- df %>% group_by(cyl) %>% do({do.andrews(as.matrix(select(., -cyl)))}) dplyr方法,和下面datatable方法结果一样
dt.plot <- dt[, do.andrews(as.matrix(.SD)), .SDcols = -2, by="cyl"] %>%
.[, cyl:=factor(cyl, levels = c("6", "4", "8"))]
dt.color <- c("#d76f31", "#fcc454", "#b3b4b3")
ggplot(dt.plot, aes(x = AndrewsId, y = AndrewsValue, color = cyl, group = interaction(cyl, RowId))) +
geom_line(linewidth = 0.5)+
scale_color_manual(values = dt.color)+
guides(color=guide_legend(keyheight = unit(0.3, "cm")))+
scale_x_continuous(limits = c(-3,3), breaks = seq(-3,3,1), expand = c(0,0,0,0))+
scale_y_continuous(limits = c(-600, 600), breaks = seq(-400,400,200), expand = c(0,0))+
labs(x=NULL, y=NULL, title = "Andrews Curves of mtcars", color=NULL)+
theme_bw()+
theme(aspect.ratio = 1/1.2,
panel.border = element_blank(),
axis.line = element_line(linewidth=0.3),
axis.ticks = element_blank(),
panel.grid.minor = element_blank(),
legend.position = c(0.92,0.95),
plot.title = element_text(hjust = 0.5))
Parallel Coordinates
平行坐标有助于可视化要素是否有助于有效地隔离组。如果隔离受到影响,该功能在预测该组时可能非常有用。
# df <- fread("https://raw.githubusercontent.com/selva86/datasets/master/diamonds_filter.csv")
df <- fread("data17_diamondsfilter.csv")
df.plot <- df %>%
.[, index:=1:nrow(df)] %>%
data.table::melt(id=c("cut", "index")) %>%
.[, variable:=factor(variable, levels = c("carat", "depth", "table", "clarity"))]
df.color <- c("#1a5946", "#8C86BB", "#c09430", "#646c68")
ggplot(df.plot, aes(variable, value, color=cut, group=index))+
geom_line()+
scale_color_manual(values = df.color)+
guides(color=guide_legend(keyheight = unit(0.2, "cm")))+
geom_vline(xintercept = c("carat", "depth", "table"))+
scale_x_discrete(expand = c(0,0))+
scale_y_continuous(limits = c(-5, 75), breaks = seq(0,70, 10), expand = c(0,0))+
labs(x=NULL, y=NULL, title = "Parallel Coordinated of Diamonds", color=NULL)+
theme_bw()+
theme(aspect.ratio = 1/1.2,
panel.border = element_blank(),
axis.line = element_line(linewidth=0.3),
axis.ticks = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_line(linewidth=0.2),
legend.position = c(0.92,0.95),
plot.title = element_text(hjust = 0.5))