R语言处理因子之forcats包介绍（3）

Original 阿越就是我医学和生信笔记 2023-02-25

收录于合集

#r语言 200 个

#数据分析 34 个

#R包学习 73 个

今天继续学习forcats包的内容，这是forcats包介绍系列的第3篇。

修改因子向量名称

改变因子的值，同时保持原来的顺序(尽可能)

2.1 `fct_anon()`

用任意数字标识符替换因子级别。值和级别的顺序都不会被保留

gss_cat$relig %>% fct_count()
## # A tibble: 16 x 2
##    f                           n
##    <fct>                   <int>
##  1 No answer                  93
##  2 Don't know                 15
##  3 Inter-nondenominational   109
##  4 Native american            23
##  5 Christian                 689
##  6 Orthodox-christian         95
##  7 Moslem/islam              104
##  8 Other eastern              32
##  9 Hinduism                   71
## 10 Buddhism                  147
## 11 Other                     224
## 12 None                     3523
## 13 Jewish                    388
## 14 Catholic                 5124
## 15 Protestant              10846
## 16 Not applicable              0

gss_cat$relig %>% fct_anon() %>% fct_count()
## # A tibble: 16 x 2
##    f         n
##    <fct> <int>
##  1 01       32
##  2 02      224
##  3 03       93
##  4 04     3523
##  5 05      689
##  6 06     5124
##  7 07    10846
##  8 08      104
##  9 09      109
## 10 10      147
## 11 11       23
## 12 12       71
## 13 13      388
## 14 14        0
## 15 15       15
## 16 16       95

gss_cat$relig %>% fct_anon("X") %>% fct_count()
## # A tibble: 16 x 2
##    f         n
##    <fct> <int>
##  1 X01     109
##  2 X02    5124
##  3 X03     224
##  4 X04    3523
##  5 X05      95
##  6 X06       0
##  7 X07     689
##  8 X08      93
##  9 X09      32
## 10 X10     147
## 11 X11      15
## 12 X12      71
## 13 X13     388
## 14 X14     104
## 15 X15      23
## 16 X16   10846

2.2 `fct_collapse()`

简单的说就是可以给因子分组。

fct_count(gss_cat$partyid)
## # A tibble: 10 x 2
##    f                      n
##    <fct>              <int>
##  1 No answer            154
##  2 Don't know             1
##  3 Other party          393
##  4 Strong republican   2314
##  5 Not str republican  3032
##  6 Ind,near rep        1791
##  7 Independent         4119
##  8 Ind,near dem        2499
##  9 Not str democrat    3690
## 10 Strong democrat     3490

一共有10行，也就是10个水平，现在我们可以把10个水平分组，手动定义新的组：

partyid2 <- fct_collapse(gss_cat$partyid,
                         missing = c("No answer", "Don't know"),
                         rep = c("Strong republican", "Not str republican"),
                         other = "Other party",
                         ind = c("Ind,near rep", "Independent", "Ind,near dem"),
                         dem = c("Not str democrat", "Strong democrat")
                         )
fct_count(partyid2)
## # A tibble: 5 x 2
##   f           n
##   <fct>   <int>
## 1 missing   155
## 2 other     393
## 3 rep      5346
## 4 ind      8409
## 5 dem      7180

2.3 `fct_lump()`

这个是一系列函数，可以将满足某些条件的水平合并为一组。如果你经常做机器学习、统计建模等工作，你可能会经常需要把一些占比比较低的组都变成“其他”组。Python中的pandas可以很容易做到，R语言当然也可以。

fct_lump_min(): 把小于某些次数的归为其他类.
fct_lump_prop(): 把小于某个比例的归为其他类.
fct_lump_n(): 把个数最多的n个留下，其他的归为一类（如果n < 0，则个数最少的n个留下）.
fct_lump_lowfreq(): 将最不频繁的级别合并在一起.

x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
x %>% table()
## .
##  A  B  C  D  E  F  G  H  I 
## 40 10  5 27  1  1  1  1  1

把个数最多的3个留下，其他归为一类

x %>% fct_lump_n(3) %>% table() # ties.method = c("min", "average", "first", "last", "random", "max")
## .
##     A     B     D Other 
##    40    10    27    10

把个数最少的3个留下

x %>% fct_lump_n(-3) %>% table()
## .
##     E     F     G     H     I Other 
##     1     1     1     1     1    82

把比例小于0.1的归为一类

x %>% fct_lump_prop(0.1) %>% table()
## .
##     A     B     D Other 
##    40    10    27    10

把小于2次的归为其他类

x %>% fct_lump_min(2, other_level = "其他") %>% table()
## .
##    A    B    C    D 其他 
##   40   10    5   27    5

把频率小的归为其他类，同时确保其他类仍然是频率最小的

x %>% fct_lump_lowfreq() %>% table()
## .
##     A     D Other 
##    40    27    20

2.4 `fct_other()`

把某些因子归为其他类，类似于 fct_lump

x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

# 把A,B留下，其他归为一类
fct_other(x, keep = c("A", "B"), other_level = "other")
##  [1] A     A     A     A     A     A     A     A     A     A     A     A    
## [13] A     A     A     A     A     A     A     A     A     A     A     A    
## [25] A     A     A     A     A     A     A     A     A     A     A     A    
## [37] A     A     A     A     B     B     B     B     B     B     B     B    
## [49] B     B     other other other other other other other other other other
## [61] other other other other other other other other other other other other
## [73] other other other other other other other other other other other other
## [85] other other other
## Levels: A B other

# 把A,B归为一类，其他留下
fct_other(x, drop = c("A", "B"), other_level = "hhahah")
##  [1] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [11] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [21] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [31] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [41] hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah hhahah
## [51] C      C      C      C      C      D      D      D      D      D     
## [61] D      D      D      D      D      D      D      D      D      D     
## [71] D      D      D      D      D      D      D      D      D      D     
## [81] D      D      E      F      G      H      I     
## Levels: C D E F G H I hhahah

2.5 `fct_recode()`

手动更改因子水平

x <- factor(c("apple", "bear", "banana", "dear"))
x
## [1] apple  bear   banana dear  
## Levels: apple banana bear dear

fct_recode(x, fruit = "apple", fruit = "banana")
## [1] fruit bear  fruit dear 
## Levels: fruit bear dear

fct_recode(x, NULL = "apple", fruit = "banana")
## [1] <NA>  bear  fruit dear 
## Levels: fruit bear dear

fct_recode(x, "an apple" = "apple", "a bear" = "bear")
## [1] an apple a bear   banana   dear    
## Levels: an apple banana a bear dear

x <- factor(c("apple", "bear", "banana", "dear"))
levels <- c(fruit = "apple", fruit = "banana")
fct_recode(x, !!!levels)
## [1] fruit bear  fruit dear 
## Levels: fruit bear dear

2.6 `fct_relable()`

gss_cat$partyid %>% fct_count()
## # A tibble: 10 x 2
##    f                      n
##    <fct>              <int>
##  1 No answer            154
##  2 Don't know             1
##  3 Other party          393
##  4 Strong republican   2314
##  5 Not str republican  3032
##  6 Ind,near rep        1791
##  7 Independent         4119
##  8 Ind,near dem        2499
##  9 Not str democrat    3690
## 10 Strong democrat     3490

gss_cat$partyid %>% fct_relabel(~ gsub(",", ", ", .x)) %>% fct_count()
## # A tibble: 10 x 2
##    f                      n
##    <fct>              <int>
##  1 No answer            154
##  2 Don't know             1
##  3 Other party          393
##  4 Strong republican   2314
##  5 Not str republican  3032
##  6 Ind, near rep       1791
##  7 Independent         4119
##  8 Ind, near dem       2499
##  9 Not str democrat    3690
## 10 Strong democrat     3490

以上就是今天的内容，欢迎点赞、关注、转发。

有任何问题欢迎评论区留言或直接添加我的微信！

欢迎关注我的公众号：医学和生信笔记

“
医学和生信笔记 公众号主要分享：1.医学小知识、肛肠科小知识；2.R语言和Python相关的数据分析、可视化、机器学习等；3.生物信息学学习资料和自己的学习笔记！

往期精彩内容：

使用tinyarray包简化你的GEO分析流程！

使用tinyarray简化你的TCGA分析流程！

R语言和医学统计学系列（11）：球形检验

R语言缺失值插补之simputation包

抗洪靠嘴，堵漏靠沙？印度官员真是绝了！

这样的洞庭湖决堤，实在让人同情不起来

有的人走了，却永远活着

圈内疯传某谣言

不要放过这些人渣

R语言处理因子之forcats包介绍（3）

修改因子向量名称

2.1 `fct_anon()`

2.2 `fct_collapse()`

2.3 `fct_lump()`

2.4 `fct_other()`

2.5 `fct_recode()`

2.6 `fct_relable()`

您可能也对以下帖子感兴趣

抗洪靠嘴，堵漏靠沙？印度官员真是绝了！

这样的洞庭湖决堤，实在让人同情不起来

有的人走了，却永远活着

圈内疯传某谣言

不要放过这些人渣

生成图片，分享到微信朋友圈

R语言处理因子之forcats包介绍（3）

修改因子向量名称

2.1 fct_anon()

2.2 fct_collapse()

2.3 fct_lump()

2.4 fct_other()

2.5 fct_recode()

2.6 fct_relable()

您可能也对以下帖子感兴趣

2.1 `fct_anon()`

2.2 `fct_collapse()`

2.3 `fct_lump()`

2.4 `fct_other()`

2.5 `fct_recode()`

2.6 `fct_relable()`