to be continued...

Rや心理統計学の備忘録的な

There are three kinds of lies: lies, damned lies, and statistics.

- 嘘には三種類ある。嘘、大嘘、そして統計だ -
Benjamn Disraeli(19世紀のイギリス首相)

俺的{dplyr1.0.0}メモ~select(),rename(),relocate()編~

最近は、水曜どうでしょうにハマっています。対決列島面白いですね

今更、{dplyr1.0.0}をキャッチアップしていこうのコーナー

今回はカラム(変数)にまつわるselect(),rename(),relocate()の話

{dplyr1.0.0}select(),rename(),relocate()で追加された俺的に特徴的な機能はこちら

  1. ~if(), ~at()を使わなくてよくなった‼
  2. 関数を使った変数名変更が可能に‼
  3. カラム(変数)界のarrange()...relocate()の登場‼

※カラム = 変数です。呼び方が安定しなくてすみません

準備

今回はdplyr::starwarsデータを使います。映画starwarsに登場するキャラクターの情報が入っているデータです(当方、スターウォーズライトセーバーが赤いと敵くらいしか知りません)。

> library(dplyr,warn.conflicts = F)
> dplyr::starwars
# A tibble: 87 x 14
   name  height  mass hair_color skin_color eye_color birth_year sex  
   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke~    172    77 blond      fair       blue            19   male 
 2 C-3PO    167    75 NA         gold       yellow         112   none 
 3 R2-D2     96    32 NA         white, bl~ red             33   none 
 4 Dart~    202   136 none       white      yellow          41.9 male 
 5 Leia~    150    49 brown      light      brown           19   fema~
 6 Owen~    178   120 brown, gr~ light      blue            52   male 
 7 Beru~    165    75 brown      light      blue            47   fema~
 8 R5-D4     97    32 NA         white, red red             NA   none 
 9 Bigg~    183    84 black      light      brown           24   male 
10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

1. ~if(), ~at()を使わなくてよくなった‼

今まで、select(),rename()のラッパーとしてあったこれらの関数ですが、使わなくてよくなりました(使えなくはなっておらず、非推奨になった感じです)。

dplyr1.0.0ではどうするかというと、where()any_of(),all_of()という関数を使うことで楽に変数指定できるようになりました

select_if()select(where())

例えば、データフレーム内の数値型データだけ抜き出したとき

> starwars %>% 
+   select(where(is.numeric))
# A tibble: 87 x 3
   height  mass birth_year
    <int> <dbl>      <dbl>
 1    172    77       19  
 2    167    75      112  
 3     96    32       33  
 4    202   136       41.9
 5    150    49       19  
 6    178   120       52  
 7    165    75       47  
 8     97    32       NA  
 9    183    84       24  
10    182    77       57  
# ... with 77 more rows

select(where(~~))といった指定をすることで、従来のselect_if()と同じ出力を得ることができます。これが楽かどうかは置いておいて、これにより予期せぬエラーを回避できるそうです(詳しくはコチラのupdate notice読んでね)。

もちろん、他の関数や条件式を使って変数指定することもできます

# factor型以外のデータの抽出
> starwars %>% select(!where(is.factor))
# A tibble: 87 x 14
   name  height  mass hair_color skin_color eye_color birth_year sex  
   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke~    172    77 blond      fair       blue            19   male 
 2 C-3PO    167    75 NA         gold       yellow         112   none 
 3 R2-D2     96    32 NA         white, bl~ red             33   none 
 4 Dart~    202   136 none       white      yellow          41.9 male 
 5 Leia~    150    49 brown      light      brown           19   fema~
 6 Owen~    178   120 brown, gr~ light      blue            52   male 
 7 Beru~    165    75 brown      light      blue            47   fema~
 8 R5-D4     97    32 NA         white, red red             NA   none 
 9 Bigg~    183    84 black      light      brown           24   male 
10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

# 数値型データかつ変数の頭文字が"h"の変数を抽出
> starwars %>% select(where(is.numeric) & starts_with("h"))
# A tibble: 87 x 1
   height
    <int>
 1    172
 2    167
 3     96
 4    202
 5    150
 6    178
 7    165
 8     97
 9    183
10    182
# ... with 77 more rows

select_at()select(any_of()) or select(all_of())

変数名の 文字ベクトルを受け渡して変数指定するときは

# 変数名の文字ベクトル作成
> vars <- c("name","mass","skin_color","sex")
# any_of()は部分一致:文字ベクトルに一致するものがあれば抜き出す
> starwars %>% select(any_of(vars))
# A tibble: 87 x 4
   name                mass skin_color  sex   
   <chr>              <dbl> <chr>       <chr> 
 1 Luke Skywalker        77 fair        male  
 2 C-3PO                 75 gold        none  
 3 R2-D2                 32 white, blue none  
 4 Darth Vader          136 white       male  
 5 Leia Organa           49 light       female
 6 Owen Lars            120 light       male  
 7 Beru Whitesun lars    75 light       female
 8 R5-D4                 32 white, red  none  
 9 Biggs Darklighter     84 light       male  
10 Obi-Wan Kenobi        77 fair        male  
# ... with 77 more rows

このように、select()の中にany_of()を使うことでselect_at()と同じ出力を得ることができます

any_of()は部分一致、all_of()は完全一致です。なので、文字ベクトルの中にデータフレームのカラムと符合しない要素があると

# starwarsの中に"umr"というカラムはない
> vars2 <- c("name","mass","skin_color","sex","umr")
> starwars %>% select(all_of(vars2))
 エラー: Can't subset columns that don't exist.
x Column `umr` doesn't exist.

エラーを返しつつ、「"umr"ってカラムはないよ」と教えてくれます。

基本的にany_of()を使えばいいと思いますが、厳密性を持たせるときとかカラムを探索するときはall_of()がつかえるのかな?

select()の指定方法として当然ですが、文字ベクトルの変数以外を抜き出したいときは、「-」を使えばおけです。

> starwars %>% select(-any_of(vars))
# A tibble: 87 x 10
   height hair_color eye_color birth_year gender homeworld species
    <int> <chr>      <chr>          <dbl> <chr>  <chr>     <chr>  
 1    172 blond      blue            19   mascu~ Tatooine  Human  
 2    167 NA         yellow         112   mascu~ Tatooine  Droid  
 3     96 NA         red             33   mascu~ Naboo     Droid  
 4    202 none       yellow          41.9 mascu~ Tatooine  Human  
 5    150 brown      brown           19   femin~ Alderaan  Human  
 6    178 brown, gr~ blue            52   mascu~ Tatooine  Human  
 7    165 brown      blue            47   femin~ Tatooine  Human  
 8     97 NA         red             NA   mascu~ Tatooine  Droid  
 9    183 black      brown           24   mascu~ Tatooine  Human  
10    182 auburn, w~ blue-gray       57   mascu~ Stewjon   Human  
# ... with 77 more rows, and 3 more variables: films <list>,
#   vehicles <list>, starships <list>

2. 関数を使った変数変更が可能に‼

変数名指定に使われるrename()ですが、新たにrename_with()が追加されました。

rename_with()では変更したい変数の指定・変数名の変更方法を関数を使って記述することが可能です

変数名を大文字にしてみる

> starwars %>% rename_with(toupper)
# A tibble: 87 x 14
   NAME  HEIGHT  MASS HAIR_COLOR SKIN_COLOR EYE_COLOR BIRTH_YEAR SEX  
   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke~    172    77 blond      fair       blue            19   male 
 2 C-3PO    167    75 NA         gold       yellow         112   none 
 3 R2-D2     96    32 NA         white, bl~ red             33   none 
 4 Dart~    202   136 none       white      yellow          41.9 male 
 5 Leia~    150    49 brown      light      brown           19   fema~
 6 Owen~    178   120 brown, gr~ light      blue            52   male 
 7 Beru~    165    75 brown      light      blue            47   fema~
 8 R5-D4     97    32 NA         white, red red             NA   none 
 9 Bigg~    183    84 black      light      brown           24   male 
10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male 
# ... with 77 more rows, and 6 more variables: GENDER <chr>,
#   HOMEWORLD <chr>, SPECIES <chr>, FILMS <list>, VEHICLES <list>,
#   STARSHIPS <list>

rename_with(toupper)で瞬殺です。

{tidyselect}where(),any_of()を使って変数名を変更してみましょう

# 変数名の最初が"n"の変数だけ大文字に
> starwars %>% rename_with(toupper, starts_with("n"))
# A tibble: 87 x 14
   NAME  height  mass hair_color skin_color eye_color birth_year sex  
   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke~    172    77 blond      fair       blue            19   male 
 2 C-3PO    167    75 NA         gold       yellow         112   none 
 3 R2-D2     96    32 NA         white, bl~ red             33   none 
 4 Dart~    202   136 none       white      yellow          41.9 male 
 5 Leia~    150    49 brown      light      brown           19   fema~
 6 Owen~    178   120 brown, gr~ light      blue            52   male 
 7 Beru~    165    75 brown      light      blue            47   fema~
 8 R5-D4     97    32 NA         white, red red             NA   none 
 9 Bigg~    183    84 black      light      brown           24   male 
10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

# 数値型変数名だけ大文字に
> starwars %>% rename_with(toupper, where(is.numeric))
# A tibble: 87 x 14
   name  HEIGHT  MASS hair_color skin_color eye_color BIRTH_YEAR sex  
   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke~    172    77 blond      fair       blue            19   male 
 2 C-3PO    167    75 NA         gold       yellow         112   none 
 3 R2-D2     96    32 NA         white, bl~ red             33   none 
 4 Dart~    202   136 none       white      yellow          41.9 male 
 5 Leia~    150    49 brown      light      brown           19   fema~
 6 Owen~    178   120 brown, gr~ light      blue            52   male 
 7 Beru~    165    75 brown      light      blue            47   fema~
 8 R5-D4     97    32 NA         white, red red             NA   none 
 9 Bigg~    183    84 black      light      brown           24   male 
10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

# 文字ベクトル(vars)の変数だけ大文字に
> starwars %>% rename_with(toupper, all_of(vars))
# A tibble: 87 x 14
   NAME  height  MASS hair_color SKIN_COLOR eye_color birth_year SEX  
   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke~    172    77 blond      fair       blue            19   male 
 2 C-3PO    167    75 NA         gold       yellow         112   none 
 3 R2-D2     96    32 NA         white, bl~ red             33   none 
 4 Dart~    202   136 none       white      yellow          41.9 male 
 5 Leia~    150    49 brown      light      brown           19   fema~
 6 Owen~    178   120 brown, gr~ light      blue            52   male 
 7 Beru~    165    75 brown      light      blue            47   fema~
 8 R5-D4     97    32 NA         white, red red             NA   none 
 9 Bigg~    183    84 black      light      brown           24   male 
10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

このように、rename_with(toupper, 変更変数の指定)といった記述で変更できます。

他にもgsub()を用いた正規表現を使った変数名の変更などもできますがここでは省略


3. カラム(変数)界のarrange()...relocate()の登場‼

変数名の順序を変えたいとき困ってましたよね...そこまで重要な作業ではないのですが、あんなこといいなできたらいいな処理です。従来は、select()を使って変数名の順序を変更することは可能でしたが、もうそんなハックみあふれることをしなくて大丈夫です。

relocate()は指定した変数をデータフレームの左に持ってきます

> starwars %>% relocate(skin_color,birth_year)
# A tibble: 87 x 14
   skin_color birth_year name  height  mass hair_color eye_color sex  
   <chr>           <dbl> <chr>  <int> <dbl> <chr>      <chr>     <chr>
 1 fair             19   Luke~    172    77 blond      blue      male 
 2 gold            112   C-3PO    167    75 NA         yellow    none 
 3 white, bl~       33   R2-D2     96    32 NA         red       none 
 4 white            41.9 Dart~    202   136 none       yellow    male 
 5 light            19   Leia~    150    49 brown      brown     fema~
 6 light            52   Owen~    178   120 brown, gr~ blue      male 
 7 light            47   Beru~    165    75 brown      blue      fema~
 8 white, red       NA   R5-D4     97    32 NA         red       none 
 9 light            24   Bigg~    183    84 black      brown     male 
10 fair             57   Obi-~    182    77 auburn, w~ blue-gray male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

name,height...という並びでしたが、実行後はskin_color,birth_yearが最も左に来ています。

勿論、{tidyselect}where()などを使った指定方法も可能です

> starwars %>% relocate(starts_with("n"))
# A tibble: 87 x 14
   name  height  mass hair_color skin_color eye_color birth_year sex  
   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke~    172    77 blond      fair       blue            19   male 
 2 C-3PO    167    75 NA         gold       yellow         112   none 
 3 R2-D2     96    32 NA         white, bl~ red             33   none 
 4 Dart~    202   136 none       white      yellow          41.9 male 
 5 Leia~    150    49 brown      light      brown           19   fema~
 6 Owen~    178   120 brown, gr~ light      blue            52   male 
 7 Beru~    165    75 brown      light      blue            47   fema~
 8 R5-D4     97    32 NA         white, red red             NA   none 
 9 Bigg~    183    84 black      light      brown           24   male 
10 Obi-~    182    77 auburn, w~ fair       blue-gray       57   male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>


> starwars %>% relocate(where(is.character))
# A tibble: 87 x 14
   name  hair_color skin_color eye_color sex   gender homeworld
   <chr> <chr>      <chr>      <chr>     <chr> <chr>  <chr>    
 1 Luke~ blond      fair       blue      male  mascu~ Tatooine 
 2 C-3PO NA         gold       yellow    none  mascu~ Tatooine 
 3 R2-D2 NA         white, bl~ red       none  mascu~ Naboo    
 4 Dart~ none       white      yellow    male  mascu~ Tatooine 
 5 Leia~ brown      light      brown     fema~ femin~ Alderaan 
 6 Owen~ brown, gr~ light      blue      male  mascu~ Tatooine 
 7 Beru~ brown      light      blue      fema~ femin~ Tatooine 
 8 R5-D4 NA         white, red red       none  mascu~ Tatooine 
 9 Bigg~ black      light      brown     male  mascu~ Tatooine 
10 Obi-~ auburn, w~ fair       blue-gray male  mascu~ Stewjon  
# ... with 77 more rows, and 7 more variables: species <chr>,
#   height <int>, mass <dbl>, birth_year <dbl>, films <list>,
#   vehicles <list>, starships <list>


> starwars %>% relocate(name,where(is.numeric))
# A tibble: 87 x 14
   name  height  mass birth_year hair_color skin_color eye_color sex  
   <chr>  <int> <dbl>      <dbl> <chr>      <chr>      <chr>     <chr>
 1 Luke~    172    77       19   blond      fair       blue      male 
 2 C-3PO    167    75      112   NA         gold       yellow    none 
 3 R2-D2     96    32       33   NA         white, bl~ red       none 
 4 Dart~    202   136       41.9 none       white      yellow    male 
 5 Leia~    150    49       19   brown      light      brown     fema~
 6 Owen~    178   120       52   brown, gr~ light      blue      male 
 7 Beru~    165    75       47   brown      light      blue      fema~
 8 R5-D4     97    32       NA   NA         white, red red       none 
 9 Bigg~    183    84       24   black      light      brown     male 
10 Obi-~    182    77       57   auburn, w~ fair       blue-gray male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

.before .after引数を使った指定

.before,.after引数を使えば、任意の変数を別の任意の変数の前後に持ってくることができます。

IDと併せて、当該の変数情報を見たいときとかに便利ですかね?

# before
> starwars %>% relocate(name, .before = hair_color)
# A tibble: 87 x 14
   height  mass name  hair_color skin_color eye_color birth_year sex  
    <int> <dbl> <chr> <chr>      <chr>      <chr>          <dbl> <chr>
 1    172    77 Luke~ blond      fair       blue            19   male 
 2    167    75 C-3PO NA         gold       yellow         112   none 
 3     96    32 R2-D2 NA         white, bl~ red             33   none 
 4    202   136 Dart~ none       white      yellow          41.9 male 
 5    150    49 Leia~ brown      light      brown           19   fema~
 6    178   120 Owen~ brown, gr~ light      blue            52   male 
 7    165    75 Beru~ brown      light      blue            47   fema~
 8     97    32 R5-D4 NA         white, red red             NA   none 
 9    183    84 Bigg~ black      light      brown           24   male 
10    182    77 Obi-~ auburn, w~ fair       blue-gray       57   male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

# after
> starwars %>% relocate(name, .after = hair_color)
# A tibble: 87 x 14
   height  mass hair_color name  skin_color eye_color birth_year sex  
    <int> <dbl> <chr>      <chr> <chr>      <chr>          <dbl> <chr>
 1    172    77 blond      Luke~ fair       blue            19   male 
 2    167    75 NA         C-3PO gold       yellow         112   none 
 3     96    32 NA         R2-D2 white, bl~ red             33   none 
 4    202   136 none       Dart~ white      yellow          41.9 male 
 5    150    49 brown      Leia~ light      brown           19   fema~
 6    178   120 brown, gr~ Owen~ light      blue            52   male 
 7    165    75 brown      Beru~ light      blue            47   fema~
 8     97    32 NA         R5-D4 white, red red             NA   none 
 9    183    84 black      Bigg~ light      brown           24   male 
10    182    77 auburn, w~ Obi-~ fair       blue-gray       57   male 
# ... with 77 more rows, and 6 more variables: gender <chr>,
#   homeworld <chr>, species <chr>, films <list>, vehicles <list>,
#   starships <list>

# last_col()
> starwars %>%
+   select(1:5) %>%
+   print() %>% 
+   relocate(name, .after = last_col())
# A tibble: 87 x 5
   name               height  mass hair_color    skin_color 
   <chr>               <int> <dbl> <chr>         <chr>      
 1 Luke Skywalker        172    77 blond         fair       
 2 C-3PO                 167    75 NA            gold       
 3 R2-D2                  96    32 NA            white, blue
 4 Darth Vader           202   136 none          white      
 5 Leia Organa           150    49 brown         light      
 6 Owen Lars             178   120 brown, grey   light      
 7 Beru Whitesun lars    165    75 brown         light      
 8 R5-D4                  97    32 NA            white, red 
 9 Biggs Darklighter     183    84 black         light      
10 Obi-Wan Kenobi        182    77 auburn, white fair       
# ... with 77 more rows
# A tibble: 87 x 5
   height  mass hair_color    skin_color  name              
    <int> <dbl> <chr>         <chr>       <chr>             
 1    172    77 blond         fair        Luke Skywalker    
 2    167    75 NA            gold        C-3PO             
 3     96    32 NA            white, blue R2-D2             
 4    202   136 none          white       Darth Vader       
 5    150    49 brown         light       Leia Organa       
 6    178   120 brown, grey   light       Owen Lars         
 7    165    75 brown         light       Beru Whitesun lars
 8     97    32 NA            white, red  R5-D4             
 9    183    84 black         light       Biggs Darklighter 
10    182    77 auburn, white fair        Obi-Wan Kenobi    
# ... with 77 more rows

実行環境と参考

環境

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

参考

to be continued...