Introduction to dplyr

dplyr パッケージは、これらのステップを迅速かつ容易にします。

このドキュメントでは、dplyr の基本的なツール群を紹介し、それらをデータフレームに適用する方法を示します。 dplyr は、dbplyr パッケージによってデータベースもサポートしています。

データ: starwars

dplyr の基本的なデータ操作方法を調べるために、データセット starwars を使います。このデータセットは 87 の文字を含んでおり、Star Wars APIから来ており、?starwars で文書化されています。

dim(starwars)
#> [1] 87 14
starwars
#> # A tibble: 87 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke~    172    77 blond      fair       blue            19   male  mascu~
#> 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu~
#> 3 R2-D2     96    32 <NA>       white, bl~ red             33   none  mascu~
#> 4 Dart~    202   136 none       white      yellow          41.9 male  mascu~
#> # ... with 83 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

starwarsは、データフレームを現代風にアレンジした tibble であることに注意してください。これは最初の数行だけを表示するので、大きなデータセットには特に便利です。tibble については https://tibble.tidyverse.org を参照してください。特に、データフレームを tibble に変換するには as_tibble() を使用します。

単一テーブルの動詞

dplyrは、データ操作の基本的な動詞ごとに関数を提供することを目指しています。これらの動詞は、扱うデータセットの構成要素に基づいて、3つのカテゴリに分類されます。

Rows:
- filter() は、カラムの値に基づいて行を選択します。
- slice() は、位置に基づいて行を選択します。
- arrange() は、行の順序を変更します。
Columns:
- select() は、カラムを含めるかどうかを変更します。
- rename() は、カラムの名前を変更します。
- mutate() は、カラムの値を変更したり、新しいカラムを作成します。
- relocate() は、カラムの順序を変更します。
行のグループ。
- summarise() はグループを1つの行にまとめます。

パイプ

dplyrのすべての関数は、第1引数としてデータフレーム（または tibble）を取ります。中間オブジェクトの保存や関数のネストをユーザに強いるのではなく、dplyr は magrittr の%>%演算子を提供します。x %>% f(y) は f(x, y) となり，あるステップの結果が次のステップに「パイプ」されます．パイプを使って、左から右へ、上から下へと読める複数の操作を書き換えることができます（パイプ演算子を「then」と読みます）。

`filter()` による行のフィルタリング

filter() を使うと、データフレーム内の行のサブセットを選択することができます。他の単一の動詞と同様に、第1引数は tibble (またはデータフレーム)です。2番目以降の引数は、そのデータフレーム内の変数を参照し、その式が TRUE である行を選択します。

例えば、肌の色が明るく、目が茶色のキャラクターをすべて選択するには、次のようにします。

starwars %>% filter(skin_color == "light", eye_color == "brown")
#> # A tibble: 7 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Leia~    150    49 brown      light      brown             19 fema~ femin~
#> 2 Bigg~    183    84 black      light      brown             24 male  mascu~
#> 3 Corde    157    NA brown      light      brown             NA fema~ femin~
#> 4 Dorme    165    NA brown      light      brown             NA fema~ femin~
#> # ... with 3 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

これは、この基本的なRのコードとほぼ同じです。

starwars[starwars$skin_color == "light" & starwars$eye_color == "brown", ]

`arrange()`で行を並べる

arrange() は filter() と似たような動作をしますが、行をフィルタリングしたり選択したりするのではなく、並び替えを行います。データフレームと、順序付けするための列名（またはより複雑な式）のセットを受け取ります。複数の列名を指定した場合，追加の各列は前の列の値の同値を解消するために使用されます。

starwars %>% arrange(height, mass)
#> # A tibble: 87 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Yoda      66    17 white      green      brown            896 male  mascu~
#> 2 Ratt~     79    15 none       grey, blue unknown           NA male  mascu~
#> 3 Wick~     88    20 brown      brown      brown              8 male  mascu~
#> 4 Dud ~     94    45 none       blue, grey yellow            NA male  mascu~
#> # ... with 83 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

列を降順で並べるには desc() を使います。

starwars %>% arrange(desc(height))
#> # A tibble: 87 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Yara~    264    NA none       white      yellow            NA male  mascu~
#> 2 Tarf~    234   136 brown      brown      blue              NA male  mascu~
#> 3 Lama~    229    88 none       grey       black             NA male  mascu~
#> 4 Chew~    228   112 brown      unknown    blue             200 male  mascu~
#> # ... with 83 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

`slice()`による位置による行の選択

slice() では、行の(整数の)位置でインデックスを作ることができます。これにより、行の選択、削除、複製が可能になります。

5行目から10行目までの文字を得ることができます。

starwars %>% slice(5:10)
#> # A tibble: 6 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Leia~    150    49 brown      light      brown             19 fema~ femin~
#> 2 Owen~    178   120 brown, gr~ light      blue              52 male  mascu~
#> 3 Beru~    165    75 brown      light      blue              47 fema~ femin~
#> 4 R5-D4     97    32 <NA>       white, red red               NA none  mascu~
#> # ... with 2 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

これには，一般的な使用例に対応した多くのヘルパーが付属しています．

slice_head() と slice_tail() は、最初の行または最後の行を選択します。

starwars %>% slice_head(n = 3)
#> # A tibble: 3 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke~    172    77 blond      fair       blue              19 male  mascu~
#> 2 C-3PO    167    75 <NA>       gold       yellow           112 none  mascu~
#> 3 R2-D2     96    32 <NA>       white, bl~ red               33 none  mascu~
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

slice_sample() はランダムに行を選択します。ケースのある割合を選択するには、オプションの prop を使います。

starwars %>% slice_sample(n = 5)
#> # A tibble: 5 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Dud ~     94    45 none       blue, grey yellow            NA male  mascu~
#> 2 Bossk    190   113 none       green      red               53 male  mascu~
#> 3 Shaa~    178    57 none       red, blue~ black             NA fema~ femin~
#> 4 Dorme    165    NA brown      light      brown             NA fema~ femin~
#> # ... with 1 more row, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
starwars %>% slice_sample(prop = 0.1)
#> # A tibble: 8 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Qui-~    193    89 brown      fair       blue              92 male  mascu~
#> 2 Dext~    198   102 none       brown      yellow            NA male  mascu~
#> 3 R4-P~     96    NA none       silver, r~ red, blue         NA none  femin~
#> 4 Lama~    229    88 none       grey       black             NA male  mascu~
#> # ... with 4 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

ブートストラップサンプルを実行するには、replace = TRUE を使用します。必要であれば、weightという引数でサンプルに重み付けをすることができます。

slice_min() と slice_max() は、変数の最大値または最小値を持つ行を選択します。最初に NA ではない値だけを選択しなければならないことに注意してください。

starwars %>%
  filter(!is.na(height)) %>%
  slice_max(height, n = 3)
#> # A tibble: 3 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Yara~    264    NA none       white      yellow            NA male  mascu~
#> 2 Tarf~    234   136 brown      brown      blue              NA male  mascu~
#> 3 Lama~    229    88 none       grey       black             NA male  mascu~
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

`select()`によるカラムの選択

多くのカラムを含む大規模なデータセットを扱う際に、実際にはいくつかのカラムにしか興味を持てないことがよくあります。select() を使えば、通常は数値変数の位置にしか作用しない操作を使って、有用なサブセットを素早く拡大することができます。

# Select columns by name
starwars %>% select(hair_color, skin_color, eye_color)
#> # A tibble: 87 x 3
#>   hair_color skin_color  eye_color
#>   <chr>      <chr>       <chr>    
#> 1 blond      fair        blue     
#> 2 <NA>       gold        yellow   
#> 3 <NA>       white, blue red      
#> 4 none       white       yellow   
#> # ... with 83 more rows
# Select all columns between hair_color and eye_color (inclusive)
starwars %>% select(hair_color:eye_color)
#> # A tibble: 87 x 3
#>   hair_color skin_color  eye_color
#>   <chr>      <chr>       <chr>    
#> 1 blond      fair        blue     
#> 2 <NA>       gold        yellow   
#> 3 <NA>       white, blue red      
#> 4 none       white       yellow   
#> # ... with 83 more rows
# Select all columns except those from hair_color to eye_color (inclusive)
starwars %>% select(!(hair_color:eye_color))
#> # A tibble: 87 x 11
#>   name  height  mass birth_year sex   gender homeworld species films vehicles
#>   <chr>  <int> <dbl>      <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>  
#> 1 Luke~    172    77       19   male  mascu~ Tatooine  Human   <chr~ <chr [2~
#> 2 C-3PO    167    75      112   none  mascu~ Tatooine  Droid   <chr~ <chr [0~
#> 3 R2-D2     96    32       33   none  mascu~ Naboo     Droid   <chr~ <chr [0~
#> 4 Dart~    202   136       41.9 male  mascu~ Tatooine  Human   <chr~ <chr [0~
#> # ... with 83 more rows, and 1 more variable: starships <list>
# Select all columns ending with color
starwars %>% select(ends_with("color"))
#> # A tibble: 87 x 3
#>   hair_color skin_color  eye_color
#>   <chr>      <chr>       <chr>    
#> 1 blond      fair        blue     
#> 2 <NA>       gold        yellow   
#> 3 <NA>       white, blue red      
#> 4 none       white       yellow   
#> # ... with 83 more rows

select() の中には、starts_with()、ends_with()、matches()、contains()といったヘルパー関数が用意されています。これらを使うと、ある基準を満たす大きな変数のブロックに素早くマッチさせることができます。詳細は ?select を参照してください。

select() では、名前付き引数を使って変数の名前を変更することができます。

starwars %>% select(home_world = homeworld)
#> # A tibble: 87 x 1
#>   home_world
#>   <chr>     
#> 1 Tatooine  
#> 2 Tatooine  
#> 3 Naboo     
#> 4 Tatooine  
#> # ... with 83 more rows

しかし、select() は明示的に言及されていないすべての変数を削除してしまうので、あまり便利ではありません。代わりに rename() を使ってください。:

starwars %>% rename(home_world = homeworld)
#> # A tibble: 87 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke~    172    77 blond      fair       blue            19   male  mascu~
#> 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu~
#> 3 R2-D2     96    32 <NA>       white, bl~ red             33   none  mascu~
#> 4 Dart~    202   136 none       white      yellow          41.9 male  mascu~
#> # ... with 83 more rows, and 5 more variables: home_world <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

`mutate()`で新しいカラムを追加する

既存のカラムのセットを選択するだけでなく、既存のカラムの関数である新しいカラムを追加することもしばしば役に立ちます。これが mutate() の仕事です。

starwars %>% mutate(height_m = height / 100)
#> # A tibble: 87 x 15
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke~    172    77 blond      fair       blue            19   male  mascu~
#> 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu~
#> 3 R2-D2     96    32 <NA>       white, bl~ red             33   none  mascu~
#> 4 Dart~    202   136 none       white      yellow          41.9 male  mascu~
#> # ... with 83 more rows, and 6 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>, height_m <dbl>

先ほど計算したメートル単位の高さを見ることはできませんが、select コマンドを使って修正することができます。

starwars %>%
  mutate(height_m = height / 100) %>%
  select(height_m, height, everything())
#> # A tibble: 87 x 15
#>   height_m height name   mass hair_color skin_color eye_color birth_year sex  
#>      <dbl>  <int> <chr> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
#> 1     1.72    172 Luke~    77 blond      fair       blue            19   male 
#> 2     1.67    167 C-3PO    75 <NA>       gold       yellow         112   none 
#> 3     0.96     96 R2-D2    32 <NA>       white, bl~ red             33   none 
#> 4     2.02    202 Dart~   136 none       white      yellow          41.9 male 
#> # ... with 83 more rows, and 6 more variables: gender <chr>, homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

dplyr::mutate() は基本的な transform() と似ていますが、作成したばかりのカラムを参照することができます。:

starwars %>%
  mutate(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  ) %>%
  select(BMI, everything())
#> # A tibble: 87 x 16
#>     BMI name  height  mass hair_color skin_color eye_color birth_year sex  
#>   <dbl> <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
#> 1  26.0 Luke~    172    77 blond      fair       blue            19   male 
#> 2  26.9 C-3PO    167    75 <NA>       gold       yellow         112   none 
#> 3  34.7 R2-D2     96    32 <NA>       white, bl~ red             33   none 
#> 4  33.3 Dart~    202   136 none       white      yellow          41.9 male 
#> # ... with 83 more rows, and 7 more variables: gender <chr>, homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>,
#> #   height_m <dbl>

新しい変数だけを保持したい場合は、transmute() を使用してください。:

starwars %>%
  transmute(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  )
#> # A tibble: 87 x 2
#>   height_m   BMI
#>      <dbl> <dbl>
#> 1     1.72  26.0
#> 2     1.67  26.9
#> 3     0.96  34.7
#> 4     2.02  33.3
#> # ... with 83 more rows

`relocate()`による列の順序変更

select() と同様の構文で、複数の列を一度に移動させることができます。

starwars %>% relocate(sex:homeworld, .before = height)
#> # A tibble: 87 x 14
#>   name  sex   gender homeworld height  mass hair_color skin_color eye_color
#>   <chr> <chr> <chr>  <chr>      <int> <dbl> <chr>      <chr>      <chr>    
#> 1 Luke~ male  mascu~ Tatooine     172    77 blond      fair       blue     
#> 2 C-3PO none  mascu~ Tatooine     167    75 <NA>       gold       yellow   
#> 3 R2-D2 none  mascu~ Naboo         96    32 <NA>       white, bl~ red      
#> 4 Dart~ male  mascu~ Tatooine     202   136 none       white      yellow   
#> # ... with 83 more rows, and 5 more variables: birth_year <dbl>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

`summarise()`で値をまとめる

最後の動詞は summarise() です。これは、データフレームを1つの行にまとめるものです。

starwars %>% summarise(height = mean(height, na.rm = TRUE))
#> # A tibble: 1 x 1
#>   height
#>    <dbl>
#> 1   174.

以下の group_by() という動詞を覚えるまでは、それほど便利ではありません。

共通点

これらの動詞の構文や機能が非常に似ていることにお気づきでしょうか。

最初の引数はデータフレームです。
後続の引数は、データフレームの処理内容を記述します。データフレームのデータフレーム内のカラムを$を使わずに直接参照することができます。
結果は新しいデータフレームになります。

これらのプロパティを組み合わせることで、複数のシンプルなステップを連鎖させて、複雑な結果を簡単に得ることができます。

これらの5つの機能は、データ操作のための言語の基礎となります。最も基本的なレベルでは、整理されたデータフレームを5つの便利な方法で変更することができます。行を並べ替える(arrange())、興味のある観測値や変数を選ぶ(filter()およびselect())、既存の変数の関数である新しい変数を追加する(mutate())、多くの値をまとめて表示する(summarise())です。

`%>%` を使った関数の組み合わせ

dplyr の API は、関数呼び出しが副作用を持たないという意味で機能的です。関数呼び出しの結果は必ず保存しなければなりません。これは、特に多くの処理を一度に行いたい場合、特にエレガントなコードにはなりません。ステップ・バイ・ステップで行う必要があります。

a1 <- group_by(starwars, species, sex)
a2 <- select(a1, height, mass)
a3 <- summarise(a2,
  height = mean(height, na.rm = TRUE),
  mass = mean(mass, na.rm = TRUE)
)

あるいは、中間結果に名前を付けたくない場合は、関数の呼び出しを相互にラップする必要があります。:

summarise(
  select(
    group_by(starwars, species, sex),
    height, mass
  ),
  height = mean(height, na.rm = TRUE),
  mass = mean(mass, na.rm = TRUE)
)
#> Adding missing grouping variables: `species`, `sex`
#> `summarise()` has grouped output by 'species'. You can override using the `.groups` argument.
#> # A tibble: 41 x 4
#> # Groups:   species [38]
#>   species  sex   height  mass
#>   <chr>    <chr>  <dbl> <dbl>
#> 1 Aleena   male      79    15
#> 2 Besalisk male     198   102
#> 3 Cerean   male     198    82
#> 4 Chagrian male     196   NaN
#> # ... with 37 more rows

これが読みにくいのは、演算の順番が内側から外側へとなっているからです。したがって、引数は関数からずいぶん離れたところにあります。この問題を回避するために，dplyr は magrittr の %>% 演算子を提供しています．x %>% f(y)はf(x, y)` になるので，これを使って，複数の演算を左から右，上から下に読めるように書き換えることができます（パイプ演算子を「then」と読む）．

starwars %>%
  group_by(species, sex) %>%
  select(height, mass) %>%
  summarise(
    height = mean(height, na.rm = TRUE),
    mass = mean(mass, na.rm = TRUE)
  )

操作のパターン

dplyr の動詞は、その動作の種類によって分類することができます（その意味を セマンティクス、つまり意味、と呼ぶことがあります）。 select 操作と mutate 操作の違いをよく理解しておくと便利です。

選択操作

dplyr　の魅力の1つは、tibble　の中の列をあたかも列を通常の変数のように参照できることです。しかししかし、列名をそのまま参照するという構文上の統一性のために、動詞間の意味上の違いが隠されています。意味的な違いが隠されています。select() に与えられた列のシンボルは select() に与えられた列のシンボルは，同じシンボルが mutate() に与えられたものとは同じ意味を持ちません． mutate() に与えられる同じシンボルとは同じ意味を持ちません。

選択操作では、列名と位置が求められます。そのため裸の変数名で select() を呼び出した場合、それらは実際には tibble の中の自分の位置を表しています。の位置を表しています。以下の呼び出しは、dplyr の観点からは完全に dplyr の観点からは完全に等価です。

# `name` represents the integer 1
select(starwars, name)
#> # A tibble: 87 x 1
#>   name          
#>   <chr>         
#> 1 Luke Skywalker
#> 2 C-3PO         
#> 3 R2-D2         
#> 4 Darth Vader   
#> # ... with 83 more rows
select(starwars, 1)
#> # A tibble: 87 x 1
#>   name          
#>   <chr>         
#> 1 Luke Skywalker
#> 2 C-3PO         
#> 3 R2-D2         
#> 4 Darth Vader   
#> # ... with 83 more rows

同じ意味で、これは、周囲のコンテキストの変数が列の1つと同じ名前である場合には、その変数を参照できないことを意味します。次の例では、height は5ではなく、やはり2を表しています。:

height <- 5
select(starwars, height)
#> # A tibble: 87 x 1
#>   height
#>    <int>
#> 1    172
#> 2    167
#> 3     96
#> 4    202
#> # ... with 83 more rows

便利な点としては、これが裸の名前と、c(height, mass) や height:mass のような選択的な呼び出しにのみ適用されることです。その他のケースでは、データフレームの列はスコープに入りません。これにより、選択ヘルパーで文脈変数を参照することができます。:

name <- "color"
select(starwars, ends_with(name))
#> # A tibble: 87 x 3
#>   hair_color skin_color  eye_color
#>   <chr>      <chr>       <chr>    
#> 1 blond      fair        blue     
#> 2 <NA>       gold        yellow   
#> 3 <NA>       white, blue red      
#> 4 none       white       yellow   
#> # ... with 83 more rows

これらのセマンティクスは、通常、直感的に理解できるものです。しかし、微妙な違いに注意してください。:

name <- 5
select(starwars, name, identity(name))
#> # A tibble: 87 x 2
#>   name           skin_color 
#>   <chr>          <chr>      
#> 1 Luke Skywalker fair       
#> 2 C-3PO          gold       
#> 3 R2-D2          white, blue
#> 4 Darth Vader    white      
#> # ... with 83 more rows

第1引数の name は自分の位置1を表します。2番目の引数では、name は周囲の文脈で評価され、5番目の列を表します。

長い間、select()は列の位置しか理解できませんでした。dplyr 0.6 からは、列名も理解できるようになりました。これにより、select() を使ったプログラミングが少し楽になりました。

vars <- c("name", "height")
select(starwars, all_of(vars), "mass")
#> # A tibble: 87 x 3
#>   name           height  mass
#>   <chr>           <int> <dbl>
#> 1 Luke Skywalker    172    77
#> 2 C-3PO             167    75
#> 3 R2-D2              96    32
#> 4 Darth Vader       202   136
#> # ... with 83 more rows

ミューテート操作

ミューテートのセマンティクスは、選択のセマンティクスとは全く異なります。select() では列の名前や位置を指定するのに対し、mutate() では 列のベクトル を指定します。ここでは、例として小さめの tibble を設定してみます。

df <- starwars %>% select(name, height, mass)

select() を使う場合，むき出しの列名はtibbleの中のそれぞれの位置を表している。一方，mutate()では，列のシンボルは tibble に格納されている実際の列ベクトルを表す。 mutate() に文字列や数値を与えるとどうなるか考えてみよう。

mutate(df, "height", 2)
#> # A tibble: 87 x 5
#>   name           height  mass `"height"`   `2`
#>   <chr>           <int> <dbl> <chr>      <dbl>
#> 1 Luke Skywalker    172    77 height         2
#> 2 C-3PO             167    75 height         2
#> 3 R2-D2              96    32 height         2
#> 4 Darth Vader       202   136 height         2
#> # ... with 83 more rows

mutate() は長さ1のベクトルを取得し，それをデータフレームの新しい列として解釈します．これらのベクトルは，行数と一致するように再利用されます．行数と一致するように再利用されます。これは文字列に10を加えることに等しいのです。正しい表現は:

mutate(df, height + 10)
#> # A tibble: 87 x 4
#>   name           height  mass `height + 10`
#>   <chr>           <int> <dbl>         <dbl>
#> 1 Luke Skywalker    172    77           182
#> 2 C-3PO             167    75           177
#> 3 R2-D2              96    32           106
#> 4 Darth Vader       202   136           212
#> # ... with 83 more rows

同じように、有効な列を表す値であれば、コンテキストから値を unquote することができます。これらの値は、長さが1であるか（リサイクルされます）、行数と同じ長さである必要があります。次の例では、新しいベクトルを作成し、データフレームに追加しています。

var <- seq(1, nrow(df))
mutate(df, new = var)
#> # A tibble: 87 x 4
#>   name           height  mass   new
#>   <chr>           <int> <dbl> <int>
#> 1 Luke Skywalker    172    77     1
#> 2 C-3PO             167    75     2
#> 3 R2-D2              96    32     3
#> 4 Darth Vader       202   136     4
#> # ... with 83 more rows

その例として、group_by()が挙げられます。これは select セマンティクスだと思われるかもしれませんが、実際には mutate セマンティクスです。これは、変更された列でグループ化することができるので、非常に便利です。

group_by(starwars, sex)
#> # A tibble: 87 x 14
#> # Groups:   sex [5]
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke~    172    77 blond      fair       blue            19   male  mascu~
#> 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu~
#> 3 R2-D2     96    32 <NA>       white, bl~ red             33   none  mascu~
#> 4 Dart~    202   136 none       white      yellow          41.9 male  mascu~
#> # ... with 83 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
group_by(starwars, sex = as.factor(sex))
#> # A tibble: 87 x 14
#> # Groups:   sex [5]
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <fct> <chr> 
#> 1 Luke~    172    77 blond      fair       blue            19   male  mascu~
#> 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu~
#> 3 R2-D2     96    32 <NA>       white, bl~ red             33   none  mascu~
#> 4 Dart~    202   136 none       white      yellow          41.9 male  mascu~
#> # ... with 83 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
group_by(starwars, height_binned = cut(height, 3))
#> # A tibble: 87 x 15
#> # Groups:   height_binned [4]
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke~    172    77 blond      fair       blue            19   male  mascu~
#> 2 C-3PO    167    75 <NA>       gold       yellow         112   none  mascu~
#> 3 R2-D2     96    32 <NA>       white, bl~ red             33   none  mascu~
#> 4 Dart~    202   136 none       white      yellow          41.9 male  mascu~
#> # ... with 83 more rows, and 6 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>, height_binned <fct>

これが、group_by()に列名を与えることができない理由です。これは、行数にリサイクルされた文字列を含む新しい列を作成することになります。:

group_by(df, "month")
#> # A tibble: 87 x 4
#> # Groups:   "month" [1]
#>   name           height  mass `"month"`
#>   <chr>           <int> <dbl> <chr>    
#> 1 Luke Skywalker    172    77 month    
#> 2 C-3PO             167    75 month    
#> 3 R2-D2              96    32 month    
#> 4 Darth Vader       202   136 month    
#> # ... with 83 more rows

Introduction to dplyr

データ: starwars

単一テーブルの動詞

パイプ

filter() による行のフィルタリング

arrange()で行を並べる

slice()による位置による行の選択

select()によるカラムの選択

mutate()で新しいカラムを追加する

relocate()による列の順序変更

summarise()で値をまとめる

共通点

%>% を使った関数の組み合わせ