この vignette は、dplyr の関数を base R の同等品と比較しています。これは、base R に精通している人がdplyrの機能をよりよく理解するのに役立ち、dplyr ユーザーが同じアイデアを base R のコードでどのように表現できるかを示します。まず、主な違いの大まかな概要を説明し、次に1テーブルの動詞について詳しく説明し、続いて2テーブルの動詞について説明します。

概要

dplyr verbs は、データフレームを入出力するコードです。これは、Rの基本的な Rの関数は、個々のベクトルを扱うことが多いです。
dplyr は「非標準の評価」に大きく依存しており、「現在の」データフレーム内の列を参照するのに現在の "データフレーム内の列を参照するために$を使用する必要がないように。この動作は基本関数の subset() と transform() にヒントを得ています。
dplyr のソリューションは様々な単一目的の動詞を使用する傾向があります。 Rのソリューションは、一般的に、タスクに応じて様々な方法で [ を使用する傾向があります。を使う傾向があります。
複数の dplyr の動詞は、多くの場合、%>%によってパイプラインにつなぎ合わされます。ベースRでは、通常、中間結果を変数に保存して、それを破棄したり、繰り返したりします。捨てたり、繰り返し上書きしたりします。
すべての dplyr verbs は「グループ化された」データフレームを扱うため、グループごとに計算を実行するコードは非常に似ています。グループごとに計算を実行するコードは、データフレーム全体で動作するコードとよく似ています。データフレーム全体を扱うコードとよく似ています。基本的なRでは、グループ単位の演算は様々な形で行われます。

One table verbs

次の表は、dplyrの動詞とベースとなるRの等価物との間の凝縮された翻訳を示しています。次のセクションでは、各操作についてより詳しく説明します。dplyrの動詞については、それぞれのドキュメントや vignette("one-table")で詳しく説明されています。

dplyr	base
`arrange(df, x)`	`df[order(x), , drop = FALSE]`
`distinct(df, x)`	`df[!duplicated(x), , drop = FALSE]`, `unique()`
`filter(df, x)`	`df[which(x), , drop = FALSE]`, `subset()`
`mutate(df, z = x + y)`	`df$z <- df$x + df$y`, `transform()`
`pull(df, 1)`	`df[[1]]`
`pull(df, x)`	`df$x`
`rename(df, y = x)`	`names(df)[names(df) == "x"] <- "y"`
`relocate(df, y)`	`df[union("y", names(df))]`
`select(df, x, y)`	`df[c("x", "y")]`, `subset()`
`select(df, starts_with("x")`	`df[grepl(names(df), "^x")]`
`summarise(df, mean(x))`	`mean(df$x)`, `tapply()`, `aggregate()`, `by()`
`slice(df, c(1, 2, 5))`	`df[c(1, 2, 5), , drop = FALSE]`

まず、dplyr をロードして、mtcars と iris を tibble に変換して、各操作の略式出力だけを簡単に表示できるようにします。

library(dplyr)
mtcars <- as_tibble(mtcars)
iris <- as_tibble(iris)

`arrange()`: 変数で行を並べる

dplyr::arrange()は、データフレームの行を1つまたは複数の列の値で並べます。

mtcars %>% arrange(cyl, disp)
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 2  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 3  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 4  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> # ... with 28 more rows

desc()ヘルパーを使うと、選択した変数を降順で並べることができます。

mtcars %>% arrange(desc(cyl), desc(disp))
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  10.4     8   472   205  2.93  5.25  18.0     0     0     3     4
#> 2  10.4     8   460   215  3     5.42  17.8     0     0     3     4
#> 3  14.7     8   440   230  3.23  5.34  17.4     0     0     3     4
#> 4  19.2     8   400   175  3.08  3.84  17.0     0     0     3     2
#> # ... with 28 more rows

order() で [ を使うことで、base Rに複製することができます。

mtcars[order(mtcars$cyl, mtcars$disp), , drop = FALSE]
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#> 2  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#> 3  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 4  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> # ... with 28 more rows

drop = FALSEの使用に注意してください。これを忘れると、入力が1列のデータフレームであった場合、出力はデータフレームではなくベクトルになります。これは微妙なバグの原因になります。

base Rは、個々の変数を降順でソートする便利で一般的な方法を提供していないので、2つの選択肢があります。

数値変数の場合は -x を使用できます。
すべての変数を降順でソートするには order() をリクエストします。

mtcars[order(mtcars$cyl, mtcars$disp, decreasing = TRUE), , drop = FALSE]
mtcars[order(-mtcars$cyl, -mtcars$disp), , drop = FALSE]

`distinct()`: 固有/ユニークな行の選択

dplyr::distinct() は、ユニークな行を選択します。

df <- tibble(
  x = sample(10, 100, rep = TRUE),
  y = sample(10, 100, rep = TRUE)
)

df %>% distinct(x) # selected columns
#> # A tibble: 10 x 1
#>       x
#>   <int>
#> 1     7
#> 2     5
#> 3     9
#> 4     1
#> # ... with 6 more rows
df %>% distinct(x, .keep_all = TRUE) # whole data frame
#> # A tibble: 10 x 2
#>       x     y
#>   <int> <int>
#> 1     7    10
#> 2     5     3
#> 3     9     9
#> 4     1     4
#> # ... with 6 more rows

base R では、データフレーム全体を取得するか、選択した変数のみを取得するかによって、2つの同等の機能があります。

unique(df["x"]) # selected columns
#> # A tibble: 10 x 1
#>       x
#>   <int>
#> 1     7
#> 2     5
#> 3     9
#> 4     1
#> # ... with 6 more rows
df[!duplicated(df$x), , drop = FALSE] # whole data frame
#> # A tibble: 10 x 2
#>       x     y
#>   <int> <int>
#> 1     7    10
#> 2     5     3
#> 3     9     9
#> 4     1     4
#> # ... with 6 more rows

`filter()`: 条件にマッチした行を返す

dplyr::filter()は、ある式が TRUE である行を選択する。

starwars %>% filter(species == "Human")
#> # A tibble: 35 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke~    172    77 blond      fair       blue            19   male  mascu~
#> 2 Dart~    202   136 none       white      yellow          41.9 male  mascu~
#> 3 Leia~    150    49 brown      light      brown           19   fema~ femin~
#> 4 Owen~    178   120 brown, gr~ light      blue            52   male  mascu~
#> # ... with 31 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
starwars %>% filter(mass > 1000)
#> # A tibble: 1 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Jabb~    175  1358 <NA>       green-tan~ orange           600 herm~ mascu~
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>
starwars %>% filter(hair_color == "none" & eye_color == "black")
#> # A tibble: 9 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Nien~    160    68 none       grey       black             NA male  mascu~
#> 2 Gasg~    122    NA none       white, bl~ black             NA male  mascu~
#> 3 Kit ~    196    87 none       green      black             NA male  mascu~
#> 4 Plo ~    188    80 none       orange     black             22 male  mascu~
#> # ... with 5 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

最も近い基本的な等価物（そして filter() のインスピレーション）は subset() です。

subset(starwars, species == "Human")
#> # A tibble: 35 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke~    172    77 blond      fair       blue            19   male  mascu~
#> 2 Dart~    202   136 none       white      yellow          41.9 male  mascu~
#> 3 Leia~    150    49 brown      light      brown           19   fema~ femin~
#> 4 Owen~    178   120 brown, gr~ light      blue            52   male  mascu~
#> # ... with 31 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
subset(starwars, mass > 1000)
#> # A tibble: 1 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Jabb~    175  1358 <NA>       green-tan~ orange           600 herm~ mascu~
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>
subset(starwars, hair_color == "none" & eye_color == "black")
#> # A tibble: 9 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Nien~    160    68 none       grey       black             NA male  mascu~
#> 2 Gasg~    122    NA none       white, bl~ black             NA male  mascu~
#> 3 Kit ~    196    87 none       green      black             NA male  mascu~
#> 4 Plo ~    188    80 none       orange     black             22 male  mascu~
#> # ... with 5 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

また、[を使うこともできますが、その場合はNAを削除するためにwhich()を使う必要があります。

starwars[which(starwars$species == "Human"), , drop = FALSE]
#> # A tibble: 35 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Luke~    172    77 blond      fair       blue            19   male  mascu~
#> 2 Dart~    202   136 none       white      yellow          41.9 male  mascu~
#> 3 Leia~    150    49 brown      light      brown           19   fema~ femin~
#> 4 Owen~    178   120 brown, gr~ light      blue            52   male  mascu~
#> # ... with 31 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>
starwars[which(starwars$mass > 1000), , drop = FALSE]
#> # A tibble: 1 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Jabb~    175  1358 <NA>       green-tan~ orange           600 herm~ mascu~
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>
starwars[which(starwars$hair_color == "none" & starwars$eye_color == "black"), , drop = FALSE]
#> # A tibble: 9 x 14
#>   name  height  mass hair_color skin_color eye_color birth_year sex   gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
#> 1 Nien~    160    68 none       grey       black             NA male  mascu~
#> 2 Gasg~    122    NA none       white, bl~ black             NA male  mascu~
#> 3 Kit ~    196    87 none       green      black             NA male  mascu~
#> 4 Plo ~    188    80 none       orange     black             22 male  mascu~
#> # ... with 5 more rows, and 5 more variables: homeworld <chr>, species <chr>,
#> #   films <list>, vehicles <list>, starships <list>

`mutate()`: 変数の作成と変換

dplyr::mutate()は、既存の変数から新しい変数を作成します。

df %>% mutate(z = x + y, z2 = z ^ 2)
#> # A tibble: 100 x 4
#>       x     y     z    z2
#>   <int> <int> <int> <dbl>
#> 1     7    10    17   289
#> 2     5     3     8    64
#> 3     9     9    18   324
#> 4     1     4     5    25
#> # ... with 96 more rows

これに最も近い基本的なものはtransform()ですが、作成したばかりの変数は使えないことに注意してください。:

head(transform(df, z = x + y, z2 = (x + y) ^ 2))
#>   x  y  z  z2
#> 1 7 10 17 289
#> 2 5  3  8  64
#> 3 9  9 18 324
#> 4 1  4  5  25
#> 5 2  5  7  49
#> 6 1  6  7  49

代わりに、$<-を使うこともできます。:

mtcars$cyl2 <- mtcars$cyl * 2
mtcars$cyl4 <- mtcars$cyl2 * 2

グループ化されたデータフレームに適用すると、dplyr::mutate()はグループごとに一度だけ新しい変数を計算します。:

gf <- tibble(g = c(1, 1, 2, 2), x = c(0.5, 1.5, 2.5, 3.5))
gf %>% 
  group_by(g) %>% 
  mutate(x_mean = mean(x), x_rank = rank(x))
#> # A tibble: 4 x 4
#> # Groups:   g [2]
#>       g     x x_mean x_rank
#>   <dbl> <dbl>  <dbl>  <dbl>
#> 1     1   0.5      1      1
#> 2     1   1.5      1      2
#> 3     2   2.5      3      1
#> 4     2   3.5      3      2

これを base R で再現するには、ave()を使います。

transform(gf, 
  x_mean = ave(x, g, FUN = mean), 
  x_rank = ave(x, g, FUN = rank)
)
#>   g   x x_mean x_rank
#> 1 1 0.5      1      1
#> 2 1 1.5      1      2
#> 3 2 2.5      3      1
#> 4 2 3.5      3      2

`pull()`: 単一の変数を取り出す

dplyr::pull() は、名前または位置によって変数を抽出します。:

mtcars %>% pull(1)
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
mtcars %>% pull(cyl)
#>  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

これは、ポジションの場合は [[、名前の場合は $ に相当します。

mtcars[[1]]
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
mtcars$cyl
#>  [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

`relocate()`: 列の順序を変更する

dplyr::relocate()は、列のセットを新しい位置（デフォルトでは前）に簡単に移動させることができます。

# to front
mtcars %>% relocate(gear, carb) 
#> # A tibble: 32 x 13
#>    gear  carb   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  cyl2  cyl4
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4     4  21       6   160   110  3.9   2.62  16.5     0     1    12    24
#> 2     4     4  21       6   160   110  3.9   2.88  17.0     0     1    12    24
#> 3     4     1  22.8     4   108    93  3.85  2.32  18.6     1     1     8    16
#> 4     3     1  21.4     6   258   110  3.08  3.22  19.4     1     0    12    24
#> # ... with 28 more rows

# to back
mtcars %>% relocate(mpg, cyl, .after = last_col()) 
#> # A tibble: 32 x 13
#>    disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2  cyl4   mpg   cyl
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1   160   110  3.9   2.62  16.5     0     1     4     4    12    24  21       6
#> 2   160   110  3.9   2.88  17.0     0     1     4     4    12    24  21       6
#> 3   108    93  3.85  2.32  18.6     1     1     4     1     8    16  22.8     4
#> 4   258   110  3.08  3.22  19.4     1     0     3     1    12    24  21.4     6
#> # ... with 28 more rows

これを base R でも少しのセット操作で再現できます。

mtcars[union(c("gear", "carb"), names(mtcars))]
#> # A tibble: 32 x 13
#>    gear  carb   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  cyl2  cyl4
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4     4  21       6   160   110  3.9   2.62  16.5     0     1    12    24
#> 2     4     4  21       6   160   110  3.9   2.88  17.0     0     1    12    24
#> 3     4     1  22.8     4   108    93  3.85  2.32  18.6     1     1     8    16
#> 4     3     1  21.4     6   258   110  3.08  3.22  19.4     1     0    12    24
#> # ... with 28 more rows

to_back <- c("mpg", "cyl")
mtcars[c(setdiff(names(mtcars), to_back), to_back)]
#> # A tibble: 32 x 13
#>    disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2  cyl4   mpg   cyl
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1   160   110  3.9   2.62  16.5     0     1     4     4    12    24  21       6
#> 2   160   110  3.9   2.88  17.0     0     1     4     4    12    24  21       6
#> 3   108    93  3.85  2.32  18.6     1     1     4     1     8    16  22.8     4
#> 4   258   110  3.08  3.22  19.4     1     0     3     1    12    24  21.4     6
#> # ... with 28 more rows

カラムを中間地点に移動させるには、もう少しセットをいじる必要があります。

`rename()`: 名前による変数名の変更

dplyr::rename()では、変数を名前や位置でリネームすることができます。

iris %>% rename(sepal_length = Sepal.Length, sepal_width = 2)
#> # A tibble: 150 x 5
#>   sepal_length sepal_width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> # ... with 146 more rows

base R では、ポジションによる変数名の変更は簡単です。

iris2 <- iris
names(iris2)[2] <- "sepal_width"

変数を名前でリネームするには、少し手間がかかります。:

names(iris2)[names(iris2) == "Sepal.Length"] <- "sepal_length"

`rename_with()`: 変数名を関数で変更する

dplyr::rename_with()は、関数を使って列名を変換します。

iris %>% rename_with(toupper)
#> # A tibble: 150 x 5
#>   SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIES
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> # ... with 146 more rows

同様の効果は，base R の setNames() でも得られます。

setNames(iris, toupper(names(iris)))
#> # A tibble: 150 x 5
#>   SEPAL.LENGTH SEPAL.WIDTH PETAL.LENGTH PETAL.WIDTH SPECIES
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> # ... with 146 more rows

`select()`: 名前による変数の選択

dplyr::select() は、位置、名前、名前の関数、または他のプロパティによって列をサブセットします。

iris %>% select(1:3)
#> # A tibble: 150 x 3
#>   Sepal.Length Sepal.Width Petal.Length
#>          <dbl>       <dbl>        <dbl>
#> 1          5.1         3.5          1.4
#> 2          4.9         3            1.4
#> 3          4.7         3.2          1.3
#> 4          4.6         3.1          1.5
#> # ... with 146 more rows
iris %>% select(Species, Sepal.Length)
#> # A tibble: 150 x 2
#>   Species Sepal.Length
#>   <fct>          <dbl>
#> 1 setosa           5.1
#> 2 setosa           4.9
#> 3 setosa           4.7
#> 4 setosa           4.6
#> # ... with 146 more rows
iris %>% select(starts_with("Petal"))
#> # A tibble: 150 x 2
#>   Petal.Length Petal.Width
#>          <dbl>       <dbl>
#> 1          1.4         0.2
#> 2          1.4         0.2
#> 3          1.3         0.2
#> 4          1.5         0.2
#> # ... with 146 more rows
iris %>% select(where(is.factor))
#> # A tibble: 150 x 1
#>   Species
#>   <fct>  
#> 1 setosa 
#> 2 setosa 
#> 3 setosa 
#> 4 setosa 
#> # ... with 146 more rows

base R では、位置による変数のサブセットが簡単にできます。

iris[1:3] # single argument selects columns; never drops
#> # A tibble: 150 x 3
#>   Sepal.Length Sepal.Width Petal.Length
#>          <dbl>       <dbl>        <dbl>
#> 1          5.1         3.5          1.4
#> 2          4.9         3            1.4
#> 3          4.7         3.2          1.3
#> 4          4.6         3.1          1.5
#> # ... with 146 more rows
iris[1:3, , drop = FALSE]
#> # A tibble: 3 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa

名前によるサブセットには2つのオプションがあります。

iris[c("Species", "Sepal.Length")]
#> # A tibble: 150 x 2
#>   Species Sepal.Length
#>   <fct>          <dbl>
#> 1 setosa           5.1
#> 2 setosa           4.9
#> 3 setosa           4.7
#> 4 setosa           4.6
#> # ... with 146 more rows
subset(iris, select = c(Species, Sepal.Length))
#> # A tibble: 150 x 2
#>   Species Sepal.Length
#>   <fct>          <dbl>
#> 1 setosa           5.1
#> 2 setosa           4.9
#> 3 setosa           4.7
#> 4 setosa           4.6
#> # ... with 146 more rows

名前の関数でサブセットするには、grep() でちょっとした作業が必要です。

iris[grep("^Petal", names(iris))]
#> # A tibble: 150 x 2
#>   Petal.Length Petal.Width
#>          <dbl>       <dbl>
#> 1          1.4         0.2
#> 2          1.4         0.2
#> 3          1.3         0.2
#> 4          1.5         0.2
#> # ... with 146 more rows

また、Filter() を使ってタイプ別にサブセットすることもできます。

Filter(is.factor, iris)
#> # A tibble: 150 x 1
#>   Species
#>   <fct>  
#> 1 setosa 
#> 2 setosa 
#> 3 setosa 
#> 4 setosa 
#> # ... with 146 more rows

`summarise()`: 複数の値を1つの値にまとめる

dplyr::summarise() は、各グループの1つまたは複数のサマリーを計算します。:

mtcars %>% 
  group_by(cyl) %>% 
  summarise(mean = mean(disp), n = n())
#> # A tibble: 3 x 3
#>     cyl  mean     n
#>   <dbl> <dbl> <int>
#> 1     4  105.    11
#> 2     6  183.     7
#> 3     8  353.    14

base R としては、by() が最も近いと思います。残念ながら、by() はデータフレームのリストを返しますが、do.call() や rbind() を使って、再びデータフレームを組み合わせることができます。

mtcars_by <- by(mtcars, mtcars$cyl, function(df) {
  with(df, data.frame(cyl = cyl[[1]], mean = mean(disp), n = nrow(df)))
})
do.call(rbind, mtcars_by)
#>   cyl     mean  n
#> 4   4 105.1364 11
#> 6   6 183.3143  7
#> 8   8 353.1000 14

aggregate() は、エレガントな答えに非常に近いものです。

agg <- aggregate(disp ~ cyl, mtcars, function(x) c(mean = mean(x), n = length(x)))
agg
#>   cyl disp.mean   disp.n
#> 1   4  105.1364  11.0000
#> 2   6  183.3143   7.0000
#> 3   8  353.1000  14.0000

しかし、残念ながら、disp.mean と disp.n の列があるように見えますが、実際には1つの行列の列です。

str(agg)
#> 'data.frame':    3 obs. of  2 variables:
#>  $ cyl : num  4 6 8
#>  $ disp: num [1:3, 1:2] 105 183 353 11 7 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr  "mean" "n"

他にも様々なオプションがhttps://gist.github.com/hadley/c430501804349d382ce90754936ab8ecに掲載されています。

`slice()`: 位置による行の選択

slice()は、位置で行を選択します。:

slice(mtcars, 25:n())
#> # A tibble: 8 x 13
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2  cyl4
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2    16    32
#> 2  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1     8    16
#> 3  26       4 120.     91  4.43  2.14  16.7     0     1     5     2     8    16
#> 4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2     8    16
#> # ... with 4 more rows

これは、[ で簡単に再現できます。:

mtcars[25:nrow(mtcars), , drop = FALSE]
#> # A tibble: 8 x 13
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb  cyl2  cyl4
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  19.2     8 400     175  3.08  3.84  17.0     0     0     3     2    16    32
#> 2  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1     8    16
#> 3  26       4 120.     91  4.43  2.14  16.7     0     1     5     2     8    16
#> 4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2     8    16
#> # ... with 4 more rows

2テーブルの動詞

2つのデータフレーム（xとy）をマージしたい場合、それらを一緒にするための様々な方法があります。Rの基本的な merge() コールは、様々な dplyr の join() 関数で置き換えられます。

dplyr	base
`inner_join(df1, df2)`	`merge(df1, df2)`
`left_join(df1, df2)`	`merge(df1, df2, all.x = TRUE)`
`right_join(df1, df2)`	`merge(df1, df2, all.y = TRUE)`
`full_join(df1, df2)`	`merge(df1, df2, all = TRUE)`
`semi_join(df1, df2)`	`df1[df1$x %in% df2$x, , drop = FALSE]`
`anti_join(df1, df2)`	`df1[!df1$x %in% df2$x, , drop = FALSE]`

two-table動詞の詳細については、vignette("two-table")を参照してください。

結合の変更

dplyrのinner_join()、left_join()、right_join()、full_join()は、yからxまでの新しい列を追加し、一連の「キー」に基づいて行をマッチングします。これらは，all, all.x, all.y の各引数を様々に設定した merge() の呼び出しと同じです。主な違いは、行の順序です。

dplyr は x データフレームの順序を保持します。
merge() は主要な列をソートします。

結合のフィルタリング

dplyr の semi_join() や anti_join() は行のみに影響し、列には影響しません。

band_members %>% semi_join(band_instruments)
#> Joining, by = "name"
#> # A tibble: 2 x 2
#>   name  band   
#>   <chr> <chr>  
#> 1 John  Beatles
#> 2 Paul  Beatles
band_members %>% anti_join(band_instruments)
#> Joining, by = "name"
#> # A tibble: 1 x 2
#>   name  band  
#>   <chr> <chr> 
#> 1 Mick  Stones

これらは[や%in%を使って base R で複製することができます。

band_members[band_members$name %in% band_instruments$name, , drop = FALSE]
#> # A tibble: 2 x 2
#>   name  band   
#>   <chr> <chr>  
#> 1 John  Beatles
#> 2 Paul  Beatles
band_members[!band_members$name %in% band_instruments$name, , drop = FALSE]
#> # A tibble: 1 x 2
#>   name  band  
#>   <chr> <chr> 
#> 1 Mick  Stones

複数のキーとなる変数を持つ半統合や反統合は、実装が非常に困難です。

dplyr <-> base R

概要

One table verbs

arrange(): 変数で行を並べる

distinct(): 固有/ユニークな行の選択

filter(): 条件にマッチした行を返す

mutate(): 変数の作成と変換

pull(): 単一の変数を取り出す

relocate(): 列の順序を変更する

rename(): 名前による変数名の変更

rename_with(): 変数名を関数で変更する

select(): 名前による変数の選択

summarise(): 複数の値を1つの値にまとめる

slice(): 位置による行の選択