Using spatial resamples for analysis

spatialsample で作成されたリサンプリングオブジェクトは、 rsample で作成されたオブジェクトと同じように、比較からモデルの評価まで多くの方法で使用することが可能である。これらのオブジェクトは、 tidymodels フレームワークの他の部分と一緒に使用することができるが、ここでは、IA州Amesの住宅データの線形モデリングを使用して、より基本的な例を見てみよう。

data("ames", package = "modeldata")

Ames Housing データは、通常の tibbleである。spatialsample の関数の多くは標準的な tibble をサポートしているが、いくつかの関数は空間距離計算を適切に処理するために、データを sf オブジェクトにする必要がある。Ames のデータをsfオブジェクトに変換するには、sf::st_as_sf() 関数を使用する。

ames_sf <- sf::st_as_sf(
  ames,
  # "coords" is in x/y order -- so longitude goes first!
  coords = c("Longitude", "Latitude"),
  # Set our coordinate reference system to EPSG:4326,
  # the standard WGS84 geodetic coordinate reference system
  crs = 4326
)

今回の vignette では、エイムズのデータセットに含まれる住宅の販売価格をモデル化する。これらの住宅の販売価格は、建築年、居住面積 (広さ) 、住宅のタイプ (二世帯住宅かタウンハウスか一戸建てか) に依存し、さらにタイプと住宅サイズの間におそらく相互作用があるとする。

log10(Sale_Price) ~ Year_Built + Gr_Liv_Area +  Bldg_Type

この関係は、Ames 市全体で空間的な自己相関を示す可能性があり、spatialsampleが提供するいくつかの異なる方法のいずれかを使用して、その調査を試みることができる。

例えば、v = 15 空間交差検証フォルドを作成し、spatial_clustering_cv()、 uses k-means clustering in order to divide the data into foldsを作成することができる。そして、spatialsampleの autoplot() 関数を使用して、これらのフォルドを視覚化することができる。

library(spatialsample)

set.seed(123)
cluster_folds <- spatial_clustering_cv(ames_sf, v = 15)

autoplot(cluster_folds)

私たちの cluster_folds オブジェクトは、rset オブジェクトで、splits 列の多くのリサンプルまたは rsplit オブジェクトを含んでいる。結果として得られるパーティションは、必ずしも同数のオブザベーションを含んでいるとは限らない。

cluster_folds
#> #  15-fold spatial cross-validation 
#> # A tibble: 15 × 2
#>    splits             id    
#>    <list>             <chr> 
#>  1 <split [2797/133]> Fold01
#>  2 <split [2724/206]> Fold02
#>  3 <split [2777/153]> Fold03
#>  4 <split [2830/100]> Fold04
#>  5 <split [2836/94]>  Fold05
#>  6 <split [2759/171]> Fold06
#>  7 <split [2560/370]> Fold07
#>  8 <split [2810/120]> Fold08
#>  9 <split [2715/215]> Fold09
#> 10 <split [2776/154]> Fold10
#> 11 <split [2778/152]> Fold11
#> 12 <split [2695/235]> Fold12
#> 13 <split [2750/180]> Fold13
#> 14 <split [2496/434]> Fold14
#> 15 <split [2717/213]> Fold15

しかし、空間クラスタリングは、spatialsample を使った空間交差検証のための 一つの 手法ではあるが、それだけが利用できる手法ではない。他の方法は大まかに似ていて、空間的な配置に基づいてデータをいくつかの (必ずしも均等ではない) フォールドに分割する。

例えば、spatial_block_cv() 関数は、お客様のデータで spatial blocking を実行する。

set.seed(123)
block_folds <- spatial_block_cv(ames_sf, v = 15)

autoplot(block_folds)

データ中のどの場所が密接に関連している可能性が高いかをすでに把握している場合は、spatial_leave_location_out_cv() 関数を使用して leave-location-out cross-validation を実行することも可能である。例えば、この関数を使って、Amesのデータを近隣に基づくフォールドに分割することができる。

set.seed(123)
location_folds <- 
  spatial_leave_location_out_cv(
    ames_sf,
    group = Neighborhood,
    v = 15
  )

autoplot(location_folds)

これで、たくさんの異なるリサンプルを手に入れることができた。それぞれのサンプルに、同じコードを使って、同じようにモデルを当てはめようと思われる。この作業を少し簡単にするために、type という新しい列を追加して、それぞれのフォールドがどのような種類のリサンプルであるかを示すようにし、それらを新しいデータフレームに結合することにしよう。

cluster_folds$type <- "cluster"
block_folds$type <- "block"
location_folds$type <- "location"

resamples <- 
  dplyr::bind_rows(
    cluster_folds,
    block_folds,
    location_folds
  )

では、各レサンプルに対して、次のような関数を書いてみよう。

# `splits` will be the `rsplit` object
compute_preds <- function(splits) {
  # fit the model to the analysis set
  mod <- lm(log10(Sale_Price) ~ Year_Built + Bldg_Type * log10(Gr_Liv_Area), 
            data = analysis(splits))
  # identify the assessment set
  holdout <- assessment(splits)
  # return the assessment set, with true and predicted price
  tibble::tibble(
    geometry = holdout$geometry,
    Sale_Price = log10(holdout$Sale_Price), 
    .pred = predict(mod, holdout)
  )
}

この関数を splits のうちの1つだけに適用することができる。

compute_preds(cluster_folds$splits[[7]]) 
#> # A tibble: 370 × 3
#>                geometry Sale_Price .pred
#>             <POINT [°]>      <dbl> <dbl>
#>  1 (-93.67907 42.03608)       4.83  4.99
#>  2  (-93.6772 42.03654)       5.05  5.20
#>  3 (-93.67367 42.03479)       5.17  5.14
#>  4 (-93.67096 42.03569)       5.14  5.06
#>  5  (-93.6721 42.03486)       5.09  5.07
#>  6 (-93.67207 42.03457)       5.12  5.11
#>  7 (-93.66969 42.03533)       5.10  5.10
#>  8  (-93.66067 42.0346)       5.23  5.25
#>  9 (-93.65921 42.03456)       5.12  5.26
#> 10 (-93.65623 42.03345)       5.39  5.33
#> # … with 360 more rows

あるいは、この関数を purrr::map() を使って、すべての splits に適用することもできる。

library(purrr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

cv_res <- resamples %>%
  mutate(.preds = map(splits, compute_preds))

library(tidyr)
library(yardstick)
#> For binary classification, the first factor level is assumed to be the event.
#> Use the argument `event_level = "second"` to alter this as needed.

cv_rmse <- cv_res %>%
  unnest(.preds) %>%
  group_by(id, type) %>%
  rmse(Sale_Price, .pred)

cv_rmse
#> # A tibble: 45 × 5
#>    id     type    .metric .estimator .estimate
#>    <chr>  <chr>   <chr>   <chr>          <dbl>
#>  1 Fold01 block   rmse    standard      0.0788
#>  2 Fold01 cluster rmse    standard      0.0715
#>  3 Fold02 block   rmse    standard      0.0705
#>  4 Fold02 cluster rmse    standard      0.104 
#>  5 Fold03 block   rmse    standard      0.0757
#>  6 Fold03 cluster rmse    standard      0.107 
#>  7 Fold04 block   rmse    standard      0.0962
#>  8 Fold04 cluster rmse    standard      0.0542
#>  9 Fold05 block   rmse    standard      0.103 
#> 10 Fold05 cluster rmse    standard      0.146 
#> # … with 35 more rows

RMSE は都市によって異なる可能性があるようなので、結果を元にメトリクスを結合してプロットしてむ。

library(ggplot2)

cv_res %>%
  unnest(.preds) %>%
  left_join(cv_rmse, by = c("id", "type")) %>%
  ggplot(aes(color = .estimate)) + 
  geom_sf(aes(geometry = geometry), alpha = 0.5) +
  labs(color = "RMSE") +
  scale_color_viridis_c() + 
  facet_wrap(vars(type), ncol = 1)

ご覧のように、得られる結果は、データの再サンプル方法に大きく依存する。データに適した方法 (空間ブロッキングやバッファード・クロスバリデーションなどの方法では、適切な距離) を使用することが重要である。