画像や配列のデータがRのワーキングメモリ (RAM) に簡単に数回収まるようになったら、ラッキーだと思いよう。この文書はあなたのために書かれたのではない。画像が大きすぎる場合、あるいはその他の理由でファイルよりも小さなデータチャンクを扱いたい場合、このドキュメントを読んでみてみよう。まず、低レベルのインターフェイスについて説明し、次に高レベルのインターフェイスとして、すべての読み込みを遅延させるスタープロキシオブジェクトを使用する方法について説明する。

前文: starsdata パッケージ

この vignette のすべての例を実行するには、stars パッケージに保持するには大きすぎる (1 Gb) データセットが入ったパッケージをインストールする必要がある。これらは drat repo に入っており、インストールは以下の方法で行う。

install.packages("starsdata", repos = "http://gis-bigdata.uni-muenster.de/pebesma", type = "source") 
# possibly after: options(timeout = 100)
# or from an alternative repository:
# install.packages("starsdata", repos = "http://pebesma.staff.ifgi.de", type = "source")

チャンクの読み込み、解像度の変更、バンドの選択

read_stars にはという引数があり、GDAL データセットの読み込み方法を制御する。デフォルトでは、すべてのピクセルとすべてのバンドがメモリに読み込まれる。これは多くの時間を消費し、多くのメモリを必要とする可能性がある。ファイルは圧縮されている可能性があり、ファイル内でバイトで表現されたピクセル値は、R では 8 バイトの倍数に変換されることを忘れないでみよう。
RasterIO

このために RasterIO を使用する理由は、使用するパラメータが、使用する GDAL RasterIO 関数に直接マッピングされていることがある (R の 1 ベースのオフセットインデックスを C++ の 0 ベースのオフセットに適合させた後) 。

特定のチャンクを読む

RasterIO の使用例である。

library(stars)
## Loading required package: abind
## Loading required package: sf
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
tif = system.file("tif/L7_ETMs.tif", package = "stars")
rasterio = list(nXOff = 6, nYOff = 6, nXSize = 100, nYSize = 100, bands = c(1, 3, 4))
(x = read_stars(tif, RasterIO = rasterio))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##              Min. 1st Qu. Median     Mean 3rd Qu. Max.
## L7_ETMs.tif    23      54     63 62.05977   73.25  235
## dimension(s):
##      from  to  offset delta                     refsys point x/y
## x       6 105  288776  28.5 SIRGAS 2000 / UTM zone 25S FALSE [x]
## y       6 105 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE [y]
## band    1   3      NA    NA                         NA    NA
dim(x)
##    x    y band 
##  100  100    3

これと比較すると

st_dimensions(read_stars(tif))
##      from  to  offset delta                     refsys point x/y
## x       1 349  288776  28.5 SIRGAS 2000 / UTM zone 25S FALSE [x]
## y       1 352 9120761 -28.5 SIRGAS 2000 / UTM zone 25S FALSE [y]
## band    1   6      NA    NA                         NA    NA

と見ることができる。

delta の値は変わらない。
グリッドのオフセット (原点の x/y 座標) は同じままである。
from と to は新しいエリアを反映し、新しい delta の値と関連している。
dim(x) は新しいサイズを反映している。
3バンドのみ読み込み

別の解像度で読む

nBufXSize と nBufYSize を設定することで、低解像度 (高解像度も可能！) のデータセットを読み込むことができる。

rasterio = list(nXOff = 6, nYOff = 6, nXSize = 100, nYSize = 100,
                nBufXSize = 20, nBufYSize = 20, bands = c(1, 3, 4))
(x = read_stars(tif, RasterIO = rasterio))
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##              Min. 1st Qu. Median     Mean 3rd Qu. Max.
## L7_ETMs.tif    27      53     63 62.09417      74  151
## dimension(s):
##      from to  offset  delta                     refsys point x/y
## x       2 21  288776  142.5 SIRGAS 2000 / UTM zone 25S FALSE [x]
## y       2 21 9120761 -142.5 SIRGAS 2000 / UTM zone 25S FALSE [y]
## band    1  3      NA     NA                         NA    NA

ということを、追加で見ている。

nBufXSize と nBufYSize が nXSize と nYSize よりも 5 倍小さい値に設定されたため、delta (ラスタセルサイズ) の値が 5 倍に増加した。
グリッドのオフセット座標はそのままである。
from と to は新しいエリアを反映しているが、新しい delta のセルサイズ値に関連している。

ここでは、3×3の面積を100×100に拡大したものを読み込んでいる。

rasterio = list(nXOff = 6, nYOff = 6, nXSize = 3, nYSize = 3,
   nBufXSize = 100, nBufYSize = 100, bands = 1)
x = read_stars(tif, RasterIO = rasterio)
dim(x)
##   x   y 
## 100 100
plot(x)

理由は「 ” only three grid cells is that the default sampling method is” 近隣を見る」ことである。によって修正することができる。

rasterio = list(nXOff = 6, nYOff = 6, nXSize = 3, nYSize = 3,
   nBufXSize = 100, nBufYSize = 100, bands = 1, resample = "cubic_spline")
x = read_stars(tif, RasterIO = rasterio)
dim(x)
##   x   y 
## 100 100
plot(x)

パラメータ resample では、以下の方法が使用可能である。

これらのメソッドはすべてGDALで実装されている。これらのメソッドが具体的に何を行っているかは、GDALのドキュメントまたはソースコードを参照。

star 形プロキシオブジェクト

star 形プロキシオブジェクトは別のアプローチをとる。作成時にはデータをまったく持たず、データを読み込める場所へのポインタだけを持つ。データは必要なときに必要な分だけ読み込まれる。プロキシオブジェクトをプロットすると、データはネイティブ解像度ではなく、画面上のピクセル解像度で読み込まれるので、例えば10000 x 10000のSentinel 2 (レベル1C) 画像があれば

granule = system.file("sentinel/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.zip", package = "starsdata")
s2 = paste0("SENTINEL2_L1C:/vsizip/", granule, "/S2A_MSIL1C_20180220T105051_N0206_R051_T32ULE_20180221T134037.SAFE/MTD_MSIL1C.xml:10m:EPSG_32632")
(p = read_stars(s2, proxy = TRUE))

となり、データが読み込まれないため、_instantly_に起こる。このオブジェクトをプロットすると

system.time(plot(p))

プロット上に見える画素だけを読み込むので、1秒程度しかかからない。のように、まず画像全体をメモリに読み込むとすると、1秒程度しかかからない。

p = read_stars(s2, proxy = FALSE)

となると、読み込みだけだと1分以上かかり、5Gbのメモリが必要になる。

スタープロキシオブジェクト用メソッド

methods(class = "stars_proxy")
##  [1] [               [[<-            [<-             adrop          
##  [5] aggregate       aperm           as.data.frame   c              
##  [9] coerce          dim             droplevels      filter         
## [13] hist            initialize      is.na           Math           
## [17] merge           Ops             plot            predict        
## [21] print           show            slotsFromS3     split          
## [25] st_apply        st_as_sf        st_as_stars     st_crop        
## [29] st_dimensions<- st_downsample   st_mosaic       st_redimension 
## [33] st_sample       st_set_bbox     write_stars    
## see '?methods' for accessing help and source code

Select attributes

通常の stars オブジェクトと同様に、[ の第一引数で属性を選択することができる。

x = c("avhrr-only-v2.19810901.nc",
"avhrr-only-v2.19810902.nc",
"avhrr-only-v2.19810903.nc",
"avhrr-only-v2.19810904.nc",
"avhrr-only-v2.19810905.nc",
"avhrr-only-v2.19810906.nc",
"avhrr-only-v2.19810907.nc",
"avhrr-only-v2.19810908.nc",
"avhrr-only-v2.19810909.nc")
file_list = system.file(paste0("netcdf/", x), package = "starsdata")
y = read_stars(file_list, quiet = TRUE, proxy = TRUE)
names(y)
y["sst"]

この選択により、9つのNetCDFファイルすべてから4つのサブデータセットから1つのサブデータセットに読み込みが制限されることに注意。

地域を選択する

もう1つの可能性は、空間オブジェクトに基づいた長方形の領域を切り取る、または選択することである。これは、bbox オブジェクト、または sf、sfc、stars オブジェクトを渡すことによって行うことができ、そこからバウンディングボックスが取得されることになる。例を挙げる。

bb = st_bbox(c(xmin = 10.125, ymin = 0.125, xmax = 70.125, ymax = 70.125))
ysub = y[bb]
st_dimensions(ysub)
class(ysub) # still no data here!!
plot(ysub, reset = FALSE) # plot reads the data, at resolution that is relevant
plot(st_as_sfc(bb), add = TRUE, lwd = .5, border = 'red')

遅延評価、評価順序の変更

他のいくつかのアクションは stars_proxy オブジェクト上で実行できるが、その効果はデータが実際に必要になるまで遅延する ( plot , write_stars )。例えば、上に示した以外の次元での範囲選択は、まずデータが必要であり、その時初めて実行できる。このような関数は、call_list という属性で、オブジェクトに追加される。

yy = adrop(y)
yyy = yy[,1:10,1:10,]
class(yyy) # still no data
st_dimensions(yyy) # and dimensions not adjusted
attr(yyy, "call_list") # the name of object in the call (y) is replaced with x:

こうすることで、操作の順番を最適化することができる。例えば、st_apply の場合、関数が適用される次元で順次読み込みを行うことができる。

各バンドに関数を適用する場合 (例: バンドの分位数を計算する) 、バンドを順次読み込み、分位数が計算された後に破棄することができる。
時系列関数をピクセルの時系列に適用し、その結果を地図上にプロットした場合、時系列関数は実際にプロットされたピクセルに対してのみ評価される。これは、例えば

plot(st_apply(x, c("x", "y"), range))

plot はどのピクセルが表示されるかを知っており、このサブセットに対して st_apply が実行される前に、x がどのようにダウンサンプリングされるかを制御する。

データの取得

データの取得には、配列全体を読み込んでから、call_list を順次評価することになる。

(x = st_as_stars(yyy)) # read, adrop, subset

評価順序を変更した場合のプロット

Sentinel 2のデータでは、バンド4が近赤外、バンド1が赤を表すので、NDVIは次のように計算される。

# S2 10m: band 4: near infrared, band 1: red.
#ndvi = function(x) (x[4] - x[1])/(x[4] + x[1])
ndvi = function(x1, x2, x3, x4) (x4 - x1)/(x4 + x1)
rm(x)
(s2.ndvi = st_apply(p, c("x", "y"), ndvi))
system.time(plot(s2.ndvi)) # read - compute ndvi - plot

マルチレゾリューション・プロキシ・オブジェクト

このセクションでは、異なるマップが異なる解像度を持つ場合に、stars_proxy オブジェクトがどのように対処するかについて、いくつかの例を示す。ここでの前提は以下の通りである。

すべての地図は、同じ原点座標 (通常は左上隅) とCRSを持つ必要がある。
最初のマップは「作業用」解像度を決定し、例えばネイティブ解像度やダウンサンプル解像度を参照する。

セルサイズ1、2、3の4つのマップを作成することにする。

s1 = st_as_stars(matrix(1:16, 4))
s2 = st_as_stars(matrix(1:16, 4))
s3 = st_as_stars(matrix(1:16, 4))
attr(s1, "dimensions")$X1$offset = 0
attr(s1, "dimensions")$X2$offset = 4
attr(s2, "dimensions")$X1$offset = 0
attr(s2, "dimensions")$X2$offset = 4
attr(s3, "dimensions")$X1$offset = 0
attr(s3, "dimensions")$X2$offset = 4
attr(s1, "dimensions")$X1$delta =  1
attr(s1, "dimensions")$X2$delta = -1
attr(s2, "dimensions")$X1$delta =  2
attr(s2, "dimensions")$X2$delta = -2
attr(s3, "dimensions")$X1$delta =  3
attr(s3, "dimensions")$X2$delta = -3
plot(s1, axes = TRUE, text_values = TRUE, text_color = 'orange')

plot(s2, axes = TRUE, text_values = TRUE, text_color = 'orange')

plot(s3, axes = TRUE, text_values = TRUE, text_color = 'orange')

セル値と寸法は同じだが、セルサイズ、つまりエクステントが異なる3つのラスタを作成した。これらを単一のプロキシオブジェクトにバインドして

fn1 = paste0(tempdir(), .Platform$file.sep, "img1.tif")
fn2 = paste0(tempdir(), .Platform$file.sep, "img2.tif")
fn3 = paste0(tempdir(), .Platform$file.sep, "img3.tif")
write_stars(s1, fn1)
write_stars(s2, fn2)
write_stars(s3, fn3) 
(r1 = read_stars(c(fn1, fn2, fn3), proxy = TRUE))
## multi-resolution stars_proxy object with 3 attributes in 3 file(s):
## $`1`
## [1] "[...]/img1.tif"
## 
## $`2`
## [1] "[...]/img2.tif"
## 
## $`3`
## [1] "[...]/img3.tif"
## 
## dimension(s):
##   from to offset delta x/y
## x    1  4      0     1 [x]
## y    1  4      4    -1 [y]

表示された要約にmulti-resolutionと記載されているのがわかる。これを stars オブジェクトに変換すると、セカンダリラスタは最初のラスタのセルサイズ + 範囲に再サンプルされる。

st_as_stars(r1) %>%
  merge() %>%
  plot(breaks = "equal", text_values = TRUE, text_color = 'orange', axes = TRUE)

これをオブジェクトの解像度で定義されたサブレンジに対して行うと、次のようになる。

st_as_stars(r1[,2:4,2:4]) %>%
  merge() %>%
  plot(breaks = "equal", text_values = TRUE, text_color = 'orange', axes = TRUE)

同じ領域 ( [0,4] x [0,4] ) に、異なる解像度 (セルサイズ1、1/2、1/3) で4つのマップを作成することにする。

s4 = st_as_stars(matrix(1: 16, 4))
s5 = st_as_stars(matrix(1: 64, 8))
s6 = st_as_stars(matrix(1:144,12))
attr(s4, "dimensions")$X1$offset = 0
attr(s4, "dimensions")$X2$offset = 4
attr(s5, "dimensions")$X1$offset = 0
attr(s5, "dimensions")$X2$offset = 4
attr(s6, "dimensions")$X1$offset = 0
attr(s6, "dimensions")$X2$offset = 4
attr(s4, "dimensions")$X1$delta =  1
attr(s4, "dimensions")$X2$delta = -1
attr(s5, "dimensions")$X1$delta =  1/2
attr(s5, "dimensions")$X2$delta = -1/2
attr(s6, "dimensions")$X1$delta =  1/3
attr(s6, "dimensions")$X2$delta = -1/3
plot(s4, axes = TRUE, text_values = TRUE, text_color = 'orange')

plot(s5, axes = TRUE, text_values = TRUE, text_color = 'orange')

plot(s6, axes = TRUE, text_values = TRUE, text_color = 'orange')

fn4 = paste0(tempdir(), .Platform$file.sep, "img4.tif")
fn5 = paste0(tempdir(), .Platform$file.sep, "img5.tif")
fn6 = paste0(tempdir(), .Platform$file.sep, "img6.tif")
write_stars(s4, fn4)
write_stars(s5, fn5)
write_stars(s6, fn6) 
(r2 = read_stars(c(fn4, fn5, fn6), proxy = TRUE))
## multi-resolution stars_proxy object with 3 attributes in 3 file(s):
## $`4`
## [1] "[...]/img4.tif"
## 
## $`5`
## [1] "[...]/img5.tif"
## 
## $`6`
## [1] "[...]/img6.tif"
## 
## dimension(s):
##   from to offset delta x/y
## x    1  4      0     1 [x]
## y    1  4      4    -1 [y]

st_as_stars(r2) %>%
  merge() %>%
  plot(breaks = "equal", text_values = TRUE, text_color = 'orange', axes = TRUE)

st_as_stars(r2[,2:4,2:4]) %>%
  merge() %>%
  plot(breaks = "equal", text_values = TRUE, text_color = 'orange', axes = TRUE)

最後に、1枚目のラスタが高解像度である場合の例である。

(r3 = read_stars(c(fn6, fn5, fn4), proxy = TRUE))
## multi-resolution stars_proxy object with 3 attributes in 3 file(s):
## $`6`
## [1] "[...]/img6.tif"
## 
## $`5`
## [1] "[...]/img5.tif"
## 
## $`4`
## [1] "[...]/img4.tif"
## 
## dimension(s):
##   from to offset     delta x/y
## x    1 12      0  0.333333 [x]
## y    1 12      4 -0.333333 [y]

st_as_stars(r3) %>%
  merge() %>%
  plot(breaks = "equal", text_values = TRUE, text_color = 'orange', axes = TRUE)

st_as_stars(r3[,2:6,3:6]) %>%
  merge() %>%
  plot(breaks = "equal", text_values = TRUE, text_color = 'orange', axes = TRUE)

2. stars proxy objects

Edzer Pebesma