Estimating Phylogenetic Multilevel Models with brms

イントロダクション

本 vignette では、brmsを用いた系統樹マルチレベルモデルの指定方法について説明する。これらのモデルは、進化生物学において、多くの種のデータを同時に分析する場合に関連する。通常のアプローチでは、マルチレベルモデルにおいて種をグループ化因子としてモデル化し、種ごとに変化する切片（場合によっては傾きも変化する）を推定することになる。しかし、種は同じ系統樹に由来するため独立ではなく、したがって、この依存性を組み込むためにモデルを調整する必要がある。ここで説明する例は、書籍「Modern Phylogenetic Comparative Methods and the application in Evolutionary Biology (de Villemeruil & Nakagawa, 2014)」の第11章から引用している。必要なデータは対応するウェブサイト (https://www.mpcm-evolution.com/) からダウンロードできる。これらのモデルの中には、適合に数分かかるものがある。

簡単な系統樹モデル

表現型、phen （例えば体の大きさ）、cofactor 変数（例えば環境の温度）の測定値があるとする。次のコードでデータを準備する。

phylo <- ape::read.nexus("https://paul-buerkner.github.io/data/phylo.nex")
data_simple <- read.table(
  "https://paul-buerkner.github.io/data/data_simple.txt",
  header = TRUE
)
head(data_simple)

       phen  cofactor phylo
1 107.06595 10.309588  sp_1
2  79.61086  9.690507  sp_2
3 116.38186 15.007825  sp_3
4 143.28705 19.087673  sp_4
5 139.60993 15.658404  sp_5
6  68.50657  6.005236  sp_6

phylo オブジェクトは、種間の関係に関する情報を含んでいる。この情報を用いて、種の共分散行列を構築することができる(Hadfield & Nakagawa, 2010)。

A <- ape::vcv.phylo(phylo)

これで、最初の系統的マルチレベル・モデルの適合の準備が整いた。

model_simple <- brm(
  phen ~ cofactor + (1|gr(phylo, cov = A)),
  data = data_simple,
  family = gaussian(),
  data2 = list(A = A),
  prior = c(
    prior(normal(0, 10), "b"),
    prior(normal(0, 50), "Intercept"),
    prior(student_t(3, 0, 20), "sd"),
    prior(student_t(3, 0, 20), "sigma")
  )
)

(1|phylo) の代わりに (1|gr(phylo, cov = A)) を使うという例外を除いて、これは種にわたって切片が変化する基本的なマルチレベル・モデルである（ phylo はこのデータセットにおける種のインジケータ）。しかし、gr 関数で cov = A を使用することにより、共分散行列 A で指定されるように、種の相関があることを確認する。A 自体は data2 の引数で渡す。これは data の引数の規則的な構造に当てはまらないあらゆる種類のデータに対して使用できる。このモデルで良い収束を得るためには、事前分布を設定する必要はないが、サンプリング速度を少し向上させる。フィットの後、結果を詳しく調べることができる。

summary(model_simple)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: phen ~ cofactor + (1 | gr(phylo, cov = A)) 
   Data: data_simple (Number of observations: 200) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~phylo (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    14.57      2.15    10.50    18.99 1.00      915     1817

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    37.96      7.09    23.79    52.08 1.00     1995     2097
cofactor      5.17      0.14     4.90     5.45 1.00     5523     2796

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     9.22      0.73     7.86    10.68 1.00     1205     2019

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

plot(model_simple, N = 2, ask = FALSE)

plot(conditional_effects(model_simple), points = TRUE)

いわゆる系統信号（しばしば $\lambda$ で記号化される）は hypothesis の方法で計算することができ、この例ではおおよそ $\lambda = 0.7$ となる。

hyp <- "sd_phylo__Intercept^2 / (sd_phylo__Intercept^2 + sigma^2) = 0"
(hyp <- hypothesis(model_simple, hyp, class = NULL))

Hypothesis Tests for class :
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (sd_phylo__Interc... = 0      0.7      0.08     0.51     0.84         NA        NA    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

plot(hyp)

なお、系統的シグナルは、文脈上の系統解析で用いられるクラス内相関（ICC）の同義語に過ぎない。

反復測定による系統樹モデル

多くの場合、1つの種につき複数の観測値があり、これによってより複雑な系統モデルを当てはめることができる。

data_repeat <- read.table(
  "https://paul-buerkner.github.io/data/data_repeat.txt",
  header = TRUE
)
data_repeat$spec_mean_cf <-
  with(data_repeat, sapply(split(cofactor, phylo), mean)[phylo])
head(data_repeat)

       phen  cofactor species phylo spec_mean_cf
1 107.41919 11.223724    sp_1  sp_1    10.309588
2 109.16403  9.805934    sp_1  sp_1    10.309588
3  91.88672 10.308423    sp_1  sp_1    10.309588
4 121.54341  8.355349    sp_1  sp_1    10.309588
5 105.31638 11.854510    sp_1  sp_1    10.309588
6  64.99859  4.314015    sp_2  sp_2     3.673914

変数 spec_mean_cf は、各生物種の補酵素の平均を含むだけである。反復測定系統樹モデルのコードは以下のようになる。

model_repeat1 <- brm(
  phen ~ spec_mean_cf + (1|gr(phylo, cov = A)) + (1|species),
  data = data_repeat,
  family = gaussian(),
  data2 = list(A = A),
  prior = c(
    prior(normal(0,10), "b"),
    prior(normal(0,50), "Intercept"),
    prior(student_t(3,0,20), "sd"),
    prior(student_t(3,0,20), "sigma")
  ),
  sample_prior = TRUE, chains = 2, cores = 2,
  iter = 4000, warmup = 1000
)

変数 phylo と species は、どちらも種の識別子であるため同一である。しかし、私たちは系統共分散を phylo についてのみモデル化しているので、species 変数は、種間の系統的関係から独立した任意の特定の効果（たとえば、環境またはニッチ効果）を説明するものである。ここでも、系統的信号の推定値だけでなく、モデルの要約を得ることができる。

summary(model_repeat1)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: phen ~ spec_mean_cf + (1 | gr(phylo, cov = A)) + (1 | species) 
   Data: data_repeat (Number of observations: 1000) 
  Draws: 2 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 6000

Group-Level Effects: 
~phylo (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    16.50      1.88    13.01    20.33 1.00     1519     2991

~species (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     4.97      0.84     3.30     6.58 1.00      979     1838

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept       36.20      7.94    20.89    51.59 1.00     3029     3513
spec_mean_cf     5.10      0.10     4.89     5.30 1.00     6681     4473

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     8.11      0.20     7.71     8.51 1.00     5637     4417

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

hyp <- paste(
  "sd_phylo__Intercept^2 /",
  "(sd_phylo__Intercept^2 + sd_species__Intercept^2 + sigma^2) = 0"
)
(hyp <- hypothesis(model_repeat1, hyp, class = NULL))

Hypothesis Tests for class :
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (sd_phylo__Interc... = 0     0.74      0.06     0.62     0.84          0         0    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

plot(hyp)

これまでは、種内における補酵素の変動は完全に無視してきた。これをモデルに取り入れるために、次のように定義する。

data_repeat$within_spec_cf <- data_repeat$cofactor - data_repeat$spec_mean_cf

とし、within_spec_cf を追加予測因子として再度フィットさせた。

model_repeat2 <- update(
  model_repeat1, formula = ~ . + within_spec_cf,
  newdata = data_repeat, chains = 2, cores = 2,
  iter = 4000, warmup = 1000
)

結果はほとんど変わらず、表現型と種内分散の間には一見何の関係もないように見える cofactor。

summary(model_repeat2)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: phen ~ spec_mean_cf + (1 | gr(phylo, cov = A)) + (1 | species) + within_spec_cf 
   Data: data_repeat (Number of observations: 1000) 
  Draws: 2 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 6000

Group-Level Effects: 
~phylo (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)    16.43      1.91    12.92    20.30 1.00     1100     1354

~species (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     4.97      0.86     3.17     6.61 1.00      905      855

Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept         36.55      7.79    21.06    51.69 1.00     2625     3105
spec_mean_cf       5.10      0.10     4.89     5.30 1.00     5363     4459
within_spec_cf    -0.06      0.19    -0.42     0.31 1.00    10310     3896

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     8.11      0.20     7.72     8.53 1.00     5943     4289

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

また、系統的な信号もほぼ変わらない。

hyp <- paste(
  "sd_phylo__Intercept^2 /",
  "(sd_phylo__Intercept^2 + sd_species__Intercept^2 + sigma^2) = 0"
)
(hyp <- hypothesis(model_repeat2, hyp, class = NULL))

Hypothesis Tests for class :
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob Star
1 (sd_phylo__Interc... = 0     0.74      0.06     0.62     0.84          0         0    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

系統的メタ分析

例えば、フィッシャーのz変換相関係数 $Zr$ が種ごとに、対応するサンプルサイズとともにあるとする（例えば、オスの色彩と生殖成功の相関）。

data_fisher <- read.table(
  "https://paul-buerkner.github.io/data/data_effect.txt",
  header = TRUE
)
data_fisher$obs <- 1:nrow(data_fisher)
head(data_fisher)

          Zr  N phylo obs
1 0.28917549 13  sp_1   1
2 0.02415579 40  sp_2   2
3 0.19513651 39  sp_3   3
4 0.09831239 40  sp_4   4
5 0.13780152 66  sp_5   5
6 0.13710587 41  sp_6   6

サンプリング分散は既知とし、フィッシャー値については $V(Zr) = \frac{1}{N - 3}$ とする。 $N$ は、種ごとのサンプルサイズである。既知のサンプリング分散をモデルに取り入れるのは簡単である。しかし、brmsは、分散そのものではなく、サンプリング標準偏差（分散の平方根）を入力として必要とすることに注意しなければならない。obs のグループレベルの効果は、メタ分析モデルで明示的にモデル化しなければならない残留分散を表している。

model_fisher <- brm(
  Zr | se(sqrt(1 / (N - 3))) ~ 1 + (1|gr(phylo, cov = A)) + (1|obs),
  data = data_fisher, family = gaussian(),
  data2 = list(A = A),
  prior = c(
    prior(normal(0, 10), "Intercept"),
    prior(student_t(3, 0, 10), "sd")
  ),
  control = list(adapt_delta = 0.95),
  chains = 2, cores = 2, iter = 4000, warmup = 1000
)

適合したモデルの要約は、以下の方法で得られる。

summary(model_fisher)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Zr | se(sqrt(1/(N - 3))) ~ 1 + (1 | gr(phylo, cov = A)) + (1 | obs) 
   Data: data_fisher (Number of observations: 200) 
  Draws: 2 chains, each with iter = 4000; warmup = 1000; thin = 1;
         total post-warmup draws = 6000

Group-Level Effects: 
~obs (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.05      0.03     0.00     0.10 1.00      836     1430

~phylo (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.07      0.04     0.00     0.15 1.00      748     1622

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.16      0.04     0.08     0.24 1.00     2911     2718

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.00      0.00     0.00     0.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

plot(model_fisher)

メタ分析平均（すなわち、モデル切片）は、 $0.16$、信頼区間は次のとおりである。 $ [0.08, 0.25] $ .したがって、種間の平均的な相関は、モデルによると正である。

系統的カウントデータモデル

連続変数ではなく、カウントからなる表現型を分析するとしよう。このような場合、正規性の仮定はおそらく正当化されないので、カウントデータに明示的に適した分布、たとえば、ポアソン分布を使用することが推奨される。次のデータセット（再びmpcm-evolution.orgから取得）は、その例を示している。

data_pois <- read.table(
  "https://paul-buerkner.github.io/data/data_pois.txt",
  header = TRUE
)
data_pois$obs <- 1:nrow(data_pois)
head(data_pois)

  phen_pois   cofactor phylo obs
1         1  7.8702830  sp_1   1
2         0  3.4690529  sp_2   2
3         1  2.5478774  sp_3   3
4        14 18.2286628  sp_4   4
5         1  2.5302806  sp_5   5
6         1  0.5145559  sp_6   6

ポアソン分布は自然な過分散パラメータを持たないので、obs のグループレベルの効果によって残差分散をモデル化する（例えば、Lawless, 1987を参照）。

model_pois <- brm(
  phen_pois ~ cofactor + (1|gr(phylo, cov = A)) + (1|obs),
  data = data_pois, family = poisson("log"),
  data2 = list(A = A),
  chains = 2, cores = 2, iter = 4000,
  control = list(adapt_delta = 0.95)
)

ここでも、適合したモデルの要約を以下のようにして得ることができる。

summary(model_pois)

 Family: poisson 
  Links: mu = log 
Formula: phen_pois ~ cofactor + (1 | gr(phylo, cov = A)) + (1 | obs) 
   Data: data_pois (Number of observations: 200) 
  Draws: 2 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~obs (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.18      0.08     0.02     0.34 1.00      632      893

~phylo (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.18      0.10     0.03     0.42 1.00      810     1475

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -2.08      0.21    -2.49    -1.68 1.00     2891     2631
cofactor      0.25      0.01     0.23     0.27 1.00     4404     3248

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

plot(conditional_effects(model_pois), points = TRUE)

ここで、表現型がカウントデータであることを無視し、代わりに線形正規モデルを当てはめるとする。

model_normal <- brm(
  phen_pois ~ cofactor + (1|gr(phylo, cov = A)),
  data = data_pois, family = gaussian(),
  data2 = list(A = A),
  chains = 2, cores = 2, iter = 4000,
  control = list(adapt_delta = 0.95)
)

summary(model_normal)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: phen_pois ~ cofactor + (1 | gr(phylo, cov = A)) 
   Data: data_pois (Number of observations: 200) 
  Draws: 2 chains, each with iter = 4000; warmup = 2000; thin = 1;
         total post-warmup draws = 4000

Group-Level Effects: 
~phylo (Number of levels: 200) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.69      0.50     0.03     1.90 1.00      907     1341

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -3.06      0.65    -4.34    -1.82 1.00     3099     1960
cofactor      0.68      0.04     0.59     0.76 1.00     7302     2894

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     3.44      0.18     3.09     3.81 1.00     4628     3034

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

どちらのモデルでも、cofactor は表現型と正の関係を持つことがわかる。しかし、この例では正準対数リンク関数を適用しているので、ポアソンモデルの推定値は対数スケールであることに注意しなければならない。したがって、同じデータに適用しても線形正規モデルとは推定値が比較にならない。しかし、比較できるのはモデルの適合度であり、例えば事後予測チェックによってグラフィカルに比較することができる。

pp_check(model_pois)

pp_check(model_normal)

どうやら、ポアソンモデルによって予測された表現型の分布は、元の表現型の分布にかなり近く、一方、正規モデルはそうならないようである。また、モデルの適合度を直接数値で比較するために、リーブワンアウトクロスバリデーションを適用することができる。

loo(model_pois, model_normal)

Output of model 'model_pois':

Computed from 4000 by 200 log-likelihood matrix

         Estimate   SE
elpd_loo   -348.6 17.0
p_loo        30.4  3.5
looic       697.2 34.0
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     169   84.5%   692       
 (0.5, 0.7]   (ok)        28   14.0%   164       
   (0.7, 1]   (bad)        3    1.5%   113       
   (1, Inf)   (very bad)   0    0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

Output of model 'model_normal':

Computed from 4000 by 200 log-likelihood matrix

         Estimate   SE
elpd_loo   -536.0 15.9
p_loo        10.3  2.4
looic      1072.1 31.8
------
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     198   99.0%   407       
 (0.5, 0.7]   (ok)         2    1.0%   4422      
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Model comparisons:
             elpd_diff se_diff
model_pois      0.0       0.0 
model_normal -187.4      18.0

looの値が小さいほど適合度が高いので、やはりポアソンモデルが正規モデルよりデータに適合していることがわかる。もちろん、ポアソンモデルだけが唯一の合理的な選択肢ではない。たとえば、負の2項モデル（ファミリー negative_binomial ）を使うことができる。これはすでに過分散パラメータを含んでいるので、obs の変化する切片をモデル化することは無意味になる。

グループレベル効果が複数ある系統樹モデル

上記の例では、系統分類のグループ化要因に対して、単一のグループレベル効果（すなわち、変化する切片）のみを使用している。brms**では、これらのグループ化要因に対して、複数のグループレベル効果（例えば、変化する切片と変化する傾き）を推定することも可能である。しかし、その場合、共分散行列のクロネッカー積を繰り返し計算しながらモデルを当てはめていく必要がある。特にグループ化因子が多水準で行列が大きい場合、非常に時間がかかることになる。

参考文献

de Villemeruil P. & Nakagawa, S. (2014) General quantitative genetic methods for comparative biology. In: Modern phylogenetic comparative methods and their application in evolutionary biology: concepts and practice (ed. Garamszegi L.) Springer, New York. pp. 287-303.

Hadfield, J. D. & Nakagawa, S. (2010) General quantitative genetic methods for comparative biology: phylogenies, taxonomies, and multi-trait models for continuous and categorical characters. Journal of Evolutionary Biology. 23. 494-508.

Lawless, J. F. (1987). Negative binomial and mixed Poisson regression. Canadian Journal of Statistics, 15(3), 209-225.