library(mandala)
library(dplyr)
library(ggplot2)
library(Matrix)Two-Stage Genomic Analysis
Integrating GBLUP with two-stage MET workflows
Overview
This tutorial extends the two-stage MET workflow to include genomic information. By incorporating a genomic relationship matrix (GRM) at Stage 2, we can:
- Obtain genomic estimated breeding values (GEBVs)
- Predict performance of untested genotypes
- Leverage marker information for selection decisions
The key innovation is using GM(geno, GRM) alongside vcov(RMAT) to combine genomic relationships with properly propagated BLUE uncertainty.
- Prepare and align a GRM for Stage 2 analysis
- Fit GBLUP models using
GM()withvcov(RMAT) - Use the
mandala_gp()wrapper for streamlined genomic prediction - Extract and compare GEBVs from different approaches
- Understand when genomic prediction adds value
This tutorial assumes completion of Tutorial 06: Two-Stage MET. You should understand:
- The two-stage workflow (Stage 1 BLUEs, Stage 2 analysis)
- How
vcov(RMAT)propagates BLUE uncertainty - Basic heritability and variance component concepts
Data Setup
We’ll use the same phenotypic data structure as Tutorial 06, plus a simulated GRM.
Simulate Phenotypic Data
set.seed(42)
n_env <- 4
n_geno <- 50
n_rep <- 2
n_row <- 10
n_col <- 10
# Create base data structure
df <- expand.grid(
env = paste0("E", 1:n_env),
geno = paste0("G", sprintf("%03d", 1:n_geno)),
rep = 1:n_rep
)
# Assign spatial positions
df <- df %>%
group_by(env, rep) %>%
mutate(
row = rep(1:n_row, length.out = n()),
col = rep(1:n_col, each = n_row)[1:n()]
) %>%
ungroup()
# Simulate effects
env_effects <- rnorm(n_env, mean = 0, sd = 5)
names(env_effects) <- paste0("E", 1:n_env)
geno_effects <- rnorm(n_geno, mean = 0, sd = 3)
names(geno_effects) <- paste0("G", sprintf("%03d", 1:n_geno))
# GxE interaction
gxe_effects <- matrix(rnorm(n_env * n_geno, mean = 0, sd = 2),
nrow = n_geno, ncol = n_env)
rownames(gxe_effects) <- names(geno_effects)
colnames(gxe_effects) <- names(env_effects)
# Build response
df <- df %>%
mutate(
yld = 100 +
env_effects[as.character(env)] +
geno_effects[as.character(geno)] +
sapply(1:n(), function(i) gxe_effects[as.character(geno[i]), as.character(env[i])]) +
rnorm(n(), mean = 0, sd = 2)
)
# Convert to factors
df$env <- as.factor(df$env)
df$geno <- as.factor(df$geno)
df$rep <- as.factor(df$rep)
df$row <- as.factor(df$row)
df$col <- as.factor(df$col)
df$block <- df$rep
head(df)# A tibble: 6 × 7
env geno rep row col yld block
<fct> <fct> <fct> <fct> <fct> <dbl> <fct>
1 E1 G001 1 1 1 111. 1
2 E2 G001 1 1 1 95.0 1
3 E3 G001 1 1 1 101. 1
4 E4 G001 1 1 1 102. 1
5 E1 G002 1 2 1 105. 1
6 E2 G002 1 2 1 96.3 1
Simulate Genomic Relationship Matrix
In practice, you would compute the GRM from marker data. Here we simulate one that reflects the true genetic effects.
# Simulate a GRM based on true genetic effects
# In practice, use markers and compute with A.mat() or similar
geno_names <- paste0("G", sprintf("%03d", 1:n_geno))
# Create a correlation structure based on genetic similarity
set.seed(123)
n_markers <- 1000
markers <- matrix(sample(0:2, n_geno * n_markers, replace = TRUE),
nrow = n_geno, ncol = n_markers)
rownames(markers) <- geno_names
# Compute realized relationship matrix (VanRaden method 1)
p <- colMeans(markers) / 2
P <- 2 * (p - 0.5)
Z <- scale(markers, center = TRUE, scale = FALSE)
GRM <- tcrossprod(Z) / (2 * sum(p * (1 - p)))
# Ensure positive definiteness
GRM <- GRM + diag(1e-6, nrow(GRM))
rownames(GRM) <- colnames(GRM) <- geno_names
# Preview
GRM[1:5, 1:5] G001 G002 G003 G004 G005
G001 1.310757677 0.007381282 -0.039952294 0.002639819 -0.07516880
G002 0.007381282 1.246970820 -0.022202203 -0.022161677 -0.03918231
G003 -0.039952294 -0.022202203 1.302247359 0.005476592 -0.03991177
G004 0.002639819 -0.022161677 0.005476592 1.320564804 -0.05810764
G005 -0.075168798 -0.039182312 -0.039911768 -0.058107638 1.33312765
Stage 1: Per-Environment Analysis
This follows the same workflow as Tutorial 06.
# Stage 1 preparation
s1_prep <- stage1_prep(
df = df,
env_col = "env",
s1_fixed = yld ~ geno,
s1_random = ~ rep + row + col,
s1_classify_term = "geno",
response_var = "yld"
)
# Audit
stage1_audit(df, s1_prep) env rows nonmiss_response nonmiss_id n_id_levels
1 E1 100 100 100 50
2 E2 100 100 100 50
3 E3 100 100 100 50
4 E4 100 100 100 50
# Run Stage 1
MET_stage1_run <- mandala_stage1(
df = df,
prep = s1_prep
)Explicit param_df Check:
name type vc_idx init lower upper origin
1 rep var 1 5.955770 1e-06 Inf G
2 row var 2 5.955770 1e-06 Inf G
3 col var 3 5.955770 1e-06 Inf G
4 R.sigma2 R_var 0 7.444713 1e-06 Inf R
--- Starting EM Warmup ---
--- EM Warmup Complete. Starting AI-REML ---
Iter LogLik Sigma2 DF wall Restrained
Main AI-REML Loop finished. Combining results.
solveMME_cpp: adding nugget 3.79815e-07 to MME diagonal (scaled from median diag)
Notes:
- The predictions are obtained by averaging across the hypertable
calculated from model terms constructed solely from factors in
the averaging and classify sets.
- The simple averaging set: (none)
- The ignored set: rep, row, col
Explicit param_df Check:
name type vc_idx init lower upper origin
1 rep var 1 7.935386 1e-06 Inf G
2 row var 2 7.935386 1e-06 Inf G
3 col var 3 7.935386 1e-06 Inf G
4 R.sigma2 R_var 0 9.919232 1e-06 Inf R
--- Starting EM Warmup ---
--- EM Warmup Complete. Starting AI-REML ---
Iter LogLik Sigma2 DF wall Restrained
Main AI-REML Loop finished. Combining results.
solveMME_cpp: adding nugget 6.51258e-07 to MME diagonal (scaled from median diag)
Notes:
- The predictions are obtained by averaging across the hypertable
calculated from model terms constructed solely from factors in
the averaging and classify sets.
- The simple averaging set: (none)
- The ignored set: rep, row, col
Explicit param_df Check:
name type vc_idx init lower upper origin
1 rep var 1 8.419766 1e-06 Inf G
2 row var 2 8.419766 1e-06 Inf G
3 col var 3 8.419766 1e-06 Inf G
4 R.sigma2 R_var 0 10.524707 1e-06 Inf R
--- Starting EM Warmup ---
--- EM Warmup Complete. Starting AI-REML ---
Iter LogLik Sigma2 DF wall Restrained
Main AI-REML Loop finished. Combining results.
solveMME_cpp: adding nugget 5.46104e-07 to MME diagonal (scaled from median diag)
Notes:
- The predictions are obtained by averaging across the hypertable
calculated from model terms constructed solely from factors in
the averaging and classify sets.
- The simple averaging set: (none)
- The ignored set: rep, row, col
Explicit param_df Check:
name type vc_idx init lower upper origin
1 rep var 1 5.782310 1e-06 Inf G
2 row var 2 5.782310 1e-06 Inf G
3 col var 3 5.782310 1e-06 Inf G
4 R.sigma2 R_var 0 7.227887 1e-06 Inf R
--- Starting EM Warmup ---
--- EM Warmup Complete. Starting AI-REML ---
Iter LogLik Sigma2 DF wall Restrained
Main AI-REML Loop finished. Combining results.
solveMME_cpp: adding nugget 6.62442e-07 to MME diagonal (scaled from median diag)
Notes:
- The predictions are obtained by averaging across the hypertable
calculated from model terms constructed solely from factors in
the averaging and classify sets.
- The simple averaging set: (none)
- The ignored set: rep, row, col
# Bundle outputs
bundle <- stage1_bundle(MET_stage1_run)
cat("Stage 1 complete. BLUEs extracted for", nrow(bundle$blues), "observations\n")Stage 1 complete. BLUEs extracted for 200 observations
Stage 2: Prepare Data and Matrices
# Create Stage 2 data and RMAT
s2_inputs <- stage2_prep(
bundle = bundle,
df_name = "stage2_data",
vcov_name = "RMAT",
out_env = "env",
out_id = "geno",
out_resp = "yld"
)
s2_data <- s2_inputs$stage2_data
RMAT <- s2_inputs$RMAT
# Align row IDs
rownames(s2_data) <- paste0(s2_data$env, "|", s2_data$geno)
s2_data <- s2_data[rownames(RMAT), , drop = FALSE]
# Ensure Matrix format
if (!inherits(RMAT, "Matrix")) {
RMAT <- Matrix::forceSymmetric(Matrix::Matrix(RMAT, sparse = TRUE))
}
head(s2_data) env geno yld
E1|G001 E1 G001 109.6486
E1|G002 E1 G002 109.0374
E1|G003 E1 G003 113.4323
E1|G004 E1 G004 108.8724
E1|G005 E1 G005 107.4882
E1|G006 E1 G006 106.9827
Prepare GRM for Stage 2
Use mandala_grm_prep() to align and scale the GRM for the Stage 2 genotype set.
# For this tutorial, we'll prepare the GRM manually
# In practice with a file: mandala_grm_prep(GRM = "path/to/grm.csv", ...)
# Subset GRM to genotypes in s2_data
genos_in_s2 <- unique(s2_data$geno)
GRM_subset <- GRM[as.character(genos_in_s2), as.character(genos_in_s2)]
# Add small ridge for numerical stability
lambda <- 1e-8
GRM_subset <- GRM_subset + diag(lambda, nrow(GRM_subset))
# Convert to Matrix class
GRM_subset <- Matrix::Matrix(GRM_subset, sparse = FALSE)
cat("GRM dimensions:", dim(GRM_subset), "\n")GRM dimensions: 50 50
cat("Genotypes in GRM:", nrow(GRM_subset), "\n")Genotypes in GRM: 50
Stage 2 with GRM (GBLUP)
Now we fit the Stage 2 model with genomic relationships using GM(geno, GRM).
stage2_gblup <- mandala(
fixed = yld ~ env,
random = ~ GM(geno, GRM),
R_formula = ~ vcov(RMAT),
data = s2_data,
matrix_list = list(RMAT = RMAT, GRM = GRM_subset)
)Initial data rows: 200
Final data rows after NA handling: 200
Explicit param_df Check:
name type vc_idx init lower upper origin
1 GM(geno,GRM) var 1 11.99967 1e-06 Inf G
2 R.sigma2 R_var 0 14.99959 1e-06 Inf R
--- EM Warmup Skipped. Starting AI-REML ---
Starting AI-REML logLik = -558.3649
Iter LogLik Sigma2 DF wall Restrained
1 -547.9894 11.400 196 09:41:56 ( 0 restrained)
2 -540.6867 8.664 196 09:41:56 ( 0 restrained)
3 -533.1870 5.290 196 09:41:56 ( 0 restrained)
4 -532.8520 3.156 196 09:41:56 ( 0 restrained)
5 -532.8520 3.156 196 09:41:56 ( 0 restrained)
Converged at iter 5 by tolerance(s): delta_theta=0.00e+00, delta_loglik=0.00e+00
Main AI-REML Loop finished. Combining results.
solveMME_cpp: adding nugget 0.00247173 to MME diagonal (scaled from median diag)
summary(stage2_gblup)Model statement:
mandala(fixed = yld ~ env, random = ~GM(geno, GRM), data = s2_data,
matrix_list = list(RMAT = RMAT, GRM = GRM_subset), R_formula = ~vcov(RMAT))
Variance Components:
component estimate std.error z.ratio bound %ch
GM(geno,GRM) 8.015792 2.0501810 3.909797 P NA
R.sigma2 3.155855 0.9579506 3.294381 P NA
Fixed Effects (BLUEs) [first 5]:
effect estimate std.error z.ratio
(Intercept) 107.171738 0.2511531 426.71868
envE2 -10.622834 0.3551976 -29.90683
envE3 -5.485344 0.3551976 -15.44308
envE4 -4.448287 0.3551976 -12.52342
Converged: TRUE | Iterations: 5
Random Effects (BLUPs) [first 5]:
random level estimate std.error z.ratio
GM(geno,GRM) G001 0.5389986 0.8469212 0.6364212
GM(geno,GRM) G002 0.2114432 0.8447910 0.2502905
GM(geno,GRM) G003 3.6846055 0.8464394 4.3530646
GM(geno,GRM) G004 -1.6141332 0.8468802 -1.9059759
GM(geno,GRM) G005 3.4298982 0.8471452 4.0487727
logLik: -532.852 AIC: 1069.704 BIC: 1076.260 logLik_Trunc: -352.740
Extract GEBVs
gebvs <- mandala_predict(stage2_gblup, classify_term = "geno")
Notes:
- The predictions are obtained by averaging across the hypertable
calculated from model terms constructed solely from factors in
the averaging and classify sets.
- The simple averaging set: env, [num mean]
head(gebvs) geno predicted_value std_error
1 G001 102.5716 0.8561858
2 G002 102.2441 0.8540787
3 G003 105.7172 0.8557092
4 G004 100.4185 0.8561452
5 G005 105.4625 0.8564073
6 G006 102.4956 0.8567694
Using mandala_gp() Wrapper
For convenience, mandala_gp() wraps the GRM preparation and model fitting into a single call.
# Example using mandala_gp() with a GRM file
# In practice, you would provide a path to your GRM file
gp_fit <- mandala_gp(
GRM = GRM_subset, # Can also be a file path: "path/to/grm.csv"
GRM_label = "GRM",
data = s2_data,
gen_col = "geno",
fixed = yld ~ env,
random = ~ GM(geno, GRM),
R_formula = ~ vcov(RMAT),
R_matrix = RMAT,
lambda = 1e-8,
scale = FALSE,
method = "sparse"
)
summary(gp_fit$fit)For this tutorial, we’ll demonstrate with the manual approach since we have the GRM in memory:
# The stage2_gblup model we already fit is equivalent
# Extract predictions
gebv_predictions <- mandala_predict(stage2_gblup, classify_term = "geno")
Notes:
- The predictions are obtained by averaging across the hypertable
calculated from model terms constructed solely from factors in
the averaging and classify sets.
- The simple averaging set: env, [num mean]
head(gebv_predictions) geno predicted_value std_error
1 G001 102.5716 0.8561858
2 G002 102.2441 0.8540787
3 G003 105.7172 0.8557092
4 G004 100.4185 0.8561452
5 G005 105.4625 0.8564073
6 G006 102.4956 0.8567694
Comparing Approaches
Let’s compare predictions from: 1. Phenotypic Stage 2 (no GRM) — Tutorial 06 approach 2. Genomic Stage 2 (with GRM) — GBLUP
Fit Non-Genomic Stage 2
stage2_pheno <- mandala(
fixed = yld ~ env,
random = ~ geno,
R_formula = ~ vcov(RMAT),
data = s2_data,
matrix_list = list(RMAT = RMAT)
)Initial data rows: 200
Final data rows after NA handling: 200
Explicit param_df Check:
name type vc_idx init lower upper origin
1 geno var 1 11.99967 1e-06 Inf G
2 R.sigma2 R_var 0 14.99959 1e-06 Inf R
--- EM Warmup Skipped. Starting AI-REML ---
Starting AI-REML logLik = -558.3649
Iter LogLik Sigma2 DF wall Restrained
1 -548.2637 11.400 196 09:41:57 ( 0 restrained)
2 -540.7545 8.664 196 09:41:57 ( 0 restrained)
3 -532.8004 5.241 196 09:41:57 ( 0 restrained)
4 -532.4003 3.071 196 09:41:57 ( 0 restrained)
5 -532.4003 3.071 196 09:41:57 ( 0 restrained)
Converged at iter 5 by tolerance(s): delta_theta=0.00e+00, delta_loglik=0.00e+00
Main AI-REML Loop finished. Combining results.
solveMME_cpp: adding nugget 1.40104e-06 to MME diagonal (scaled from median diag)
pheno_preds <- mandala_predict(stage2_pheno, classify_term = "geno")
Notes:
- The predictions are obtained by averaging across the hypertable
calculated from model terms constructed solely from factors in
the averaging and classify sets.
- The simple averaging set: env, [num mean]
Merge and Compare
# Phenotypic predictions
p_pheno <- pheno_preds[, c("geno", "predicted_value")]
names(p_pheno) <- c("geno", "pred_pheno")
# Genomic predictions
p_geno <- gebv_predictions[, c("geno", "predicted_value")]
names(p_geno) <- c("geno", "pred_genomic")
# Merge
comparison <- merge(p_pheno, p_geno, by = "geno")
# Correlation
cor_pg <- cor(comparison$pred_pheno, comparison$pred_genomic, use = "complete.obs")
cat("Correlation (phenotypic vs genomic):", round(cor_pg, 3), "\n")Correlation (phenotypic vs genomic): 1
ggplot(comparison, aes(x = pred_pheno, y = pred_genomic)) +
geom_point(alpha = 0.6, size = 2.5) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
labs(
title = "Phenotypic vs Genomic Predictions",
subtitle = paste("r =", round(cor_pg, 3)),
x = "Predicted Value (Phenotypic)",
y = "Predicted Value (Genomic/GBLUP)"
) +
theme_minimal()
Variance Components Comparison
cat("=== Phenotypic Model Variance Components ===\n")=== Phenotypic Model Variance Components ===
print(summary(stage2_pheno)$varcomp)Model statement:
mandala(fixed = yld ~ env, random = ~geno, data = s2_data, matrix_list = list(RMAT = RMAT),
R_formula = ~vcov(RMAT))
Variance Components:
component estimate std.error z.ratio bound %ch
geno 10.130038 2.6049988 3.888692 P NA
R.sigma2 3.071444 0.9516386 3.227532 P NA
Fixed Effects (BLUEs) [first 5]:
effect estimate std.error z.ratio
(Intercept) 107.191623 0.5138352 208.61088
envE2 -10.644406 0.3505106 -30.36828
envE3 -5.506115 0.3505106 -15.70884
envE4 -4.468897 0.3505106 -12.74967
Converged: TRUE | Iterations: 5
Random Effects (BLUPs) [first 5]:
random level estimate std.error z.ratio
geno G001 0.5943664 0.9497788 0.6257945
geno G002 0.2551813 0.9497788 0.2686745
geno G003 3.6729523 0.9497788 3.8671660
geno G004 -1.5904980 0.9497788 -1.6745983
geno G005 3.5063399 0.9497788 3.6917436
logLik: -532.400 AIC: 1068.801 BIC: 1075.357 logLik_Trunc: -352.288
component estimate std.error z.ratio bound %ch
1 geno 10.130038 2.6049988 3.888692 P NA
2 R.sigma2 3.071444 0.9516386 3.227532 P NA
cat("\n=== Genomic Model Variance Components ===\n")
=== Genomic Model Variance Components ===
print(summary(stage2_gblup)$varcomp)Model statement:
mandala(fixed = yld ~ env, random = ~GM(geno, GRM), data = s2_data,
matrix_list = list(RMAT = RMAT, GRM = GRM_subset), R_formula = ~vcov(RMAT))
Variance Components:
component estimate std.error z.ratio bound %ch
GM(geno,GRM) 8.015792 2.0501810 3.909797 P NA
R.sigma2 3.155855 0.9579506 3.294381 P NA
Fixed Effects (BLUEs) [first 5]:
effect estimate std.error z.ratio
(Intercept) 107.171738 0.2511531 426.71868
envE2 -10.622834 0.3551976 -29.90683
envE3 -5.485344 0.3551976 -15.44308
envE4 -4.448287 0.3551976 -12.52342
Converged: TRUE | Iterations: 5
Random Effects (BLUPs) [first 5]:
random level estimate std.error z.ratio
GM(geno,GRM) G001 0.5389986 0.8469212 0.6364212
GM(geno,GRM) G002 0.2114432 0.8447910 0.2502905
GM(geno,GRM) G003 3.6846055 0.8464394 4.3530646
GM(geno,GRM) G004 -1.6141332 0.8468802 -1.9059759
GM(geno,GRM) G005 3.4298982 0.8471452 4.0487727
logLik: -532.852 AIC: 1069.704 BIC: 1076.260 logLik_Trunc: -352.740
component estimate std.error z.ratio bound %ch
1 GM(geno,GRM) 8.015792 2.0501810 3.909797 P NA
2 R.sigma2 3.155855 0.9579506 3.294381 P NA
When Does Genomic Prediction Help?
Genomic prediction provides the most benefit when:
| Scenario | Benefit |
|---|---|
| Predicting untested genotypes | GRM allows prediction for genotypes without phenotypic data |
| Low heritability traits | Genomic info adds signal when phenotypes are noisy |
| Small family sizes | GRM captures relationships beyond pedigree |
| Selection accuracy | GEBVs often outperform phenotypic BLUPs |
To assess prediction accuracy, use mandala_gp_cv() for cross-validation:
cv_results <- mandala_gp_cv(
GRM = GRM,
data = s2_data,
gen_col = "geno",
fixed = yld ~ env,
random = ~ GM(geno, GRM),
R_formula = ~ vcov(RMAT),
R_matrix = RMAT,
nfold = 5,
nrep = 3
)Summary
- GRM integration at Stage 2 enables genomic prediction on BLUEs
- Use
GM(geno, GRM)to specify genomic relationship structure - Combine with
vcov(RMAT)to properly weight BLUE uncertainty mandala_grm_prep()handles GRM alignment and scalingmandala_gp()provides a convenient wrapper for the full workflow- Genomic predictions are most valuable for untested genotypes and low-heritability traits
Next Steps
- Explore cross-validation with
mandala_gp_cv()to assess prediction accuracy - Consider factor-analytic models (
FA()) for complex G×E patterns - Review the Quick Reference for function details
Session Information
sessionInfo()R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] Matrix_1.7-4 ggplot2_4.0.2 dplyr_1.2.0 mandala_1.0.1
loaded via a namespace (and not attached):
[1] vctrs_0.7.1 cli_3.6.5 knitr_1.51 rlang_1.1.7
[5] xfun_0.56 generics_0.1.4 S7_0.2.1 jsonlite_2.0.0
[9] labeling_0.4.3 glue_1.8.0 htmltools_0.5.9 scales_1.4.0
[13] rmarkdown_2.30 grid_4.5.2 evaluate_1.0.5 tibble_3.3.1
[17] fastmap_1.2.0 yaml_2.3.12 lifecycle_1.0.5 compiler_4.5.2
[21] RColorBrewer_1.1-3 Rcpp_1.1.1 pkgconfig_2.0.3 farver_2.1.2
[25] lattice_0.22-7 digest_0.6.39 R6_2.6.1 utf8_1.2.6
[29] tidyselect_1.2.1 pillar_1.11.1 magrittr_2.0.4 withr_3.0.2
[33] gtable_0.3.6 tools_4.5.2