15 Ecology

The solutions assume the following packages are attached (other packages will be attached when needed):

E1. Run a NMDS using the percentage data of the community matrix. Report the stress value and compare it to the stress value as retrieved from the NMDS using presence-absence data. What might explain the observed difference?

data("comm", package = "spDataLarge")
pa = vegan::decostand(comm, "pa")
pa = pa[rowSums(pa) != 0, ]
comm = comm[rowSums(comm) != 0, ]
nmds_pa = vegan::metaMDS(comm = pa, k = 4, try = 500)
nmds_per = vegan::metaMDS(comm = comm, k = 4, try = 500)

The NMDS using the presence-absence values yields a better result (nmds_pa$stress) than the one using percentage data (nmds_per$stress). This might seem surprising at first sight. On the other hand, the percentage matrix contains both more information and more noise. Another aspect is how the data was collected. Imagine a botanist in the field. It might seem feasible to differentiate between a plant which has a cover of 5% and another species that covers 10%. However, what about a herbal species that was only detected three times and consequently has a very tiny cover, e.g., 0.0001%. Maybe another herbal species was detected 6 times, is its cover then 0.0002%? The point here is that percentage data as specified during a field campaign might reflect a precision that the data does not have. This again introduces noise which in turn will worsen the ordination result. Still, it is a valuable information if one species had a higher frequency or coverage in one plot than another compared to just presence-absence data. One compromise would be to use a categorical scale such as the Londo scale.

E2. Compute all the predictor rasters we have used in the chapter (catchment slope, catchment area), and put them into a SpatRaster-object. Add dem and ndvi to it. Next, compute profile and tangential curvature and add them as additional predictor rasters (hint: grass7:r.slope.aspect). Finally, construct a response-predictor matrix. The scores of the first NMDS axis (which were the result when using the presence-absence community matrix) rotated in accordance with elevation represent the response variable, and should be joined to random_points (use an inner join). To complete the response-predictor matrix, extract the values of the environmental predictor raster object to random_points.

# first compute the terrain attributes we have also used in the chapter
data("comm", "random_points", package = "spDataLarge")
dem = terra::rast(system.file("raster/dem.tif", package = "spDataLarge"))
ndvi = terra::rast(system.file("raster/ndvi.tif", package = "spDataLarge"))

# use presence-absence matrix and get rid of empty
pa = vegan::decostand(comm, "pa")
pa = pa[rowSums(pa) != 0, ]

# enable plugins if not already done so
qgisprocess::qgis_enable_plugins(c("grassprovider", "processing_saga_nextgen"))

# compute environmental predictors (ep) catchment slope and catchment area
ep = qgisprocess::qgis_run_algorithm(
  alg = "sagang:sagawetnessindex",
  DEM = dem,
  SLOPE = tempfile(fileext = ".sdat"),
  AREA = tempfile(fileext = ".sdat"),
  .quiet = TRUE)
# read in catchment area and catchment slope
ep = ep[c("AREA", "SLOPE")] |>
  unlist() |>
# assign proper names 
names(ep) = c("carea", "cslope")
# make sure all rasters share the same origin
origin(ep) = origin(dem)
# add dem and ndvi to the multilayer SpatRaster object
ep = c(dem, ndvi, ep) 
ep$carea = log10(ep$carea)

# compute the curvatures
curvs = qgis_run_algorithm(
  elevation = dem,
  .quiet = TRUE)
# adding curvatures to ep
curv_nms = c("pcurvature", "tcurvature")
curvs = curvs[curv_nms] |>
  unlist() |>
curvs = terra::app(curvs, as.numeric)
names(curvs) = curv_nms
ep = c(ep, curvs)
random_points[, names(ep)] = 
  # terra::extract adds an ID column, we don't need
  terra::extract(ep, random_points) |>
elev = dplyr::filter(random_points, id %in% rownames(pa)) %>% 
# rotating NMDS in accordance with altitude (proxy for humidity)
rotnmds = MDSrotate(nmds_pa, elev)
# extracting the first two axes
sc = vegan::scores(rotnmds, choices = 1:2, display = "sites")
rp = data.frame(id = as.numeric(rownames(sc)),
                sc = sc[, 1])
# join the predictors (dem, ndvi and terrain attributes)
rp = dplyr::inner_join(random_points, rp, by = "id")
# saveRDS(rp, "extdata/15-rp_exercises.rds")

E3. Retrieve the bias-reduced RMSE of a random forest and a linear model using spatial cross-validation. The random forest modeling should include the estimation of optimal hyperparameter combinations (random search with 50 iterations) in an inner tuning loop. Parallelize the tuning level. Report the mean RMSE and use a boxplot to visualize all retrieved RMSEs. Please not that this exercise is best solved using the mlr3 functions benchmark_grid() and benchmark() (see https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking for more information).


# in case you have not run the previous exercises, run:
# rp = readRDS("extdata/15-rp_exercises.rds")

# define the task
task = mlr3spatiotempcv::as_task_regr_st(
  select(rp, -id, -spri),
  target = "sc", 
  id = "mongon")

# define the learners
# linear model
lrn_lm = mlr3::lrn("regr.lm", predict_type = "response")
# random forest
lrn_rf = mlr3::lrn("regr.ranger", predict_type = "response")
# now define the AutoTuner of the random forest
search_space = paradox::ps(
  mtry = paradox::p_int(lower = 1, upper = ncol(task$data()) - 1),
  sample.fraction = paradox::p_dbl(lower = 0.2, upper = 0.9),
  min.node.size = paradox::p_int(lower = 1, upper = 10)
at_rf = mlr3tuning::AutoTuner$new(
  learner = lrn_rf,
  # spatial partitioning
  resampling = mlr3::rsmp("spcv_coords", folds = 5),
  # performance measure
  measure = mlr3::msr("regr.rmse"),
  search_space = search_space,
  # random search with 50 iterations
  terminator = mlr3tuning::trm("evals", n_evals = 50),
  tuner = mlr3tuning::tnr("random_search")
# define the resampling strategy
rsmp_sp = mlr3::rsmp("repeated_spcv_coords", folds = 5, repeats = 100)

# create the benchmark design
design_grid = mlr3::benchmark_grid(
  tasks = task,
  learners = list(lrn_lm, at_rf),
  resamplings = rsmp_sp)
# execute the outer loop sequentially and parallelize the inner loop
future::plan(list("sequential", "multisession"), 
             workers = floor(future::availableCores() / 2))
# reduce verbosity
# BE CAREFUL: Running the benchmark might take quite some time
# therefore, we have stored the result of the benchmarking in
# extdata/15-bmr-exercises.rds (see below)
progressr::with_progress(expr = {
  bmr = mlr3::benchmark(
    design = design_grid,
    # New argument `encapsulate` for `resample()` and `benchmark()` to
    # conveniently enable encapsulation and also set the fallback learner to the
    # respective featureless learner. This is simply for convenience, configuring
    # each learner individually is still possible and allows a more fine-grained
    # control
    encapsulate = "evaluate",
    store_backends = FALSE,
    store_models = FALSE)

# stop parallelization
# we have saved the result as follows
# saveRDS(bmr, file = "extdata/15-bmr_exercises.rds")
# READ IT IN in in case you don't want to run the spatial CV yourself as it is computationally
# demanding
# bmr = readRDS("extdata/15-bmr_exercises.rds")

# mean RMSE
bmr$aggregate(measures = msr("regr.rmse"))
# or computed manually
agg = bmr$aggregate(measures = msr("regr.rmse"))

# lm performs slightly better when considering the mean rmse
purrr::map(agg$resample_result, ~ mean(.$score(msr("regr.rmse"))$regr.rmse))
# however, looking at the median, the random forest performs slightly better
purrr::map(agg$resample_result, ~ median(.$score(msr("regr.rmse"))$regr.rmse))

# make a boxplot (when using autoplot, mlr3viz needs to be attached!)
autoplot(bmr, measure = msr("regr.rmse"))

# or doing it "manually"
# extract the AUROC values and put them into one data.table
d = purrr::map_dfr(agg$resample_result, ~ .$score(msr("regr.rmse")))
# create the boxplots
ggplot(data = d, mapping = aes(x = learner_id, y = regr.rmse)) +
  geom_boxplot(fill = c("lightblue2", "mistyrose2")) +
  theme_bw() +
  labs(y = "RMSE", x = "model")

In fact, lm performs at least as good the random forest model, and thus should be preferred since it is much easier to understand and computationally much less demanding (no need for fitting hyperparameters). But keep in mind that the used dataset is small in terms of observations and predictors and that the response-predictor relationships are also relatively linear.