Frequently Asked Questions (FAQ)

This document contains a series of frequently asked questions when using the ibis.iSDM package and is a work in progress.

Data preparation

What are the necessary data preparation steps for ibis.iSDM ?

The ibis.iSDM R-package can handle most standard spatial formats in R (such as vector or raster formats) and works predominantly with the [sf], [SpatRaster] and [stars] packages to do much of the formatting and processing work. When adding [biodiversity] and [predictor] variables to a distribution() object a number of default validity checks and alignments are commonly conducted, for instance ensuring that provided points align in geographic projection.

To ease up the modelling and to avoid any unfortunate errors or crashes, ideally ensure the following steps are taken:

A 'background' layer describing the modelling extent is provided directly as [sf] 'POLYGON' or 'MULTIPOLYGON' object and covers all biodiversity and predictor data.
All provided data are in the same geographic projection.
Biodiversity data is provided in [sf] format and covers the 'background' bounding box. Furthermore each biodiversity dataset has a set "field_occurrence" field with numeric values.
is appropriately formatted (see also below).

Important: For environmental predictors it becomes important to ensure that nodata values are appropriately handled. Unfortunately many of the implemented [engines] can not handle nodata values well, thus it is necessary during the pre-processing to remove any rows of covariate extraction where at least one variable has missing data. For instance by assinging a constant to NA values:

predictors[is.na(predictors)] <- 0

I have presence-only data (GBIF and co). How can I model the probability of occurrence?

Technically, it is impossible to estimate a probability of occurrence just with presence-only data (such as commonly available from databases like GBIF). What people normally do is to add so called pseudo-absence (often in excessive numbers) over the entire background to their data, approximating the probability of occurrence by assuming that detection probability is uniform over the landscape (see Merow et al. 2013).

The ibis.iSDM package follows the design principle that all data types (e.g. presence-only or presence-absence records) should be modelled with the least amount of assumptions possible. For presence-only records the default way of estimating any kind of responses or habitat suitability is to estimate such data following a Poisson-Process modelling approach. However, it is possible to add these pseudo-absence points to a presence-only dataset as follows:

virtual_points <- sf::st_read(system.file('extdata/input_data.gpkg', package='ibis.iSDM'), 'points',quiet = TRUE)
# This takes the default pseudo-absence options created when loading the Ibis package.
virtual_points <- add_pseudoabsence(virtual_points, field_occurrence = "Observed")
# Check that absence points are present
unique(virtual_points$Observed)

Also see the add_pseudoabsence() and pseudoabs_settings() help pages for more settings and also the first article on this website. For example it is possible to define pseudo-absence sampling in specific spatial formats, such as sampling within or outside a minimum convex polygon (MCP) of all presence points or within a certain buffer.

Model setup

Is it possible to transform input predictors before model fitting?

Yes, in many instances it is desirable to transform input predictors to a new range or to remove extreme values from the predictors. For example the popular [maxnet] R-package for Maximum entropy estimation of species distribution modelling often transforms the variables via quadratic, hinge or threshold transformations. Although this can also (and preferably should) be done before setting up a model, there are options available to transform the predictors when they are added to the model. See the help page of [add_predictors()] for examples and descriptions.

Is there a way to limit my predictions to only certain regions in the extent?

Yes, this can be a desirable outcome during the modelling. For instance one can make - in the absence of better information on dispersal constrain (see [add_constrain_dispersal]) - the assumption that certain species can only disperse within a given ecoregion and not beyond. See for instance the method section of Wessels, Merow and Trisos (2021).

To do this directly with the ibis.iSDM R-package, one has to specify the limit of projection to the [distribution()] object. In this context the zones extent over the whole background (and have the same extent and spatial scale). The prediction will however be limited only to those zones where supplied biodiversity observations fall in.

# Where zone is a provided raster
mod <- distribution(background, limits = zone) |>  
  add_biodiversity_poipo(species_data) |>  
  engine_gdb() |>  
  train()
plot(mod)

# Alternatively one can also create such limits based on a minimum convex polygon
# of the provided data. Here we create a non-buffered MCP across all points used
# in species_data to constrain the prediction.
mod <- distribution(background, limits_method = "mcp", mcp_buffer = 0) |>  
  add_biodiversity_poipo(species_data) |>  
  engine_gdb() |>  
  train()
plot(mod)

There is further a dedicated function ([create_zonaloccurrence_mask()]) to help set up such zones, either by taking existing categorical raster datasets or by constructing buffers around existing presence records (which could for example reflect assumed maximum dispersal distances).

How exactly do I add a prior to a model ?

The ibis.iSDM R-package allows to users to add prior information on parameters to any model to be estimated. These priors have to be added as engine-specific priors and their format depends on the engine in question (see specific help pages for more information). A prior can generally be define via a combination of ENGINENAME + Prior and has to be wrapped in a PriorList to be used in the estimation.

Example:

# We have prior information that 'Forest' is important for a species
# In this case and for the INLA engine we define normal prior on the mean and precision
p <- INLAPrior(variable = "Forest",type = "normal",hyper = c(2, 10))
# This is then wrapped in a PriorList
pp <- priors(p)
print( pp )

# We can specify multiple priors of course
p <- list(
  INLAPrior(variable = "Forest",type = "normal",hyper = c(2, 10)),
  INLAPrior(variable = "Cropland",type = "normal",hyper = c(0, 1))
)
pp <- priors(pp)

# And can now added to the model
mod <- distribution(background, limits = zone) |>  
  add_biodiversity_poipo(species_data) |>  
  add_predictors(covariates) |>  
  add_priors(priors = pp)
  engine_inlabru()

Multiple priors for an Engine can be defined in a PriorList. Whenever a prior for the same variable is set again, it overwrites the previous value.

How can I control pseudo-absence sampling, for instance with targeted sampling?

A great number of SDM literature suggests that altering how the background / pseudo-absence points are created, can greatly affect model outcomes (see add_pseudoabsence() for references). In the ibis.iSDM R-package there are options available to modify how pseudo-absence points are created. By default the package creates at least 10 000 points or at least 25% of the presence-points (what ever is larger). To change the default pseudo-absence sampling settings, there are two options. Either change the global default settings for pseudo-absence sampling or by adding these settings to the add_biodiversity function.

To overwrite the global settings, do the following:

# Define new settings with greater number of background points
ss <- pseudoabs_settings(background = NULL, nrpoints =  1e6)
# Overwrite the default settings
options("ibis.pseudoabsence" = ss)

Alternatively one could think of specifying specific pseudo-absence sampling information to one biodiversity dataset specifically:

# Define absence layer with biased background to sample from
ss <- pseudoabs_settings(background = NULL, bias = bias_layer)

# Assuming background and point data exists
x <- distribution(background) |>  
  add_biodiversity_poipo(points, pseudoabsence_settings = ss)

What are the options for parallelization in ibis.iSDM?

Most code in the ibis.iSDM R-package is by default already parallelized and many computationally-intensive operations should be making use of all cores (if you can find an example where this is not the case, please raise an issue). The number of cores is generally being decided by the option "ibis.nthread" in [ibis_options()].

In most cases, parallelized code will be run via the [parallel] and [doParallel] packages, although there is some code in its infancy to support the [future] parallelization approaches as well, offering greater flexibility. See the function [ibis_future] for more information also on the use.

The typical use case is thus to run separate models (via train()) in a loop or scheduler of a High-Performance-Computer. Users should be careful in the case of shared resources, e.g. don’t parallelize such operations on the same machine. If there is a need to parallelize multiple models on the same instance, it is suggested to disable the 'ibis.runparallel' option.

# Check ibis options if set
ibis_options()
options('ibis.runparallel' = FALSE) # Set to FALSE

Can I add multiple offsets to a distribution() object ?

Yes. The add_offset() and add_offset_range() functions allows to specify a spatial explicit offset term which is then added to the regression model in question. An offset is generally just a coefficient set to a specific value. To get more than one offset, one just needs to combine the different provided offsets in a way that is consistent to get that fixed value (see here for reference. This can be done either by summing them as transformed value (discouraged as it can be errorprone) or by simply multiplying them.

offset1 <- runif(10)
offset2 <- runif(10)
# Identical
log(offset1) + log(offset2)
log(offset1*offset2)

Internally all provided offsets to model object are combined via simple addition together. This thus requires that users transform them aprior (for instance log transform) before adding them to the estimation.

Fitting and Scenarios

There are too many messages. How can I run ibis.iSDM more verbosely ?

There are two options that can be enabled to reduce the number of messages:

By setting the parameter verbose in train() to FALSE all messages created by the respective engine are suppressed.
Setting the parameter ibis.setupmessages to FALSE suppresses all other package related message. This can be done via

options("ibis.setupmessages" = FALSE)

How can I make use of cross-validation in ibis.iSDM ?

Cross-validation has been deliberatly not integrated into package. Users who would like to make use of cross-validation techniques thus need to set up their own external modelling routines. Reason is that with the multiple types of integration in this package, the construction of (independent) testing datasets is not as trivial (considering for example offsets, priors or multiple datasets).

Why are there 2 INLA engines and which one should I use ?

In the ibis.iSDM R-package there are two engines that makes use of the INLA framework, namely [engine_inla] and [engine_inlabru]. When the package author started developing this package, the [engine_inlabru] did not yet support multiple likelihoods and thus this was implemented directly. Predictions from [engine_inla] and [engine_inlabru] should be identical, although the latter does not infer the predictions directly, instead simulating from the posterior. This simulation is particularly helpful for creating (future) projections as otherwise a new model would need to be fitted for every newdata object.

Users are advised to use [engine_inlabru] by default in most cases and only use [engine_inla] if predictions look uncertain.

I am concerned about extrapolation during prediction. What options do I have?

When creating predictive models such as SDMs it is often a concern to not predict to a variable range outside the environmental conditions for which a model was trained. The ibis.iSDM package supports variable ‘clamping’ of its predictions similar as the popular Maxent model, however for each [engine]. Clamping can be enabled by setting the parameter clamp in [train] to TRUE. This restricts any spatial (or spatial-temporal) projections to the combined range of predictor variables observed for each of the training localities.

Similar functionalities are also available separately during scenario projections by setting adaptability constraints (see [add_constraint_adaptability] or [add_constraint_boundary]).

I am using (too) many predictors in my model. What options do I have ?

Having too many predictors in a SDM can be a cause of substantial over-parametrization and subsequently overfitting (e.g. the model is reproducing the data it was trained with rather than projecting into areas unknown).

It is recommended to (a) either use an engine with very strong regularization, such as for example [engine_glmnet] or [engine_gdb], (b) train a model with caution and have a minimum number of observations (arbitrary rule of thumb, have at least 10 observations for each additional predictor included), (c) make use of pre-estimation removal of predictor, such for example through variable importance criteria or colinearity. See code below for an example.

# Prior to model fitting, remove highly collinear predictors through a pearson correlation assessment
mod <- distribution(background) |>  
  add_biodiversity_poipo(species_data) |>  
  engine_glmnet() |>  
  train(filter_predictors = "pearson")

# Alternatively use a RandomForest estimator to remove the least important variables
mod <- distribution(background) |>  
  add_biodiversity_poipo(species_data) |>  
  engine_glmnet() |>  
  train(filter_predictors = "RF")

Where can I find predictions created by train ?

After a distribution model has been trained and if the inference_only parameter in train() has been set to FALSE (Default), the outputs of the prediction be found in the created object as a SpatRaster. By default all engines will produce a SpatRaster object with at least one band called “mean” which is the average prediction by the engine. This is also the result returned when the created model object is plotted.

In addition, for Bayesian Engines other bands quantifying the posterior predictive uncertainty might be available and can be plotted as well. The raster can also be saved a spatial GeoTiff to a given filename using the write_output() function.

Example:

mod <- distribution(background) |>  
  add_biodiversity_poipo(species_data) |>  
  engine_inlabru() |>  
  train()
# To plot 
plot(mod, "mean")
plot(mod, "sd")

# To get the layer
mod$get_data("prediction")

# To save the output layer as floating point geoTiff
write_output(mod, "myoutput.tif", type = "gtif", dt = "FLT4S")

Predictions from engine_xgboost look pixelated?

This is usually due to either the number of rounds for estimation being too low or the learning_rate being to high. Try out different options in the parameters of the engine. A good way to check the performance is also to plot the evaluation log and logloss.

# Requires a fitted model
plot(fit$get_data("fit_best")$evaluation_log)

Model troubleshooting

How exactly does the model integration work and how does it differ among engines?

There are various forms of integration from a simple approach of adding [ensembles], [priors] or [offsets] to fully integrated multiple likelihood models (see Fletcher et al. 2019). Thus, users have a range of possibilities to combine different sources of evidence in their modelling.

With regards to how different [engines] treat multiple biodiversity datasets. Unfortunately only the [engine_inla()], [engine_inlabru()] and [engine_stan()] support fully integrated multiple likelihood models. A full overview can be found in the Engine comparison table. All other [engines] combine multiple datasets by running separate models in sequence where the order is determined by the sequence the datasets were added to the model. Within the train() function, users have the option of specifying how previous predictions should be handled through the [method_integration] parameter. For example predictions from one model could be added as predictors or offset to the next. Or the coefficients from one model can be used to create starting priors for the next model.

What can I do when my model outputs with presence-only data seem odd?

By default, presence only biodiversity data is modelled as point-process model (PPM, see Renner et al. 2015). Similar as Maximum Entropy models these models can be quite sensitive to biased input, which is common for most non-structured biodiversity observations where presence points tend to be clustered in urban or easily accessible areas. To avoid predictions being biased towards those covariates, there are a number of things that can potentially be done here.

Modify the targeted background sampling for better control of background points. This can be for instance done via the add_pseudoabsence() and pseudoabs_settings() methods. See the respective help files.
Make use of spatial thinning approaches. See for instance Aiello-Lammers et al. 2015 and Steen et al. 2021. Note however that spatial thinning does remove data points, affecting for instance any poisson distributed models (PPMs) in the process. Theibis.iSDM package has some functionality for spatial thinning implemented in the thin_observations() function.
Partial out a biased variable during the prediction. The add_control_bias() function can be used to specify a value that needs to be partialed out in the model. There a bias_value has to be specified and this can be set for instance to 0 or an amount assumed to be equivalent to minimal bias.
Consider setting the [clamp] parameter in train() to TRUE.
Add a spatial offset to account for the bias as introduced in Merow et al. 2016. This can be done via the add_offset_bias() function and requires preparation of a bias layer in advance.
Apply more rigorous filtering and bias control to the input data. In the end no correction can replace good data preparation and cleaning. Remember the GIGO principle.

My model keeps crashing, what can I do?

Sorry to hear and unless the model or input data are grossely misspecified, this should not happen. Consider changing the input covariates. Often scaling the covariates (see add_predictors()) can help with model convergence. Similarly if the use of discrete variables (e.g. `factor predictors) results in errors, consider setting the parameter “explode_factors” to TRUE in add_predictors(). This will create additional covariates for every factor level. Please report an issue with the concrete error message, your R session information (call sessionInfo()) and preferably some example data and code.

Any other questions and issues

Can I transform the prediction output to a suitability index?

Often it is easier to communicate an index of suitability (scale [0-1]) to stakeholders and policy, which can in principle be derived from ibis.iSDM output. Especially when using Poisson Process models to infer the suitability of a given area, the units can be hard to interpret for non-scientists. An easy way to achieve this has been added as a function to each Biodiversity distribution object. See below for an example.

# Train a model
fit <- distribution(background) |> 
  # Presence-absence data
  add_biodiversity_poipo(my_gbifpoints) |> 
  add_predictors(predictors) |> 
  engine_glmnet() |> 
  train()

# Make a transformed prediction of the suitability layer
# The output is a normalized prediction surface 
# created via (x - min) / (max - min) or x/sum(x) respectively
pred <- fit$calc_suitabilityindex()

The names of my variables are slightly changing when added to a model?

This is a feature, not a bug ;) Many covariates often come with unusual characters and symbols that can not be readily used in equations or in queries of tabular data. The sanitize_names() function cleans those variable names and removes / resets non conform symbols.

# It can be disabled by setting the following option to false at the start of the script.
options('ibis.cleannames' = FALSE)

Frequently Asked Questions (FAQ)

Martin Jung

2023-06-29