Frequently Asked Questions (FAQ)
This document contains a series of frequently asked questions when
ibis.iSDM package and is a work in progress.
What are the necessary data preparation steps for
ibis.iSDM R-package can handle most standard spatial
formats in R (such as vector or raster formats) and works predominantly
with the [
stars] packages to do much of the formatting and
processing work. When adding [
predictor] variables to a
object a number of default validity checks and alignments are commonly
conducted, for instance ensuring that provided points align in
To ease up the modelling and to avoid any unfortunate errors or crashes, ideally ensure the following steps are taken:
'background'layer describing the modelling extent is provided directly as [
'MULTIPOLYGON'object and covers all biodiversity and predictor data.
- All provided data are in the same geographic projection.
- Biodiversity data is provided in [
sf] format and covers the
'background'bounding box. Furthermore each biodiversity dataset has a set
"field_occurrence"field with numeric values.
- is appropriately formatted (see also below).
Important: For environmental predictors it becomes
important to ensure that nodata values are appropriately handled.
Unfortunately many of the implemented [
engines] can not
handle nodata values well, thus it is necessary during the
pre-processing to remove any rows of covariate extraction where at least
one variable has missing data. For instance by assinging a constant to
predictors[is.na(predictors)] <- 0
I have presence-only data (GBIF and co). How can I model the probability of occurrence?
Technically, it is impossible to estimate a probability of occurrence just with presence-only data (such as commonly available from databases like GBIF). What people normally do is to add so called pseudo-absence (often in excessive numbers) over the entire background to their data, approximating the probability of occurrence by assuming that detection probability is uniform over the landscape (see Merow et al. 2013).
ibis.iSDM package follows the design principle that
all data types (e.g. presence-only or presence-absence records) should
be modelled with the least amount of assumptions possible. For
presence-only records the default way of estimating any kind of
responses or habitat suitability is to estimate such data following a
Poisson-Process modelling approach. However, it is possible to add these
pseudo-absence points to a presence-only dataset as follows:
virtual_points <- sf::st_read(system.file('extdata/input_data.gpkg', package='ibis.iSDM'), 'points',quiet = TRUE) # This takes the default pseudo-absence options created when loading the Ibis package. virtual_points <- add_pseudoabsence(virtual_points, field_occurrence = "Observed") # Check that absence points are present unique(virtual_points$Observed)
Also see the
pseudoabs_settings() help pages for more settings and also
the first article on this website. For example it is possible to define
pseudo-absence sampling in specific spatial formats, such as sampling
within or outside a minimum convex polygon (MCP) of all presence points
or within a certain buffer.
The package seem to have many suggested dependencies. Is there a way to install them all?
Yes, the ibis.iSDM package uses a range of different functionalities from other, existing packages and if these functions are required for a specific purpose, the packages in question should be installed.
An easy convenience functions to install all packages is
ibis_dependencies() which installs all the packages listed
Is it possible to transform input predictors before model fitting?Yes, in many instances it is desirable to transform input predictors to a new range or to remove extreme values from the predictors. For example the popular [
maxnet] R-package for Maximum entropy estimation of species distribution modelling often transforms the variables via quadratic, hinge or threshold transformations. Although this can also (and preferably should) be done before setting up a model, there are options available to transform the predictors when they are added to the model. See the help page of [
add_predictors()] for examples and descriptions.
Is there a way to limit my predictions to only certain regions in the extent?
Yes, this can be a desirable outcome during the modelling. For
instance one can make - in the absence of better information on
dispersal constrain (see [
add_constrain_dispersal]) - the
assumption that certain species can only disperse within a given
ecoregion and not beyond. See for instance the method section of Wessels,
Merow and Trisos (2021).
To do this directly with the
ibis.iSDM R-package, one
has to specify the limit of projection to the
distribution()] object. In this context the zones extent
over the whole background (and have the same extent and spatial scale).
The prediction will however be limited only to those zones where
supplied biodiversity observations fall in.
# Where zone is a provided raster mod <- distribution(background, limits = zone) |> add_biodiversity_poipo(species_data) |> engine_gdb() |> train() plot(mod) # Alternatively one can also create such limits based on a minimum convex polygon # of the provided data. Here we create a non-buffered MCP across all points used # in species_data to constrain the prediction. mod <- distribution(background, limits_method = "mcp", mcp_buffer = 0) |> add_biodiversity_poipo(species_data) |> engine_gdb() |> train() plot(mod)
There is further a dedicated function
create_zonaloccurrence_mask()]) to help set up such
zones, either by taking existing categorical raster datasets or by
constructing buffers around existing presence records (which could for
example reflect assumed maximum dispersal distances).
How exactly do I add a prior to a model ?
ibis.iSDM R-package allows to users to add prior
information on parameters to any model to be estimated. These priors
have to be added as engine-specific priors and their format depends on
the engine in question (see specific help pages for more information). A
prior can generally be define via a combination of
Prior and has to be wrapped in a
PriorList to be used in the estimation.
# We have prior information that 'Forest' is important for a species # In this case and for the INLA engine we define normal prior on the mean and precision p <- INLAPrior(variable = "Forest",type = "normal",hyper = c(2, 10)) # This is then wrapped in a PriorList pp <- priors(p) print( pp ) # We can specify multiple priors of course p <- list( INLAPrior(variable = "Forest",type = "normal",hyper = c(2, 10)), INLAPrior(variable = "Cropland",type = "normal",hyper = c(0, 1)) ) pp <- priors(pp) # And can now added to the model mod <- distribution(background, limits = zone) |> add_biodiversity_poipo(species_data) |> add_predictors(covariates) |> add_priors(priors = pp) engine_inlabru()
Multiple priors for an Engine can be defined in a
PriorList. Whenever a prior for the same variable is set
again, it overwrites the previous value.
How can I control pseudo-absence sampling, for instance with targeted sampling?
A great number of SDM literature suggests that altering how the
background / pseudo-absence points are created, can greatly affect model
add_pseudoabsence() for references). In the
ibis.iSDM R-package there are options available to modify
how pseudo-absence points are created. By default the package creates at
least 10 000 points or at least 25% of the presence-points (what ever is
larger). To change the default pseudo-absence sampling settings, there
are two options. Either change the global default settings for
pseudo-absence sampling or by adding these settings to the
To overwrite the global settings, do the following:
# Define new settings with greater number of background points ss <- pseudoabs_settings(background = NULL, nrpoints = 1e6) # Overwrite the default settings options("ibis.pseudoabsence" = ss)
Alternatively one could think of specifying specific pseudo-absence sampling information to one biodiversity dataset specifically:
What are the options for parallelization in
Most code in the
ibis.iSDM R-package is by default
already parallelized and many computationally-intensive operations
should be making use of all cores (if you can find an example where this
is not the case, please raise an issue). The number of cores is
generally being decided by the option
In most cases, parallelized code will be run via the
parallel] and [
doParallel] packages, although
there is some code in its infancy to support the [
parallelization approaches as well, offering greater flexibility. See
the function [
ibis_future] for more information also on the
The typical use case is thus to run separate models (via
train()) in a loop or scheduler of a
High-Performance-Computer. Users should be careful in the case of shared
resources, e.g. don’t parallelize such operations on the same machine.
If there is a need to parallelize multiple models on the same instance,
it is suggested to disable the
Can I add multiple offsets to a
add_offset_range() functions allows to specify a spatial
explicit offset term which is then added to the regression model in
question. An offset is generally just a coefficient set to a specific
value. To get more than one offset, one just needs to combine the
different provided offsets in a way that is consistent to get that fixed
value (see here
for reference. This can be done either by summing them as transformed
value (discouraged as it can be errorprone) or by simply multiplying
Internally all provided offsets to model object are combined via simple addition together. This thus requires that users transform them aprior (for instance log transform) before adding them to the estimation.
There are too many messages. How can I run
more verbosely ?
There are two options that can be enabled to reduce the number of messages:
- By setting the parameter
FALSEall messages created by the respective engine are suppressed.
- Setting the parameter
FALSEsuppresses all other package related message. This can be done via
options("ibis.setupmessages" = FALSE)
How can I make use of cross-validation in
Cross-validation has been deliberatly not integrated into package. Users who would like to make use of cross-validation techniques thus need to set up their own external modelling routines. Reason is that with the multiple types of integration in this package, the construction of (independent) testing datasets is not as trivial (considering for example offsets, priors or multiple datasets).
Why are there 2 INLA engines and which one should I use ?
ibis.iSDM R-package there are two engines that
makes use of the INLA framework, namely [
engine_inlabru]. When the package author started
developing this package, the [
engine_inlabru] did not yet
support multiple likelihoods and thus this was implemented directly.
Predictions from [
engine_inlabru] should be identical, although the latter
does not infer the predictions directly, instead simulating from the
posterior. This simulation is particularly helpful for creating (future)
projections as otherwise a new model would need to be fitted for every
engine_inlabru] by default in most cases and only use [
engine_inla] if predictions look uncertain.
I am concerned about extrapolation during prediction. What options do I have?
When creating predictive models such as SDMs it is often a concern to
not predict to a variable range outside the environmental conditions for
which a model was trained. The
ibis.iSDM package supports
variable ‘clamping’ of its predictions similar as the popular Maxent
model, however for each [
engine]. Clamping can be enabled
by setting the parameter
clamp in [
TRUE. This restricts any spatial (or spatial-temporal)
projections to the combined
range of predictor variables
observed for each of the training localities.
Similar functionalities are also available separately during scenario
projections by setting adaptability constraints (see
I am using (too) many predictors in my model. What options do I have ?
Having too many predictors in a SDM can be a cause of substantial over-parametrization and subsequently overfitting (e.g. the model is reproducing the data it was trained with rather than projecting into areas unknown).
It is recommended to (a) either use an engine with very strong
regularization, such as for example [
engine_gdb], (b) train a model with caution and have a
minimum number of observations (arbitrary rule of thumb, have at least
10 observations for each additional predictor included), (c) make use of
pre-estimation removal of predictor, such for example through variable
importance criteria or colinearity. See code below for an example.
# Prior to model fitting, remove highly collinear predictors through a pearson correlation assessment mod <- distribution(background) |> add_biodiversity_poipo(species_data) |> engine_glmnet() |> train(filter_predictors = "pearson") # Alternatively use a RandomForest estimator to remove the least important variables mod <- distribution(background) |> add_biodiversity_poipo(species_data) |> engine_glmnet() |> train(filter_predictors = "RF")
Where can I find predictions created by
distribution model has been trained and if the
inference_only parameter in
train() has been
FALSE (Default), the outputs of the prediction be
found in the created object as a
SpatRaster. By default all
engines will produce a
SpatRaster object with at least one
band called “mean” which is the average prediction by the engine. This
is also the result returned when the created model object is
In addition, for Bayesian Engines other bands quantifying the
posterior predictive uncertainty might be available and can be plotted
as well. The raster can also be saved a spatial GeoTiff to a given
filename using the
mod <- distribution(background) |> add_biodiversity_poipo(species_data) |> engine_inlabru() |> train() # To plot plot(mod, "mean") plot(mod, "sd") # To get the layer mod$get_data("prediction") # To save the output layer as floating point geoTiff write_output(mod, "myoutput.tif", type = "gtif", dt = "FLT4S")
engine_xgboost look pixelated?
This is usually due to either the number of rounds for estimation being too low or the learning_rate being to high. Try out different options in the parameters of the engine. A good way to check the performance is also to plot the evaluation log and logloss.
# Requires a fitted model plot(fit$get_data("fit_best")$evaluation_log)
How exactly does the model integration work and how does it differ among engines?
There are various forms of integration from a simple approach of
offsets] to fully integrated multiple likelihood models
(see Fletcher et
al. 2019). Thus, users have a range of possibilities to combine
different sources of evidence in their modelling.
With regards to how different [
engines] treat multiple
biodiversity datasets. Unfortunately only the
engine_stan()] support fully integrated multiple
likelihood models. A full overview can be found in the Engine comparison table. All other
engines] combine multiple datasets by running separate
models in sequence where the order is determined by the sequence the
datasets were added to the model. Within the
function, users have the option of specifying how previous predictions
should be handled through the [
parameter. For example predictions from one model could be added as
predictors or offset to the next. Or the coefficients from one model can
be used to create starting priors for the next model.
What can I do when my model outputs with presence-only data seem odd?
By default, presence only biodiversity data is modelled as point-process model (PPM, see Renner et al. 2015). Similar as Maximum Entropy models these models can be quite sensitive to biased input, which is common for most non-structured biodiversity observations where presence points tend to be clustered in urban or easily accessible areas. To avoid predictions being biased towards those covariates, there are a number of things that can potentially be done here.
- Modify the targeted background sampling for better control of
background points. This can be for instance done via the
pseudoabs_settings()methods. See the respective help files.
- Make use of spatial thinning approaches. See for instance Aiello-Lammers
et al. 2015 and Steen
et al. 2021. Note however that spatial thinning does remove data
points, affecting for instance any
poissondistributed models (PPMs) in the process. The
ibis.iSDMpackage has some functionality for spatial thinning implemented in the
- Partial out a biased variable during the prediction. The
add_control_bias()function can be used to specify a value that needs to be partialed out in the model. There a
bias_valuehas to be specified and this can be set for instance to
0or an amount assumed to be equivalent to minimal bias.
- Consider setting the [
clamp] parameter in
- Add a spatial offset to account for the bias as introduced in Merow et al. 2016.
This can be done via the
add_offset_bias()function and requires preparation of a bias layer in advance.
- Apply more rigorous filtering and bias control to the input data. In the end no correction can replace good data preparation and cleaning. Remember the GIGO principle.
My model keeps crashing, what can I do?Sorry to hear and unless the model or input data are grossely misspecified, this should not happen. Consider changing the input covariates. Often scaling the covariates (see
add_predictors()) can help with model convergence. Similarly if the use of discrete variables (e.g. `
factorpredictors) results in errors, consider setting the parameter “explode_factors” to TRUE in
add_predictors(). This will create additional covariates for every factor level. Please report an issue with the concrete error message, your R session information (call
sessionInfo()) and preferably some example data and code.
Can I transform the prediction output to a suitability index?
Often it is easier to communicate an index of suitability (scale [0-1]) to stakeholders and policy, which can in principle be derived from ibis.iSDM output. Especially when using Poisson Process models to infer the suitability of a given area, the units can be hard to interpret for non-scientists. An easy way to achieve this has been added as a function to each Biodiversity distribution object. See below for an example.
# Train a model fit <- distribution(background) |> # Presence-absence data add_biodiversity_poipo(my_gbifpoints) |> add_predictors(predictors) |> engine_glmnet() |> train() # Make a transformed prediction of the suitability layer # The output is a normalized prediction surface # created via (x - min) / (max - min) or x/sum(x) respectively pred <- fit$calc_suitabilityindex()
The names of my variables are slightly changing when added to a model?
This is a feature, not a bug ;) Many covariates often come with
unusual characters and symbols that can not be readily used in equations
or in queries of tabular data. The
function cleans those variable names and removes / resets non conform
# It can be disabled by setting the following option to false at the start of the script. options('ibis.cleannames' = FALSE)