Process subnational statistics • mapspam2globiom

Subnational statistics on harvested area are a key input for SPAM-C. Before they can be used the data need to be put in the right format, made consistent and aggregated (or split) to the crops and farming systems that are used by SPAM-C. Unfortunately, the level of detail (e.g. coverage of subnational level 1 and 2), coverage (full crop and administrative unit coverage or many missing values) and quality (do lower levels of administrative unit data add up to higher levels?) of these statistics may differ enormously between countries. Cleaning, harmonizing and preparing the subnational statistics is probably the most time consuming and difficult part of running SPAM-C. In many cases the researcher has to make a value judgment on which statistics to keep and which to discard in order to make them consistent. Which data is more reliable if they do not add up? Administrative unit level 1 or 2? If, the harvested area for say vegetables seems exceptionally and perhaps unrealistically large in a certain administrative region, is this a data entry error or is it really that high? These are decisions that often need to be taken when preparing the data. There is no simple formula or function that makes it possible to (semi) automatically prepare the raw subnational statistics so that they can be used by SPAM-C. Unfortunately it requires a substantial amount of user input and ad hoc coding to get the data ready.

We prefer to break down the processing and preparation of the subnational statistics into two steps:

Processing of raw subnational statistics. In this step, the raw statistics, meaning the various data files that were downloaded from the national statistical office or taken from any other source, are reshaped, aggregated and cleaned so they are (almost) ready to be used by SPAM-C.
Checking and calibrating processed subnational statistics.: In this step, the processed statistics are checked for consistency, adjusted where needed and calibrated to the FAOSTAT national statistics.

The scripts to prepare the data for Malawi offer some guidance on how to organize these two steps and can be adapted where needed. They are located in the 02_subnational_statistics folder. They also illustrate the use of several functions and tools that we developed to support the process. We prefer and highly recommend to code all the steps that are needed to process statistics in R so everything is nicely documented and fully reproducible. The tidyr and dplyr packages offer a variety of functions that are specially designed to clean up and process ‘messy’ data. It is also possible to use Excel or any other software to process the raw national statistics as long as the statistics are saved in the correct format so they can be used for further processing. Obviously a major disadvantage of using Excel is that it will be very difficult to track changes or quickly make adjustments when new data becomes available and needs to be added.

Support scripts

Apart from the main scripts that clean and check the statistics, which will be discussed below, the template contains three other scripts that support these processes.

01_process_faostat_crops.r
04_process_aquastat.r
05_calculate_irrigated_system_share.r

The first script prepares the national FAOSTAT statistics. These are used for two purposes. The first is for calibration of the subnational statistics so they are in line with FAOSTAT. As FAOSTAT is the leading source of agricultural data, it is useful to ensure that the subnational data adds up to the FAOSTAT national totals. For many countries the totals will already be the same or at least in the same range as FAOSTAT as subnational agricultural statistics are often the basis for the construction of the national statistics, which in turn are used as input by FAOSTAT. Nonetheless, one might also encounter large differences between the two sources of data. The second purpose is to determine the list of crops that is needed to prepare the subnational agricultural statistics. We will discuss this further below. Make sure to run 01_process_faostat_crops.r before starting to prepare the subnational statistics. Do not forget to set the faostat_version variable in the script, which corresponds with the date you added when you downloaded the data (20200303 in the Malawi example).

The second and third script select relevant country data from the AQUASTAT database and use this to calculate the share of irrigated area per crop, respectively. This information can be used as input to inform the irrigated area shares that are needed for the farming system shares data table (See below). We will not discuss these scripts here. Note that 05_calculate_irrigated_system_share.r needs the processed harvest area statistics as input so it can only by run at the end, denoted by the number 05 in the name of the script. Please have a look and try by yourself.

Processing of raw subnational statistics

The 02_process_raw_national_statistics.r file shows all the steps we took to process the raw statistics for Malawi (taken from global SPAM for 2010), put them into the right format and make them consistent with the shapefile with the locations of the administrative units. Most of the data was already in the right format so we did not use them.We highlight several key steps you probably also need to implement to process the data for your country.

You need to create three files with subnational statistical information that will be combined later:

A file with Harvest area statistics (ha) in hectares for all subnational administrative unit at each level.
A file with the Cropping intensity (ci) for each subnational administrative unit at the national and subnational 1. If the model is solved at the subnational level 1 (by setting the model level to 1), also cropping intensity data is required at subnational level 1.
A file with Farming system area shares (fs) (not percentages!) for each subnational administrative unit at the national level (same format as the cropping intensity information).

It is the best to start with processing the harvest area statistics and then work on the cropping intensity and farming system shares files. Data for these three files is most likely coming from different sources and can be processed independently. Also cropping intensity and to a lesser extent farming system shares data sometimes needs to be adjusted after running the model as it happens that there is not enough cropland to allocate all the harvested area from the statistics, which suggests that the cropping intensity is too low for some regions.

Putting the subnational statistics in the right format

All crop statistics need to be stored in the wide format. With this we mean that the first four columns of the data table list the administrative unit code (adm_code), name (adm_name), level (adm_level) and in case of farming system shares or cropping intensity data, the farming system (system), followed by named columns, one for each crop, with harvested area, farming system shares or cropping intensity per subnational unit. adm_name and adm_code must be provided up to the most detailed administrative unit for which data is available. Hence, in case level 2 data is available, data should be supplied for level 0, 1 and 2. In case only level 1 data is available, data should be supplied for level 0 and level 1.

create_statistics_template() will create templates for the ha, fs and ci data files - also see the Malawi ha, fs and ci files for examples. The number and depth of the administrative units will be determined by the model parameters stored in the spam_par object as well as the adm_list file that is created with create_adm_list(). By default, the template will have columns for all 40 SPAM-C crops.

You can save the template as csv file and use it as a basis to prepare the statistics. Note that is is essential that you follow the template closely! Adding or removing administrative units will result in errors as the data no longer matches with the map that contains the location of the administrative units. See below for more information how to deal with missing crop information. In case of Malawi, most of the data was already in the right format so we did not use the templates.

# Create template for ha. Replace "ha" by "fs" or "ci" for other templates.
create_statistics_template("ha", param)
#> # A tibble: 1 x 43
#>   adm_name adm_code adm_level whea  rice  maiz  barl  mill  sorg  ocer  pota 
#>   <chr>    <chr>        <dbl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 Malawi   MWI              0 NA    NA    NA    NA    NA    NA    NA    NA   
#> # ... with 32 more variables: swpo <lgl>, yams <lgl>, cass <lgl>, orts <lgl>,
#> #   bean <lgl>, chic <lgl>, cowp <lgl>, pige <lgl>, lent <lgl>, opul <lgl>,
#> #   soyb <lgl>, grou <lgl>, cnut <lgl>, oilp <lgl>, sunf <lgl>, rape <lgl>,
#> #   sesa <lgl>, ooil <lgl>, sugc <lgl>, sugb <lgl>, cott <lgl>, ofib <lgl>,
#> #   coff <lgl>, coco <lgl>, teas <lgl>, toba <lgl>, bana <lgl>, plnt <lgl>,
#> #   trof <lgl>, temf <lgl>, vege <lgl>, rest <lgl>

For processing it is easier to transform the data into the long format, which is similar to the tidy data format as implemented by the tidyverse packages. In our case this means that the 40 crop columns are squeezed into two columns, one with the crop code and another with the data. The gather() and spread() in the tidyr package were designed to help you switch between wide and long tables.¹ See below and the Malawi scripts for examples.

Aggregating to SPAM-C crop classes

The raw statistics need to be mapped to the 40 crop and crop groups used by SPAM-C. The list of crops can be found in crop.csv, which is automatically copied to the mappings folder after running create_spam_folders(). In practice, almost none of the countries produces all 40 crops. To determine the set of relevant crops, you can use the list of crops for which FAOSTAT presents data. Running 01_process_faostat_crops.r will save a faostat_crop_list.csv file in the processed_data/lists folder. Note that it is no problem to keep all the 40 crops in your database as long as you set their values to NA or -999 (see below) so they are automatically removed in the next processing step.

The easiest way to aggregate the raw subnational statistics to the FAOSTAT crop list is to create an orig2crop mapping table, which links the two crop lists. It might happen that the subnational statistics contain crops that are not present in FAOSTAT. In these cases you have to either exclude them or add them to one of the crops for which FAOSTAT presents data.

The script below illustrates how to aggregate the raw statistics to the SPAM-C crops. Note that it important to use sum with na.rm = FALSE as otherwise NA+NA will result in 0 not NA. This is crucial as missing data in the statistics need to be set to NA for SPAM-C in order to process them correctly. If data for administrative units need to be aggregated a similar solution using a mapping table can be applied.

# NB: use sum with na.rm = F as we want NA+NA = NA, not NA+NA = 0!
stat <- stat %>%
  left_join(orig2crop) %>%
  group_by(crop, adm_code, adm_name, adm_level) %>%
  summarize(value_ha = sum(value_ha, na.rm = F)) %>%
  ungroup()

Treatment of missing data

It is unlikely that you will find subnational statistics for all relevant crops in a country. As a minimum requirement, SPAM-C needs full national level statistics for each crop. This is generally no problem as harvested area statistics can be taken from FAOSTAT. Farming system shares and cropping intensity data might be more difficult to obtain and often require making strong assumptions. At the subnational level, it is generally no problem to have missing values for harvested area as data will be allocated to the regions when the model is solved. Full information (harvest area statistics, farming system shares and cropping intensity) is only needed at administrative level 1 if the model is run at level 1 (solve level = 1) and never at administrative level 2.

In case data on harvested area is missing for a certain administrative unit, you can indicate this by using NA or -999. We highly recommend using the latter, particularly if you are using Excel to process the data as it avoids potential errors that may result between an NA and an empty string "" value. These are both displayed as empty cells in Excel but will be treated differently by R after loading the file. Replacing NA values by -999 can be done simultaneously when putting the data in the wide format.

# Put in preferred mapspam format, adding -999 for missing values
stat_mapspam <- stat %>%
  spread(crop, value_ha, fill = -999) %>%
  arrange(adm_code, adm_code, adm_level)

As pointed out above, it is safe to set all values for a crop to NA or -999 when the crop is not relevant for the country. These will be filtered out in the script that is described in the next section.

Data consistency

Data should always be consistent and add up, meaning that the sum of harvested area for all administrative units and a given crop is the same or smaller (in case of missing information) as the corresponding higher level unit. The script to check the statistics includes a function to reaggregate the data from the bottom up and check for consistency. It is however, essential that you check for consistency yourself when the data is processed as otherwise a large bias might be introduced resulting in completely erroneous data and model output. For example, suppose there is error in the raw harvested area statistics for maize for a certain administrative level 2 region. As a result the sum of harvested maize area of all level 2 units is much higher than the maize area of the corresponding level 1 unit, which is a clear inconsistency. The function that reaggregates the statistics will simply take the level 2 total for maize and substitute the level 1 maize data. This error is subsequently carried forward when the maize area for all level 1 unit are aggregated to the national total. The final data table will include maize data that is strongly biased upwards.

Check and calibrate subnational statistics

After most hard work is done, the ha, fs and cs files need to be checked for consistency and calibrated to the FAOSTAT national statistics. This is done in 03_check_and_calibrate_statistics.r. The first step is to check if the data is consistent, which is done by check_statistics(). If out = TRUE is set a report is returned which shows where the inconsistencies occur.

# Check the consistency of the ha statistics
check_statistics(ha_df, param, out = TRUE)
#> adm0 and adm1 are not equal!. Did you run rebalance_stat?
#> # A tibble: 29 x 4
#> # Groups:   crop [29]
#>    crop     adm0    adm1 difference
#>    <chr>   <dbl>   <dbl>      <dbl>
#>  1 bana    18342   18208        134
#>  2 bean   281065  281058          7
#>  3 cass   193993  193932         61
#>  4 chic   128936  128933          3
#>  5 coff     3001    3001          0
#>  6 cott    63552   63552          0
#>  7 cowp    64297   64297          0
#>  8 grou   290092  290093         -1
#>  9 lent     1904    1905         -1
#> 10 maiz  1660214 1660044        170
#> # ... with 19 more rows

To reaggreate the statistics from the bottom up, use reaggregate_statistics().

# Reaggreate from the bottom up.
ha_df <- reaggregate_statistics(ha_df, param)

# Check again.
check_statistics(ha_df, param, out = T)

Next, the statistics are calibrated to the national FAOSTAT statistics. The code below shows how this is implemented. First, the code checks if there are still crops included which are not in the FAOSTAT data and removes them if needed. Next, crops are identified that are not present in the statisitics and the national total is added from the FAOSTAT database. Finally all the subnational statistics are proportionally scaled so the total is equal to the FAOSTAT figures.

# Identify crops that are present in ha_df but not in fao and remove them from ha_df.
crop_rem <- setdiff(unique(ha_df$crop), unique(fao$crop))
ha_df <- ha_df %>%
  filter(!crop %in% crop_rem)

# Identify crops that are present in fao but not in ha_df.
# We will add then to ha_df
crop_add <- setdiff(unique(fao$crop), unique(ha_df$crop))
ha_df <- ha_df %>%
  bind_rows(
    fao %>%
      filter(crop %in% crop_add) %>%
      mutate(
        fips = unique(ha_df$adm_code[ha_df$adm_level==0]),
        adm_level = 0,
        adm = unique(ha_df$adm_name[ha_df$adm_level==0])))

# Calculate scaling factor
fao_stat_sf <-bind_rows(
  fao %>%
    mutate(source = "fao"),
  ha_df %>%
    filter(adm_level == 0) %>%
    mutate(source = "ha_df")) %>%
  dplyr::select(crop, source, ha) %>%
  spread(source, ha) %>%
  mutate(sf = fao/ha_df) %>%
  dplyr::select(crop, sf)

# rescale ha_df
ha_df <- ha_df %>%
  left_join(fao_stat_sf) %>%
  mutate(ha = ha * sf) %>%
  dplyr::select(-sf)

Finally the three files (ha, fs and ci), need to be saved in the `processed_data/agricultural_statistics folders. It is important to use the correct names otherwise other functions cannot find the data.

write_csv(ha_df, file.path(param$spam_path, glue("processed_data/agricultural_statistics/ha_adm_{param$year}_{param$iso3c}.csv")))
write_csv(fs_df, file.path(param$spam_path, glue("processed_data/agricultural_statistics/fs_adm_{param$year}_{param$iso3c}.csv")))
write_csv(ci_df, file.path(param$spam_path, glue("processed_data/agricultural_statistics/ci_adm_{param$year}_{param$iso3c}.csv")))

The next step is to process the spatial data.

Note that these functions have recently been replaced by pivot_long() and pivot_wide(). In a next update we will use these as well.↩︎