Spatial filtering

Callum Waite, Shandiya Balasubramaniam

2023-10-13

Biodiversity queries to the ALA usually require some spatial filtering. Let’s see how spatial data are stored in the ALA and a few different methods to spatially filter data with galah and other packages.

Most records in the ALA contain location data in the form of two key fields: decimalLatitude and decimalLongitude. As expected, these fields are the decimal coordinates of each occurrence, with south and west values denoted by negatives. While there may be some uncertainty in these values (see field coordinateUncertaintyInMeters), they are generally very accurate.

While these are very important and useful fields, very rarely will we encounter situations that require queries directly calling decimalLatitude and decimalLongitude. Instead, the ALA and galah have a number of features that make spatial queries simpler.

Contextual and spatial layers

Often we want to filter results down to some commonly defined spatial regions, such as states, LGAs or IBRA/IMCRA regions. The ALA contains a large range (>100) of contextual and spatial layers, in-built as searchable and queriable fields. They are denoted by names beginning with "cl", followed by an identifying number that may be up to 6 digits long. These fields are each based on shapefiles, and contain the names of the regions in these layers that each record lies in.

We strongly recommend using search_fields() to check whether a contextual layer already exists in the ALA that matches what you require before proceeding with other methods of spatial filtering. These fields are all able to be queried with galah_filter()and so they are generally easier to use.

Suppose we are interested in querying records of the Red-Necked Avocet (Recurvirostra novaehollandiae) in a particular protected wetlands, the Coorong wetlands in South Australia. We can search the ALA fields for wetlands.

library(galah)
library(tidyverse)
library(gt)
library(sf)
galah_config(email = "your_email_here", verbose = FALSE)
search_fields("wetlands")
## # A tibble: 1 × 3
##   id    description                     type  
##   <chr> <chr>                           <chr> 
## 1 cl901 Directory of Important Wetlands fields

Our search identifies that layer cl901 seems to match what we are looking for. We can then either view all possible values in the field with show_values(), or search again for our particular field.

search_fields("cl901") |> search_values("coorong")
## • Showing values for 'cl901'.
## # A tibble: 1 × 1
##   cl901                                      
##   <chr>                                      
## 1 The Coorong, Lake Alexandrina & Lake Albert

We can filter all occurrences for exact matches with this value, "Lake Eyre". Our galah query can be built as follows:

galah_call() |>
  galah_identify("Recurvirostra novaehollandiae") |>
  galah_filter(cl901 == "The Coorong, Lake Alexandrina & Lake Albert") |>
  atlas_occurrences() |>
  head(5) |>
  gt::gt()
## Retrying in 1 seconds.
## Retrying in 2 seconds.
## Retrying in 4 seconds.
recordID scientificName taxonConceptID decimalLatitude decimalLongitude eventDate occurrenceStatus dataResourceName
00630f58-ee78-4f69-ba39-718b0a1c3356 Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -35.59293 139.0218 2008-11-18 00:00:00 PRESENT SA Fauna
00705077-abc2-4f28-aedd-11a479ec2d82 Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -35.52833 138.8089 2016-03-06 00:00:00 PRESENT BirdLife Australia, Birdata
0071444e-ddf2-45dc-9a10-c6399686e306 Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -35.57719 138.9923 2008-01-01 00:00:00 PRESENT SA Fauna
008a2e6c-26b4-4f4e-a428-b0330db53466 Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -35.52831 138.8280 2017-11-19 17:15:00 PRESENT eBird Australia
00b11f13-0100-4471-b18b-7e9736ddeeaf Recurvirostra novaehollandiae https://biodiversity.org.au/afd/taxa/c69e7308-527a-429d-a80d-143bd20b5100 -36.18001 139.6589 2017-08-05 13:25:00 PRESENT eBird Australia

galah_geolocate()

While server-side spatial information is useful, there are likely to be cases where the shapefile or region you wish to query will not be pre-loaded as a contextual layer in the ALA. In this case, shapefiles can be introduced to the filtering process using the {sf} package and the galah_geolocate() function. Shapefiles can be provided as an sf object, whether that is by importing them with sf::st_read() or taking a POLYGON or MULTIPOLYGON character string and transforming them with sf::st_as_sfc().

For instance, we might interested in species occurrences in King George Square, Brisbane. We can take the MULTIPOLYGON object for the square (as sourced from the Brisbane City Council) and transform it into sfc and then sf objects.

king_george_sq <- "MULTIPOLYGON(((153.0243 -27.46886, 153.0242 -27.46896, 153.0236 -27.46837, 153.0239 -27.46814, 153.0239 -27.46813, 153.0242 -27.46789, 153.0244 -27.46805, 153.0245 -27.46821, 153.0246 -27.46828, 153.0247 -27.46835, 153.0248 -27.46848, 153.0246 -27.4686, 153.0246 -27.46862, 153.0245 -27.46871, 153.0243 -27.46886)))" |>
  sf::st_as_sfc() |> 
  sf::st_as_sf()

We can provide this MULTIPOLYGON in our filter as the argument of galah_geolocate() to assess which species have been recorded in King George Square.

galah_call() |>
  galah_filter() |>
  galah_geolocate(king_george_sq) |>
  galah_select(decimalLatitude, decimalLongitude, eventDate, scientificName, vernacularName) |>
  atlas_occurrences() |> 
  head(10) |>
  gt::gt()
decimalLatitude decimalLongitude eventDate scientificName vernacularName
-27.46851 153.0238 2023-04-08 07:49:00 Entomyzon cyanotis Blue-faced Honeyeater
-27.46842 153.0241 2022-08-30 01:29:00 Threskiornis moluccus Black-necked Ibis
-27.46833 153.0241 2022-07-24 14:43:00 Threskiornis moluccus Black-necked Ibis

There is a second argument of galah_geolocate() called type, which defaults to value "polygon". By setting the type argument to "bbox", the provided POLYGON or MULTIPOLYGON will be converted into the smallest bounding box (rectangle) that contains the POLYGON. In this case, records will be included that may not exactly lie inside the provided shape.

galah_call() |>
  galah_filter() |>
  galah_geolocate(king_george_sq, type = "bbox") |>
  galah_select(decimalLatitude, decimalLongitude, eventDate, scientificName, vernacularName) |>
  atlas_occurrences() |>
  head(10) |>
  gt::gt()
## Data returned for bounding box:
## xmin = 153.0236 xmax = 153.0248 ymin = -27.46896 ymax = -27.46789
## Retrying in 1 seconds.
decimalLatitude decimalLongitude eventDate scientificName vernacularName
-27.46882 153.0236 2019-04-21 04:40:00 Threskiornis moluccus Black-necked Ibis
-27.46882 153.0236 2005-08-28 12:18:00 Threskiornis moluccus Black-necked Ibis
-27.46851 153.0238 2023-04-08 07:49:00 Entomyzon cyanotis Blue-faced Honeyeater
-27.46842 153.0241 2022-08-30 01:29:00 Threskiornis moluccus Black-necked Ibis
-27.46833 153.0241 2022-07-24 14:43:00 Threskiornis moluccus Black-necked Ibis
-27.46806 153.0239 NA Syntrichia laevipila NA

Large shapefiles

The type argument with option "bbox" is provided because sf objects with >500 vertices will not be accepted by the ALA. In the event you have a large shapefile, using type = "bbox" will at least enable an initial reduction of the data that is downloaded, before finer filtering to the actual shapefile will obtain the desired set of occurrences. Alternatively, one can also perform the "bbox" reduction before passing the shape to galah_geolocate() by using sf::st_bbox().

A common situation for this to occur is when a shapefile with multiple shapes is provided, where we are interested in grouping our results by each shape. Here is a mock workflow using a subset of a shapefile of all 2,184 Brisbane parks.

Let’s say we are interested in knowing which parks in the Brisbane postcode 4075 have the most occurrences of the Scaly-Breasted Lorikeet, Trichoglossus chlorolepidotus, since 2020. We can download the entire shapefile from the above link, and perform our filtering and summarising as follows:

brisbane_parks <- sf::st_read("path/to/Park___Locations.shp") |>
  sf::st_make_valid() |>
  filter(POST_CODE == 4075)
# Convert shapefile to a bounding box
brisbane_parks_bbox <- brisbane_parks |> sf::st_bbox()

# Find all occurrences of Trichoglossus chlorolepidotus in the bounding box in 2022
lorikeet_brisbane <- galah_call() |>
  galah_filter(scientificName == "Trichoglossus chlorolepidotus", 
               year >= 2020) |>
  galah_geolocate(brisbane_parks_bbox, type = "bbox") |>
  atlas_occurrences()
## Data returned for bounding box:
## xmin = 152.96331 xmax = 152.99668 ymin = -27.57737 ymax = -27.51606
## Retrying in 1 seconds.
# Filter records down to only those in the shapefile polygons
lorikeet_brisbane |>
  # Create a point geometry based on the occurrence coordinates
  sf::st_as_sf(coords = c("decimalLongitude", "decimalLatitude"), crs = sf::st_crs(brisbane_parks), remove = FALSE) |>
  # identify which park each occurrence sits in with st_intersects()
  mutate(intersection = sf::st_intersects(geometry, brisbane_parks) |> as.integer(),
         park = ifelse(is.na(intersection), NA, brisbane_parks$PARK_NAME[intersection])) |>
  # Filter out occurrences that did not occur in a park
  filter(!is.na(park)) |>
  # Drop the geometry column
  sf::st_drop_geometry() |>
  # Summarise the top 10 parks for lorikeet sightings in 2022
  group_by(park) |>
  summarise(counts = n()) |>
  arrange(desc(counts)) |>
  head(10)
## # A tibble: 10 × 2
##    park                           counts
##    <chr>                           <int>
##  1 SHERWOOD ARBORETUM                276
##  2 NYUNDARE-BA PARK                  271
##  3 FAULKNER PARK                      51
##  4 STRICKLAND TERRACE PARK            33
##  5 FORT ROAD BUSHLAND                 31
##  6 GRACEVILLE RIVERSIDE PARKLANDS     31
##  7 BENARRAWA RESERVE                  19
##  8 NOSWORTHY PARK                     16
##  9 HORACE WINDOW RESERVE               9
## 10 GRACEVILLE AVENUE PARK              6

Some shapefiles cover large geographic areas with the caveat that even the bounding box doesn’t restrict the number of records to a value that can be downloaded easily. In this case, we recommend more nuances and detailed methods that can be performed using looping techniques. One of our ALA Labs blog posts, Hex maps for species occurrence data, has been written detailing how to approach larger problems such as this.