Object-Oriented Programming in galah

Martin Westgate, Dax Kellie

2024-02-02

From version 2.0 onwards, galah is built around object-oriented programming principles. In practice, this won’t change the experience for most users. However, it does enable new ways of programming that were not previously available.

Masked functions

The default method for building queries in galah is to first use galah_call() to create a query object called a data_request. This object class is specific to galah.

galah_call() |>
  galah_filter(genus == "Crinia") |>
  class()
## [1] "data_request"

When an object is of class data_request, galah can trigger functions to use specific methods for this object class, even if a function name is used by another package. For example, users can use filter() and group_by() functions from dplyr instead of galah_filter() and galah_group_by() to construct a query. Consequently, the following queries are synonymous:

galah_call() |>
  galah_filter(genus == "Crinia", year == 2020) |>
  galah_group_by(species) |>
  atlas_counts()
galah_call() |> 
  filter(genus == "Crinia", year == 2020) |>
  group_by(species) |>
  atlas_counts()
## # A tibble: 16 × 2
##    species                 count
##    <chr>                   <int>
##  1 Crinia signifera        58579
##  2 Crinia parinsignifera   12782
##  3 Crinia glauerti          3090
##  4 Crinia georgiana         1490
##  5 Crinia remota             784
##  6 Crinia sloanei            644
##  7 Crinia insignifera        536
##  8 Crinia tinnula            417
##  9 Crinia deserticola        278
## 10 Crinia pseudinsignifera   236
## 11 Crinia tasmaniensis       189
## 12 Crinia bilingua            75
## 13 Crinia subinsignifera      45
## 14 Crinia riparia             10
## 15 Crinia flindersensis        3
## 16 Crinia nimba                1

Thanks to object-oriented programming, galah “masks” filter() and group_by() functions to use methods defined for data_request objects instead. The full list of masked functions is:

Note that these functions are all evaluated lazily; they amend the underlying object, but do not amend the nature of the data until the call is evaluated. To actually build and run the query, we’ll need to use one or more of a different set of dplyr verbs: collapse(), compute() and collect().

Advanced query building

The usual way to begin a query to request data in galah is using galah_call(). From version 2.0, to make galah more flexible and modular, the underlying architecture of galah_call() has been divided into several types of request_ functions. You can begin your pipe with one of these dedicated request_ functions (rather than galah_call()) depending on the type of data you want to collect.

For example, if you want to download occurrences, use request_data():

x <- request_data("occurrences") |> # note that "occurrences" is the default `type`
  filter(species == "Crinia tinnula", 
         year == 2010) |>
  collect()

You’ll notice that this query differs slightly from the typical query structure users of galah are familiar with. The desired data type, "occurrences", is specified at the beginning of the query within request_data() rather than at the end using atlas_occurrences(). Specifying the data type at the start allows users to make use of advanced query building using three newly implemented stages of query building: collapse(), compute() and collect(). These stages mirror existing functions in dplyr for querying databases, and act in the following way:

We can use these in sequence, or just leap ahead to the stage we want:

x <- request_data() |>
  filter(genus == "Crinia", year == 2020) |>
  group_by(species) |>
  arrange(species) |>
  count()

collapse(x)
## Object of class query with type data/occurrences-count-groupby 
## url: https://biocache-ws.ala.org.au/ws/occurrences/facets?fq=%28genus%3A%22... 
## arrange: species (ascending)
compute(x)
## Object of class computed_query with type data/occurrences-count-groupby 
## url: https://biocache-ws.ala.org.au/ws/occurrences/facets?fq=%28genus%3A%22... 
## arrange: species (ascending)
collect(x) |> head()
## # A tibble: 6 × 2
##   species              count
##   <chr>                <int>
## 1 Crinia bilingua         75
## 2 Crinia deserticola     278
## 3 Crinia flindersensis     3
## 4 Crinia georgiana      1490
## 5 Crinia glauerti       3090
## 6 Crinia insignifera     536

The benefit of using collapse(), compute() and collect() is that queries are more modular. This is particularly useful for large data requests in galah. Users can send their query using compute(), and download data once the query has finished — downloading with collect() later — rather than waiting for the request to finish within R.

# Create and send query to be calculated server-side
request <- request_data() |>
  identify("perameles") |>
  filter(year > 1900) |>
  compute()
  
# Download data
request |>
  collect()

Additionally, functions that are more modular are generally easier to interrogate and debug. Previously some functions did several different things, making it difficult to know which APIs were being called, when, and for what purpose. Partitioning queries into three distinct stages is much more transparent, and allows users to check their query construction prior to sending a request. For example, the query above is constructed with the following information, returned by collapse().

request_data() |>
  identify("perameles") |>
  filter(year > 1900) |>
  collapse()
## Object of class query with type data/occurrences 
## url: https://biocache-ws.ala.org.au/ws/occurrences/offline/download?fq=%28l...

The collapse() stage includes an additional argument (.expand) that, when set to TRUE, shows all the APIs called to construct the user-requested query. This is especially useful for debugging.

New object classes

Under the hood, the different query-building verbs amend our object to a new class:

These can be called directly, or via the method and type arguments of galah_call(), which specify which dedicated request_ function and data type to return. To demonstrate what we mean, take the following calls, which despite using different syntax, all return the number of records available for the year 2020:

# new syntax
request_data() |>
  filter(year == 2020) |>
  count() |>
  collect()

# similar, but using `galah_call()`
galah_call(method = "data",
           type = "occurrences-count") |>
  filter(year == 2020) |>
  collect()

# original syntax
galah_call() |>
  galah_filter(year == 2020) |>
  atlas_counts()

Another example is to list available fields in the selected atlas:

request_metadata(type = "fields") |>
  collect()

galah_call(method = "metadata", 
           type = "fields") |>
  collect()

show_all(fields)

Or to show values for states and territories:

request_metadata() |>
  filter(field == "cl22") |>
  unnest() |>
  collect()

galah_call(method = "metadata", 
           type = "fields-unnest") |>
  galah_filter(id == "cl22") |>
  collect()

search_all(fields, "cl22") |>
  show_values()

Although there is little reason to use request_metadata() rather than show_all() most of the time, in some cases larger databases like GBIF return huge data.frames of metadata. galah allows users to use collapse(), compute() and collect() on all types of requests, meaning users are able to download large metadata queries using the process detailed above in “Advanced query building” to get around this issue.

Do I need to use advanced query building?

Despite these benefits, we have no plans to require this new syntax; functions prefixed with galah_ or atlas_ are not going away.

Indeed, while there is perfect redundancy between old and new syntax in some cases, in others they serve different purposes. In atlas_media() for example, several calls are made and joined in a way that reduces the number of steps required by the user.

Under the hood, however, all atlas_ functions are now entirely built using the above syntax.