Piping in galah

Martin Westgate & Dax Kellie

2023-10-13

galah has been designed to support a piped workflow that mimics workflows made popular by tidyverse packages such as dplyr. Although piping in galah is optional, it can make things much easier to understand, and so we use it in (nearly) all our examples.

To see what we mean, let’s look at an example of how dplyr::filter() works. Notice how dplyr::filter and galah_filter both require logical arguments to be added by using the == sign:

library(dplyr)

mtcars |> 
  filter(mpg == 21)
##               mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
## Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4
galah_call() |>
  galah_filter(year == 2021) |>
  atlas_counts()
## # A tibble: 1 × 1
##     count
##     <int>
## 1 8204330

As another example, notice how galah_group_by() + atlas_counts() works very similarly to dplyr::group_by() + dplyr::count():

mtcars |>
  group_by(vs) |> 
  count()
## # A tibble: 2 × 2
## # Groups:   vs [2]
##      vs     n
##   <dbl> <int>
## 1     0    18
## 2     1    14
galah_call() |>
  galah_group_by(biome) |>
  atlas_counts()
## # A tibble: 2 × 2
##   biome           count
##   <chr>           <int>
## 1 TERRESTRIAL 120083089
## 2 MARINE        5189286

We made this move towards tidy evaluation to make it possible to use piping for building queries to the Atlas of Living Australia. In practice, this means that data queries can be filtered just like how you might filter a data.frame with the tidyverse suite of functions.

Piping with galah_call()

You may have noticed in the above examples that dplyr pipes begin with some data, while galah pipes all begin with galah_call() (be sure to add the parentheses!). This function tells galah that you will be using pipes to construct your query. Follow this with your preferred pipe (|> from base or %>% from magrittr). You can then narrow your query line-by-line using galah_ functions. Finally, end with an atlas_ function to identify what type of data you want from your query.

Here is an example using counts of bandicoot records:

galah_call() |>
  galah_identify("perameles") |>
  galah_filter(year >= 2020) |>
  galah_group_by(species, year) |>
  atlas_counts()
## # A tibble: 15 × 3
##    species                year  count
##    <chr>                  <chr> <int>
##  1 Perameles nasuta       2021   3337
##  2 Perameles nasuta       2020   1573
##  3 Perameles nasuta       2022   1515
##  4 Perameles nasuta       2023    625
##  5 Perameles gunnii       2023     95
##  6 Perameles gunnii       2021     72
##  7 Perameles gunnii       2022     64
##  8 Perameles gunnii       2020     49
##  9 Perameles bougainville 2021     84
## 10 Perameles bougainville 2022     72
## 11 Perameles bougainville 2020      1
## 12 Perameles pallescens   2022     30
## 13 Perameles pallescens   2021     25
## 14 Perameles pallescens   2023     20
## 15 Perameles pallescens   2020     11

And a second example, to download occurrence records of bandicoots in 2021, and also to include information on which records had zero coordinates:

galah_call() |>
  galah_identify("perameles") |>
  galah_filter(year == 2021) |>
  galah_select(group = "basic", ZERO_COORDINATE) |>
  atlas_occurrences() |>
  head()
## Retrying in 1 seconds.
## # A tibble: 6 × 8
##   recordID              decimalLatitude decimalLongitude eventDate           scientificName taxonConceptID dataResourceName occurrenceStatus
##   <chr>                           <dbl>            <dbl> <dttm>              <chr>          <chr>          <chr>            <chr>           
## 1 00108221-afc6-4246-a…           -28.8             153. 2021-09-29 00:00:00 Perameles nas… https://biodi… NSW BioNet Atlas PRESENT         
## 2 001e914d-0281-41cb-b…           -33.8             151. 2021-04-19 00:00:00 Perameles nas… https://biodi… NSW BioNet Atlas PRESENT         
## 3 00233c1e-66df-4d9c-8…           -33.8             151. 2021-02-27 00:00:00 Perameles nas… https://biodi… NSW BioNet Atlas PRESENT         
## 4 003064b3-490a-49b5-a…           -27.5             152. 2021-11-05 12:06:00 Perameles nas… https://biodi… iNaturalist Aus… PRESENT         
## 5 004fd28b-a899-4a97-8…           -33.8             151. 2021-07-24 00:00:00 Perameles nas… https://biodi… NSW BioNet Atlas PRESENT         
## 6 0068547b-b091-4a86-8…           -33.8             151. 2021-01-28 00:00:00 Perameles nas… https://biodi… NSW BioNet Atlas PRESENT

Note that the order in which galah_ functions are added doesn’t matter, as long as galah_call() goes first, and an atlas_ function comes last.

Using dplyr functions in galah

As of version 1.5.1, it is possible to call dplyr functions natively within galah to amend how queries are processed, i.e.:

# galah syntax
galah_call() |>
  galah_filter(year >= 2020) |>
  galah_group_by(year) |>
  atlas_counts()
## # A tibble: 4 × 2
##   year    count
##   <chr>   <int>
## 1 2022  8353115
## 2 2021  8204330
## 3 2020  7116059
## 4 2023  2997676
# dplyr syntax
galah_call() |>
  filter(year >= 2020) |>
  group_by(year) |>
  count()
## Object of type `data_request` containing:
## • type occurrences-count
## • filter year >= 2020
## • group_by year

The full list of masked functions is: