The IPUMS API provides two asset types, both of which are supported by ipumsr:
IPUMS extract endpoints can be used to submit extract requests for processing and download completed extract files.
IPUMS metadata endpoints can be used to discover and explore available IPUMS data as well as retrieve codes, names, and other extract parameters necessary to form extract requests.
Use of the IPUMS API enables the adoption of a programmatic workflow that can help users to:
The basic workflow for interacting with the IPUMS API is as follows:
Before getting started, we’ll load the necessary packages for the examples in this vignette:
IPUMS extract support is currently available via API for the following collections:
Note that this support only includes data available via a collection’s extract engine. Many collections provide additional data via direct download, but these products are not supported by the IPUMS API.
IPUMS metadata support is currently available via API for the following collections:
API support will continue to be added for more collections in the
future. You can check general API availability for all IPUMS collections
ipums_data_collections() #> # A tibble: 14 × 4 #> collection_name collection_type code_for_api api_support #> <chr> <chr> <chr> <lgl> #> 1 IPUMS USA microdata usa TRUE #> 2 IPUMS CPS microdata cps TRUE #> 3 IPUMS International microdata ipumsi TRUE #> 4 IPUMS NHGIS aggregate data nhgis TRUE #> 5 IPUMS IHGIS aggregate data ihgis FALSE #> 6 IPUMS ATUS microdata atus FALSE #> 7 IPUMS AHTUS microdata ahtus FALSE #> 8 IPUMS MTUS microdata mtus FALSE #> 9 IPUMS DHS microdata dhs FALSE #> 10 IPUMS PMA microdata pma FALSE #> 11 IPUMS MICS microdata mics FALSE #> 12 IPUMS NHIS microdata nhis FALSE #> 13 IPUMS MEPS microdata meps FALSE #> 14 IPUMS Higher Ed microdata highered FALSE
The tools in ipumsr may not necessarily support all the functionality currently supported by the IPUMS API. See the API documentation for more information about its latest features. Furthermore, the API tools in ipumsr are in active development, and some functionality may not yet be stable.
To interact with the IPUMS API, you’ll need to register for access with the IPUMS project you’ll be using. If you have not yet registered, you can find links to register for each of the API-supported IPUMS collections below:
Once you’re registered, you’ll be able to create an API key.
By default, ipumsr API functions assume that your key is stored in
IPUMS_API_KEY environment variable. You can also
provide your key directly to these functions, but storing it in an
environment variable saves you some typing and helps prevent you from
inadvertently sharing your key with others (for instance, on
You can save your API key to the
environment variable with
set_ipums_api_key(). To save your
key for use in future sessions, set
save = TRUE. This will
add your API key to your
.Renviron file in your user home
The rest of this vignette assumes you have obtained an API key and
stored it in the
IPUMS_API_KEY environment variable.
Each IPUMS collection has its own extract definition function that is
used to specify the parameters of a new extract request from scratch.
These functions take the form
When you define an extract request, you can specify the data to be included in the extract and indicate the desired format and layout.
For instance, the following defines a simple IPUMS USA extract
request for the
MARST variables from the 2018
and 2019 American Community Survey (ACS):
usa_ext_def <- define_extract_usa( description = "USA extract for API vignette", samples = c("us2018a", "us2019a"), variables = c("AGE", "SEX", "RACE", "STATEFIP", "MARST") ) usa_ext_def #> Unsubmitted IPUMS USA extract #> Description: USA extract for API vignette #> #> Samples: (2 total) us2018a, us2019a #> Variables: (5 total) AGE, SEX, RACE, STATEFIP, MARST
The exact extract definition options vary across collections, but all collections can be used with the same general workflow. For more details on the available extract definition options, see the associated microdata and NHGIS vignettes.
For the purposes of demonstrating the overall workflow, we will continue to work with the sample IPUMS USA extract definition created above.
define_extract_*() functions always produce an
ipums_extract object, which can be handled by other API
?ipums_extract). Furthermore, these objects
will have a subclass for the particular collection with which they are
Many of the specifications for a given extract request object can be accessed by indexing the object:
ipums_extract objects also contain information about the
extract request’s processing status and its assigned extract number,
which serves as an identifier for the extract request. Since this
extract request is still unsubmitted, it has no request number:
To obtain the data requested in the extract definition, we must first submit it to the IPUMS API for processing.
To submit an extract definition, use
If no errors are detected in the extract definition, a submitted extract request will be returned with its assigned number and status. Storing the returned object can be useful for checking the extract request’s status later.
The extract number will be stored in the returned object:
Note that some fields of a submitted extract may be automatically updated by the API upon submission. For instance, for microdata extracts, additional preselected variables may be added to the extract even if they weren’t specified explicitly in the extract definition.
If you forget to store the updated extract object returned by
submit_extract(), you can use the
get_last_extract_info() helper to request the information
for your most recent extract request for a given collection:
It may take some time for the IPUMS servers to process your extract
request. You can ensure that an extract has finished processing before
you attempt to download its files by using
wait_for_extract(). This polls the API regularly until
processing has completed (by default, each interval increases by 10
seconds). It then returns an
containing the completed extract definition.
usa_ext_complete <- wait_for_extract(usa_ext_submitted) #> Checking extract status... #> Waiting 10 seconds... #> Checking extract status... #> IPUMS USA extract 348 is ready to download. usa_ext_complete$status #>  "completed" # `download_links` should be populated if the extract is ready for download names(usa_ext_complete$download_links) #>  "r_command_file" "basic_codebook" "data" #>  "stata_command_file" "sas_command_file" "spss_command_file" #>  "ddi_codebook"
wait_for_extract() will tie up your R session
until your extract is ready to download. While this is fine in a
strictly programmatic workflow, it may be frustrating when working
interactively, especially for large extracts or when the IPUMS servers
In these cases, you can manually check whether an extract is ready
for download with
is_extract_ready(). As long as this
TRUE, you should be able to download your extract’s
For a more detailed status check, provide the extract’s collection
and number to
get_extract_info(). This returns an
ipums_extract object reflecting the requested extract
definition with the most current status. The
status of a
submitted extract will be one of
Note that extracts are removed from the IPUMS servers after a set
period of time (72 hours for microdata collections, 2 weeks for IPUMS
NHGIS). Therefore, an extract that has a
may still be unavailable for download.
is_extract_ready() will alert you if the extract has
expired and needs to be resubmitted. Simply use
submit_extract() to resubmit an extract request. Note that
this will produce a new extract (with a new extract number),
even if the extract definition is identical.
Once your extract has finished processing, use
download_extract() to download the extract’s data files to
your local machine. This will return the path to the downloaded file(s)
required to load the data into R.
For microdata collections, this will be the path to the DDI codebook (.xml) file, which can be used to read the associated data (contained in a .dat.gz file).
For NHGIS, this will be a path to the .zip archive containing the requested data files and/or shapefiles.
The files produced by
download_extract() can be passed
directly into the reader functions provided by ipumsr. For instance, for
If instead you’re working with an NHGIS extract, use
See the associated vignette for more information about loading IPUMS data into R.
To retrieve the definition corresponding to a particular extract,
provide its collection and number to
These can be provided either as a single string of the form
"collection:number" or as a length-2 vector:
c(collection, number). Several other API functions support
this syntax as well.
If you know you made a specific extract definition in the past, but
you can’t remember the exact number, you can use
get_extract_history() to peruse your recent extract
requests for a particular collection.
By default, this returns your 10 most recent extract requests as a
ipums_extract objects. You can adjust how many
requests to retrieve with the
usa_extracts <- get_extract_history("usa", how_many = 3) usa_extracts #> [] #> Submitted IPUMS USA extract number 348 #> Description: USA extract for API vignette #> #> Samples: (2 total) us2018a, us2019a #> Variables: (15 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER, STATEFIP... #> #> [] #> Submitted IPUMS USA extract number 347 #> Description: Data from long ago #> #> Samples: (1 total) us1880a #> Variables: (12 total) YEAR, SAMPLE, SERIAL, HHWT, CLUSTER, STRATA, GQ, PERNUM... #> #> [] #> Submitted IPUMS USA extract number 346 #> Description: Data from 2017 PRCS #> #> Samples: (1 total) us2017b #> Variables: (9 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, GQ, PERNUM, PERWT,...
Because this is a list of
ipums_extract objects, you can
operate on them with the API functions that have been introduced
You can also iterate through your extract history to find extracts
with particular characteristics. For instance, we can use
purrr::keep() to find all extracts that contain a certain
variable or are ready for download:
purrr::keep(usa_extracts, ~ "MARST" %in% names(.x$variables)) #> [] #> Submitted IPUMS USA extract number 348 #> Description: USA extract for API vignette #> #> Samples: (2 total) us2018a, us2019a #> Variables: (15 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER, STATEFIP... purrr::keep(usa_extracts, is_extract_ready) #> [] #> Submitted IPUMS USA extract number 348 #> Description: USA extract for API vignette #> #> Samples: (2 total) us2018a, us2019a #> Variables: (15 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, CLUSTER, STATEFIP... #> #> [] #> Submitted IPUMS USA extract number 347 #> Description: Data from long ago #> #> Samples: (1 total) us1880a #> Variables: (12 total) YEAR, SAMPLE, SERIAL, HHWT, CLUSTER, STRATA, GQ, PERNUM... #> #> [] #> Submitted IPUMS USA extract number 346 #> Description: Data from 2017 PRCS #> #> Samples: (1 total) us2017b #> Variables: (9 total) YEAR, SAMPLE, SERIAL, CBSERIAL, HHWT, GQ, PERNUM, PERWT,...
Or we can use the
purrr::map() family to browse certain
If you regularly use only a single IPUMS collection, you can save
yourself some typing by setting that collection as your default.
set_ipums_default_collection() will save a specified
collection to the value of the
environment variable. If you have a default collection set, API
functions will use that collection in all requests, assuming no other
collection is specified.
# Check the default collection: Sys.getenv("IPUMS_DEFAULT_COLLECTION") #>  "usa" # Most recent USA extract: usa_last <- get_last_extract_info() # Request info on extract request "usa:10" usa_ext_10 <- get_extract_info(10) # You can still request other collections as usual: cps_ext_10 <- get_extract_info("cps:10")
Occasionally, you may want to modify an existing extract definition
(e.g. to update an analysis with new data). The easiest way to do so is
to add the new specifications to the
code that produced the original extract definition. This is why we
highly recommend that you save this code somewhere where it can be
accessed and updated in the future.
However, there are cases where the original extract definition code
does not exist (e.g. if the extract was created using the online IPUMS
extract system). In this case, the best approach is to view the extract
get_extract_info() and create a new extract
definition (using a
define_extract_*() function) that
reproduces that definition along with the desired modifications. While
this may be a bit tedious for complex extract definitions, it is a
one-time investment that will make any future updates to the extract
definition much easier.
Previously, we encouraged users to use the helpers
when modifying extracts. We now encourage you to re-write extract
definitions because they improve reproducibility: extract definition
code will always be more clear and stable if it is written explicitly,
rather than based only on an old extract number. These two functions may
be retired in the future.
The core API functions in ipumsr are compatible with one another such that they can be combined into a single pipeline that requests, downloads, and reads your extract data into an R data frame:
Note that for NHGIS extracts that contain both data and shapefiles, a
single file will need to be selected before reading, as
download_extract() will return the path to each file. For
instance, for a hypothetical
nhgis_extract that contains
both tabular and spatial data:
Not only does this API workflow allow you to obtain IPUMS data without ever leaving your R environment, but it also allows you to retain a reproducible record of your process. This makes it much easier to document your workflow, collaborate with other researchers, and update your analysis in the future.