{xmpdf}
provides functions for getting and setting Extensibe
Metadata Platform (XMP) metadata in a variety of media file formats
as well as getting and setting PDF documentation
info entries and bookmarks
(aka outline aka table of contents).
::install_github("trevorld/r-xmpdf") remotes
Depending on what you’d like to do you’ll need to install some additional R packages and/or command-line tools:
{qpdf} can be used to concatenate pdf files together as well as get the number of pages in a pdf. Note currently a dependency of {pdftools}.
install.packages("qpdf")
{pdftools} can be used to get documentation info entries in pdf files. Note currently depends on {qpdf}.
install.packages("pdftools")
will probably install
{qpdf}
as wellexiftool can be used to get/set xmp metadata in a variety of media files as well as documentation info entries in pdf files. Can also be used to get the number of pages in a pdf. Note can be installed by {exiftoolr}.
install.packages("exiftoolr"); exiftoolr::install_exiftool()
(Cross-Platform)sudo apt-get install libimage-exiftool-perl
(Debian/Ubuntu)brew install exiftool
(Homebrew)choco install exiftool
(Chocolately)ghostscript can be used to set bookmarks and documentation info entries in pdf files. Can also be used to concatenate pdf files together as well as get the number of pages in a pdf.
sudo apt-get install ghostscript
(Debian/Ubuntu)brew install ghostscript
(Homebrew)choco install ghostscript
(Chocolately)pdftk-java or
perhaps pdftk
can be used to get/set bookmarks and documentation info entries in pdf
files.
Can also be used to concatenate pdf files together as well as get the
number of pages in a pdf.
sudo apt-get install pdftk-java
(Debian/Ubuntu)brew install pdftk-java
(Homebrew)choco install pdftk-java
(Chocolately)A simple example where we create a two page pdf using
pdf()
and then add XMP metadata, PDF documentation info
metadata, and PDF bookmarks to it:
library("xmpdf")
# Create a two page pdf using `pdf()`
<- tempfile(fileext = ".pdf")
f pdf(f, onefile = TRUE)
::grid.text("Page 1")
grid::grid.newpage()
grid::grid.text("Page 2")
gridinvisible(dev.off())
# See what default metadata `pdf()` created
get_docinfo(f)[[1]] |> print()
## Author: NULL
## CreationDate: 2023-02-08T14:42:02
## Creator: R
## Producer: R 4.2.2
## Title: R Graphics Output
## Subject: NULL
## Keywords: NULL
## ModDate: 2023-02-08T14:42:02
get_xmp(f)[[1]] |> print()
## No XMP metadata found
get_bookmarks(f)[[1]] |> print()
## [1] title page level count open color fontface
## <0 rows> (or 0-length row.names)
# Edit PDF documentation info
<- get_docinfo(f)[[1]] |>
d update(author = "John Doe",
subject = "A minimal document to demonstrate {xmpdf} features on",
title = "Two Boring Pages",
keywords = c("R", "xmpdf"))
set_docinfo(d, f)
get_docinfo(f)[[1]] |> print()
## Author: John Doe
## CreationDate: 2023-02-08T14:42:02
## Creator: R
## Producer: GPL Ghostscript 9.55.0
## Title: Two Boring Pages
## Subject: A minimal document to demonstrate {xmpdf} features on
## Keywords: R, xmpdf
## ModDate: 2023-02-08T14:42:02
# Edit XMP metadata
<- as_xmp(d) |>
x update(attribution_url = "https://example.com/attribution",
date_created = Sys.Date(),
spdx_id = "CC-BY-4.0")
set_xmp(x, f)
get_xmp(f)[[1]] |> print()
## cc:attributionName := John Doe
## cc:attributionURL := https://example.com/attribution
## cc:license := https://creativecommons.org/licenses/by/4.0/
## dc:creator := John Doe
## dc:description := A minimal document to demonstrate {xmpdf} features on
## dc:format := application/pdf
## dc:rights := © 2023 John Doe. Some rights reserved.
## dc:subject := R, xmpdf
## dc:title := Two Boring Pages
## pdf:Keywords := R, xmpdf
## pdf:Producer := R 4.2.2
## photoshop:Credit := John Doe
## photoshop:DateCreated := 2023-02-08
## x:XMPToolkit := Image::ExifTool 12.40
## xmp:CreateDate := 2023-02-08T14:42:02
## xmp:CreatorTool := R
## xmp:ModifyDate := 2023-02-08T14:42:02
## xmpMM:DocumentID := uuid:5b406831-e01e-11f8-0000-2567e21c8552
## xmpRights:Marked := TRUE
## xmpRights:UsageTerms := This work is licensed to the public under the Creative Commons
## Attribution 4.0 International license
## https://creativecommons.org/licenses/by/4.0/
## xmpRights:WebStatement := https://creativecommons.org/licenses/by/4.0/
# Edit PDF bookmarks
<- data.frame(title = c("Page 1", "Page 2"), page = c(1, 2))
bm set_bookmarks(bm, f)
get_bookmarks(f)[[1]] |> print()
## title page level count open color fontface
## 1 Page 1 1 1 NA NA <NA> <NA>
## 2 Page 2 2 1 NA NA <NA> <NA>
Besides pdf files with exiftool
we can also edit the XMP
metadata for a large number of
image formats including “gif”, “png”, “jpeg”, “tiff”, and “webp”. In
particular we may be interested in setting the subset of IPTC
Photo XMP metadata displayed by Google Images as well as embedding
Creative Commons
license XMP metadata.
library("xmpdf")
<- tempfile(fileext = ".png")
f png(f)
::grid.text("This is an image!")
griddev.off() |> invisible()
get_xmp(f)[[1]] |> print()
## No XMP metadata found
<- xmp(attribution_url = "https://example.com/attribution",
x creator = "John Doe",
description = "An image caption",
date_created = Sys.Date(),
spdx_id = "CC-BY-4.0")
print(x, mode = "google_images", xmp_only = TRUE)
## dc:creator := John Doe
## => dc:rights = © 2023 John Doe. Some rights reserved.
## => photoshop:Credit = John Doe
## X plus:Licensor (not currently supported by {xmpdf})
## => xmpRights:WebStatement = https://creativecommons.org/licenses/by/4.0/
print(x, mode = "creative_commons", xmp_only = TRUE)
## => cc:attributionName = John Doe
## cc:attributionURL := https://example.com/attribution
## => cc:license = https://creativecommons.org/licenses/by/4.0/
## cc:morePermissions := NULL
## => dc:rights = © 2023 John Doe. Some rights reserved.
## => xmpRights:Marked = TRUE
## => xmpRights:UsageTerms = This work is licensed to the public under the Creative Commons
## Attribution 4.0 International license
## https://creativecommons.org/licenses/by/4.0/
## => xmpRights:WebStatement = https://creativecommons.org/licenses/by/4.0/
set_xmp(x, f)
get_xmp(f)[[1]] |> print()
## cc:attributionName := John Doe
## cc:attributionURL := https://example.com/attribution
## cc:license := https://creativecommons.org/licenses/by/4.0/
## dc:creator := John Doe
## dc:description := An image caption
## dc:rights := © 2023 John Doe. Some rights reserved.
## photoshop:Credit := John Doe
## photoshop:DateCreated := 2023-02-08
## x:XMPToolkit := Image::ExifTool 12.40
## xmpRights:Marked := TRUE
## xmpRights:UsageTerms := This work is licensed to the public under the Creative Commons
## Attribution 4.0 International license
## https://creativecommons.org/licenses/by/4.0/
## xmpRights:WebStatement := https://creativecommons.org/licenses/by/4.0/
# Create two multi-page pdfs and add bookmarks to them
<- tempfile(fileext = ".pdf")
f_a pdf(f_a, title = "Document A", onefile = TRUE)
::grid.text("Document A: First Page")
grid::grid.newpage()
grid::grid.text("Document A: Second Page")
griddev.off() |> invisible()
<- tempfile(fileext = ".pdf")
f_b pdf(f_b, title = "Document B", onefile = TRUE)
::grid.text("Document B: First Page")
grid::grid.newpage()
grid::grid.text("Document B: Second Page")
griddev.off() |> invisible()
<- data.frame(title = c("First Page", "Second Page"), page = c(1, 2))
bm set_bookmarks(bm, f_a)
set_bookmarks(bm, f_b)
# Concatenate pdfs to a single pdf and add their concatenated bookmarks to it
<- c(f_a, f_b)
files <- tempfile(fileext = ".pdf")
f_cat cat_pages(files, f_cat)
cat_bookmarks(get_bookmarks(files), method = "title") |>
set_bookmarks(f_cat)
print(get_bookmarks(f_cat)[[1]])
## title page level count open color fontface
## 1 Document A 1 1 NA NA <NA> <NA>
## 2 First Page 1 2 NA NA <NA> <NA>
## 3 Second Page 2 2 NA NA <NA> <NA>
## 4 Document B 3 1 NA NA <NA> <NA>
## 5 First Page 3 2 NA NA <NA> <NA>
## 6 Second Page 4 2 NA NA <NA> <NA>
{xmpdf} feature |
exiftool |
pdftk |
ghostscript |
---|---|---|---|
Get XMP metadata | Yes | No | No |
Set XMP metadata | Yes | No | Poor: when documentation info metadata is set then as a side effect it seems the documentation info metadata will also be set as XMP metadata |
Get PDF bookmarks | No | Okay: can only get Title, Page number, and Level | No |
Set PDF bookmarks | No | Okay: can only set Title, Page number, and Level | Good: supports most bookmarks features including color and font face but only action supported is to view a particular page |
Get PDF documentation info | Good: may “widen” datetimes which are less than “second” precision | Yes | No |
Set PDF documentation info | Yes | Good: may not handle entries with newlines in them | Yes: as a side effect when documentation info metadata is set then it seems will also be set as XMP metadata |
Concatenate PDF files | No | Yes | Yes |
Known limitations:
get_bookmarks_pdftk()
doesn’t report information about
bookmarks color, font face, and whether the bookmarks should start open
or closed.get_docinfo_exiftool()
an hour-only UTC offset will be
“widened” to minute precision.get_docinfo_pdftools()
’s datetimes may not accurately
reflect the embedded datetimes.set_bookmarks_gs()
supports most bookmarks features
including color and font face but only action supported is to view a
particular page.set_bookmarks_pdftk()
only supports setting the title,
page number, and level of bookmarks.set_docinfo_pdftk()
may not handle entries with
newlines in them.set_docinfo()
methods currently do not
support arbitrary info dictionary entries.set_docinfo_gs()
seems to also update
any matching XPN metadata while set_docinfo_exiftool()
and
set_docinfo_pdftk()
don’t update any previously set
matching XPN metadata. Some pdf viewers will preferentially use the
previously set document title from XPN metadata if it exists instead of
using the title set in documentation info dictionary entry. Consider
also manually setting this XPN metadata using
set_xmp()
qpdf::pdf_compress(input, linearize = TRUE)
at the
end.Note most of the R packages listed below are focused on getting metadata rather than setting metadata and/or only provide low-level wrappers around the relevant command-line tools. Please feel free to open a pull request to add any missing relevant R packages.
exiftool
command-line tool. Can download
exiftool
.exiftool
command-line tool. Can download
exiftool
.{tools}
has find_gs_cmd()
to find a
GhostScript executable in a cross-platform way.pdftk()
, a low-level wrapper around the pdftk
command-line tool.pdfinfo
tool