If you have not already done so, I would read through the
*pibble* vignette before this one.

fido can be used for non-linear modeling. Here we are going to go
through one such model which is based on multivariate Gaussian
processes. As Gaussian processes are a type of lazy learner, I have
named this model after a lazy dog: a Basset Hound. Hence this model is
called *basset*. The *basset* model can be written as
\[
\begin{align}
Y_j & \sim \text{Multinomial}\left(\pi_j \right) \\
\pi_j & = \phi^{-1}(\eta_j) \\
\eta &\sim N(\Lambda[X], \Sigma, I_N) \\
\Lambda[X] &\sim \textsf{GP}(\Theta[X], \Sigma, \Gamma[X]) \\
\Sigma &\sim W^{-1}(\Xi, \upsilon)
\end{align}
\] where we now label quantities interpreted as function with
square brackets. In particular, we use the notation that \(\Lambda[X]\) is a function that maps am
\(Q \times N\) matrix of covariates
\(X\) to a matrix \(\eta\). \(\Lambda\) can be interpreted as a smooth
function that relates covariates \(X\)
to composition \(\eta\) in a
potentially non-linear manner. \(\Theta[X]\) is the mean function evaluated
at \(X\), \(\Gamma[X]\) is a covariance matrix (or gram
matrix) formed by evaluating a positive semi-definite kernel function
\(K\) at each set of points \(i\) and \(j\), *i.e.*, \(\Gamma_{ij} = K(X_i, X_j)\). Many different
possible kernel functions makes this a very flexible and powerful
model.

To demonstrate *basset* I am going to reanalyze the mallard^{1} dataset
from Silverman et al. (2018) which is
provided as a phyloseq object in *fido*. This dataset features 4
artificial gut vessels sampled both daily and hourly for 1 month with
many technical replicates.

To accord with prior analyses (and to make it easy to visualize
results) I am going to just analyze those bacterial families that are
frequently observed. Note: the full dataset (at the sequence variant
level without preprocessing) is available if you run
`data(mallard)`

. For simplicity we will just look at samples
from vessel 1^{2}.

```
library(fido)
library(dplyr)
library(tidyr)
library(ggplot2)
data(mallard_family)
# Just take vessel 1
<- mallard_family$sample_data[mallard_family$sample_data$Vessel == 1,]
sample.ids # Just take hourly samples
<- sample.ids[(sample.ids$time > "2015-11-20 15:00:00 UTC") & (sample.ids$time < "2015-11-25 16:00:00 UTC"),]
sample.ids
# Subsetting the sample data and OTU data
<- mallard_family$sample_data[mallard_family$sample_data$X.SampleID %in% sample.ids$X.SampleID,]
subset.sample_data
<- mallard_family$otu_table[rownames(mallard_family$otu_table) %in% sample.ids$X.SampleID,]
subset.otu_table
# Order samples - to make plotting easy later
<- order(subset.sample_data$time)
o <- subset.otu_table[o,]
subset.otu_table <- subset.sample_data[o,]
subset.sample_data
# Extract Data / dimensions from Phyloseq object
<- t(as(subset.otu_table, "matrix"))
Y <- nrow(Y)
D <- nrow(subset.sample_data)
N
# X in hours
<- as.numeric(subset.sample_data$time)
X <- t((X-min(X)) / 3600) X
```

The function *basset* is a wrapper around some *fido*
internals that make fitting the above model easy. *basset* is
patterned after the function *pibble* but requires that
`Theta`

and `Gamma`

be given as matrix functions
rather than as matrices. That is `Theta`

must be a function
that given `X`

outputs a \(D-1
\times N\) matrix (just like you would give to *pibble*)
and `Gamma`

must be a function that given `X`

outputs a \(N \times N\) covariance
matrix (e.g., covariance over the samples). There are a few kernel
functions that meet the requirements for `Gamma`

in
*fido* already. Here we will use a Squared Exponential (SE)
kernel for `Gamma`

and will set `Theta`

to be the
zero function. The SE kernel will give us non-linear smoothing of the
observed time-series.

*Important Note:* Currently, `Theta`

must output a
matrix represented in the default coordinate system (\(ALR_D\)). This may be generalized in the
future.

Here we will just specify the Kernel parameters manually, more generally these parameters can be chosen by cross-validation.

```
# Specify Priors
<- function(X) SE(X, sigma=5, rho=10) # Create partial function
Gamma <- function(X) matrix(0, D-1, ncol(X))
Theta <- D-1+3
upsilon <- matrix(.4, D-1, D-1)
Xi diag(Xi) <- 1
# Now fit the model
<- fido::basset(Y, X, upsilon, Theta, Gamma, Xi) fit
```

It turns out that *basset* is really just creating a
*pibblefit* object that is a little special because it represents
a posterior over non-linear functions. The benefit of this is that many
of the methods available for *pibblefit* objects work for
*bassetfit* objects. For example, we can use the same
transformation methods:

```
<- to_clr(fit)
fit.clr
# Plot Sigma in CLR
plot(fit.clr, par="Sigma", focus.coord=c("clr_seq_6", "clr_seq_5", "clr_seq_2"))
```

Really *basset* shows its power when you use it to smooth.

```
# predict not just missing days but also forecast into future
<- t(1:(max(X)))
X_predict <- predict(fit.clr, X_predict, jitter=1) predicted
```

Now I am going to create a visual that shows the observed data in CLR coordinates (to do that I will need to add a pseudo-count) along with the smoothed estimates.

```
<- as(mallard_family$tax_table$Family, "vector")
family_names <- clr_array(Y+0.65, parts = 1) %>%
Y_clr_tidy gather_array(mean, coord, sample) %>%
mutate(time = X[1,sample],
coord = paste0("CLR(", family_names[coord],")"))
<- gather_array(predicted, val, coord, sample, iter) %>%
predicted_tidy mutate(time = X_predict[1,sample]) %>%
filter(!is.na(val)) %>%
group_by(time, coord) %>%
summarise_posterior(val, na.rm=TRUE) %>%
ungroup() %>%
mutate(coord = paste0("CLR(", family_names[coord],")"))
ggplot(predicted_tidy, aes(x = time, y=mean)) +
geom_ribbon(aes(ymin=p2.5, ymax=p97.5), fill="darkgrey", alpha=0.5) +
geom_ribbon(aes(ymin=p25, ymax=p75), fill="darkgrey", alpha=0.9)+
geom_line(color="blue") +
geom_point(data = Y_clr_tidy, alpha=0.5) +
facet_wrap(~coord, scales="free_y") +
theme_minimal()+
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_text(angle=45))
```

Silverman, Justin D, Heather Durand, Rachael J Bloom, Sayan Mukherjee,
and Lawrence A David. 2018. “Dynamic Linear Models Guide Design
and Analysis of Microbiota Studies Within Artificial Human Guts.”
*bioRxiv*. https://doi.org/10.1101/306597.

I have a long history of coming up with ridiculous names for statistical models and software.↩︎

In the future I may make a vignette discussing how we can model all the vessels jointly using

*basset*. Note: it’s just a different kernel function that also relates the covariance between samples as a function of which vessel they were taken from.↩︎