This vignette shows how to use the latent projection predictive
feature selection from Catalina, Bürkner, and
Vehtari (2021) in **projpred**. We
recommend to read the main vignette first, as the latent-projection
vignette presented here will skip some of the details explained in the
main vignette.

The response families used in GLMs (McCullagh
and Nelder 1989, chap. 2) (and in GLMMs, GAMs, and GAMMs) may
be termed *exponential dispersion (ED)* families (Jørgensen 1987)^{1}. For a response family
that is not an ED family, the Kullback-Leibler (KL) divergence
minimization problem (see Piironen,
Paasiniemi, and Vehtari 2020) is often not easy to solve
analytically^{2}. In order to bypass this issue, the latent
projection (Catalina, Bürkner, and Vehtari 2021)
solves the KL minimization problem in the predictive space of the latent
predictors^{3} instead of in the predictive space of the
original response values.

To this end, the latent predictor is assumed to have a Gaussian distribution, since it (i) constitutes a combination of predictor data and regression parameters which is often linear (in the parameters, but—less—often also in the predictor data) or at least additive (across the predictor terms) and (ii) has infinite support. Furthermore, the Gaussian distribution has the highest differential entropy over all distributions with two finite moments and infinite support (see, e.g., Cover and Thomas 1991). In some cases, e.g., for the probit link, the Gaussian distribution is even part of the original statistical model. In case of the logit link, the Gaussian distribution with a standard deviation of 1.6 approximates the logistic distribution (with a scale parameter of 1).

The assumption of a Gaussian distribution for the latent predictors
makes things a lot easier because it allows us to make use of
**projpred**’s traditional projection.

As illustrated by the Poisson example below, the latent projection
can not only be used for families not supported by
**projpred**’s traditional projection, but it can also be
beneficial for families supported by it.

To use the latent projection in **projpred**, the new
argument `latent`

of `extend_family()`

needs to be
set to `TRUE`

. Since `extend_family()`

is called
by `init_refmodel()`

which in turn is called by
`get_refmodel()`

(more precisely, by the
`get_refmodel()`

methods) which in turn is called at the
beginning of the top-level functions `project()`

,
`varsel()`

, and `cv_varsel()`

, it is possible to
pass `latent = TRUE`

from such a top-level function down to
`extend_family()`

via the ellipsis (`...`

).
However, for the latent projection, we recommend to define the reference
model object of class `refmodel`

explicitly (as illustrated
in the examples below) to avoid repetitions^{4}.

After performing the projection (either as a stand-alone feature via
`project()`

or embedded in a variable selection via
`varsel()`

or `cv_varsel()`

), the post-processing
(e.g., the calculation of the performance statistics in
`summary.vsel()`

) can be performed on the original response
scale. For this, `extend_family()`

has gained several new
arguments accepting R functions responsible for the inverse-link
transformation from latent scale to response scale
(`latent_ilink`

), for the calculation of log-likelihood
values on response scale (`latent_ll_oscale`

), and for
drawing from the predictive distribution on response scale
(`latent_ppd_oscale`

). For some families, these arguments
have internal defaults implemented natively in
**projpred**. These families are listed in the main
vignette (section “Supported
types of models”). For all other families, **projpred**
either tries to infer a reasonable function internally (in case of
`latent_ilink`

) or uses a dummy function returning only
`NA`

s (in case of `latent_ll_oscale`

and
`latent_ppd_oscale`

), unless the user supplies custom
functions to these arguments. When creating a reference model object for
a family of the latter category (i.e., lacking full response-scale
support by default), **projpred** will throw messages
stating whether (and which) features will be unavailable unless at least
some of these arguments are provided by the user. Again, the ellipsis
(`...`

) can be used to pass these arguments from a top-level
function such as `cv_varsel()`

down to
`extend_family()`

. In the post-processing functions,
response-scale analyses can usually be deactivated by setting the new
argument `resp_oscale`

to `FALSE`

, with the
exception of `predict.refmodel()`

and
`proj_linpred()`

where the existing arguments
`type`

and `transform`

serve this purpose (see the
documentation).

Apart from the arguments mentioned above,
`extend_family()`

has also gained a new argument
`latent_y_unqs`

whose purpose is described in the
documentation.

While the latent projection is an approximate solution to the KL
divergence minimization problem in the original response space^{5}, the
augmented-data projection (Weber and Vehtari
2023) gives the exact^{6} solution for some non-ED families, namely
those where the response distribution has finite support. However, the
augmented-data projection comes with a higher runtime than the latent
projection. The families supported by **projpred**’s
augmented-data projection are also listed in the main vignette (again
section “Supported
types of models”).

In this example, we will illustrate that in case of a family
supported by **projpred**’s traditional projection (here
the Poisson distribution), the latent projection can improve runtime and
results of the variable selection compared to
**projpred**’s traditional projection, at least if the L1
search is used (see argument `method`

of
`varsel()`

and `cv_varsel()`

).

First, we generate a training and a test dataset with a Poisson-distributed response:

```
# Number of observations in the training dataset (= number of observations in
# the test dataset):
<- 71
N <- function(nobs = 2 * N, ncon = 10, ngrpPL = 4, nnoise = 39) {
sim_poiss # Regression coefficients for continuous predictors:
<- rnorm(ncon)
coefs_con # Continuous predictors:
<- matrix(rnorm(nobs * ncon), ncol = ncon)
dat_sim # Start linear predictor:
<- 2.1 + dat_sim %*% coefs_con
linpred
# Population-level (PL) categorical predictor:
<- data.frame(
dat_sim x = dat_sim,
grpPL = gl(n = ngrpPL, k = nobs %/% ngrpPL, length = nobs,
labels = paste0("grpPL", seq_len(ngrpPL)))
)# Regression coefficients for the PL categorical predictor:
<- rnorm(ngrpPL)
coefs_catPL # Continue linear predictor:
<- linpred + coefs_catPL[dat_sim$grpPL]
linpred
# Noise predictors:
<- data.frame(
dat_sim
dat_sim,xn = matrix(rnorm(nobs * nnoise), ncol = nnoise)
)
# Poisson response, using the log link (i.e., exp() as inverse link):
$y <- rpois(nobs, lambda = exp(linpred))
dat_sim# Shuffle order of observations:
<- dat_sim[sample.int(nobs), , drop = FALSE]
dat_sim # Drop the shuffled original row names:
rownames(dat_sim) <- NULL
return(dat_sim)
}set.seed(300417)
<- sim_poiss()
dat_poiss <- dat_poiss[1:N, , drop = FALSE]
dat_poiss_train <- dat_poiss[(N + 1):nrow(dat_poiss), , drop = FALSE] dat_poiss_test
```

Next, we fit the reference model that we consider as the best model in terms of predictive performance that we can construct (here, we assume that we don’t know about the true data-generating process even though the dataset was simulated):

```
library(rstanarm)
# Number of regression coefficients:
<- sum(grepl("^x|^grpPL", names(dat_poiss_train))) ) ( D
```

`[1] 50`

```
# Prior guess for the number of relevant (i.e., non-zero) regression
# coefficients:
<- 10
p0 # Prior guess for the overall magnitude of the response values, see Table 1 of
# Piironen and Vehtari (2017, DOI: 10.1214/17-EJS1337SI):
<- 100
mu_prior # Hyperprior scale for tau, the global shrinkage parameter:
<- p0 / (D - p0) / sqrt(mu_prior) / sqrt(N)
tau0 # Set this manually if desired:
<- parallel::detectCores(logical = FALSE)
ncores ### Only for technical reasons in this vignette (you can omit this when running
### the code yourself):
<- min(ncores, 2L)
ncores ###
options(mc.cores = ncores)
<- as.formula(paste("y", "~", paste(
refm_fml grep("^x|^grpPL", names(dat_poiss_train), value = TRUE),
collapse = " + "
)))<- stan_glm(
refm_fit_poiss formula = refm_fml,
family = poisson(),
data = dat_poiss_train,
prior = hs(global_scale = tau0, slab_df = 100, slab_scale = 1),
### Only for the sake of speed (not recommended in general):
chains = 2, iter = 500,
###
seed = 7286013, QR = TRUE, refresh = 0
)
```

We now run a variable selection with `latent = TRUE`

in
the `get_refmodel()`

call (see section “Implementation” for why `get_refmodel()`

is
called explicitly here) to request the latent projection. Since we have
a hold-out test dataset available, we can use `varsel()`

with
argument `d_test`

instead of `cv_varsel()`

.
Furthermore, we measure the runtime to compare it to the traditional
projection’s later:

`library(projpred)`

```
<- list(
d_test_lat_poiss data = dat_poiss_test,
offset = rep(0, nrow(dat_poiss_test)),
weights = rep(1, nrow(dat_poiss_test)),
### Here, we are not interested in latent-scale post-processing, so we can set
### element `y` to a vector of `NA`s:
y = rep(NA, nrow(dat_poiss_test)),
###
y_oscale = dat_poiss_test$y
)<- get_refmodel(refm_fit_poiss, latent = TRUE) refm_poiss
```

`Since `<refmodel>$dis` will consist of only `NA`s, downstream analyses based on this reference model won't be able to use log predictive density (LPD) values on latent scale. Furthermore, proj_predict() won't be able to draw from the latent Gaussian distribution.`

```
<- system.time(vs_lat <- varsel(
time_lat
refm_poiss,d_test = d_test_lat_poiss,
### Only for the sake of speed (not recommended in general):
nclusters_pred = 20,
###
nterms_max = 14,
seed = 95930
))
```

`print(time_lat)`

```
user system elapsed
1.787 0.004 1.791
```

The message telling that `<refmodel>$dis`

consists
of only `NA`

s will not concern us here because we will only
be focusing on response-scale post-processing.

In order to decide for a submodel size, we first inspect the
`plot()`

results:

`<- plot(vs_lat, stats = "mlpd", deltas = TRUE) ) ( gg_lat `

Although the submodels’ MLPDs seem to be very close to the reference
model’s MLPD from a submodel size of 6 on, a zoomed plot reveals that
there is still some discrepancy at sizes 6 to 10 and that size 11 would
be a better choice (further down below in the `summary()`

output, we will also see that on absolute scale, the discrepancy at
sizes 6 to 10 is not negligible):

`+ ggplot2::coord_cartesian(ylim = c(-10, 0.05)) gg_lat `

Thus, we decide for a submodel size of 11:

`<- 11 modsize_decided_lat `

This is also the size that `suggest_size()`

would
suggest:

`suggest_size(vs_lat, stat = "mlpd")`

`[1] 11`

Only now, after we have made a decision for the submodel size, we inspect further results from the variable selection and, in particular, the solution path:

```
<- summary(
smmry_lat
vs_lat,stats = "mlpd",
type = c("mean", "se", "lower", "upper", "diff", "diff.se")
)print(smmry_lat, digits = 3)
```

```
------
Response-scale family:
Family: poisson
Link function: log
------
Latent-scale family:
Family: gaussian
Link function: identity
------
Formula: .y ~ x.1 + x.2 + x.3 + x.4 + x.5 + x.6 + x.7 + x.8 + x.9 + x.10 +
grpPL + xn.1 + xn.2 + xn.3 + xn.4 + xn.5 + xn.6 + xn.7 +
xn.8 + xn.9 + xn.10 + xn.11 + xn.12 + xn.13 + xn.14 + xn.15 +
xn.16 + xn.17 + xn.18 + xn.19 + xn.20 + xn.21 + xn.22 + xn.23 +
xn.24 + xn.25 + xn.26 + xn.27 + xn.28 + xn.29 + xn.30 + xn.31 +
xn.32 + xn.33 + xn.34 + xn.35 + xn.36 + xn.37 + xn.38 + xn.39
Observations (training set): 71
Observations (test set): 71
Projection method: latent
Search method: L1
Maximum submodel size for the search: 14
Number of projected draws in the search: 1 (from clustered projection)
Number of projected draws in the performance evaluation: 20 (from clustered projection)
Performance evaluation summary (response scale) with `deltas = FALSE`:
size solution_terms mlpd se lower upper diff diff.se
0 <NA> -6281.54 2793.417 -9074.95 -3488.12 -6278.324 2.79e+03
1 x.4 -4190.86 1866.360 -6057.22 -2324.50 -4187.650 1.87e+03
2 x.6 -1431.70 553.630 -1985.33 -878.07 -1428.491 5.53e+02
3 x.10 -471.56 203.944 -675.50 -267.61 -468.345 2.04e+02
4 x.3 -447.33 250.031 -697.36 -197.30 -444.118 2.50e+02
5 x.1 -166.19 84.658 -250.85 -81.53 -162.976 8.45e+01
6 x.2 -65.27 31.185 -96.45 -34.08 -62.058 3.10e+01
7 x.8 -112.69 58.348 -171.04 -54.35 -109.483 5.82e+01
8 x.7 -114.85 62.044 -176.89 -52.81 -111.639 6.19e+01
9 x.5 -125.54 75.647 -201.19 -49.89 -122.330 7.55e+01
10 grpPL -13.04 6.509 -19.55 -6.54 -9.834 6.37e+00
11 x.9 -2.89 0.278 -3.17 -2.61 0.318 6.41e-02
12 xn.6 -2.89 0.279 -3.17 -2.61 0.319 6.66e-02
13 xn.29 -2.90 0.283 -3.18 -2.62 0.310 6.70e-02
14 xn.14 -2.98 0.304 -3.28 -2.67 0.236 7.80e-02
```

On absolute scale (column `mlpd`

), we see that submodel
size 10 leads to an MLPD of `-13.045`

, i.e., a geometric mean
predictive density (GMPD; due to the discrete response family, the
“density” values are probabilities here, so we will report the GMPD as
percentage) of `exp(-13.045)`

which is ca. 2e-04 % whereas
size 11 leads to a GMPD of ca. 5.54 %. This is a considerable
improvement from size 10 to size 11, so another justification for size
11. (Size 12 would have resulted in a GMPD of ca. 5.55 % which is only a
small improvement compared to size 11.)

In the solution path up to the selected size of 11, we can see that
in this case, **projpred** has correctly selected the truly
relevant predictors first and only then the noise predictors. We can see
this more clearly using the following code:

```
<- solution_terms(vs_lat)
soltrms_lat <- head(soltrms_lat, modsize_decided_lat) ) ( soltrms_lat_final
```

```
[1] "x.4" "x.6" "x.10" "x.3" "x.1" "x.2" "x.8" "x.7" "x.5"
[10] "grpPL" "x.9"
```

We will skip post-selection inference here (see the main vignette for
a demonstration of post-selection inference), but note that
`proj_predict()`

has gained a new argument
`resp_oscale`

and that analogous response-scale functionality
is available in `proj_linpred()`

(argument
`transform`

) and `predict.refmodel()`

(argument
`type`

).

We will now look at what **projpred**’s traditional
projection would have given:

```
<- d_test_lat_poiss
d_test_trad_poiss $y <- d_test_trad_poiss$y_oscale
d_test_trad_poiss$y_oscale <- NULL
d_test_trad_poiss<- system.time(vs_trad <- varsel(
time_trad
refm_fit_poiss,d_test = d_test_trad_poiss,
### Only for the sake of speed (not recommended in general):
nclusters_pred = 20,
###
nterms_max = 14,
seed = 95930
))
```

`print(time_trad)`

```
user system elapsed
2.701 0.004 2.705
```

`<- plot(vs_trad, stats = "mlpd", deltas = TRUE) ) ( gg_trad `

```
<- summary(
smmry_trad
vs_trad,stats = "mlpd",
type = c("mean", "se", "lower", "upper", "diff", "diff.se")
)print(smmry_trad, digits = 3)
```

```
Family: poisson
Link function: log
Formula: y ~ x.1 + x.2 + x.3 + x.4 + x.5 + x.6 + x.7 + x.8 + x.9 + x.10 +
grpPL + xn.1 + xn.2 + xn.3 + xn.4 + xn.5 + xn.6 + xn.7 +
xn.8 + xn.9 + xn.10 + xn.11 + xn.12 + xn.13 + xn.14 + xn.15 +
xn.16 + xn.17 + xn.18 + xn.19 + xn.20 + xn.21 + xn.22 + xn.23 +
xn.24 + xn.25 + xn.26 + xn.27 + xn.28 + xn.29 + xn.30 + xn.31 +
xn.32 + xn.33 + xn.34 + xn.35 + xn.36 + xn.37 + xn.38 + xn.39
Observations (training set): 71
Observations (test set): 71
Projection method: traditional
Search method: L1
Maximum submodel size for the search: 14
Number of projected draws in the search: 1 (from clustered projection)
Number of projected draws in the performance evaluation: 20 (from clustered projection)
Performance evaluation summary with `deltas = FALSE`:
size solution_terms mlpd se lower upper diff diff.se
0 <NA> -3469.3 1593.1 -5062 -1876.2 -3466.1 1592.9
1 x.6 -2545.4 1068.0 -3613 -1477.4 -2542.2 1067.8
2 xn.12 -3560.3 1666.6 -5227 -1893.7 -3557.1 1666.4
3 grpPL -3774.1 1778.0 -5552 -1996.1 -3770.8 1777.8
4 x.4 -2438.1 1131.2 -3569 -1306.8 -2434.9 1131.1
5 x.10 -959.8 488.4 -1448 -471.4 -956.6 488.3
6 x.3 -821.7 518.8 -1340 -302.8 -818.4 518.7
7 xn.11 -686.5 439.2 -1126 -247.3 -683.3 439.0
8 xn.2 -1330.4 779.8 -2110 -550.6 -1327.2 779.7
9 xn.22 -1220.0 722.4 -1942 -497.6 -1216.8 722.3
10 x.2 -744.2 395.4 -1140 -348.8 -741.0 395.3
11 x.1 -107.8 37.0 -145 -70.8 -104.6 36.8
12 xn.23 -83.1 33.0 -116 -50.1 -79.9 32.8
13 xn.16 -120.7 67.8 -189 -52.9 -117.5 67.7
14 x.8 -93.0 45.2 -138 -47.8 -89.7 45.1
```

As these results show, the traditional projection takes longer than the latent projection, although the difference is rather small on absolute scale (which is due to the fact that the L1 search is already quite fast). More importantly however, the submodel size selection plot is much more unstable and the solution path reveals that several noise terms enter the solution path before truly relevant ones.

In conclusion, this example showed that the latent projection can be
advantageous also for families supported by **projpred**’s
traditional projection by improving the runtime and the stability of the
variable selection, eventually also leading to better variable selection
results.

One important point is that we have used the L1 search here. In case of the latent projection, a forward search would have given only slightly different results (in particular, a slightly smoother submodel size selection plot). However, in case of the traditional projection, a forward search would have given markedly better results (in particular, the submodel size selection plot would have been much smoother and all of the noise terms would have been selected after the truly relevant ones). Thus, the conclusions made for the L1 search here cannot be transmitted easily to the forward search.

In this example, we will illustrate the latent projection in case of
the negative binomial family (more precisely, we are using
`rstanarm::neg_binomial_2()`

here) which is a family that is
not supported by **projpred**’s traditional projection^{7}.

We will re-use the data generated above in the Poisson example.

We now fit a reference model with the negative binomial distribution
as response family^{8}:

```
<- stan_glm(
refm_fit_nebin formula = refm_fml,
family = neg_binomial_2(),
data = dat_poiss_train,
prior = hs(global_scale = tau0, slab_df = 100, slab_scale = 1),
### Only for the sake of speed (not recommended in general):
chains = 2, iter = 500,
###
seed = 7304, QR = TRUE, refresh = 0
)
```

To run a variable selection with `latent = TRUE`

, we now
also need to specify more arguments which will be passed to
`extend_family()`

^{9}:

```
<- as.matrix(refm_fit_nebin)[, "reciprocal_dispersion", drop = FALSE]
refm_prec <- function(ilpreds, y_oscale,
latent_ll_oscale_nebin wobs = rep(1, length(y_oscale)), cl_ref,
wdraws_ref = rep(1, length(cl_ref))) {
<- matrix(y_oscale, nrow = nrow(ilpreds), ncol = ncol(ilpreds),
y_oscale_mat byrow = TRUE)
<- matrix(wobs, nrow = nrow(ilpreds), ncol = ncol(ilpreds),
wobs_mat byrow = TRUE)
<- cl_agg(refm_prec, cl = cl_ref, wdraws = wdraws_ref)
refm_prec_agg <- dnbinom(y_oscale_mat, size = refm_prec_agg, mu = ilpreds, log = TRUE)
ll_unw return(wobs_mat * ll_unw)
}<- function(ilpreds_resamp, wobs, cl_ref,
latent_ppd_oscale_nebin wdraws_ref = rep(1, length(cl_ref)),
idxs_prjdraws) {<- cl_agg(refm_prec, cl = cl_ref, wdraws = wdraws_ref)
refm_prec_agg <- refm_prec_agg[idxs_prjdraws, , drop = FALSE]
refm_prec_agg_resamp <- rnbinom(prod(dim(ilpreds_resamp)), size = refm_prec_agg_resamp,
ppd mu = ilpreds_resamp)
<- matrix(ppd, nrow = nrow(ilpreds_resamp), ncol = ncol(ilpreds_resamp))
ppd return(ppd)
}<- get_refmodel(refm_fit_nebin, latent = TRUE,
refm_nebin latent_ll_oscale = latent_ll_oscale_nebin,
latent_ppd_oscale = latent_ppd_oscale_nebin)
```

`Defining `latent_ilink` as a function which calls `family$linkinv`, but there is no guarantee that this will work for all families. If relying on `family$linkinv` is not appropriate or if this raises an error in downstream functions, supply a custom `latent_ilink` function (which is also allowed to return only `NA`s if response-scale post-processing is not needed).`

`Since `<refmodel>$dis` will consist of only `NA`s, downstream analyses based on this reference model won't be able to use log predictive density (LPD) values on latent scale. Furthermore, proj_predict() won't be able to draw from the latent Gaussian distribution.`

```
<- varsel(
vs_lat_nebin
refm_nebin,d_test = d_test_lat_poiss,
### Only for the sake of speed (not recommended in general):
nclusters_pred = 20,
###
nterms_max = 14,
seed = 95930
)
```

Again, the message telling that `<refmodel>$dis`

consists of only `NA`

s will not concern us here because we
will only be focusing on response-scale post-processing. The message
concerning `latent_ilink`

can be safely ignored here (the
internal default based on `family$linkinv`

works correctly in
this case).

Again, we first inspect the `plot()`

results to decide for
a submodel size:

`<- plot(vs_lat_nebin, stats = "mlpd", deltas = TRUE) ) ( gg_lat_nebin `

Again, a zoomed plot is more helpful:

`+ ggplot2::coord_cartesian(ylim = c(-2.5, 0.25)) gg_lat_nebin `

Although the submodels’ MLPDs approach the reference model’s already from a submodel size of 8 on (taking into account the uncertainty indicated by the error bars), the curve levels off only from a submodel size of 11 on. For a more informed decision, the GMPD could be taken into account again, but for the sake of brevity, we will simply decide for a submodel size of 11 here:

`<- 11 modsize_decided_lat_nebin `

This is not the size that `suggest_size()`

would suggest,
but as mentioned in the main vignette and in the documentation,
`suggest_size()`

provides only a quite heuristic decision (so
we stick with our manual decision here):

`suggest_size(vs_lat_nebin, stat = "mlpd")`

`[1] 8`

Only now, after we have made a decision for the submodel size, we inspect further results from the variable selection and, in particular, the solution path:

```
<- summary(
smmry_lat_nebin
vs_lat_nebin,stats = "mlpd",
type = c("mean", "se", "lower", "upper", "diff", "diff.se")
)print(smmry_lat_nebin, digits = 3)
```

```
------
Response-scale family:
Family: neg_binomial_2
Link function: log
------
Latent-scale family:
Family: gaussian
Link function: identity
------
Formula: .y ~ x.1 + x.2 + x.3 + x.4 + x.5 + x.6 + x.7 + x.8 + x.9 + x.10 +
grpPL + xn.1 + xn.2 + xn.3 + xn.4 + xn.5 + xn.6 + xn.7 +
xn.8 + xn.9 + xn.10 + xn.11 + xn.12 + xn.13 + xn.14 + xn.15 +
xn.16 + xn.17 + xn.18 + xn.19 + xn.20 + xn.21 + xn.22 + xn.23 +
xn.24 + xn.25 + xn.26 + xn.27 + xn.28 + xn.29 + xn.30 + xn.31 +
xn.32 + xn.33 + xn.34 + xn.35 + xn.36 + xn.37 + xn.38 + xn.39
Observations (training set): 71
Observations (test set): 71
Projection method: latent
Search method: L1
Maximum submodel size for the search: 14
Number of projected draws in the search: 1 (from clustered projection)
Number of projected draws in the performance evaluation: 20 (from clustered projection)
Performance evaluation summary (response scale) with `deltas = FALSE`:
size solution_terms mlpd se lower upper diff diff.se
0 <NA> -386.10 157.030 -543.13 -229.07 -382.421 156.8118
1 x.4 -98.66 30.140 -128.80 -68.52 -94.985 29.8979
2 x.6 -20.54 6.219 -26.76 -14.33 -16.867 6.0514
3 x.10 -8.52 1.215 -9.73 -7.30 -4.839 1.0220
4 x.3 -5.68 0.720 -6.40 -4.96 -2.003 0.4969
5 x.1 -4.59 0.568 -5.15 -4.02 -0.908 0.3726
6 x.2 -4.07 0.470 -4.54 -3.61 -0.397 0.2686
7 x.8 -3.87 0.410 -4.28 -3.46 -0.194 0.1492
8 x.5 -3.79 0.403 -4.19 -3.38 -0.108 0.1325
9 xn.29 -3.78 0.403 -4.18 -3.38 -0.101 0.1332
10 x.7 -3.66 0.384 -4.04 -3.28 0.018 0.0967
11 grpPL -3.43 0.343 -3.77 -3.09 0.247 0.0435
12 xn.31 -3.43 0.342 -3.77 -3.08 0.252 0.0409
13 x.9 -3.42 0.340 -3.76 -3.08 0.263 0.0374
14 xn.2 -3.42 0.341 -3.77 -3.08 0.254 0.0388
```

As we can see from the solution path, our selected 11 predictor terms
lack one truly relevant predictor (`x.9`

) and include one
noise term (`xn.29`

). More explicitly, our selected predictor
terms are:

```
<- solution_terms(vs_lat_nebin)
soltrms_lat_nebin <- head(soltrms_lat_nebin,
( soltrms_lat_nebin_final modsize_decided_lat_nebin) )
```

```
[1] "x.4" "x.6" "x.10" "x.3" "x.1" "x.2" "x.8" "x.5" "xn.29"
[10] "x.7" "grpPL"
```

Again, we will skip post-selection inference here (see the main vignette for a demonstration of post-selection inference).

This example demonstrated how the latent projection can be used for
those families which are neither supported by
**projpred**’s traditional nor by
**projpred**’s augmented-data projection. (In the future,
this vignette will be extended to demonstrate how both, latent and
augmented-data projection, can be applied to those non-ED families which
are supported by both.)

Catalina, Alejandro, Paul Bürkner, and Aki Vehtari. 2021. “Latent
Space Projection Predictive Inference.” arXiv. https://doi.org/10.48550/arXiv.2109.04702.

Cover, Thomas M., and Joy A. Thomas. 1991. *Elements of Information
Theory*. New York, NY, USA: John Wiley &
Sons, Ltd. https://doi.org/10.1002/0471200611.

Jørgensen, Bent. 1987. “Exponential Dispersion Models.”
*Journal of the Royal Statistical Society. Series B
(Methodological)* 49 (2): 127–62.

McCullagh, P., and J. A. Nelder. 1989. *Generalized Linear
Models*. 2nd ed. London: Chapman &
Hall.

Piironen, Juho, Markus Paasiniemi, and Aki Vehtari. 2020.
“Projective Inference in High-Dimensional Problems:
Prediction and Feature Selection.” *Electronic
Journal of Statistics* 14 (1): 2155–97. https://doi.org/10.1214/20-EJS1711.

Piironen, Juho, and Aki Vehtari. 2017. “Sparsity Information and
Regularization in the Horseshoe and Other Shrinkage Priors.”
*Electronic Journal of Statistics* 11 (2): 5018–51. https://doi.org/10.1214/17-EJS1337SI.

Weber, Frank, and Aki Vehtari. 2023. “Projection Predictive
Variable Selection for Discrete Response Families with Finite
Support.” arXiv. https://doi.org/10.48550/arXiv.2301.01660.

Jørgensen (1987) himself only uses the term “exponential dispersion model”, but the discussion for that article mentions the term “ED [i.e., exponential dispersion] family”. Jørgensen (1987) also introduces the class of

*discrete exponential dispersion*families (here abbreviated by “DED families”), see section “Example: Negative binomial distribution”.↩︎The set of families supported by

**projpred**’s traditional—i.e., neither augmented-data nor latent—projection is only a subset of the set of all ED families (**projpred**’s traditional projection supports the`gaussian()`

family, the`binomial()`

family, the`brms::bernoulli()`

family via`brms::get_refmodel.brmsfit()`

, and the`poisson()`

family). Thus, we cannot use the term “ED families” equivalently to the term “families supported by**projpred**’s traditional projection”, but the concept of ED families is important here nonetheless.↩︎The latent predictors are also known as the linear predictors, but “latent” is a more general term than “linear”.↩︎

If the

`refmodel`

-class object is not defined explicitly but implicitly by a call to a top-level function such as`project()`

,`varsel()`

, or`cv_varsel()`

, then`latent = TRUE`

and all other arguments related to the latent projection need to be set in*each*call to a top-level function.↩︎More precisely, the latent projection

*replaces*the KL divergence minimization problem in the original response space by a KL divergence minimization problem in the latent space and solves the latter.↩︎Here, “exact” means apart from approximations and simplifications which are also undertaken for the traditional projection.↩︎

The negative binomial distribution belongs to the class of

*discrete exponential dispersion*families (Jørgensen 1987) (here abbreviated by “DED families”). DED families are closely related to ED families (Jørgensen 1987), but strictly speaking, the class of DED families is not a subset of the class of ED families. GitHub issue #361 explains why the “traditional” projection onto a DED-family submodel is currently not implemented in**projpred**.↩︎For the sake of simplicity, we won’t adjust

`tau0`

to this new family, but in a real-world example, such an adjustment would be necessary. However, since Table 1 of Piironen and Vehtari (2017) doesn’t list the negative binomial distribution, this would first require a manual derivation of the pseudo-variance \(\tilde{\sigma}^2\).↩︎The suffix

`_prec`

in`refm_prec`

stands for “precision” because here, we follow the Stan convention (see the Stan documentation for the`neg_binomial_2`

distribution, the`brms::negbinomial()`

documentation, and the**brms**vignette “Parameterization of Response Distributions in brms”) and prefer the term*precision*parameter for what is denoted by \(\phi\) there (confusingly, argument`size`

in`?stats::NegBinomial`

—which is the same as \(\phi\) from the Stan notation—is called the*dispersion*parameter there, although the variance is increased by its reciprocal).↩︎