The aim of this vignette is to introduce the basic steps involved in
fitting GLM-PCA model to single-cell RNA-seq data using
**fastglmpca**. (See Townes *et al*
2019 or Collins
*et al* 2001 for a detailed description of the GLM-PCA
model.)

To begin, load the packages that are needed.

```
library(Matrix)
library(fastglmpca)
library(ggplot2)
library(cowplot)
```

Set the seed so that the results can be reproduced.

`set.seed(1)`

We will illustrate fastglmpca using a single-cell RNA-seq data set
from Zheng *et al*
(2017). These data are reference transcriptome profiles from 10
bead-enriched subpopulations of peripheral blood mononuclear cells
(PBMCs). The original data set is much larger; for this introduction, we
have taken a subset of roughly 3,700 cells.

The data we will analyze are unique molecular identifier (UMI) counts. These data are stored as an \(n \times m\) sparse matrix, where \(n\) is the number of genes and \(m\) is the number of cells:

```
data(pbmc_facs)
dim(pbmc_facs$counts)
# [1] 16791 3774
```

The UMI counts are “sparse”—that is, most of the counts are zero. Indeed, over 95% of the UMI counts are zero:

```
mean(pbmc_facs$counts > 0)
# [1] 0.04265257
```

For the purposes of this vignette only, we randomly subset the data further to reduce the running time:

```
counts <- pbmc_facs$counts
n <- nrow(counts)
rows <- sample(n,3000)
counts <- counts[rows,]
```

Now we have a 3,000 x 3,774 counts matrix:

```
dim(counts)
# [1] 3000 3774
```

Since no preprocessing of UMI counts is needed (e.g., a
log-transformation), the first step is to initialize the model fit using
`init_glmpca_pois()`

. This function has many input arguments
and options, but here we will keep all settings at the defaults, and we
set K, the rank of the matrix factorization, to 2:

`fit0 <- init_glmpca_pois(counts,K = 2)`

By default, `init_glmpca_pois()`

adds gene- (or row-)
specific intercept, and a fixed cell- (or column-) specific size-factor.
This is intended to mimic the defaults in glmpca.
`init_glmpca_pois()`

has many other options which we do not
demonstrate here.

Once we have initialized the model, we are ready to run the
optimization algorithm to fit the model (i.e., estimate the model
parameters). This is accomplished by a call to
`fit_glmpca_pois()`

:

`fit <- fit_glmpca_pois(counts,fit0 = fit0)`

If you prefer not to wait for the model optimization (it may take
several minutes to run), you are welcome to load the previously fitted
model (which is the output from the `fit_glmpca_pois`

call
above):

`fit <- pbmc_facs$fit`

The return value of `fit_glmpca_pois()`

resembles the
output of `svd()`

and similar functions, with a few other
outputs giving additional information about the model:

```
names(fit)
# [1] "U" "V" "fixed_b_cols" "fixed_w_cols" "loglik"
# [6] "progress" "X" "B" "Z" "W"
# [11] "d"
```

In particular, the outputs that are capital letters the low-rank reconstruction of the counts matrix:

```
fitted_counts <- with(fit,
exp(tcrossprod(U %*% diag(d),V) +
tcrossprod(X,B) +
tcrossprod(W,Z)))
```

Let’s compare (a random subset of) the reconstructed (“fitted”) counts versus the observed counts:

```
i <- sample(prod(dim(counts)),2e4)
pdat <- data.frame(obs = as.matrix(counts)[i],
fitted = fitted_counts[i])
ggplot(pdat,aes(x = obs,y = fitted)) +
geom_point() +
geom_abline(intercept = 0,slope = 1,color = "magenta",linetype = "dotted") +
labs(x = "observed count",y = "fitted count") +
theme_cowplot(font_size = 12)
```

The U and V outputs in particular are interesting because they give low-dimensional (in this case, 2-d) embeddings of the genes and cells, respectively. Let’s compare this 2-d embedding of the cells the provided cell-type labels:

```
celltype_colors <- c("forestgreen","dodgerblue","darkmagenta",
"gray","hotpink","red")
celltype <- as.character(pbmc_facs$samples$celltype)
celltype[celltype == "CD4+/CD25 T Reg" |
celltype == "CD4+ T Helper2" |
celltype == "CD8+/CD45RA+ Naive Cytotoxic" |
celltype == "CD4+/CD45RA+/CD25- Naive T" |
celltype == "CD4+/CD45RO+ Memory"] <- "T cell"
celltype <- factor(celltype)
pdat <- data.frame(celltype = celltype,
pc1 = fit$V[,1],
pc2 = fit$V[,2])
ggplot(pdat,aes(x = pc1,y = pc2,color = celltype)) +
geom_point() +
scale_color_manual(values = celltype_colors) +
theme_cowplot(font_size = 10)
```