**MEDseq** is an R package which fits a range of models
introduced by Murphy
et al. (2021). These are finite mixtures of exponential-distance
models for clustering longitudinal/categorical life-course sequence
data. A family of parsimonious precision parameter constraints are
accommodated. So too are sampling weights. Gating covariates can be
supplied via formula interfaces. Visualisation of the results of such
models is also facilitated.

The most important function in the **MEDseq** package
is: `MEDseq_fit`

, for fitting the models via the EM or CEM
algorithms. `MEDseq_control`

allows supplying additional
arguments which govern, among other things, controls on the
initialisation of the allocations for the EM/CEM algorithm and the
various model selection options. `MEDseq_compare`

is provided
for conducting model selection between different results from using
different covariate combinations &/or initialisation strategies,
etc. `MEDseq_stderr`

is provided for computing the standard
errors of the coefficients for the covariates in the gating network.

A dedicated plotting function exists for visualising various aspects
of the results, using new methods as well as some existing methods
adapted from the **TraMineR** package. Finally, the package
also contains two data sets: `biofam`

and
`mvad`

.

If you find bugs or want to suggest new features please visit the
**MEDseq** GitHub issues
page.

This vignette mostly aims to demonstrate the **MEDseq**
models by reproducing the analysis of the MVAD data in the Murphy et
al. (2021) paper. However, an additional example data set is also
analysed.

**MEDseq** will run in Windows, Mac OS X, or Linux. To
install it you first need to install R. Installing RStudio as a nice desktop environment for
using R is also recommended.

Once in R you can type at the R command prompt:

```
install.packages('devtools')
::install_github('Keefe-Murphy/MEDseq') devtools
```

to install the latest development version of the package from the
**MEDseq** GitHub page. To
instead install the latest stable official release of the package from
CRAN go to R and type:

`install.packages('MEDseq')`

In either case, if you then type:

`library(MEDseq)`

it will load in all the **MEDseq** functions.

The GitHub version contains a few more features but some of these may not yet be fully tested, and occasionally this version might be liable to break when it is in the process of being updated.

Load the MVAD data. The data comes from a study by McVicar and
Anyadike-Danes (2002) on transition from school to work. The data
consist of static background characteristics (covariates and sampling
weights) and a time series sequence of 72 monthly labour market
activities for each of 712 individuals in a cohort survey. The
individuals were followed up from July 1993 to June 1999. Type
`?mvad`

for more information. We will drop the first two
sequence positions (i.e. months) as they (along with the covariates
`Grammar`

and `Location`

) were used to define the
sampling weights.

Note that the data set must have equal sequence lengths, intervals
must be evenly spaced, and missingness is not allowed. The MVAD data
meet these criteria. The **TraMineR** function
`seqdef`

is used to convert the data to the appropriate
format.

```
data(mvad, package="MEDseq")
$Location <- factor(apply(mvad[,5L:9L], 1L, function(x) which(x == "yes")),
mvadlabels = colnames(mvad[,5L:9L]))
<- list(covariates = mvad[c(3L:4L,10L:14L,87L)],
mvad sequences = mvad[,15L:86L],
weights = mvad[,2L])
<- mvad$covariates
mvad.cov <- seqdef(mvad$sequences[-c(1L,2L)],
mvad.seq states = c("EM", "FE", "HE", "JL", "SC", "TR"),
labels = c("Employment", "Further Education", "Higher Education",
"Joblessness", "School", "Training"))
```

The names of the model types are CC, UC, CU, UU, CCN, UCN, CUN, and UUN. The first letter denotes whether the precision parameters are constrained (C) or unconstrained (U) across clusters, the second denotes the same across time periods, and the third letter (N) indicates the precision of a noise component, i.e. one in which the distribution of the sequences is uniform.

The `MEDseq_control`

function allows the algorithm used
for model fitting, the method used to initialise the allocations
(e.g. k-medoids or Ward’s hierarchical clustering), the criterion used
to identify the optimal model (e.g. BIC, density-based silhouette (DBS),
average silhouette width (ASW), etc.), and more to be specified. By
default, the EM algorithm is employed, k-medoids is used to obtain
starting values, and the BIC is used to choose the optimal model (if a
range of models are fitted). In this vignette, we will mostly accept
these defaults, with the exception of the `noise.gate`

argument.

The optimal model identified in the Murphy et al. (2021) paper has
`G=11`

components, the `UUN`

model type, sampling
weights, and one covariate (an indicator of GCSE exam performance) in
the gating network, identified using a stepwise variable selection
procedure. The UUN model has a specific precision parameter for each
time point in each cluster. The `gating`

covariates can be
specified via a formula interface. The argument `covars`

tells the formula where to look for the specified covariates. Here, the
probability of belonging to this noise component is set to be
independent of the included gating covariate. Thus, to fit such a
model:

```
<- MEDseq_fit(mvad.seq, G=11, modtype="UUN", weights=mvad$weights, gating= ~ gcse5eq,
mod1 covars=mvad.cov, control=MEDseq_control(noise.gate=FALSE))
```

Typically, a range of `G`

values and `modtype`

settings are supplied to a single call to `MEDseq_fit`

and
chosen from using some criterion. Thus, for a given set of covariates,
the model which is optimal in terms of the number of components and the
precision parameter configuration can be identified.

To compare different runs using different sets of covariates, the
function `MEDseq_compare`

can be used. First, let’s fit
models with all covariates included (except `Grammar`

and
`Location`

) and no covariates included. Let’s try different
numbers of components and different model types. Note that only the
first model here includes a noise component.

```
# 10-component CUN model with no covariates.
# CUN models have a precision parameter for each sequence position (i.e. time point),
# though each time point's precision is common across clusters.
<- MEDseq_fit(mvad.seq, G=10, modtype="CUN", weights=mvad$weights)
mod2
# 12-component CC model with all covariates.
# CC models have a single precision parameter across all clusters and time points.
<- MEDseq_fit(mvad.seq, G=12, modtype="CC", weights=mvad$weights,
mod3 gating= ~ . - Grammar - Location, covars=mvad.cov)
```

Confirm that the first model is indeed optimal according to the BIC. Examine the optimal model in greater detail.

```
<- MEDseq_compare(mod1, mod2, mod3, criterion="bic"))
(comp #> ---------------------------------------------------------------------
#> Comparison of Mixtures of Exponential-Distance Models with Covariates
#> Data: mvad.seq
#> Ranking Criterion: BIC
#> Optimal Only: FALSE
#> ---------------------------------------------------------------------
#>
#> rank MEDNames modelNames G df iters bic icl aic dbs
#> 1 mod1 UUN 11 3669 9 -92953.853 -92958.276 -76193.575 0.455
#> 2 mod2 CUN 10 2734 9 -98132.771 -98143.386 -85643.646 0.385
#> 3 mod3 CC 12 3618 6 -98566.426 -98587.89 -82039.12 0.223
#> asw loglik gating algo
#> 0.386 -34427.788 ~gcse5eq EM
#> 0.366 -40087.823 None EM
#> 0.349 -37401.56 ~male + catholic + funemp + gcse5eq + fmpr + livboth EM
#> weights noise noise.gate equalPro
#> TRUE TRUE FALSE
#> TRUE TRUE FALSE
#> TRUE FALSE
```

```
<- comp$optimal
opt summary(opt, classification = TRUE, parameters = FALSE, network = FALSE)
#> ------------------------------------------------------
#> Mixture of Exponential-Distance Models with Covariates
#> Data: mvad.seq
#> ------------------------------------------------------
#>
#> MEDseq (UUN), with 11 components
#> Gating Network Covariates: ~gcse5eq
#> Noise Component: TRUE
#> Noise Component Gating: FALSE
#>
#> log.likelihood N P V df iters BIC DBS ASW Algo
#> -34427.79 712 70 6 3669 9 -92953.85 0.455 0.386 EM
#>
#> Clustering table :
#> 0 1 2 3 4 5 6 7 8 9 10
#> 18 87 59 18 32 60 67 165 95 56 55
```

Examine the estimated gating network coefficients with the dedicated
`print`

method. Note that standard errors can be computed by
calling `MEDseq_stderr`

on the `opt`

object, to
better inform the interpretations of the covariate effects.

```
print(opt$gating)
#>
#> Coefficients:
#> (Intercept) gcse5eqyes
#> Cluster2 -0.9529057 -0.4670714
#> Cluster3 -0.4618282 -1.2292897
#> Cluster4 0.5804638 -2.1756487
#> Cluster5 1.0273366 -3.4335736
#> Cluster6 1.1878210 -3.7253458
#> Cluster7 1.6985296 -4.0901375
#> Cluster8 0.6040331 -2.2020626
#> Cluster9 0.9480638 -3.1958694
#> Cluster10 0.9044878 -3.7306802
#>
#> Formula: ~gcse5eq
#> Noise: TRUE
#> Noise Component Gating: FALSE
#> EqualPro: FALSE
```

Visualise the clusters uncovered by the optimal model. Note that seriation is applied (by default) to the observations within clusters for visual clarity. The legend indicates which colours correspond to which state categories.

`plot(opt, type="clusters")`

Many other plotting options exist, some of which are adapted from the
**TraMineR** package. Use the following code to examine the
central sequence parameters. Note that the central sequence of the noise
component is not shown as it doesn’t contribute to the likelihood.

`plot(opt, type="central")`

Use the following code to see the observation-specific (weighted) density-based silhouette values (coloured by cluster). The (weighted) mean within each cluster is also shown. Note that the low average for the noise component is as expected; we do not expect this cluster to be internally coherent, rather it acts as a filter that allows the other clusters to be captured more clearly.

`plot(opt, type="dbsvals")`

Finally, we can quantify the type of observation characterising each
cluster by computing the mean time spent in each state within each
cluster. By specifying `MAP=TRUE`

here, we are computing the
mean time based on the hard MAP partition rather than weighted by the
soft cluster membership probabilities (albeit still weighted by the
sampling weights used during model fitting, by default). By specifying
`norm=TRUE`

(which is the default), the mean times are
normalised to sum to the sequence length within each cluster. The size
of each cluster in terms of the number of observations assigned to it is
also reported.

```
MEDseq_meantime(opt, MAP=TRUE, norm=TRUE)
#> Size EM FE HE JL SC TR
#> Cluster1 87 3.77 0.286 38.447 0.886 26.076 0.535
#> Cluster2 59 4.65 26.513 37.629 0.447 0.762 0.000
#> Cluster3 18 3.38 30.557 8.563 4.065 21.812 1.558
#> Cluster4 32 35.60 1.683 3.623 2.850 25.598 0.634
#> Cluster5 60 28.36 1.230 0.000 3.378 1.353 35.757
#> Cluster6 67 45.33 1.463 0.000 3.122 1.511 17.912
#> Cluster7 165 57.68 2.085 0.000 5.177 1.625 3.671
#> Cluster8 95 41.47 22.692 0.990 3.039 1.304 0.725
#> Cluster9 56 21.84 35.209 0.268 3.992 6.165 2.576
#> Cluster10 55 8.41 3.388 0.217 42.903 4.199 10.925
#> Noise 18 21.52 11.422 1.187 14.478 2.200 19.226
```

As a second example, let’s consider data on \(N=2000\) 16 year-long family life state
sequences built from the retrospective biographical survey carried out
by the Swiss Household Panel (SHP) in 2002. Each of the \(v=8\) states are defined from a combination
of five basic states; namely, living with parents (Parent), left home
(Left), married (Marr), having children (Child), and Divorced. The data
is available in the **MEDseq** package as
`biofam`

. Type `?biofam`

for more information.

```
data(biofam, package="MEDseq")
<- list(covs = cbind(biofam[2L:9L], age = 2002 - biofam$birthyr),
biofam sequences = biofam[10L:25L] + 1L)
<- biofam$covs[,colSums(is.na(biofam$covs)) == 0]
bio.cov <- seqdef(biofam$sequences,
bio.seq states = c("P", "L", "M", "L+M",
"C", "L+C", "L+M+C", "D"),
labels = c("Parent", "Left", "Married",
"Left+Marr", "Child", "Left+Child",
"Left+Marr+Child", "Divorced"))
```

While the data set contains weights, they are not appropriate for
use; `biofam`

is merely a subsample of the original data so
the weights are not properly adapted. Thus, we will fit a model without
supplying the `weights`

argument. Secondly, the data set also
contain some baseline covariates. As many of them contain missing
values, let’s only consider the `age`

variable, derived from
the `birthyr`

reported in 2002. Again, we override the
`noise.gate=TRUE`

default so that the covariate(s) do not
influence the mixing proportions of the noise component. We do so here
directly, without invoking `MEDseq_control`

explicitly. Thus,
the noise component’s mixing proportion will be constant (though
estimated) across all observed sequences and covariate patterns.

```
# The UUN model includes a noise component.
# Otherwise, there is a precision parameter for each time point in each cluster.
<- MEDseq_fit(bio.seq, G=10, modtype="UUN", gating= ~ age,
bio covars=bio.cov, noise.gate=FALSE)
```

Such a model was identified as optimal following the same steps as the analysis of the MVAD data in the Murphy et al. (2021) paper, albeit according to the weighted DBS criterion rather than the BIC. Print the details of the model to the screen:

```
print(bio)
#> Call: MEDseq_fit(seqs = bio.seq, G = 10, modtype = "UUN", gating = ~age,
#> covars = bio.cov, noise.gate = FALSE)
#>
#> Best Model: UUN, with 10 components and no weights (incl. gating covariates)
#> BIC = -53285.922 | ICL = -53414.91 | AIC = -47897.854
#> DBS = 0.537 | ASW = 0.388
#> Gating: ~age
```

As before, let’s look at the clusters uncovered by the model. This
time, `seriated="both"`

means to order the clusters
themselves as well as the observations within clusters for visual
clarity.

`plot(bio, type="clusters", seriated="both")`

Use the following code to examine the precision parameters. Note how
we have a full \(16\times G\) matrix of
precision parameters, one for each time point in each cluster. In this
case there are many quite large precision parameter values which skew
the colour scale. Indeed, some are even infinite! Infinite precision
under UU or UUN models implies that all values for that time point are
identical to the corresponding central sequence position estimate of the
given cluster. By specifying `quant.scale=TRUE`

, quantiles
are used to construct non-linear break-points, thereby ensuring each
colour represents an equal proportion of the data. We invoke
`seriated`

again for comparability.

`plot(bio, type="precision", quant.scale=TRUE, seriated="clusters")`

This time, rather than showing the weighted mean DBS values, let’s look at observation-specific (unweighted) average silhouette-width values. Note that this relies instead on the MAP partition rather than the soft cluster assignment probabilities.

`plot(bio, type="aswvals")`

As stated above, some plot types are adapted from
**TraMineR**, specifically from its `seqplot`

function. Let’s first look at a plot of the transversal Shannon
entropies for the whole data set.

`seqplot(bio.seq, type="Ht")`

Now let’s use the plot function from **MEDseq** to
examine the transversal entropies within each cluster (by default
weighted by the cluster membership probabilities and sampling weights,
if any). Here we can see for instance that subjects assigned to Cluster
9, corresponding to those individuals who left the parental home to
marry relatively early and had a child on average just one year later,
do indeed exhibit greater variability. Conversely, a postponement of the
transition to adulthood is evident for subjects in Cluster 5. Note that
the y-axis labels are suppressed here, for clarity.

`plot(bio, type="Ht", ylab=NA)`

Other plot types adapted from **TraMineR** can be
produced using `type="d"`

(state distribution plots),
`type="dH"`

(same as `type="Ht"`

overlaid on
`type="d"`

), `type="f"`

(state frequency plots),
`type="i"`

(selected sequence index plots),
`type="I"`

(whole set index plots), `type="ms"`

(modal state plots; an alternative to `type="central"`

above), and `type="mt"`

(mean times plots). Each of these
plots are shown on a per-cluster basis; by default, all account for the
sampling weights (if any), via the `weighted`

argument, and
such plots are typically additionally weighted according to the soft
cluster membership probabilities, though it is also possible to use the
hard MAP partition instead, via modifying the `soft`

argument. All plot types adapted from **TraMineR**
additionally, by default, invoke the `SPS`

argument, which
causes each constituent panel to be labelled by the central sequence
characterising the given cluster, in state-permanence-sequence (SPS)
format. Finally, clustering uncertainties, the gating network, the
similarity matrix, and model selection criteria can also be
visualised.

Murphy, K., Murphy, T. B., Piccarreta, R., and Gormley, I. C. (2021).
Clustering longitudinal life-course sequences using mixtures of
exponential-distance models. *Journal of the Royal Statistical
Society: Series A (Statistics in Society)*, 184(4): 1414–1451.
<doi:10.1111/rssa.12712>.

Gabadinho, A., Ritschard, G., Müller, N. S., and Studer, M. (2011).
Analyzing and visualizing state sequences in R with TraMineR.
*Journal of Statistical Software*, 40(4): 1–37.

McVicar, D. and Anyadike-Danes, M. (2002). Predicting successful and
unsuccessful transitions from school to work by using sequence methods.
*Journal of the Royal Statistical Society: Series A (Statistics in
Society)*, 165(2): 317–334.

Müller, N. S., Studer, M., and Ritschard, G. (2007). Classification
de parcours de vie à l’aide de l’optimal matching. In *XIVe Rencontre
de la Société francophone de classification (SFC 2007), Paris, 5-7
septembre 2007*, pp. 157–160.

Studer, M. (2018). Divisive property-based and fuzzy clustering for
sequence analysis. In G. Ritschard and M. Studer (Eds.), *Sequence
Analysis and Related Approaches: Innovative Methods and
Applications*, pp. 223–239. Cham, Switzerland: Springer
International Publishing.