An implementation of Approximate Bayesian Computation (ABC) methods
applied to pooled sequencing (Pool-seq) data is available in R language
in the package `poolABC`

. The purpose of this vignette is to
provide an in-depth overview of the capabilities of the package,
highlighting the usage of its main functions.

The initial sections of this vignette detail how to import pooled
sequencing data from files in the `_rc`

format. We also show
how to randomly select multiple subsets of the observed data, compute
summary statistics for those subsets and use those summary statistics as
target for parameter estimation and model selection with ABC.

This vignette also teaches users how to simulate Pool-seq data under
pre-defined models. Note that the simulation of Pool-seq data requires
functions included in the `poolHelper`

package. We then
exemplify how the imported data and the simulated data can be used to
perform parameter inference. The `poolABC`

package allows the
use of genome-wide multilocus data for ABC by using multiple subsets of
simulated and observed loci.

Briefly, we obtain one set of summary statistics for each set of
simulated loci and for each random subset of observed data. Each set of
summary statistics computed from a unique subset of observed data is
used as an independent target for parameter estimation. Thus, with the
`poolABC`

package, users obtain one posterior distribution,
for each parameter of interest and for each subset of observed loci.
Then, our package allows users to combine those multiple posteriors to
obtain a single estimate per parameter. This merging is performed with
the Epanechnikov kernel and weighting according to the distance between
the mean summary statistics of a subset of loci and the mean across the
genome, giving more weight to subsets of loci with a mean closer to the
overall mean. Finally, the `poolABC`

package also includes
functions to compute point estimates and produce plots of those merged
posterior distributions.

This package uses pooled sequencing data stored on `_rc`

format files. These `_rc`

files are created by running the
`SNP-frequency-diff.pl`

function of
`popoolation2`

. Briefly, this is an example of a typical
`_rc`

file with only two populations:

```
data.frame(chr = c("NC297", "NC297"), pos = c(3530, 5450), rc = c("A", "T"), allele_count = 2,
allele_states = c("A/G", "T/A"), deletion_sum = 0, snp_type = "pop", major_alleles = c("AA",
"TT"), minor_alleles = c("GG", "AN"), maa_1 = c("54/55", "51/54"), maa_2 = c("76/78",
"96/96"), mia_1 = c("1/55", "3/54"), mia_2 = c("2/78", "0/96"))
#> chr pos rc allele_count allele_states deletion_sum snp_type major_alleles
#> 1 NC297 3530 A 2 A/G 0 pop AA
#> 2 NC297 5450 T 2 T/A 0 pop TT
#> minor_alleles maa_1 maa_2 mia_1 mia_2
#> 1 GG 54/55 76/78 1/55 2/78
#> 2 AN 51/54 96/96 3/54 0/96
```

More information about these files can be found at: https://sourceforge.net/p/popoolation2/wiki/Main/

If you have your data in `_rc`

files in a folder of your
computer, you can simply use the `importContigs`

function.

```
# load multiple files and organize information by contig
files <- importContigs(path = "/home/myawesomedata", pops = c(1, 4, 7, 10))
```

The `path`

input of this function indicates the path to
the folder where the `_rc`

files are located. By default, the
`importContigs`

function will import all files present in the
folder that include the `_rc`

pattern in their name. The
index of populations to import is defined by the `pops`

input
argument. For instance, the input `pops = c(1, 4, 7, 10)`

will import the major and minor allele for the first, fourth, seventh
and tenth population in the `_rc`

files.

The `importContigs`

function also includes several
optional input arguments. The `files`

input argument allows
you to specify the index of the files to import. For instance, by
setting `files = 1:5`

, only the first five files listed in
the output of `list.files(path = path)`

will be imported.
Additionally, specific contigs can be removed from the data by adding
their names to the `remove`

input argument.

The `min.minor`

input argument allows you to filter the
data by the number of minor allele reads. For instance, if
`min.minor = 2`

all sites where the total number of minor
allele reads across all populations of the `pops`

input
argument is below 2, will be removed from the data. Alternatively, by
setting `filter = TRUE`

, you can filter the data by the
frequency of the minor allele. When `filter = TRUE`

, the user
can define a `threshold`

for the minimum allowed frequency of
the minor allele. If no `threshold`

is defined, the
`importContigs`

function will assume that at least one minor
allele read per site should exist. Finally, it is possible to include an
header when importing the data. This header can be created with the
`createHeader`

function.

Random windows of a given size (in base pairs) can be selected from
the imported data with the `pickWindows`

function. The data
imported with the `importContigs`

function is a list with all
the elements required for the `pickWindows`

function. You can
assign each of those list elements to an individual object or use them
directly as input argument of the `pickWindows`

function.

```
# randomly select blocks of a given size from several contigs
blocks <- pickWindows(freqs, positions, range, rMajor, rMinor, coverage, window = 1000,
nLoci = 100)
```

With this function, users can randomly select a subset of the
complete pooled sequencing data at their disposal. More specifically,
the `pickWindows`

function allows users to randomly select
`nLoci`

blocks of `window`

size (in base pairs)
from the data imported in the previous section. In other words, this
function will randomly select select `nLoci`

contigs and then
select one random block with a user defined size (defined by the
`window`

input) per contig.

The next step is the computation of a set of summary statistics from
the observed data. To compute summary statistics from the observed data,
we can use the `statsContig`

function. This function will
compute the same set of summary statistics used in the simulations from
the multiple random subset of loci obtained in the previous step.

```
# compute a set of observed summary statitstics
obs <- statsContig(randomWindows = blocks, nPops = 4, stat.names = stat.names)
```

Note that we are using the `blocks`

object created with
the `pickWindows`

function as the `randomWindows`

input argument. The `statsContig`

function will compute
summary statistics from those randomly selected blocks of observed data.
Also, the use of names for the summary statistics is strongly
recommended. To ensure that the set of observed summary statistics is
named, we should obtain the name of the simulated summary statistics and
include those in the `stat.names`

input argument of the
`statsContig`

function.

With this package, you also have the ability to simulate pooled
sequencing data under three different models by using the
`poolSim`

function. The `model`

input argument
allows the user to define which model to simulate. At the moment, this
package includes three different models: an isolation with migration
model with two populations, a model representing a single origin of two
divergent ecotypes and a third model representing a parallel origin of
those ecotypes.

To simulate data using the two populations model, you have to define the mean depth of coverage and the variance of the coverage for those two populations. You also need to create a list with the number of individuals per pool and per population. In the next chunk, you can see how to simulate data using this model:

```
# set the mean and variance of the coverage
sMean <- c(84.34, 66.76)
sVars <- c(1437.22, 912.43)
# create a list containing the information of the pool sizes by population
size <- rep(list(rep(10, 10)), 2)
# run simulation for a two-populations model
sims <- poolSim(model = "2pops", nDip = 200, nPops = 2, nLoci = 100, nSites = 2000,
mutrate = 2e-08, size = size, mean = sMean, variance = sVars, minimum = 15, maximum = 180,
min.minor = 1, Nref = c(25000, 25000), ratio = c(0.1, 3), pool = c(5, 180), seq = c(1e-04,
0.001), split = c(0, 3), CW = c(1e-13, 0.001), WC = c(1e-13, 0.001), bT = c(0,
0.5))
```

The `poolSim`

function requires several input arguments,
that are explained in detail in the help page of the function. However,
note that most of those input arguments define the minimum and maximum
values for a variety of relevant parameters. To simulate data using a
four populations model:

```
# set the mean and variance of the coverage
sMean <- c(84.34, 66.76, 65.69, 68.83)
sVars <- c(1437.22, 912.43, 848.02, 1028.23)
# create a list containing the information of the pool sizes by population
size <- rep(list(rep(5, 20)), 4)
# run simulation for a four-populations model
sims <- poolSim(model = "Single", nDip = 400, nPops = 4, nLoci = 100, nSites = 2000,
mutrate = 2e-08, size = size, mean = sMean, variance = sVars, minimum = 15, maximum = 180,
min.minor = 2, Nref = c(25000, 25000), ratio = c(0.1, 3), pool = c(5, 180), seq = c(1e-04,
0.001), split = c(0, 3), CW = c(1e-13, 0.001), WC = c(1e-13, 0.001), CC = c(1e-13,
0.001), WW = c(1e-13, 0.001), ANC = c(1e-13, 0.001), bT = c(0, 0.5), bCW = c(0,
0.5), bWC = c(0, 0.5))
```

The `poolSim`

function can be used to perform a single
simulation. However, most of the times, you will want to perform
thousands of simulations. One way to accomplish this is to use
`replicate`

function together with our `poolSim`

function. We recommend that you do the following:

```
# set the mean and variance of the coverage
sMean <- c(84.34, 66.76)
sVars <- c(1437.22, 912.43)
# create a list containing the information of the pool sizes by population
size <- rep(list(rep(5, 20)), 2)
# define how many simulations to run
nSims <- 10
# run one batch of simulations
sims <- t(replicate(n = nSims, unlist(poolSim(model = "2pops", nDip = 200, nPops = 2,
nLoci = 100, nSites = 1000, mutrate = 2e-08, size = size, mean = sMean, variance = sVars,
minimum = 20, maximum = 185, min.minor = 2, Nref = c(25000, 25000), ratio = c(0.1,
3), pool = c(5, 180), seq = c(1e-04, 0.001), split = c(0, 3), CW = c(1e-13,
0.001), WC = c(1e-13, 0.001), bT = c(0, 0.5)))))
```

By using the `replicate`

function, you can perform
multiple simulations. By unlisting and then transposing the output of
those simulations, you obtain a matrix where each row corresponds to a
different simulation and each column is a different parameter or summary
statistic.

The observed summary statistics computed in the previous sections and the simulations performed in the previous one can then be used to perform parameter estimation with Approximate Bayesian Computation (ABC).

We included with this package a small dataset simulated under the two
populations model. This includes one matrix (`sumstats`

) with
the summary statistics computed from the simulated data, one matrix
(`params`

) with the simulated parameter values and a final
matrix (`limits`

) with the minimum and maximum value of the
prior distribution for each parameter.

The `poolABC`

package aims at streamlining the process of
parameter inference with Pool-seq data. One of the key components of
that design is the `ABC`

function.

By using this function, users can simultaneously perform parameter
estimation with ABC for multiple targets. The `ABC`

function
requires the data, imported with the `importContigs`

function
and then uses both the `pickWindows`

and
`statsContig`

functions to select multiple random subset of
loci from the observed data and compute a set of observed summary
statistics for each of those subsets. Thus, for each subset of loci we
obtain a vector of summary statistics and each vector acts as an
independent target for parameter estimation. The `ABC`

function can be used by doing:

```
# parameter estimation with ÃBC function
myabc <- ABC(nPops = 2, ntrials = 10, freqs = freqs, positions = positions, range = range,
rMajor = rMajor, rMinor = rMinor, coverage = coverage, window = 1000, nLoci = 100,
limits = limits, params = params, sumstats = sumstats, tol = 0.01, method = "regression")
```

The `ntrials`

input argument defines the number of
independent targets for parameter estimation. In this example, we are
performing parameter inference for 10 different targets. Each of those
targets was obtained by computing summary statistics from windows of
1000 base pairs (`window = 1000`

) from 100
(`nLoci = 100`

) randomly selected contigs of the observed
data.

Note that you should also define the `method`

and
tolerance rate, `tol`

, to use. The `tol`

is
defined as the percentage of accepted simulation. You should strive to
keep a low tolerance rate, to avoid accepting simulations that are too
distant from the observed data, but it is also important to avoid very
stringent tolerance rates that may lead to few accepted values. A
typical value of `tol = 0.01`

or `tol = 0.05`

is
recommended but you should test different `tol`

values in the
cross-validation analysis (see more about this in subsequent
sections).

This package implements two ABC algorithms for constructing the
posterior distribution from the accepted simulations: a rejection method
and a regression-based correction using a local linear regression. When
`method`

is “rejection”, simulations are accepted if the
Euclidean distance between the set of summary statistics computed from
the simulated data and the target is sufficiently small and these
accepted simulations are considered a sample from the posterior
distribution. When `method`

is “regression”, an additional
step is used to correct for the imperfect match between the summary
statistics computed from the simulated data and the summary statistics
computed from the observed data. For this reason, we recommend that you
select the regression method because it will, most often than not, lead
to more precise parameter estimates.

After using the `ABC`

function to perform parameter
estimation with Approximate Bayesian Computation for several targets, we
need to merge the multiple posteriors obtained (one for each target)
into a single posterior per parameter.

This can be performed with the `mergepost`

function. One
of the required input arguments of this function is the
`global`

input. This input should be a vector with the
observed summary statistics computed from the entire dataset. We
recommend that you use the `pickWindows`

function to select a
large number of loci and then use that random selection as the input
argument of the `statsContig`

function.

```
# load multiple files and organize information by contig
blocks <- pickWindows(freqs = freqs, positions = positions, range = range, rMajor = rMajor,
rMinor = rMinor, coverage = coverage, window = 1000, nLoci = 800)
# compute a set of summary statistics from the observed data
global <- statsContig(randomWindows = blocks, nPops = 2, stat.names = stat.names)
```

The global vector can then be used in the `mergepost`

function. The remaining required input arguments are the matrix with the
`target`

for the parameter inference and the list containing
the posteriors (`post`

) for each target and parameter. It is
also possible to include the regression weights in the
`wtreg`

option.

```
# merge posterior distributions
myabc <- mergepost(target = myabc$target, global = global, post = myabc$adjusted,
wtreg = myabc$weights)
```

Briefly, this function will merge the different posteriors into a
single one, using different weighting methods for the merging. Details
about the various elements of the `mergepost`

output can be
found in the help page of the function. Note that the
`merge`

, `weighted`

, `merge_reg`

and
`weighted_reg`

entries contain, for each parameter, a
`locfit`

object, obtained after merging the multiple
posteriors using the corresponding method. The `merged_stat`

,
`weighted_stat`

, `merge_reg_stat`

and
`weighted_reg_stat`

are the posterior point estimates for the
corresponding merging method.

Users can then plot the resulting merged posterior distribution with
the `plot_weighted`

function. You should include your matrix
with the simulated parameter values in the `prior`

input
argument of this function to also plot the prior distribution of the
chosen parameter. This allows for a comparison, in the same plot, of the
prior and posterior shape.

You should include the output of the `mergepost`

function
as the `merged_posterior`

input argument and a matrix with
the `limits`

of the prior distribution for each parameter.
Then, you just need to define which parameter to plot with the
`index`

input argument.

```
# plot the density estimation of a parameter
plot_weighted(prior = params, merged_posterior = myabc, limits = limits, index = 2)
```

The `plot_weighted`

function plots the posterior density
of the chosen parameter, together with the prior distribution of the
same plot.

The `poolABC`

package also allows users to perform model
selection by estimating the posterior model probabilities, comparing two
scenarios of ecotype formation: the single and the parallel origin
scenario. The `modelSelect`

function can be used to perform
model selection with ABC.

One of the required input arguments of the `modelSelect`

function is the `index`

. This is a vector of model indices
that should have the same `length`

as
`nrow(sumstats)`

to indicate to which model a particular row
of `sumstats`

belongs. The remaining input arguments are
explained in the help page of the function. As before, you should also
define the tolerance rate (`tol`

) and the `method`

to use. A tolerance of 0.01 and the “regression” method are
recommended.

```
# create a vecto of model indices
index <- c(rep("model1", nrow(sumstats)/2), rep("model2", nrow(sumstats)/2))
# select a random simulation to act as target
target <- sumstats[10, ]
# perform model selection with ABC
mysel <- modelSelect(target = target, index = index, sumstats = sumstats, tol = 0.1,
method = "regression")
# display the structure of the mysel object
str(mysel)
#> List of 6
#> $ method : chr "regression"
#> $ indices: Factor w/ 2 levels "model1","model2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ pred : 'table' num [1:2(1d)] 0.541 0.459
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:2] "model1" "model2"
#> $ ss : num [1:1000, 1:14] 0 0.0106 0.00118 0.01243 0 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:14] "Sf" "Sx1" "Sx2" "SS" ...
#> $ weights: num [1:1000] 0.2723 1 0.0146 0.0171 0.3713 ...
#> $ nmodels: Named int [1:2] 5000 5000
#> ..- attr(*, "names")= chr [1:2] "model1" "model2"
```

The output of the `modelSelect`

function is a list with
six entries. To quickly view the results of the model selection, you can
use the `summary_modelSelect`

function. This function will
provide an easy to read display of the posterior model probabilities and
Bayes factors. The only required input argument of the
`summary_modelSelect`

function is the object created with the
the `modelSelect`

function.

```
# check results of model selection
msel <- summary_modelSelect(object = mysel)
#> Data:
#> results based on 1000 posterior samples
#>
#> Models a priori:
#> model1, model2
#>
#> Models a posteriori:
#> model1, model2
#>
#> Proportion of accepted simulations (rejection):
#> model1 model2
#> 0.51 0.49
#>
#> Posterior model probabilities (mnlogistic):
#> model1 model2
#> 0.541 0.459
```

As you can see, by running the `summary.modelSelect`

function we get an output with the proportion of accepted simulation for
each model under a rejection method and posterior model probabilities
under the regression method. If we print the object itself

```
# print results of model selection
msel
#> $rejection
#> $rejection$Prob
#> model1 model2
#> 0.51 0.49
#>
#> $rejection$BayesF
#> model1 model2
#> model1 1.0000000 1.040816
#> model2 0.9607843 1.000000
#>
#>
#> $mnlogistic
#> $mnlogistic$Prob
#> model1 model2
#> 0.5406397 0.4593603
#>
#> $mnlogistic$BayesF
#> model1 model2
#> model1 1.0000000 1.176941
#> model2 0.8496605 1.000000
```

we can also see the Bayes factors between pairs of models for both the rejection and the regression methods.

A fundamental part of any ABC analysis is the validation of the
results obtained in the parameter estimation and model selection steps.
The `poolABC`

package includes tools to perform cross
validation for both analysis, computing prediction errors for both
parameter inference and model selection.

One important component of this validation process is the calculation of prediction errors for each parameter. This allows us to evaluate the confidence of the estimates and the effect of various point estimates and/or tolerance rates.

To perform a leave-one-out cross validation for ABC, you can use the
`simulationABC`

function. This functions requires the
simulated parameter values, `params`

, the simulated summary
statistics, `sumstats`

and a matrix with the
`limits`

of the prior distributions. You should also define
the size of the cross-validation sample, `nval`

, the
tolerance rate, `tol`

, and the type of ABC algorithm to be
applied in the `method`

input.

```
# perform an Approximate Bayesian Computation simulation study
mycv <- simulationABC(params = params, sumstats = sumstats, limits = limits, nval = 100,
tol = 0.01, method = "regression")
# display the structure of the mycv object
str(mycv, max.level = 2)
#> List of 3
#> $ true: num [1:100, 1:8] 1.09 0.364 0.579 1.332 0.306 ...
#> ..- attr(*, "dimnames")=List of 2
#> $ rej :List of 4
#> ..$ mode : num [1:100, 1:8] 1.888 0.276 0.27 0.709 0.099 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ median: num [1:100, 1:8] 1.87 0.396 0.313 0.934 0.262 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ mean : num [1:100, 1:8] 1.777 0.459 0.348 1.033 0.295 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ error : num [1:3, 1:8] 0.411 0.176 0.176 0.331 0.207 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> $ reg :List of 4
#> ..$ mode : num [1:100, 1:8] 0.549 0.291 0.497 1.458 0.18 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ median: num [1:100, 1:8] 0.686 0.311 0.51 1.415 0.198 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ mean : num [1:100, 1:8] 0.77 0.335 0.524 1.395 0.221 ...
#> .. ..- attr(*, "dimnames")=List of 2
#> ..$ error : num [1:3, 1:8] 0.155 0.142 0.135 0.198 0.195 ...
#> .. ..- attr(*, "dimnames")=List of 2
```

The output of the `simulationABC`

function is a list with
three elements. Details about each list element are available in the
help page of the function. A quick way to visualize the results of the
leave-one-out cross validation is to plot the the cross-validation
results.

The `poolABC`

package includes the
`plot_errorABC`

function to allow this visual evaluation of
the quality of the estimation. This function requires as input the
output of the `simulationABC`

function. Additionally, you
need to define the ABC algorithm (either “reg” for regression or “rej”
for rejection) and which point estimate (“mode”, “median” or “mean”) to
plot. You should also define which parameter to plot, by selecting the
corresponding `index`

.

```
# plot the prediction errors
plot_errorABC(x = mycv, method = "reg", statistic = "median", index = 1)
```

As you can see, this produces a plot with the true parameter value in x-axis and the estimate value of the parameter in the y-axis. Thus, the closer the points are to the diagonal line, the higher is the quality of the estimation.

It is also possible to evaluate how much confidence we should place
in the model selection results by performing a leave-one-out cross
validation for model selection with ABC via subsequent calls to the
function `modelSelect`

. Briefly, several simulations from
each model are selected to act as validation simulations, while the
remaining simulations are used as training simulations. For each
validation simulation, the function `modelSelect`

is called
to estimate the posterior model probabilities.

```
# perform a leave-one-out cross validation for model selection
modelSim <- sim_modelSel(index = index, sumstats = sumstats, nval = 100, tol = 0.1)
# display the structure of the modelSim object
str(modelSim)
#> List of 5
#> $ cvsamples : Named int [1:200] 3353 2818 3561 1671 1513 3309 2166 1897 3942 3356 ...
#> ..- attr(*, "names")= chr [1:200] "model11" "model12" "model13" "model14" ...
#> $ true : chr [1:200] "model1" "model1" "model1" "model1" ...
#> $ estimated : chr [1:200] "model2" "model1" "model1" "model2" ...
#> $ model.probs: num [1:200, 1:2] 0.494 0.532 0.501 0.486 0.456 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:200] "model1" "model1" "model1" "model1" ...
#> .. ..$ : chr [1:2] "model1" "model2"
#> $ models : chr [1:2] "model1" "model2"
```

The output of this leave-one-out cross validation for model selection
is a list with 5 different elements that can be used in the
`error_modelSel`

function to compute the confusion matrix and
the mean misclassification probabilities of models.

Users can also define a `threshold`

for the posterior
model probabilities. This `threshold`

corresponds to the
minimum posterior probability of assignment. Thus, a simulation where
the posterior probability of any model is below the threshold will not
be assigned to a model and will instead be classified as
`unclear`

.

```
# compute the mean misclassification probabilities
mselError <- error_modelSel(object = modelSim)
#> Confusion matrix based on 100 samples for each model
#>
#> model1 model2
#> model1 41 59
#> model2 51 49
#>
#> Mean model posterior probabilities (mnlogistic)
#>
#> model1 model2
#> model1 0.484 0.516
#> model2 0.506 0.494
#>
#> Posterior probabilities of correctly assigned model1 models
#>
#> model1 incorrect
#> 0.53 0.47
#>
#> Posterior probabilities of correctly assigned model2 models
#>
#> incorrect model2
#> 0.464 0.536
#>
#> Posterior probabilities when model1 is estimated as model2
#>
#> model1 model2
#> 0.453 0.547
#>
#> Posterior probabilities when model2 is estimated as model1
#>
#> model1 model2
#> 0.546 0.454
```

The `error_modelSel`

function outputs the confusion matrix
and the mean model posterior probabilities obtained with the regression
method. It will also output other useful information such as the mean
posterior probability of correctly assigned models and the mean
posterior probability when each model is not correctly assigned.

For a more visual interpretation of these results, it is also
possible to display a barplot of the model misclassification. By using
the `plot_msel`

function we can plot the confusion matrix,
either in colour (if `color = TRUE`

) or in grey (if
`color = FALSE`

).

Using the output of the `error_modelSel`

function as the
input of the `plot_msel`

function, we can produce this
barplot showing the proportion of simulations classified to any of the
models.