The **philentropy** package has several mechanisms to calculate distances between probability density functions. The main one is to use the the `distance()`

function, which enables to compute 46 different distances/similarities between probability density functions (see `?philentropy::distance`

and a companion vignette for details). Alternatively, it is possible to call each distance/dissimilarity function directly. For example, the `euclidean()`

function will compute the euclidean distance, while `jaccard`

- the Jaccard distance. The complete list of available distance measures are available with the `philentropy::getDistMethods()`

function.

Both of the above approaches have their pros and cons. The `distance()`

function is more flexible as it allows users to use any distance measure and can return either a `matrix`

or a `dist`

object. It also has several defensive programming checks implemented, and thus, it is more appropriate for regular users. Single distance functions, such as `euclidean()`

or `jaccard()`

, can be, on the other hand, slightly faster as they directly call the underlining C++ code.

Now, we introduce three new low-level functions that are intermediaries between `distance()`

and single distance functions. They are fairly flexible, allowing to use of any implemented distance measure, but also usually faster than calling the `distance()`

functions (especially, if it is needed to use many times). These functions are:

`dist_one_one()`

- expects two vectors (probability density functions), returns a single value`dist_one_many()`

- expects one vector (a probability density function) and one matrix (a set of probability density functions), returns a vector of values`dist_many_many()`

- expects two matrices (two sets of probability density functions), returns a matrix of values

Let’s start testing them by attaching the **philentropy** package.

`library(philentropy)`

`dist_one_one()`

`dist_one_one()`

is a lower level equivalent to `distance()`

. However, instead of accepting a numeric `data.frame`

or `matrix`

, it expects two vectors representing probability density functions. In this example, we create two vectors, `P`

and `Q`

.

```
<- 1:10 / sum(1:10)
P <- 20:29 / sum(20:29) Q
```

To calculate the euclidean distance between them we can use several approaches - (a) build-in R `dist()`

function, (b) `philentropy::distance()`

, (c) `philentropy::euclidean()`

, or the new `dist_one_one()`

.

```
# install.packages("microbenchmark")
::microbenchmark(
microbenchmarkdist(rbind(P, Q), method = "euclidean"),
distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
euclidean(P, Q, FALSE),
dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
)
```

```
## Unit: microseconds
## expr
## dist(rbind(P, Q), method = "euclidean")
## distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE)
## euclidean(P, Q, FALSE)
## dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 21.024 22.0665 26.83100 23.4125 23.901 336.156 100
## 32.786 33.7415 58.98310 34.5680 35.239 2315.590 100
## 2.586 2.8385 3.17071 3.0570 3.464 4.778 100
## 3.871 4.4115 5.46040 4.9085 5.213 56.764 100
```

All of them return the same, single value. However, as you can see in the benchmark above, some are more flexible, and others are faster.

`dist_one_many()`

The role of `dist_one_many()`

is to calculate distances between one probability density function (in a form of a `vector`

) and a set of probability density functions (as rows in a `matrix`

).

Firstly, let’s create our example data.

```
set.seed(2020-08-20)
<- 1:10 / sum(1:10)
P <- t(replicate(100, sample(1:10, size = 10) / 55)) M
```

`P`

is our input vector and `M`

is our input matrix.

Distances between the `P`

vector and probability density functions in `M`

can be calculated using several approaches. For example, we could write a `for`

loop (adding a new code) or just use the existing `distance()`

function and extract only one row (or column) from the results. The `dist_one_many()`

allows for this calculation directly as it goes through each row in `M`

and calculates a given distance measure between `P`

and values in this row.

```
# install.packages("microbenchmark")
::microbenchmark(
microbenchmarkas.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
dist_one_many(P, M, method = "euclidean", testNA = FALSE)
)
```

```
## Unit: microseconds
## expr
## as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1]
## distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1]
## dist_one_many(P, M, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 316.244 397.361 494.36541 491.927 568.745 849.122 100
## 26182.286 28366.181 32239.31384 30350.948 35339.433 50017.425 100
## 27.124 31.942 39.40121 37.929 43.306 127.129 100
```

The `dist_one_many()`

returns a vector of values. It is, in this case, much faster than `distance()`

, and visibly faster than `dist()`

while allowing for more possible distance measures to be used.

`dist_many_many()`

`dist_many_many()`

calculates distances between two sets of probability density functions (as rows in two `matrix`

objects).

Let’s create two new `matrix`

example data.

```
set.seed(2020-08-20)
<- t(replicate(10, sample(1:10, size = 10) / 55))
M1 <- t(replicate(10, sample(1:10, size = 10) / 55)) M2
```

`M1`

is our first input matrix and `M2`

is our second input matrix. I am not aware of any function build-in R that allows calculating distances between rows of two matrices, and thus, to solve this problem, we can create our own - `many_dists()`

…

```
= function(m1, m2){
many_dists = matrix(nrow = nrow(m1), ncol = nrow(m2))
r for (i in seq_len(nrow(m1))){
for (j in seq_len(nrow(m2))){
= rbind(m1[i, ], m2[j, ])
x = distance(x, method = "euclidean", mute.message = TRUE)
r[i, j]
}
}
r }
```

… and compare it to `dist_many_many()`

.

```
# install.packages("microbenchmark")
::microbenchmark(
microbenchmarkmany_dists(M1, M2),
dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
)
```

```
## Unit: microseconds
## expr min
## many_dists(M1, M2) 2850.561
## dist_many_many(M1, M2, method = "euclidean", testNA = FALSE) 40.507
## lq mean median uq max neval
## 3218.5515 3782.73070 3417.8890 3681.904 16620.19 100
## 46.2875 53.14176 50.5715 54.483 172.86 100
```

Both `many_dists()`

and `dist_many_many()`

return a matrix. The above benchmark concludes that `dist_many_many()`

is about 30 times faster than our custom `many_dists()`

approach.