• mevolCVP2

title: “How to use and build on mevolCVP2”

output: rmarkdown::html_vignette

vignette: >

## Foreword ### History

This tiny package (finally) makes available the idea coined by Allowen Evin and colleagues in Journal of Archaeological Science in 2013.

This approach was previously available as a single script (circa 2013), then wrapped into a package (circa 2018) without any change to the code and not released on CRAN.

Here, we rethink and rewrote the latter code to have something easier to build upon.

### The idea The Evin et al. (2013) paper, among others, used “predictive” discriminant analyses in morphometrical analyses, particularly in the context of landmark-based morphometrics which uses principal component analyses prior to LDA.

The idea was two-fold: i) parameterize the optimal number of principal components to retain for further linear discriminant analyses, and ii) evaluate the effects of unbalanced sample sized between groups for further LDA.

Yet this approach deeply changed the way we see, parameterize and do LDAs unsing morphometrics data, it actually boils down, statistically, to the following two core ideas and to the very few and thin functions included in that package.

We present the rationale behind this package, then how it works in practice, and finally how to extend it to your own needs.

### The rationale

The rational of permutationnal and parameterized discriminant analyses boils down to: both the number of variables and the balance between groups size matter for linear discriminant analyses.

The number of variables is discriminant analyses matters and can be optimized.

This is typically the case for landmark approaches where a “preliminary” PCA mandatory and its components later used for LDA. But even for outline-based approaches, where quantitative variables are not orthogonal and independant by construction, their number also matters.

In other words, the common view that a linear discriminant analysis accuracy increases monotonously with the number of variables you feed it with is wrong.

Of course, we let apart here many other fundamental aspects such as measurement error, it is neither reasonable nor meaningful (nor even allowed) to use 200 principal components or 60 harmonics to train models that will ultimately be used on degraded archaeological material (or even fresh)

The group sample sizes matters. In spite of prior used by default in LDA (eg in MASS::lda), the class proportions is not enough to unbias class predictions and consequently global accuracy. In other words, if you have a dataset with 90% of class A and 10% of class B, you will “mechanically” have a better cross-validation accuracy for class A and likely an optimistic global accuracy too. This is definitely not what we want when we do LDA, particularly when we used trained models to then make predictions, for example in the archaeological context.

We realized both i) and ii) empirically and yet we are confident that these can probably be explained mathematically we never find out how and why.

Along the last ~10 years (2014->2024), we thus systematically investigated the number of components (or harmonics) to retain and used permutations to obtain balanced datasets.

As a side effect, the use of permutationnal approaches typically allows to obtain the true distribution of $H_0$ , in other words to compare model performances against pure random. For morphometrical studies, this allowed to detect subtle yet significant morphological signals even with relatively low accuracies. In other words, even a model with 60% accuracy (which is far from impressive) may be better than random (ie a binomial or a coin) and thus reveal a true morphological signal.

## mevolCVP2 in practice

### Dependencies

First thing first, you can install the package with:

# install.packages("pak")
pak::pak("vbonhomme/mevolCVP2")

Once it will be released on CRAN, this will install the last release from CRAN:

  install.packages("mevolCVP2")

We first load the package using library. We also specify a random seed which is a good idea to have the best of both worlds: randomness and replicability across your sessions.

library(mevolCVP2)
set.seed(2329)

We will also use a couple of packages from the tidyverse so we load them now:

library(dplyr)
library(purrr)
library(tidyr)
library(ggplot2)

Data presentation

Here, all analyses will start with a data.frame with:

the first column being the grouping structure that can be turned into a factor (eg a factor or character)
at least one other column being a (or many more) quantitative variable.

Such a dataset is available within the package, named pig (see ?pig) and borrowed from the JAS paper. Let’s have a look to the first and last 3 rows and the first 3 variables columns:

minipig <- pig[c(1:3, 168:171), 1:4]
minipig
#> # A tibble: 7 × 4
#>   sp           PC1       PC2       PC3
#>   <fct>      <dbl>     <dbl>     <dbl>
#> 1 WB    -0.0247     0.00722  -0.000179
#> 2 WB    -0.00864    0.0240   -0.000265
#> 3 WB     0.0254    -0.000168  0.0101  
#> 4 DP     0.0000239  0.0106    0.00964 
#> 5 DP    -0.0420     0.00126  -0.00753 
#> 6 DP    -0.0121     0.00378  -0.00159 
#> 7 DP    -0.0504    -0.000302  0.00555

Helpers

Below we present helpers and recipes that can easily be reused to you own analyses.

Randomizing and selecting columns

Three core helpers will ease the common operations.

randomize shuffles the first column, ie randomize grouping labels;
balance resample rows using the grouping labels, so that there are n individuals in each group. n_ind is, by default, the smallest group size. Also, sampling is done with replace=TRUE by default
retain will select only n_col from variables columns (all by default).

randomize(minipig)  # look at the first column
#> # A tibble: 7 × 4
#>   sp           PC1       PC2       PC3
#>   <fct>      <dbl>     <dbl>     <dbl>
#> 1 DP    -0.0247     0.00722  -0.000179
#> 2 WB    -0.00864    0.0240   -0.000265
#> 3 DP     0.0254    -0.000168  0.0101  
#> 4 WB     0.0000239  0.0106    0.00964 
#> 5 DP    -0.0420     0.00126  -0.00753 
#> 6 WB    -0.0121     0.00378  -0.00159 
#> 7 DP    -0.0504    -0.000302  0.00555
balance(minipig, n_ind=2) # count rows in each group
#> # A tibble: 4 × 4
#>   sp           PC1       PC2       PC3
#>   <fct>      <dbl>     <dbl>     <dbl>
#> 1 DP     0.0000239  0.0106    0.00964 
#> 2 DP    -0.0420     0.00126  -0.00753 
#> 3 WB    -0.00864    0.0240   -0.000265
#> 4 WB     0.0254    -0.000168  0.0101
retain(minipig, n_col=2)  # count quantitative variables columns
#> # A tibble: 7 × 3
#>   sp           PC1       PC2
#>   <fct>      <dbl>     <dbl>
#> 1 WB    -0.0247     0.00722 
#> 2 WB    -0.00864    0.0240  
#> 3 WB     0.0254    -0.000168
#> 4 DP     0.0000239  0.0106  
#> 5 DP    -0.0420     0.00126 
#> 6 DP    -0.0121     0.00378 
#> 7 DP    -0.0504    -0.000302

To ease permutations, they are wrapped in partition() that returns a list with four different datasets with self-explanatory names:

pigs1 <- partition(minipig, n_ind=3)
pigs1
#> $original
#> # A tibble: 7 × 4
#>   sp           PC1       PC2       PC3
#>   <fct>      <dbl>     <dbl>     <dbl>
#> 1 WB    -0.0247     0.00722  -0.000179
#> 2 WB    -0.00864    0.0240   -0.000265
#> 3 WB     0.0254    -0.000168  0.0101  
#> 4 DP     0.0000239  0.0106    0.00964 
#> 5 DP    -0.0420     0.00126  -0.00753 
#> 6 DP    -0.0121     0.00378  -0.00159 
#> 7 DP    -0.0504    -0.000302  0.00555 
#> 
#> $random
#> # A tibble: 7 × 4
#>   sp           PC1       PC2       PC3
#>   <fct>      <dbl>     <dbl>     <dbl>
#> 1 DP    -0.0247     0.00722  -0.000179
#> 2 DP    -0.00864    0.0240   -0.000265
#> 3 WB     0.0254    -0.000168  0.0101  
#> 4 DP     0.0000239  0.0106    0.00964 
#> 5 WB    -0.0420     0.00126  -0.00753 
#> 6 DP    -0.0121     0.00378  -0.00159 
#> 7 WB    -0.0504    -0.000302  0.00555 
#> 
#> $balanced
#> # A tibble: 6 × 4
#>   sp           PC1       PC2       PC3
#>   <fct>      <dbl>     <dbl>     <dbl>
#> 1 DP    -0.0504    -0.000302  0.00555 
#> 2 DP     0.0000239  0.0106    0.00964 
#> 3 DP    -0.0420     0.00126  -0.00753 
#> 4 WB    -0.0247     0.00722  -0.000179
#> 5 WB    -0.0247     0.00722  -0.000179
#> 6 WB    -0.0247     0.00722  -0.000179
#> 
#> $balanced_random
#> # A tibble: 6 × 4
#>   sp           PC1       PC2       PC3
#>   <fct>      <dbl>     <dbl>     <dbl>
#> 1 DP     0.0000239  0.0106    0.00964 
#> 2 WB     0.0000239  0.0106    0.00964 
#> 3 DP    -0.0504    -0.000302  0.00555 
#> 4 WB    -0.00864    0.0240   -0.000265
#> 5 DP    -0.0247     0.00722  -0.000179
#> 6 WB     0.0254    -0.000168  0.0101

LDA calculation and digestion

The workhorse of this package is a thin wrapper on top of MASS::lda that directly returns the confusion matrix. MASS::lda is passed with CV=TRUE which returns leave-one-out cross-validated predictions, digested into a confusion matrix. Let’s try it on the full pig dataset:

cm <- lda1(pig)

We can calculate accuracy indices on this confusion matrix. We will later see how to obtain additional metrics from the latter.

acc(cm)
#>       acc 
#> 0.8947368
acc_classes(cm)
#>        DP        WB 
#> 0.7142857 0.9534884
acc_all(cm)
#>       acc        DP        WB 
#> 0.8947368 0.7142857 0.9534884

pig is “highly” unbalanced in terms of sample sizes between the two groups. We expect a naive LDA on this dataset to be biased and an LDA on balanced dataset to be more realistic in terms of class accuracies.

pig %>% lda1() %>% acc_classes()
#>        DP        WB 
#> 0.7380952 0.9534884
pig %>% balance() %>% lda1() %>% acc_classes()
#>        DP        WB 
#> 0.8809524 0.8333333

Above, we observe that the class accuracy of the WB group (larger by more than a factor 3) is drastically reduced and actually performs less well than DP.

Whether these values are better than pure random have to be properly tested. We expect, in such binary case, something around 50/50 for pure random (ie a binomial) but we can be away from this ideal case.

pig %>% balance() %>% randomize() %>% lda1() %>% acc_classes()
#>       DP       WB 
#> 0.547619 0.452381

We already have pretty much everything we need!

In action: balancing and permutating

A single round

Now, we gonna go take profit of functionnal programming tools provided by purrr. If you are not familiar with this package, stop everything and go have a look to it.

pig %>% partition() %>% map(\(.df1) .df1 %>% lda1() %>% acc_all())
#> $original
#>       acc        DP        WB 
#> 0.8947368 0.7142857 0.9534884 
#> 
#> $random
#>       acc        DP        WB 
#> 0.6257310 0.1666667 0.7751938 
#> 
#> $balanced
#>       acc        DP        WB 
#> 0.9642857 0.9523810 0.9761905 
#> 
#> $balanced_random
#>       acc        DP        WB 
#> 0.4642857 0.5476190 0.3809524

The results above should reflect the expectations and results we observed before, doing it “by hand”, line after line.

Let’s wrap this into a function that would accept any dataset. We also arrange a bit what comes out of the map into a nice and named tibble.

perm1 <- function(x) {
  x %>% 
    partition() %>% 
    map(\(.x) .x %>% lda1() %>% acc_all()) %>% 
    do.call("rbind", .) %>% as_tibble(rownames = "partition")
}
pig %>% perm1()
#> # A tibble: 4 × 4
#>   partition         acc    DP    WB
#>   <chr>           <dbl> <dbl> <dbl>
#> 1 original        0.895 0.714 0.953
#> 2 random          0.673 0.167 0.837
#> 3 balanced        0.810 0.857 0.762
#> 4 balanced_random 0.321 0.333 0.310

Can you see where we’re going here? So let fun begins and we let’s replicate these permutations.

More than a single round

Here we go for 10 rounds for the sake of speed while writing and compiling the package but feel free to change it by your side.

We will come back to the full pig and use a map variant, the _dfr version, that directly rbinds the intermediate list into a tibble. Feel free try the raw map below.

Everything that follows will adhere to this idea: combine the few helpers, do permutations using map and friends, summarize and graph.

K=10
pig10 <- map_dfr(seq_len(K), \(iter) pig %>% perm1())
head(pig10)
#> # A tibble: 6 × 4
#>   partition         acc    DP    WB
#>   <chr>           <dbl> <dbl> <dbl>
#> 1 original        0.901 0.738 0.953
#> 2 random          0.637 0.143 0.798
#> 3 balanced        0.821 0.929 0.714
#> 4 balanced_random 0.571 0.595 0.548
#> 5 original        0.895 0.714 0.953
#> 6 random          0.637 0.190 0.783

Let’s summarise this a bit with dplyr so that we can summatise the mean and the SD:

pig10 %>% 
  group_by(partition) %>% 
  summarize(across(everything(), c("median"=median, "sd"=sd)))
#> # A tibble: 4 × 7
#>   partition       acc_median  acc_sd DP_median  DP_sd WB_median  WB_sd
#>   <chr>                <dbl>   <dbl>     <dbl>  <dbl>     <dbl>  <dbl>
#> 1 balanced             0.833 0.0413      0.881 0.0408     0.786 0.0659
#> 2 balanced_random      0.470 0.0764      0.488 0.0856     0.464 0.0985
#> 3 original             0.901 0.00302     0.738 0.0123     0.953 0     
#> 4 random               0.637 0.0260      0.190 0.0778     0.791 0.0203

Summary and graph

Alternatively, to get a more compact table:

mean_sd <- function(x) paste0(round(mean(x), 3), "±", round(sd(x), 3))
pig10 %>% 
  group_by(partition) %>% 
  summarize(across(everything(), mean_sd))
#> # A tibble: 4 × 4
#>   partition       acc         DP          WB         
#>   <chr>           <chr>       <chr>       <chr>      
#> 1 balanced        0.83±0.041  0.871±0.041 0.788±0.066
#> 2 balanced_random 0.487±0.076 0.49±0.086  0.483±0.099
#> 3 original        0.898±0.003 0.729±0.012 0.953±0    
#> 4 random          0.649±0.026 0.19±0.078  0.798±0.02

We can also have a nice graph after a little pivoting step to please ggplot2:

pig10 %>% 
  mutate(partition=factor(partition, 
                          levels=c("original", "random", "balanced_random", "balanced"), ordered=TRUE)) %>% # order partition for graphics purposes
  pivot_longer(acc:last_col(), names_to = "what") %>% # pivot the tibble to a longer form
  ggplot() + 
  aes(x=partition, y=value, fill=what) + 
  geom_boxplot() +
  theme_minimal()

The original results is the one you would obtain without touching your original dataset: global accuracy is overestimated as weel as WB class. When you look at balanced this is corrected. This model performs clearly better than balanced_random which is pure random. Sounds like there is a strong morphological signal in there!

In action: adding variable selection

Let’s sophisticate a bit the perm1 function to allow for selecting more or less variables to feed the LDA with. We simply add a n_col argument that we pass to retain(), coined before partition():

perm2 <- function(x, n_col) {
  x %>% 
    retain(n_col) %>% 
    partition() %>% 
    map(\(.x) .x %>% lda1() %>% acc_all()) %>% 
    do.call("rbind", .) %>% as_tibble(rownames = "partition")
}

We expect that 20 components would perform better than a single one. Let’s check it on a single round, and indeed:

pig %>% perm2(1)
#> # A tibble: 4 × 4
#>   partition         acc    DP    WB
#>   <chr>           <dbl> <dbl> <dbl>
#> 1 original        0.813 0.405 0.946
#> 2 random          0.754 0     1    
#> 3 balanced        0.774 0.762 0.786
#> 4 balanced_random 0.548 0.5   0.595
pig %>% perm2(20)
#> # A tibble: 4 × 4
#>   partition         acc     DP    WB
#>   <chr>           <dbl>  <dbl> <dbl>
#> 1 original        0.912 0.786  0.953
#> 2 random          0.684 0.0952 0.876
#> 3 balanced        0.940 0.952  0.929
#> 4 balanced_random 0.524 0.5    0.548

We can easily write this into a map. We also record the number of variables and make it return directly a single, rbinded data.frame with map_dfr:

z <- map_dfr(1:20, \(n_col) pig %>% perm2(n_col) %>% mutate(retain=n_col, .before=1))
head(z)
#> # A tibble: 6 × 5
#>   retain partition         acc    DP    WB
#>    <int> <chr>           <dbl> <dbl> <dbl>
#> 1      1 original        0.813 0.405 0.946
#> 2      1 random          0.754 0     1    
#> 3      1 balanced        0.798 0.810 0.786
#> 4      1 balanced_random 0.560 0.619 0.5  
#> 5      2 original        0.807 0.452 0.922
#> 6      2 random          0.754 0     1

To replicate this calculation K times, we simply wrap it into an additional layer of map:

K=10
z <- map_dfr(seq_len(K),
        \(iter) map_dfr(1:20, \(n_col) pig %>% perm2(n_col) %>% mutate(retain=n_col, .before=1)) %>% 
          mutate(iter=iter, .before=1)
)
head(z)
#> # A tibble: 6 × 6
#>    iter retain partition         acc    DP    WB
#>   <int>  <int> <chr>           <dbl> <dbl> <dbl>
#> 1     1      1 original        0.813 0.405 0.946
#> 2     1      1 random          0.754 0     1    
#> 3     1      1 balanced        0.821 0.714 0.929
#> 4     1      1 balanced_random 0.536 0.452 0.619
#> 5     1      2 original        0.807 0.452 0.922
#> 6     1      2 random          0.754 0     1

Let’s plot this beauty:

z %>% 
  pivot_longer(acc:last_col(), names_to="metric") %>% 
  ggplot() + 
  aes(x=retain, y=value, col=partition) +
  geom_point(size=0.1) +
  geom_smooth() +
  facet_grid(~metric) +
  xlab("number of PC retained") +
  ylab("CV accuracy") +
  theme_minimal()
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'