Systematic co-variation of monophthongs across speakers of New Zealand English

Supplementary materials: Analysis script

James Brand¹, Jen Hay^1,2, Lynn Clark², Kevin Watson² & Márton Sóskuthy³

¹New Zealand Institute for Language, Brain and Behaviour, Univeristy of Canterbury, NZ
²Department of Linguistics, Univeristy of Canterbury, NZ
³Department of Linguistics, The University of British Columbia, CA

Corresponding author: James Brand
Email: james.brand@canterbury.ac.nz
Website: https://jamesbrandscience.github.io

12 May, 2020

Document outline

This document provides the code used in the analyses of the Brand, Hay, Clark, Watson and Sóskuthy (2020) manuscript, submitted to the Journal of Phonetics. It contains the various analysis steps reported in the paper, as well as additional analyses that the authors completed but were not considered central to the manuscript’s core research questions, they are included here in case readers are interested.

Whilst every attempt has been made to make the code transparent, clear and comprehensible to all readers, regardless of your proficiency with using R or the statistical procedures applied in the analyses, if there are questions/queries/issues that do arise, please do contact the corresponding author (contact details at the top of the document).

Note that in the project repository, large and computationally expensive processes, such as the GAMM modelling, have been pre-run and important data stored in the Data folder. This has been done to ensure the compilation of this file is achieved relatively quickly and can be hosted online (i.e. via GitHub and OSF), in addition to allowing others to have quick access to all the required data. These steps are included in this file and can be run on your own computer to reproduce all the original files. When pre-run steps have been carried out, they are identifiable in the .Rmd file by the chunks having an eval=FALSE argument. If you are running these chunks, please ensure you have sufficient memory avialable (I require 13.18GB to store the Analysis folder, when all models are saved).

cat(paste0("Start time:\n", format(Sys.time(), "%d %B %Y, %r")))

## Start time:
## 12 May 2020, 05:10:13 pm

start_time <- Sys.time()

Analysis steps

The document covers a number of steps that we completed, all of which can be reproduced by using the code and data in the project repository (LINK). In order to orientate the reader, we provide a brief written outline of what the steps are.

Load in the data and provide summaries of the how it is structured.
Apply a new normalisation procedure (Lobanov 2.0) to the formant measurements.
Run a series of GAMMs that model the normalised values (per formant and per vowel), with fixed effects of speaker year of birth, gender and speech rate. These models will then be used to extract the by-speaker random intercepts, which we use as estimates of how innovative a speaker’s realisations of each vowel are in terms of F1/F2, whilst keeping the fixed effects constant.
Run a principal components analysis (PCA) on the speaker intercepts data. Then inspect the eigen values of each of the principal components (PCs), this will allow us to determine which PCs account for sufficient variance to be meaningfully interpreted.
Extract the PCA scores from the PCs, which give each individual speaker a value for each PC, the more extreme (i.e. high or low) this value, the more the speaker contributes to the PC’s formation. This will allow us to identify which speakers are representative of the PCs.
Assess if any of the PCs can be explained by the fixed-effects from the GAMM fitting procedure, i.e. is there a relationship between the PCA scores and factors such as year of birth or gender. We will provide examples of speaker vowel spaces to assist in the interpreation of the PCs in terms of F1/F2 space (the Shiny app allows for exploration of all speakers, so we recommend that as the optimal tool for understanding speaker variation LINK).
Following on from the previous inspection of the variables, our interpretation for how they co-vary together within a more theoretical framework (as explained in the paper), was driven by our understanding of the directions of change in F1/F2. To demonstrate this we will run additional GAMMs predicting F1/F2 by the PCA scores. Then visualise how these changes map onto change in New Zealand English.

Pre-requisites

Purpose: Install libraries and load data

In order for the code in this document to work, the following packages are required to be installed and loaded into your R session. If you do not have any of the packages installed, you can run install.packages("PACKAGE NAME") which should resolve any warning messages you might get (change “PACKAGE NAME” to the required package name, e.g. install.packages("tidyverse")).

A large portion of the code in this document is written in a tidy way, this means that it (tries to) always use the tidyverse functions when possible, if you are new to using R or are more familiar with R’s base packages, see http://tidyverse.tidyverse.org/ for a full reference guide.

Similarly, if there are any functions that you are not familiar with/would like more information on (or the arguments to those functions), simply press F1 whilst your cursor is clicked anywhere on the name of the function, this will bring up the help page in RStudio (note this will only work if you are using the .rmd version of this file and not the .html).

For more general information on R Markdown documents and how they work see https://rmarkdown.rstudio.com/index.html

Libraries

The following libraries are required for the document to be run.

library(lme4) #mixed-effects models
library(rms) #fitting restricted cubic splines
library(cowplot) #plotting functions
library(tidyverse) #lots of things
library(kableExtra) #outputting nice tables
library(factoextra) #pca related things
library(ggrepel) #more plotting things
library(gganimate) #animation plotting
library(lmerTest) #p values from lmer models
library(DT) #interactive data tables
library(mgcv) #gamms
library(itsadug) #additional gamm things
library(scales) #rescale functions
library(gifski) #needed to generate gif
library(circlize) #chord diagram
library(PerformanceAnalytics) #correlation figure

#this is important for reproduction of any stochastic computations
set.seed(123)

#check information about R session, this will give details of the R setup on the authors computer at the time of running. If any of the outputs are not reproduced as in the html file produced from this markdown document (see repository), there may be differences in the package versions or setup on your computer. You can update packages by running utils::update.packages()
sessionInfo()

## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] PerformanceAnalytics_2.0.4 xts_0.12-0                
##  [3] zoo_1.8-7                  circlize_0.4.8            
##  [5] gifski_0.8.6               scales_1.1.0              
##  [7] itsadug_2.3                plotfunctions_1.3         
##  [9] mgcv_1.8-31                nlme_3.1-147              
## [11] DT_0.13                    lmerTest_3.1-2            
## [13] gganimate_1.0.5            ggrepel_0.8.2             
## [15] factoextra_1.0.7           kableExtra_1.1.0          
## [17] forcats_0.5.0              stringr_1.4.0             
## [19] dplyr_0.8.5                purrr_0.3.4               
## [21] readr_1.3.1                tidyr_1.0.2               
## [23] tibble_3.0.1               tidyverse_1.3.0           
## [25] cowplot_1.0.0              rms_5.1-4                 
## [27] SparseM_1.78               Hmisc_4.4-0               
## [29] ggplot2_3.3.0              Formula_1.2-3             
## [31] survival_3.1-12            lattice_0.20-41           
## [33] lme4_1.1-23                Matrix_1.2-18             
## 
## loaded via a namespace (and not attached):
##  [1] TH.data_1.0-10      minqa_1.2.4         colorspace_1.4-1   
##  [4] ellipsis_0.3.0      htmlTable_1.13.3    GlobalOptions_0.1.1
##  [7] base64enc_0.1-3     fs_1.4.1            rstudioapi_0.11    
## [10] farver_2.0.3        MatrixModels_0.4-1  fansi_0.4.1        
## [13] mvtnorm_1.1-0       lubridate_1.7.8     xml2_1.3.2         
## [16] codetools_0.2-16    splines_3.6.3       knitr_1.28         
## [19] jsonlite_1.6.1      nloptr_1.2.2.1      broom_0.5.6        
## [22] cluster_2.1.0       dbplyr_1.4.3        png_0.1-7          
## [25] compiler_3.6.3      httr_1.4.1          backports_1.1.6    
## [28] assertthat_0.2.1    cli_2.0.2           tweenr_1.0.1       
## [31] acepack_1.4.1       htmltools_0.4.0     quantreg_5.55      
## [34] prettyunits_1.1.1   tools_3.6.3         gtable_0.3.0       
## [37] glue_1.4.0          Rcpp_1.0.4          cellranger_1.1.0   
## [40] vctrs_0.2.4         xfun_0.13           rvest_0.3.5        
## [43] lifecycle_0.2.0     statmod_1.4.34      polspline_1.1.17   
## [46] MASS_7.3-51.6       hms_0.5.3           sandwich_2.5-1     
## [49] RColorBrewer_1.1-2  yaml_2.2.1          gridExtra_2.3      
## [52] rpart_4.1-15        latticeExtra_0.6-29 stringi_1.4.6      
## [55] checkmate_2.0.0     boot_1.3-25         shape_1.4.4        
## [58] rlang_0.4.5         pkgconfig_2.0.3     evaluate_0.14      
## [61] htmlwidgets_1.5.1   tidyselect_1.0.0    magrittr_1.5       
## [64] R6_2.4.1            generics_0.0.2      multcomp_1.4-13    
## [67] DBI_1.1.0           pillar_1.4.3        haven_2.2.0        
## [70] foreign_0.8-75      withr_2.2.0         nnet_7.3-14        
## [73] modelr_0.1.6        crayon_1.3.4        rmarkdown_2.1      
## [76] jpeg_0.1-8.1        progress_1.2.2      grid_3.6.3         
## [79] readxl_1.3.1        data.table_1.12.8   reprex_0.3.0       
## [82] digest_0.6.25       webshot_0.5.2       numDeriv_2016.8-1.1
## [85] munsell_0.5.0       viridisLite_0.3.0   quadprog_1.5-8

Data

Purpose: Understand the structure of the dataset

All data has been made available to reproduce the results, the data file should be located in a folder called data within the folder/directory this file is saved in. We will store the data as an R data frame called vowels_all. Note the data is saved as a .rds file, this is essentially the same as a normal .csv file, but is more efficient when working in R. If you wish to reuse the data in a different format, it is recommended that you load in the data and then export it to your preferred format, e.g. using the write.csv() function for .csv files.

#load in the data
vowels_all <- readRDS("Data/ONZE_vowels_filtered_anon.rds")

We can inspect the data in different ways, to ensure that the correct file has been loaded and for general understanding of how the data is structured.

Variables

Let’s inspect the variables…

We should have 10 variables.

Definitions of each variable are given below (factors are represented as fct with the number of unique levels also provided e.g. fct, 2 represents a factor with 2 unique values, numeric variables are represented as num, with the smallest and largest values provided, e.g. num, (1857-1988)):

Variable	Description	Class
Speaker	The speaker ID (format: corpus_gender_distinctnumber, e.g. IA_f_001	fct (481)
Transcript	The transcript number of the speaker, e.g. IA_f_001-01.trs	fct (2523)
Corpus	The sub-corpus the data comes from, i.e. either MU, IA, Darfield or CC	fct (4)
Gender	The gender of the speaker, i.e. either F for female or M for male	fct (2)
participant_year_of_birth	The year the participant was born in e.g. 1864	num (1864-1982)
Word	The word form of the token, this is anonymised (format: word_distinctnumber, e.g. word_00002	fct (14632)
Vowel	The vowel of the token, using Well’s notation, e.g. FLEECE	fct (10)
F1_50	The raw F1 of the vowel in Hz, taken at the mid-point, e.g. 500	num (58-999)
F2_50	The raw F2 of the vowel in Hz, taken at the mid-point, e.g. 1500	num (320-3453)
Speech_rate	The speech rate in syllbales per second for the transcript, e.g. 1.7929	num (0.1365-15.2451)

Next, we can generate some summary information about the dataset, the following code will reproduce the information provided in Tables XX.

Token counts

There are 10 different vowels in the data, a summary of the number of tokens per vowel is given below.

Originally, we extracted 12 vowels, comprising the 10 summarised below, but also SCHWA and FOOT, these were removed during the data cleaning stage, SCHWA was removed as we are only analysing stressed tokens and the number of speakers with low N tokens for FOOT would have led to large loss in the number of speakers in the data.

Vowel	N Tokens	%
NURSE	16891	4.1
START	21217	5.1
GOOSE	26432	6.4
THOUGHT	28201	6.8
TRAP	32284	7.8
LOT	35228	8.5
FLEECE	49757	12.0
STRUT	50907	12.3
DRESS	69925	16.9
KIT	83837	20.2
Total	414679	100.0

Sub-corpora

The ONZE dataset comprises four different sub-corpora:

MU - Mobile Unit IA - Intermediate Archive Darfield - Darfield Corpus CC - Canterbury Corpus

Below is a summary of the demographic information for each of the sub-corpora.

Corpus	N Tokens	%	N Speakers	Female	Male	Year of Birth Range
MU	59059	14.2	54	13	41	1864 - 1904
IA	97620	23.5	54	29	25	1891 - 1963
Darfield	57527	13.9	25	14	11	1918 - 1980
CC	200473	48.3	348	169	179	1926 - 1982

Speakers

The distribution of speakers by gender is given in the histogram below.

#histogram of the speakers by year of birth and gender
vowels_all %>%
  select(Speaker, Gender, participant_year_of_birth) %>%
  distinct() %>%
  ggplot(aes(x = participant_year_of_birth, fill = Gender, colour = Gender)) +
  geom_histogram(aes(position="identity"),
                 binwidth=1,
                 alpha = 0.8, colour = NA) +
  geom_rug(alpha = 0.2) +
  scale_x_continuous(breaks = seq(1860, 1990, 15)) +
  scale_fill_manual(values = c("black", "green")) +
  scale_color_manual(values = c("black", "green")) +
  geom_label(data = vowels_all %>% filter(participant_year_of_birth > 1863 & participant_year_of_birth < 1983) %>% select(Speaker, Gender, participant_year_of_birth) %>% distinct() %>% group_by(Gender) %>% summarise(n = n()), aes(x = 1864, y = 20, label = paste0("N female = ", n[1], "\nN male = ", n[2], "\nN total = ", sum(n), "\nyob range: ", min(vowels_all$participant_year_of_birth), " - ", max(vowels_all$participant_year_of_birth))), hjust=0, inherit.aes = FALSE) +
  theme_bw() +
  theme(legend.position = "top")

Below we provide summary information about each of the speakers token counts per vowel. This table comprises all speakers in the dataset and can be ordered and searched like a spreadsheet.

Normalisation

We will now normalise the raw F1 and F2 values.

Here, we introduce an adapted version of the Lobanov (1971) normalisation method, which we refer to as Lobanov 2.0. Explanations of the formula for each of the methods (Lobanov and Lobanov 2.0) are given below. Please refer to the paper for reasons why this adapted version was preferred to Lobanov’s original normalisation method.

Lobanov formula:

\[ \begin{equation} F_{lobanov_i} = \frac{(F_{raw_i}-\mu_{raw_i})}{\sigma_{raw_i}} \end{equation} \]

$i$ = either F1 or F2
$F_{lobanov_i}$ = the normalised value in $i$
$F_{raw_i}$ = the raw formant measurement value in $i$
$\mu_{raw_i}$ = the mean formant value calculated across all raw values in $i$
$\sigma_{raw_i}$ = the standard deviation calculated across all raw values in $i$

In plain English, the formula subtracts the mean formant value of a speaker from the raw individual formant value, then divides that by the standard deviation of the formant values.

e.g. if a speaker has a raw F1 of 400hz, a mean F1 of 500hz and a standard deviation of 70hz, this would give a Lobanov normalised value of (400-500)/70 = -1.43.

Lobanov 2.0 formula:

\[ \begin{equation} F_{lobanov2.0_i} = \frac{(F_{raw_i}-\mu_{(\mu_{vowel_1},\cdots,\mu_{vowel_n})})}{\sigma_{(\mu_{vowel_1},\cdots,\mu_{vowel_n})}} \end{equation} \]

$i$ = either F1 or F2
$F_{lobanov2.0_i}$ = the normalised value in $i$
$F_{raw_i}$ = the raw formant measurement value in $i$
$\mu_{(\mu_{vowel_1},\cdots,\mu_{vowel_n})}$ = the mean taken from the mean formant value calculated per vowel in $i$
$\sigma_{(\mu_{vowel_1},\cdots,\mu_{vowel_n})}$ = the standard deviation taken from the mean formant value calculated per vowel in $i$

In plain English, the formula subtracts the mean of means formant value of a speaker (calculated as the mean of means, where a mean for each vowel is calculated, then the mean taken of those means) from the raw individual formant value, then divides that by the standard deviation of the mean of mean values.

e.g. if a speaker has a raw F1 of 400hz, a mean of means F1 of 550hz and a standard deviation (for the mean of means) of 70hz, this would give a Lobanov 2.0 normalised value of (400-550)/70 = -2.14.

Implementation

The primary difference between the formula for this adapted version and Lobanov’s original formula, is that each of the vowels has a mean formant value calculated, then a mean of those means is taken as the mean in the formula. The motivation for doing this is that the data we are normalising contains speakers with varying numbers of tokens across the different vowels. Lobanov’s method is suited (and designed) based on balanced data, where an equal number of tokens per vowel are normalised.

When normalising with unbalanced numbers of tokens per vowel, the calculation of $\mu_{raw_i}$ (the mean of all the raw formant values), can be skewed by tokens that have a much larger count in a certain vowel.

Therefore, we first calculate means for each of the individual vowels (per speaker, per formant), then calculate the mean based on those means. This approach allows for tokens in vowel categories to be weighted equally regardless of how many tokens there are, making the normalisation more reliable for this type of dataset.

For visualisation purposes, we plot the normalised values for F1 and F2 against each other in the plots below (Lobanov 2.0 is on the x axis and Lobanov on the y axis, with coloured lines representing each speaker, the black line represents where the values would be if they were equal, i.e. if Lobanov 2.0 = Lobanov)

#standard Lobanov normalisation - calculate means across all vowels per speaker
summary_vowels_all_lobanov <- vowels_all %>%
  group_by(Speaker) %>%
  dplyr::summarise(mean_F1_lobanov = mean(F1_50),
            mean_F2_lobanov = mean(F2_50),
            sd_F1_lobanov = sd(F1_50),
            sd_F2_lobanov = sd(F2_50),
            token_count = n())

#Lobanov 2.0 - calculate means per vowel and per speaker
summary_vowels_all <- vowels_all %>%
  group_by(Speaker, Vowel) %>%
  dplyr::summarise(mean_F1 = mean(F1_50),
            mean_F2 = mean(F2_50),
            sd_F1 = sd(F1_50),
            sd_F2 = sd(F2_50),
            token_count_vowel = n())

#get the mean_of_means and sd_of_means from the the speaker_summaries, this will give each speaker a mean caculated from the means across all vowels, as well as the standard deviation of the means
summary_mean_of_means <- summary_vowels_all %>%
  group_by(Speaker) %>%
  dplyr::summarise(mean_of_means_F1 = mean(mean_F1),
            mean_of_means_F2 = mean(mean_F2),
            sd_of_means_F1 = sd(mean_F1),
            sd_of_means_F2 = sd(mean_F2)
            )

#combine these values with the full raw dataset, then use these values to normalise the data with both the Lobanov and the Lobanov 2.0 method
vowels_all <- vowels_all %>%
  #add in the data
  left_join(., summary_mean_of_means) %>%
  left_join(., summary_vowels_all[, c("Speaker", "Vowel", "token_count_vowel")]) %>%
  left_join(., summary_vowels_all_lobanov) %>%
  #normalise the raw F1 and F2 values with Lobanov
  mutate(F1_lobanov = (F1_50 - mean_F1_lobanov)/sd_F1_lobanov,
         F2_lobanov = (F2_50 - mean_F2_lobanov)/sd_F2_lobanov,
  #normalise with Lobanov 2.0
         F1_lobanov_2.0 = (F1_50 - mean_of_means_F1)/sd_of_means_F1,
         F2_lobanov_2.0 = (F2_50 - mean_of_means_F2)/sd_of_means_F2) %>%
  #remove the variables that are not required
  dplyr::select(-(mean_of_means_F1:sd_of_means_F2), -(mean_F1_lobanov:sd_F2_lobanov))

#remove the previous summary data frames
rm(summary_vowels_all_lobanov, summary_vowels_all, summary_mean_of_means)

#inspect the relationship between the two normalised values
vowels_all %>%
  ggplot(aes(x = F1_lobanov_2.0, y = F1_lobanov, colour = Speaker)) +
  geom_smooth(method = "lm", size = 0.1, show.legend = FALSE) +
  geom_abline(slope=1, intercept=0) +
  theme_bw()

vowels_all %>%
  ggplot(aes(x = F2_lobanov_2.0, y = F2_lobanov, colour = Speaker)) +
  geom_smooth(method = "lm", size = 0.1, show.legend = FALSE) +
  geom_abline(slope=1, intercept=0) +
  theme_bw()

GAMM modelling

In order to analyse co-variation in the dataset, we first must extract a measure of how the speakers vocalic variables differ from one another. To achieve this, we first run a series of Generalised Additive Mixed Models (GAMMs), from which we can extract the by-speaker random intercepts. This is done using the mgcv and itsadug packages, if you are unfamiliar with this form of analysis, see (Winter and Wieling, (2016))[https://academic.oup.com/jole/article/1/1/7/2281883] or (Sóskuthy, (2017))[https://arxiv.org/abs/1703.05339], for further information about why we chose the by-speaker intercepts, please refer to the manuscript or see Drager and Hay (2012)

In total there will be 20 separate models (10 vowels x 2 formants) that will be fitted, each of which we will extract the random intercepts from the random effect of Speaker, as well as the model summary.

Fitting procedure

Each of the models will use the data from one of the 10 vowels (in the Vowel variable) and will have either the F1_lobanov_2.0 or the F2_lobanov_2.0 variable as the dependent/predicted measure.

All models will be fit uniformly, i.e. with the same fixed and random effects structures.

The fixed effects are:

An interaction between participant_year_of_birth and Gender
participant_year_of_birth
Gender
Speech_rate

The random effects are:

Speaker
Word

The participant_year_of_birth variable is modeled with a smooth term with 10 knots, this is to account for the non-linear ‘wiggliness’ of the effect.

To run the models in an efficient way and store the by-speaker intercepts, we use a for loop to iterate through each of the vowels, extracting the intercepts from each model and adding them to a data frame.

A for loop works by iterating over each value in a series, here we will loop through each value in our Vowels variable and extract the relevant information.

e.g the for loop will start with DRESS, run the GAMM for F1, extract the by-speaker intercepts from that model, it will then run the GAMM for F2, extract the speaker intercepts from this model, then add the 2 sets of intercepts to a data frame (gam_intercepts.tmp). The loop will then move on to the next vowel, FLEECE and do exactly the same process. The loop will finish once all vowels have been ‘looped’ through.

This will result in a data frame comprising:

494 rows (one row per speaker)
1 column identifying the speaker name
20 additional columns identifying the variable being modeled (e.g. F1_DRESS), the numeric values here represent the by-speaker intercepts from that variable’s model

Note, this process takes several hours (six and half hours on my machine) to complete. The output has been stored in files in the GAMM_output folder, for quick reference. Please see those files or load them in to your R session for the rest of the analysis if you do not run the following code chunk.

The intercepts are saved as gamm_intercepts.csv, the model summaries can be found in the model_summaries sub-folder, where each model summary is stored as a .rds file, e.g. gam_summary_F1_DRESS.rds contains the model summary for F1_DRESS.

#update the Gender variable to allow for conrast coding
vowels_all$Gender <- as.ordered(vowels_all$Gender)
contrasts(vowels_all$Gender) <- "contr.treatment"

vowels_all <- vowels_all %>%
  arrange(as.character(Speaker))

#create a data frame to store the intercepts from the models, this will initially contain just the speaker names
gam_intercepts.tmp <- vowels_all %>%
  dplyr::select(Speaker) %>%
  distinct()

#loop through the vowels
cat(paste0("Start time:\n", format(Sys.time(), "%d %B %Y, %r\n")))

for (i in levels(factor(vowels_all$Vowel))) {
  
  #F1 modelling
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F1 for FLEECE
  gam.F1 <- bam(F1_lobanov_2.0 ~
                  s(participant_year_of_birth, k=10, bs="ad", by=Gender) +
                  s(participant_year_of_birth, k=10, bs="ad") +
                  Gender +
                  s(Speech_rate) +
                  s(Speaker, bs="re") +
                  s(Word, bs="re"),
                data=vowels_all %>% filter(Vowel == i),
                discrete=T, nthreads=2)
  
  #extract the speaker intercepts from the model and store them in a temporary data frame
  gam.F1.intercepts.tmp <- as.data.frame(get_random(gam.F1)$`s(Speaker)`)
  
  #assign the model to an object
  assign(paste0("gam_F1_", i), gam.F1)
  
  #save the model summary
  saveRDS(gam.F1, file = paste0("/Users/james/Documents/GitHub/model_summaries/gam_F1_", i, ".rds"))
  
  cat(paste0("F1_", i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n")) #print the vowel the loop is up to for F1, as well as the start time for the model
  
  #F2 modelling
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F2 for FLEECE
  gam.F2 <- bam(F2_lobanov_2.0 ~
                  s(participant_year_of_birth, k=10, bs="ad", by=Gender) +
                  s(participant_year_of_birth, k=10, bs="ad") +
                  Gender +
                  s(Speech_rate) +
                  s(Speaker, bs="re") +
                  s(Word, bs="re"),
                data=vowels_all %>% filter(Vowel == i),
                discrete=T, nthreads=2)
  
  #extract the speaker intercepts again, storing them in a separate data frame
  gam.F2.intercepts.tmp <- as.data.frame(get_random(gam.F2)$`s(Speaker)`)
  
  #assign the model to an object
  assign(paste0("gam_F2_", i), gam.F2)
  
  #save the model summary
  saveRDS(gam.F2, file = paste0("/Users/james/Documents/GitHub/model_summaries/gam_F2_", i, ".rds"))

  #rename the variables so it clear which one has F1/F2, i.e. this will give F1_FLEECE, F2_FLEECE
  names(gam.F1.intercepts.tmp) <- paste0("F1_", i)
  names(gam.F2.intercepts.tmp) <- paste0("F2_", i)
  
  #combine the intercepts for F1 and F2 and store them in the intercepts.tmp_stress data frame
  gam_intercepts.tmp <- cbind(gam_intercepts.tmp, gam.F1.intercepts.tmp, gam.F2.intercepts.tmp)
  
  cat(paste0("F2_", i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n")) #print the vowel the loop is up to for F2 , as well as the start time for the model
}

#save the intercepts as a .csv file
write.csv(gam_intercepts.tmp, "Data/gam_intercepts_tmp_new.csv", row.names = FALSE)

Read in pre-run model results

In order to save time running through the GAMM modelling chunk above, the results have been stored in the repository for quick loading. The below code will load the files in to your R session.

#load in model intercepts
gam_intercepts.tmp <- read.csv("Data/gam_intercepts_tmp_new.csv")

#make vector containing all .rds filenames from model_summaries folder
model_summary_files = list.files(pattern="*.rds", path = "/Users/james/Documents/GitHub/model_summaries")

#load each of the files with for loop
for (i in model_summary_files) {
  cat(paste0(i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n")) 
  assign(gsub(".rds", "", i), readRDS(paste0("/Users/james/Documents/GitHub/model_summaries/", i)))
}

Understanding the intercepts

In order to better understand these intercepts and how they represent each speaker’s position in relation to the population, it can be helpful to visualise the speaker intercepts in relation to the vowel space.

We interpret the speaker intercepts in terms of how advanced the vowel productions are in relation to other speakers with similar fixed-effects - in other words, if a speaker has a large positive intercept from a model (with a fixed-effects structure as that used in our above modelling procedure), this would indicate that the speaker is producing formant values that are typically larger than other speakers with similar fixed-effects values, e.g. year of birth, gender etc. Likewise, if the intercept is negative, this would indicate that their F values are smaller, the closer the intercept is to 0, the more typical the speaker is (taking into account the fixed-effects).

To demonstrate this, we will visualise the change in F1 for three vowels, known to have undergone rapid change in New Zealand English (TRAP, DRESS and KIT). We will plot the participant year of birth on the x axis and normalised F1 on the y axis (reverse scaled to match normal vowel plot conventions). To highlight the change in the three vowels over time, we will also fit a smooth and plot the mean F1 values of each speaker for each of the vowels. Finally, we will highlight 4 speakers who all have intercepts that indicate that they are advanced in these vowel changes, i.e. have negative intercepts (smaller F1) for TRAP and DRESS, but positive intercepts (larger F1) for KIT.

The code blocks below will reproduce Figure XX in the manuscript. The first chunk calculates mean F1 values / speaker and merges this data with the speaker intercepts from the GAMMs.

#calculate the speaker mean F1 for each of the vowels
speakers1 <- vowels_all %>%
  select(Speaker, participant_year_of_birth, Vowel, F1_lobanov_2.0) %>% #choose the variables of interest
  filter(Vowel %in% c("KIT", "DRESS", "TRAP")) %>% #choose the vowels of interest
  group_by(Speaker, participant_year_of_birth, Vowel) %>% #group by these variable
  dplyr::summarise(mean = mean(F1_lobanov_2.0)) #calculate mean F1 values

#merge the speaker intercepts and tidy up for plotting
speakers <- gam_intercepts.tmp %>%
  dplyr::select(Speaker, F1_KIT, F1_DRESS, F1_TRAP) %>% #choose the variables of interest
  pivot_longer(F1_KIT:F1_TRAP, names_to = "Vowel", values_to = "intercept") %>% #transform the data into long format
  mutate(Vowel = gsub("F1_", "", Vowel)) %>% #remove the F1_ prefix from the vowel names
  dplyr::inner_join(speakers1, ., by = c("Speaker", "Vowel")) %>% #merge the data with the previous data frame
  mutate(mean = round(mean, 3), #round the mean
         intercept = round(intercept, 3), #round the intercepts
         direction = ifelse(intercept < 0, "negative", "positive"), #create a direction variable based on intercept sign
         Vowel = factor(Vowel, levels = c("TRAP", "DRESS", "KIT")),
         Speaker1 = paste0(Speaker, " ", intercept)) #order the vowels for nicer plotting

speakers1 <- speakers %>%
  ungroup() %>%
  arrange(participant_year_of_birth) %>%
  filter(Speaker %in% c("MU_m_348", "IA_f_341", "CC_m_167", "CC_f_297")) %>%
  mutate(letters = c(rep("A", 3), rep("B", 3), rep("C", 3), rep("D", 3)), speakers1 = paste0(letters, " (", intercept, ")"))

The second chunk extracts predictions from the GAMMs for plotting smooths over year of birth. (This code block produces plots that are not included in the RMarkdown output).

vowels_to_plot <- c("KIT", "DRESS", "TRAP")

pred_table <- function (Vowel) {
  mod_name <- paste0("gam_F1_", Vowel)
  return(cbind(Vowel, 
               plot_smooth(get(mod_name), view="participant_year_of_birth", 
                           rm.ranef=T, rug=F, n.grid=100)$fv)
  )
}

gamm_preds_to_plot <- lapply(vowels_to_plot, pred_table) %>%
  do.call(rbind, .)

saveRDS(gamm_preds_to_plot, "Data/Models/gamm_preds_to_plot.rds")

gamm_preds_to_plot <- readRDS("Data/Models/gamm_preds_to_plot.rds")

We now plot the data.

#plot the data
F1_speaker_intercepts <- vowels_all %>%
  filter(Vowel %in% c("KIT", "DRESS", "TRAP")) %>% #choose the vowels
  mutate(Vowel = factor(Vowel, levels = c("TRAP", "DRESS", "KIT"))) %>% #order the vowels
  ggplot(aes(participant_year_of_birth, F1_lobanov_2.0, colour = Vowel, fill = Vowel)) + #plot year of birth by normalised F1
  geom_line(data=gamm_preds_to_plot, aes(y=fit)) +
  geom_ribbon(data=gamm_preds_to_plot, aes(ymin=ll, ymax=ul, y=NULL), col=NA, alpha=0.2) +
  geom_point(data = speakers, #add speaker means
             aes(x = participant_year_of_birth, y = mean-0.1),
             size = 0.5,
             alpha = 0.2,
            show.legend = FALSE) +
  geom_point(data = speakers1, #add speaker means
             aes(x = participant_year_of_birth, y = mean, colour = Vowel),
             size = 2,
             alpha = 0.9,
            show.legend = FALSE, inherit.aes = FALSE) +
  geom_label(data = speakers1, #add speaker means
             aes(x = participant_year_of_birth - 6, y = mean, label = speakers1, colour = Vowel),
             size = 2.5,
             alpha = 0.9,
            show.legend = FALSE, inherit.aes = FALSE) +
  scale_y_reverse() +
  scale_colour_manual(values=c("#E69F00", "#56B4E9", "#009E73")) +
  scale_fill_manual(values=c("#E69F00", "#56B4E9", "#009E73")) +
  xlab("Participant year of birth") + #x axis title
  ylab("Normalised F1 (Lobanov 2.0)") + #y axis title
  theme_bw() + #general aesthetics
  theme(legend.position = "top", #legend position
        axis.text = element_text(size = 14), #text size
        axis.title = element_text(face = "bold", size = 14), #axis text aesthetics
        legend.title = element_blank(), #legend title text size
        legend.text = element_text(size = 14)) #legend text size

#view the plot
F1_speaker_intercepts

ggsave(plot = F1_speaker_intercepts, filename = "Figures/F1_speaker_intercepts.png", dpi = 300, width = 7, height = 5)

#clean up
rm(speakers, speakers1, F1_speaker_intercepts)

Visualisation of intercept correlations

Understanding how the intercepts may be correlated to each other is an important first step to how they relate to each other. However, simply using correlations to gain a full understanding of how the covariation is operating, may not be sufficient. We present below an initial outline of the correlations between the intercepts.

The correlations themselves

intercepts_cor <- cor(gam_intercepts.tmp[-1] %>%
             select(starts_with("F1"), starts_with("F2")))

datatable(intercepts_cor) %>%
  formatRound(columns = 1:20)

We next present scatter plots and the correlation co-efficients together

chart.Correlation(gam_intercepts.tmp[-1], histogram = FALSE, pch=19)

A chord plot that visualises the correlations. This plot will give a more simplified representation of the above plot, it connects each of the variables to all the other variables via chords. The black chords indicate a positive correlation, whereas a red chord represents a negative correlation. The size and transparency of the chord indicates the strength of the correlation, with darker wider chords being used for the strongest correlations

The code below adapts the code used in Zuguang Gu’s tutorial (see http://jokergoo.github.io/blog/html/large_matrix_circular.html)

mat <- cor(gam_intercepts.tmp[-1] %>%
             select(starts_with("F2"), starts_with("F1")))

diag(mat) = 0
mat[lower.tri(mat)] = 0
n = nrow(mat)
rn = rownames(mat)

group_size = c(rep(1, 20))
gl = lapply(1:20, function(i) {
    rownames(mat)[sum(group_size[seq_len(i-1)]) + 1:group_size[i]]
})
names(gl) = names(mat)

group_color = structure(circlize::rand_color(20), names = names(gl))
n_group = length(gl)

col_fun = colorRamp2(c(-1, 0, 1), c("red", "white", "black"), transparency = 0.2)

png("Figures/chord_plot.png", width = 2000, height = 2000, units = "px", res = 300)

chordDiagram(mat, col = col_fun(mat), grid.col = NA, grid.border = "black", 
    annotationTrack = "grid", link.largest.ontop = TRUE,
    preAllocateTracks = list(
        list(track.height = 0.02)
    )
)

circos.trackPlotRegion(track.index = 2, panel.fun = function(x, y) {
    xlim = get.cell.meta.data("xlim")
    ylim = get.cell.meta.data("ylim")
    sector.index = get.cell.meta.data("sector.index")
    circos.text(mean(xlim), mean(ylim), sector.index, col = "black", cex = 0.6, 
        facing = "inside", niceFacing = TRUE)
}, bg.border = NA)

dev.off()

## quartz_off_screen 
##                 2

chordDiagram(mat, col = col_fun(mat), grid.col = NA, grid.border = "black", 
    annotationTrack = "grid", link.largest.ontop = TRUE,
    preAllocateTracks = list(
        list(track.height = 0.02)
    )
)

circos.trackPlotRegion(track.index = 2, panel.fun = function(x, y) {
    xlim = get.cell.meta.data("xlim")
    ylim = get.cell.meta.data("ylim")
    sector.index = get.cell.meta.data("sector.index")
    circos.text(mean(xlim), mean(ylim), sector.index, col = "black", cex = 0.6, 
        facing = "inside", niceFacing = TRUE)
}, bg.border = NA)

Principal Components Analysis (PCA)

In the following section we will run a PCA on the gam_intercepts.tmp data frame, i.e. the by-speaker intercepts from each of the 20 GAMMs.

PCA offers a neat analysis solution to assessing whether there is underlying structure in the dataset, by highlighting which variables co-vary together - these are called principal components (PCs). Moreover, it also allows us to explore the data at the by-speaker level, providing us with insights into who is on the margins of these PCs.

Running the PCA

#run the PCA on the intercepts data frame, note the intercepts are in columns 2:21 as column 1 is the speaker name
intercepts.pca <- princomp(gam_intercepts.tmp[, 2:ncol(gam_intercepts.tmp)],cor=TRUE)

#print a summary of the PCA, this will give the variance explained by each PC
summary(intercepts.pca)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     1.8568236 1.7761654 1.4218609 1.27801065 1.11351321
## Proportion of Variance 0.1723897 0.1577382 0.1010844 0.08166556 0.06199558
## Cumulative Proportion  0.1723897 0.3301279 0.4312123 0.51287786 0.57487344
##                           Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     1.0424970 1.00352469 0.95546926 0.90696297 0.87916274
## Proportion of Variance 0.0543400 0.05035309 0.04564608 0.04112909 0.03864636
## Cumulative Proportion  0.6292134 0.67956653 0.72521261 0.76634170 0.80498805
##                           Comp.11    Comp.12    Comp.13    Comp.14    Comp.15
## Standard deviation     0.80953378 0.77331213 0.73748412 0.72788478 0.67193061
## Proportion of Variance 0.03276725 0.02990058 0.02719414 0.02649081 0.02257454
## Cumulative Proportion  0.83775530 0.86765588 0.89485002 0.92134084 0.94391538
##                           Comp.16    Comp.17    Comp.18     Comp.19    Comp.20
## Standard deviation     0.61321385 0.55845231 0.47851298 0.372962769 0.25635209
## Proportion of Variance 0.01880156 0.01559345 0.01144873 0.006955061 0.00328582
## Cumulative Proportion  0.96271694 0.97831039 0.98975912 0.996714180 1.00000000

#view eigen values, this will visualise the variance explained by each PC
fviz_eig(intercepts.pca, addlabels = TRUE)

ggsave(plot = fviz_eig(intercepts.pca, addlabels = TRUE) + ggtitle(""), "Figures/PCA_scree.png", width = 7, height = 5, dpi = 300)

#create objects containing the loadings for each of the 3 main PCs
PC1_loadings <- intercepts.pca$loadings[,1]
PC2_loadings <- intercepts.pca$loadings[,2]
PC3_loadings <- intercepts.pca$loadings[,3]

#store these in a data frame for ease of plotting
PC1_loadings <- PC1_loadings %>%
  as.data.frame() %>%#convert to a data frame
  dplyr::rename(Loading = 1) %>% #rename the variable
  mutate(variable = row.names(.), #add variable to identify which intercepts are representing each row
         highlight = ifelse(Loading > 0.225 | Loading < -0.225 , "black", "gray"), #for plotting reasons, we want to highlight the variables that contribute the most to the PC, so if the loading is > |0.2| it will plotted in black, if not it will be gray)
         PC1_loadings_abs = abs(Loading), #make the loadings absolute so they are comparable when plotting
         direction = ifelse(Loading < 0, "red", "black"), #define which direction the loading sign was, i.e. red if negative, black if positive
         PC = "PC1",
         Vowel = substr(variable, 4, 10)) %>% #add variable to identify which PC the data is from
  arrange(PC1_loadings_abs) #order the variables based on absolute loading value

#repeat this process for PC2
PC2_loadings <- PC2_loadings %>%
  as.data.frame() %>%
  dplyr::rename(Loading = 1) %>%
  mutate(variable = row.names(.),
         highlight = ifelse(Loading > 0.225 | Loading < -0.225 , "black", "gray"),
         PC2_loadings_abs = abs(Loading),
         direction = ifelse(Loading < 0, "red", "black"),
         PC = "PC2",
         Vowel = substr(variable, 4, 10)) %>%
  arrange(PC2_loadings_abs)

#repeat this process for PC3
PC3_loadings <- PC3_loadings %>%
  as.data.frame() %>%
  dplyr::rename(Loading = 1) %>%
  mutate(variable = row.names(.),
         highlight = ifelse(Loading > 0.225 | Loading < -0.225 , "black", "gray"),
         PC3_loadings_abs = abs(Loading),
         direction = ifelse(Loading < 0, "red", "black"),
         PC = "PC3",
         Vowel = substr(variable, 4, 10)) %>%
  arrange(PC3_loadings_abs)

PCs by Contribution

To visualise how the different variables contribute to the formation of the PC, we can look at their contribution values. These are simply the variable loading squared, giving a value that is interpretable by means of a percent.

The red dashed line in these plots represents the contribution value if all variables contributed equally to the PC. It is calculated by dividing 100 by the number of variables, 20 - so 100/20 = 5.Thus, our (least arbitrary) way to define a cut-off point for ‘importance’ of the variables is to focus on those contributing > 5%.

The colour of the dots in these plots is defined by the direction of the loadings in the PCA, i.e. negative = red and positive = black.

First we will wrangle the data from the PCA.

PC_loadings_contrib <- intercepts.pca$loadings[,1:3] %>% #get the loadings for PC1, PC2 and PC3
  as.data.frame() %>%
  cbind(get_pca(intercepts.pca, "var")$contrib[,1:3]) %>% #get the contributions
  dplyr::rename(Loading.PC1 = Comp.1,
         Loading.PC2 = Comp.2,
         Loading.PC3 = Comp.3,
         Contribution.PC1 = Dim.1,
         Contribution.PC2 = Dim.2,
         Contribution.PC3 = Dim.3) %>%
  mutate(Variable = row.names(.),
         Vowel = substr(Variable, 4, nchar(Variable))) %>%
  dplyr::select(Variable, Vowel, Loading.PC1:Contribution.PC3) %>%
  pivot_longer(Loading.PC1:Loading.PC3, names_to = "PC", values_to = "Loading") %>%
  arrange(Vowel, Variable, PC) %>%
  mutate(direction = ifelse(Loading < 0, "red", "black")) #define direction of the loadings

PC1

PC1_contrib <- PC_loadings_contrib %>%
  select(-Contribution.PC2, -Contribution.PC3) %>%
  filter(PC == "Loading.PC1") %>%
  mutate(highlight = ifelse(Contribution.PC1 > 5, "black", "gray"),
         Loading_abs = abs(Loading)) %>%
  arrange(Contribution.PC1)
  
PC1_contrib_plot <- ggplot(PC1_contrib, aes(x=reorder(Variable, Contribution.PC1), y=Contribution.PC1)) + #have the variable name on the x axis and loading value on the y
  geom_text(aes(alpha = highlight, label=ifelse(PC1_contrib$direction=="red","–", "+")), colour = PC1_contrib$direction, size = 6, fontface="bold", show.legend = FALSE) + 
  xlab("") + #remove the x axis title
  ylab("PC1 contribution (%)") +
  geom_hline(yintercept = 5, color = "red", linetype = "dashed") + #add a red dashed line to identify the 0.2 cut off
  scale_alpha_manual(values=c(1, 0.3)) +
  theme_bw() + #set the aesthetics of the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1, colour = PC1_contrib$highlight, size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold"),
        axis.title = element_text(size = 14, face = "bold")) #modify the x axis variable names so they are rotated and highlighted

PC1_contrib_plot

ggsave(plot = PC1_contrib_plot, filename = "Figures/PC1_contrib_plot.png", width = 8, height = 5, dpi = 300)

PC2

PC2_contrib <- PC_loadings_contrib %>%
  select(-Contribution.PC1, -Contribution.PC3) %>%
  filter(PC == "Loading.PC2") %>%
  mutate(highlight = ifelse(Contribution.PC2 > 5, "black", "gray"),
         Loading_abs = abs(Loading)) %>%
  arrange(Contribution.PC2)
  
PC2_contrib_plot <- ggplot(PC2_contrib, aes(x=reorder(Variable, Contribution.PC2), y=Contribution.PC2)) + #have the variable name on the x axis and loading value on the y
  geom_text(aes(alpha = highlight, label=ifelse(PC2_contrib$direction=="red","–", "+")), colour = PC2_contrib$direction, size = 6, fontface="bold", show.legend = FALSE) + #add the dots to the plot, with the colour
  xlab("") + #remove the x axis title
  ylab("PC2 contribution (%)") +
  geom_hline(yintercept = 5, color = "red", linetype = "dashed") + #add a red dashed line to identify the 0.2 cut off
  scale_alpha_manual(values=c(1, 0.3)) +
  theme_bw() + #set the aesthetics of the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1, colour = PC2_contrib$highlight, size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold"),
        axis.title = element_text(size = 14, face = "bold")) #modify the x axis variable names so they are rotated and highlighted

PC2_contrib_plot

ggsave(plot = PC2_contrib_plot, filename = "Figures/PC2_contrib_plot.png", width = 8, height = 5, dpi = 300)

PC3

PC3_contrib <- PC_loadings_contrib %>%
  select(-Contribution.PC1, -Contribution.PC2) %>%
  filter(PC == "Loading.PC3") %>%
  mutate(highlight = ifelse(Contribution.PC3 > 5, "black", "gray"),
         Loading_abs = abs(Loading)) %>%
  arrange(Contribution.PC3)
  
PC3_contrib_plot <- ggplot(PC3_contrib, aes(x=reorder(Variable, Contribution.PC3), y=Contribution.PC3)) + #have the variable name on the x axis and loading value on the y
  geom_text(aes(alpha = highlight, label=ifelse(PC3_contrib$direction=="red","–", "+")), colour = PC3_contrib$direction, size = 6, fontface="bold", show.legend = FALSE) + 
  xlab("") + #remove the x axis title
  ylab("PC3 contribution (%)") +
  geom_hline(yintercept = 5, color = "red", linetype = "dashed") + #add a red dashed line to identify the 0.2 cut off
  scale_alpha_manual(values=c(1, 0.3)) +
  theme_bw() + #set the aesthetics of the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1, colour = PC3_contrib$highlight, size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold"),
        axis.title = element_text(size = 14, face = "bold")) #modify the x axis variable names so they are rotated and highlighted

PC3_contrib_plot

ggsave(plot = PC3_contrib_plot, filename = "Figures/PC3_contrib_plot.png", width = 8, height = 5, dpi = 300)

PCs by cumulative contribution

Another, more clear way to visualise how the variables contribute to the PCs is to show the cumulative proportions as the y axis variable. This is done to highlight that only a certain number of variables actually contribute a large amount of the explained variance. We will take a 50% cumulative contribution baseline to show which variables actually contribute a lot to each of the PCs.

PC1_loadings_contrib1 <- PC_loadings_contrib %>%
  filter(PC == "Loading.PC1") %>%
  arrange(Contribution.PC1) %>%
  # group_by(Variable) %>%
  mutate(cumsum_PC1 = round(cumsum(Contribution.PC1), 4),
         highlight = ifelse(cumsum_PC1 < 50, "grey", "black"),
         highlight1 = ifelse(cumsum_PC1 < 50, 0.5, 1),
         direction1 = direction,
         direction_lab = ifelse(direction1=="red","–", "+"))

PC1_loadings_contrib1_plot <- PC1_loadings_contrib1 %>%
  ggplot(aes(x = fct_reorder(Variable, cumsum_PC1), y = cumsum_PC1, label=direction_lab, colour = direction1)) +
  # geom_point() +
  geom_text(alpha = PC1_loadings_contrib1$highlight1, size = 6, fontface="bold", show.legend = FALSE) + 
  scale_color_manual(values = c("black", "red")) +
  geom_abline(slope = 0, intercept = 50, colour = "red", linetype = 2) +
  xlab("") +
  ylab("Cumulative sum contribution %\n(PC1)") +
  # facet_grid(PC~.) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, colour = PC1_loadings_contrib1$highlight, size = 14, face = "bold")) #modify the x axis variable names so they are rotated and highlighted

PC2_loadings_contrib1 <- PC_loadings_contrib %>%
  filter(PC == "Loading.PC2") %>%
  arrange(Contribution.PC2) %>%
  # group_by(Variable) %>%
  mutate(cumsum_PC2 = round(cumsum(Contribution.PC2), 4),
         highlight = ifelse(cumsum_PC2 < 50, "grey", "black"),
         highlight1 = ifelse(cumsum_PC2 < 50, 0.5, 1),
         direction1 = direction,
         direction_lab = ifelse(direction1=="red","–", "+"))

PC2_loadings_contrib1_plot <- PC2_loadings_contrib1 %>%
  ggplot(aes(x = fct_reorder(Variable, cumsum_PC2), y = cumsum_PC2, label=direction_lab, colour = direction1)) +
  # geom_point() +
  geom_text(alpha = PC2_loadings_contrib1$highlight1, size = 6, fontface="bold", show.legend = FALSE) + 
  scale_color_manual(values = c("black", "red")) +
  geom_abline(slope = 0, intercept = 50, colour = "red", linetype = 2) +
  xlab("") +
  ylab("Cumulative sum contribution %\n(PC2)") +
  # facet_grid(PC~.) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, colour = PC2_loadings_contrib1$highlight, size = 14, face = "bold")) #modify the x axis variable names so they are rotated and highlighted

PC3_loadings_contrib1 <- PC_loadings_contrib %>%
  filter(PC == "Loading.PC3") %>%
  arrange(Contribution.PC3) %>%
  # group_by(Variable) %>%
  mutate(cumsum_PC3 = round(cumsum(Contribution.PC3), 4),
         highlight = ifelse(cumsum_PC3 < 50, "grey", "black"),
         highlight1 = ifelse(cumsum_PC3 < 50, 0.5, 1),
         direction1 = direction,
         direction_lab = ifelse(direction1=="red","–", "+"))

PC3_loadings_contrib1_plot <- PC3_loadings_contrib1 %>%
  ggplot(aes(x = fct_reorder(Variable, cumsum_PC3), y = cumsum_PC3, label=direction_lab, colour = direction1)) +
  # geom_point() +
  geom_text(alpha = PC3_loadings_contrib1$highlight1, size = 6, fontface="bold", show.legend = FALSE) + 
  scale_color_manual(values = c("black", "red")) +
  geom_abline(slope = 0, intercept = 50, colour = "red", linetype = 2) +
  xlab("") +
  ylab("Cumulative sum contribution %\n(PC3)") +
  # facet_grid(PC~.) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, colour = PC3_loadings_contrib1$highlight, size = 14, face = "bold")) #modify the x axis variable names so they are rotated and highlighted

PC_loadings_contrib1 <- plot_grid(NULL, PC1_loadings_contrib1_plot, NULL, PC2_loadings_contrib1_plot, NULL, PC3_loadings_contrib1_plot, labels = c("PC1", "", "PC2", "", "PC3", ""), rel_heights = c(0.2, 1, 0.2, 1, 0.2, 1), label_size = 14, nrow = 6)

PC_loadings_contrib1

ggsave(plot = PC_loadings_contrib1, filename = "Figures/PC_loadings_contrib1.png", width = 6, height = 14, dpi = 300)

PCA scores analysis

Next, our analyses will inspect whether the PCA scores within each of the 3 PCs are independent of social information, i.e. year of birth and gender.

Each speaker has a PCA score within each of the PCs (which can be found in intercepts.pca$scores), thus we need to extract this information and combine it with the speaker’s social information (which can be found in vowels_all).

We will begin by wrangling the data, then inspecting the relationships visually, then running GAMs to see if any of the variables can predict the PCA scores significantly.

We hypothesise that our initial GAMMs had these variables included in the fixed-effects structure, so, none of the subsequent models should show significant results. If there are no significant effects, then we have confidence in our modelling procedure and that the speaker intercepts are representative of variation independent of the known predictors of change in F1/F2.

#extract the PCA scores for the first 3 PCs from the PCA
PC_speaker_loadings <- as.data.frame(intercepts.pca$scores[,1:3]) %>% #the loadings are stored in the scores part of the intercepts.pca object
  mutate(Speaker = gam_intercepts.tmp$Speaker) #create a Speaker variable with the speaker names

#get speaker's social information and combine it with the speaker intercepts and then the PC loadings
PC_speaker_loadings <-  vowels_all %>%
  dplyr::select(Speaker, Corpus, Gender, participant_year_of_birth) %>% #choose the variables of interest from vowels_all
  distinct() %>% #make one row for each speaker
  left_join(., gam_intercepts.tmp, by = "Speaker") %>% #combine the social information with the intercepts
  left_join(., PC_speaker_loadings, by = "Speaker") #combine this with the PC loadings

#add the PCA scores from PC1, PC2 and PC3
vowels_all <- vowels_all %>%
  left_join(PC_speaker_loadings[, c("Speaker", "Comp.1", "Comp.2", "Comp.3")])

In order to statistically assess whether the PCA scores are reliably predicted by any of the social variables, we will run a series of GAMs, predicting the PCA scores in each of the 3 PCs by each of the social variables. We will also provide visualisations of the results (reproducing Figure XX).

PC1

#run a GAM to predict the PCA score on PC1 by the social information
# (entering Gender as a factor here (not ordered!), which generates
#  completely separate smooths for each Gender)
gam_PC1 = bam(-Comp.1 ~
             s(participant_year_of_birth, k=10, by=Gender) +
             Gender +
             Corpus,
           data=PC_speaker_loadings)

summary(gam_PC1)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## -Comp.1 ~ s(participant_year_of_birth, k = 10, by = Gender) + 
##     Gender + Corpus
## 
## Parametric coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     0.04203    0.13690   0.307  0.75899   
## GenderM        -0.04773    0.17182  -0.278  0.78129   
## CorpusDarfield -1.09406    0.38502  -2.842  0.00468 **
## CorpusIA        0.03169    0.29025   0.109  0.91311   
## CorpusMU        0.38015    0.41250   0.922  0.35722   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                      edf Ref.df     F p-value
## s(participant_year_of_birth):GenderM   1      1 0.482   0.488
## 
## R-sq.(adj) =  0.00986   Deviance explained = 2.02%
## fREML = 979.42  Scale est. = 3.4209    n = 481

# extract predictions
gam_PC1_preds <- plot_smooth(gam_PC1, view="participant_year_of_birth", plot_all="Gender")$fv

## Summary:
##  * Gender : factor; set to the value(s): F, M. 
##  * Corpus : factor; set to the value(s): CC. 
##  * participant_year_of_birth : numeric predictor; with 30 values ranging from 1864.000000 to 1982.000000.

PC1_speaker_loadings <- ggplot(PC_speaker_loadings, aes(x = participant_year_of_birth, y = -Comp.1, #plot year of birth on the x axis and PC1 loading on the y axis
                                label = Gender, color = Gender)) + #make the colours and data points represent Gender
  geom_point(size = 0, stroke = 0, show.legend = FALSE) + #add an empty data point for plotting
  geom_text(show.legend = FALSE) + #add text label i.e. F/M
  geom_ribbon(data=gam_PC1_preds, aes(ymin=-ll, ymax=-ul, y=NULL, group = Gender), col="grey", alpha = 0.2) +
  geom_line(data=gam_PC1_preds, aes(y=-fit), size = 1.25) +
  scale_y_continuous(breaks = seq(-6,6,2), limits = c(-7,7)) +
  scale_color_manual(breaks = c("F", "M"), labels = c("Female", "Male"), values = c("black", "green")) + #update the legend and colour scheme where F = black letters, M = green letters
  scale_fill_manual(breaks = c("F", "M"), values = c("black", "green"), guide=F) + #update the legend and colour scheme where F = black letters, M = green letters
  xlab("Year of birth") + #label x axis
  ylab("PC1 score") + #label y axis
  theme_bw() + #general aesthetics
  theme(axis.text = element_text(face = "bold", size = 14), axis.title = element_text(face = "bold", size = 14), legend.position = "none") #make the text bold and larger on the axes

PC1_speaker_loadings

ggsave(plot = PC1_speaker_loadings, filename = "Figures/PC1_speaker_loadings.png", width = 7, height = 3, dpi = 300)

PC2

#run a GAM to predict the PCA score on PC1 by the social information
# (entering Gender as a factor here (not ordered!), which generates
#  completely separate smooths for each Gender)
gam_PC2 = bam(-Comp.2 ~
             s(participant_year_of_birth, k=10, by=Gender) +
             Gender +
             Corpus,
           data=PC_speaker_loadings)

summary(gam_PC2)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## -Comp.2 ~ s(participant_year_of_birth, k = 10, by = Gender) + 
##     Gender + Corpus
## 
## Parametric coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)     0.06671    0.13198   0.505    0.613
## GenderM         0.01742    0.16565   0.105    0.916
## CorpusDarfield -0.45667    0.37118  -1.230    0.219
## CorpusIA       -0.11459    0.27981  -0.410    0.682
## CorpusMU       -0.40713    0.39768  -1.024    0.306
## 
## Approximate significance of smooth terms:
##                                      edf Ref.df     F p-value
## s(participant_year_of_birth):GenderM   1      1 0.579   0.447
## 
## R-sq.(adj) =  -0.00572   Deviance explained = 0.476%
## fREML = 962.04  Scale est. = 3.1794    n = 481

# extract predictions
gam_PC2_preds <- plot_smooth(gam_PC2, view="participant_year_of_birth", plot_all="Gender")$fv

## Summary:
##  * Gender : factor; set to the value(s): F, M. 
##  * Corpus : factor; set to the value(s): CC. 
##  * participant_year_of_birth : numeric predictor; with 30 values ranging from 1864.000000 to 1982.000000.

#this code repeats the same process as in PC1, but for PC2
PC2_speaker_loadings <- ggplot(PC_speaker_loadings, aes(x = participant_year_of_birth, y = -Comp.2, #plot year of birth on the x axis and PC1 loading on the y axis
                                label = Gender, color = Gender)) + #make the colours and data points represent Gender
  geom_point(size = 0, stroke = 0, show.legend = FALSE) + #add an empty data point for plotting
  geom_text(show.legend = FALSE) + #add text label i.e. F/M
  geom_ribbon(data=gam_PC2_preds, aes(ymin=-ll, ymax=-ul, y=NULL, group = Gender), col="grey", alpha = 0.2) +
  geom_line(data=gam_PC2_preds, aes(y=-fit), size = 1.25) +
  scale_y_continuous(breaks = seq(-6,6,2), limits = c(-7,7)) +
  scale_color_manual(breaks = c("F", "M"), labels = c("Female", "Male"), values = c("black", "green")) + #update the legend and colour scheme where F = orange letters, M = green letters
  scale_fill_manual(breaks = c("F", "M"), values = c("black", "green"), guide=F) + #update the legend and colour scheme where F = orange letters, M = green letters
  xlab("Year of birth") + #label x axis
  ylab("PC2 score") + #label y axis
  theme_bw() + #general aesthetics
  theme(axis.text = element_text(face = "bold", size = 14), axis.title = element_text(face = "bold", size = 14), legend.position = "none") #make the text bold and larger on the axes

PC2_speaker_loadings

ggsave(plot = PC2_speaker_loadings, filename = "Figures/PC2_speaker_loadings.png", width = 7, height = 3, dpi = 300)

PC3

#run a GAM to predict the PCA score on PC1 by the social information
# (entering Gender as a factor here (not ordered!), which generates
#  completely separate smooths for each Gender)
gam_PC3 = bam(-Comp.3 ~
             s(participant_year_of_birth, k=10, by=Gender) +
             Gender +
             Corpus,
           data=PC_speaker_loadings)

summary(gam_PC3)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## -Comp.3 ~ s(participant_year_of_birth, k = 10, by = Gender) + 
##     Gender + Corpus
## 
## Parametric coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -0.153924   0.099162  -1.552    0.121    
## GenderM         0.026154   0.124460   0.210    0.834    
## CorpusDarfield  2.261643   0.278891   8.109 4.37e-15 ***
## CorpusIA        0.002837   0.210242   0.013    0.989    
## CorpusMU        0.222118   0.298801   0.743    0.458    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                      edf Ref.df     F p-value
## s(participant_year_of_birth):GenderM   1      1 0.201   0.654
## 
## R-sq.(adj) =  0.114   Deviance explained = 12.3%
## fREML = 826.25  Scale est. = 1.7949    n = 481

# extract predictions
gam_PC3_preds <- plot_smooth(gam_PC3, view="participant_year_of_birth", plot_all="Gender")$fv

## Summary:
##  * Gender : factor; set to the value(s): F, M. 
##  * Corpus : factor; set to the value(s): CC. 
##  * participant_year_of_birth : numeric predictor; with 30 values ranging from 1864.000000 to 1982.000000.

#this code repeats the same process as in PC1, but for PC2
PC3_speaker_loadings <- ggplot(PC_speaker_loadings, aes(x = participant_year_of_birth, y = -Comp.3, #plot year of birth on the x axis and PC1 loading on the y axis
                                label = Gender, color = Gender)) + #make the colours and data points represent Gender
  geom_point(size = 0, stroke = 0, show.legend = FALSE) + #add an empty data point for plotting
  geom_text(show.legend = FALSE) + #add text label i.e. F/M
  geom_ribbon(data=gam_PC3_preds, aes(ymin=-ll, ymax=-ul, y=NULL, group = Gender), col="grey", alpha = 0.2) +
  geom_line(data=gam_PC3_preds, aes(y=-fit), size = 1.25) +
  scale_y_continuous(breaks = seq(-6,6,2), limits = c(-7,7)) +
  scale_color_manual(breaks = c("F", "M"), labels = c("Female", "Male"), values = c("black", "green")) + #update the legend and colour scheme where F = orange letters, M = green letters
  scale_fill_manual(breaks = c("F", "M"), values = c("black", "green"), guide=F) + #update the legend and colour scheme where F = orange letters, M = green letters
  xlab("Year of birth") + #label x axis
  ylab("PC3 score") + #label y axis
  theme_bw() + #general aesthetics
  theme(axis.text = element_text(face = "bold", size = 14), axis.title = element_text(face = "bold", size = 14), legend.position = "none") #make the text bold and larger on the axes

PC3_speaker_loadings

ggsave(plot = PC3_speaker_loadings, filename = "Figures/PC3_speaker_loadings.png", width = 7, height = 3, dpi = 300)

Relationship between the PC scores

It may be of interest to inspect the how the PC scores correlate with each other, i.e. checking that if you have a high score on one PC does that predict a high score on the other PCs? The correlations below show that there is no strong trends to support this, demonstrating that the PC scores and the PCs are independent.

Note however, there does seem to be a slight trend for speakers with high PC1 to have high PC3.

chart.Correlation(PC_speaker_loadings %>% select(Comp.1:Comp.3), histogram = FALSE, pch=19)

Visualisation in vowel space

In order to understand how these variables are co-varying together, we can re-interpret the dot plots in terms of F1/F2 vowel space. This will allow us to explore our theoretically motivated interpretation that these PCs represent changes in F1/F2 over time, thus demonstrating that the PCs may be representing ‘leaders’ and ‘laggers’ of sound change.

We will plot how the vowel space in New Zealand English has changed over the course of the dataset (defined by year of birth), based on the GAMM modelling performed in earlier models.
We then visualise how the vowel spaces of the speakers in PC1, PC2 and PC3 also change (defined by the PCA scores), this interpretation will be driven by the weighting of the variable loadings on each PC (as shown in the dot plots). This will done in 3 different ways:

GAM smooths
Vowel plots
Animations

In order to generate these plots, we have to run more GAMMs predicting either F1 or F2 (normalised), by the PCA scores, this will provide us with predicted model values, that show the trajectories based on the PCA scores (i.e. going from negative to positive values).

Note, the direction of the PCA scores (for PC1, PC2 and PC3) have been reversed in the figures, i.e. negative scores are switched to positive and vice versa. This is purely for visualisation purposes as it helps the interpretation in terms of change over time. This does not affect any underlying importance of the actual scores, the direction (+/-) is arbitrary in PCA and remains the same as long as all values are switched, which they have been.

Sound change

Visualisation by GAM smooth

Plotting change in normalised (Lobanov 2.0) F1 (red lines) and F2 (blue lines)

x axis = year of birth
y axis = normalised F1/F2
smoothed lines = GAM fit

year of birth

mod_pred_yob_values <-  readRDS("Data/Models/mod_pred_yob_values.rds")

#get means for each speaker per vowel and formant
sound_change_summary <- vowels_all %>%
  group_by(Speaker, participant_year_of_birth, Vowel) %>%
  dplyr::summarise(n = n(),
                   F1 = mean(F1_lobanov_2.0),
                   F2 = mean(F2_lobanov_2.0),
                   sd_F1 = sd(F1_lobanov_2.0),
                   sd_F2 = sd(F2_lobanov_2.0)) %>%
  ungroup() %>%
  pivot_longer(F1:F2, names_to = "Formant", values_to = "fit")

#plot the gam smooths predicting F1/F2 per vowel and add the per speaker mean values
sound_change_plot_smooth <- mod_pred_yob_values %>%
  ggplot(aes(x = participant_year_of_birth, y = fit, colour = Formant, fill = Formant)) +
  geom_point(data = sound_change_summary, size = 0.25, alpha = 0.1) +
  geom_line(size = 1) +
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.2, colour = NA) +
  scale_x_continuous(breaks = c(1900, 1950)) +
  xlab("Year of birth") +
  ylab("Predicted model fit (Lobanov 2.0)") +
  facet_grid(~Vowel) +
  theme_bw() +
  theme(legend.position = "top")

sound_change_plot_smooth

ggsave(plot = sound_change_plot_smooth, filename = "Figures/sound_change_gam.png", width = 10, height = 5, dpi = 300)

year of birth and gender

mod_pred_yobgender_values <- readRDS("Data/Models/mod_pred_yobgender_values.rds")

#get means for each speaker per vowel and formant
sound_change_gender_summary <- vowels_all %>%
  group_by(Speaker, Gender, participant_year_of_birth, Vowel) %>%
  dplyr::summarise(n = n(),
                   F1 = mean(F1_lobanov_2.0),
                   F2 = mean(F2_lobanov_2.0),
                   sd_F1 = sd(F1_lobanov_2.0),
                   sd_F2 = sd(F2_lobanov_2.0)) %>%
  ungroup() %>%
  pivot_longer(F1:F2, names_to = "Formant", values_to = "fit")

#plot the gam smooths predicting F1/F2 per vowel and add the per speaker mean values with F or M as the text
sound_change_plot_smooth_gender <- mod_pred_yobgender_values %>%
  ggplot(aes(x = participant_year_of_birth, y = fit, colour = Formant, linetype = Gender, label = Gender, fill = Formant)) +
  geom_text(data = sound_change_gender_summary, size = 1, alpha = 0.5) +
  geom_line(size = 1) +
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.2, colour = NA) +
  scale_x_continuous(breaks = c(1900, 1950)) +
  xlab("Year of birth") +
  ylab("Predicted model fit (Lobanov 2.0)") +
  facet_grid(~Vowel) +
  theme_bw() +
  theme(legend.position = "top") +
  guides(linetype=guide_legend(override.aes=list(fill=NA)))

sound_change_plot_smooth_gender

ggsave(plot = sound_change_plot_smooth_gender, filename = "Figures/sound_change_gam_gender.png", width = 10, height = 5, dpi = 300)

Visualisation in F1/F2 space

We can convert the data from the above plot in (Lobanov 2.0 normalised) F1/F2 space, to see the changes in a more conventional vowel space plot

x axis = F2
y axis = F1
colours = lexical set of vowel
lines = trajectory of the F1/F2 in terms of year of birth (1857 - 1988)
vowel labels = start point of the vowel trajectory (oldest year of birth: 1857)
arrows = end point of the trajectory (youngest year of birth: 1988)

#transform data so there are separate columns for F1 and F2
sound_change_plot_data <- mod_pred_yob_values %>%
  select(participant_year_of_birth, fit, Vowel, Formant) %>%
  pivot_wider(names_from = Formant, values_from = fit) #transform the data to wide format so there are separate F1 and F2 variables

#make data frame to plot starting point, this will give the vowel labels based on the smallest year of birth coordinates
sound_change_labels1 <- sound_change_plot_data %>%
  group_by(Vowel) %>%
  filter(participant_year_of_birth == min(participant_year_of_birth))

#make data frame to plot end point, this will give the arrow at the end of the paths based on the largest year of birth coordinates
sound_change_labels2 <- sound_change_plot_data %>%
  group_by(Vowel) %>%
  top_n(wt = participant_year_of_birth, n = 2)

#plot
sound_change_plot <- sound_change_plot_data %>%
  #set general aesthetics
  ggplot(aes(x = F2, y = F1, colour = Vowel, alpha = participant_year_of_birth, group = Vowel)) +
  #add year of birth change trajectories
  geom_path(size = 0.5, show.legend = FALSE) +
  #add end points (this gives the arrows)
  geom_path(data = sound_change_labels2, aes(x = F2, y = F1, colour = Vowel, group = Vowel),
            arrow = arrow(ends = "last", type = "closed", length = unit(0.2, "cm")),
            inherit.aes = FALSE, show.legend = FALSE) +
  #plot the vowel labels
  geom_text(data = sound_change_labels1, aes(x = F2, y = F1, colour = Vowel, group = Vowel, label = Vowel), inherit.aes = FALSE, show.legend = FALSE) +
  #label the axes
  xlab("F2 (normalised)") +
  ylab("F1 (normalised)") +
  #scale the size so the path is not too wide
  scale_size_continuous(range = c(0.2,1)) +
  #reverse the axes to follow conventional vowel plotting
  scale_x_reverse(limits = c(2,-2), position = "top") +
  scale_y_reverse(limits = c(2.3,-2), position = "right") +
  #set the colours
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                 "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  #add a title
  # labs(title = "A) Change over time\n     ") +
  #set the theme
  theme_bw() +
  #make title bold
  theme(plot.title = element_text(face="bold"))

sound_change_plot

ggsave(plot = sound_change_plot, filename = "Figures/sound_change_static.png", width = 5.5, height = 5.5, dpi = 300)

Visualisation by animation

We can also recreate the above plot in 3 dimensional space, where the trajectories are animated.

x axis = F2
y axis = F1
colours = lexical set of vowel
movement = trajectory of the F1/F2 in terms of year of birth (1857 - 1988)
trails = the previous positions of the vowels

sound_change_plot_animation <- sound_change_plot_data %>%
  #set general aesthetics
  ggplot(aes(x = F2, y = F1, colour = Vowel, group = Vowel, label = Vowel)) +
  geom_text(aes(fontface = 2), size = 5, show.legend = FALSE) +
  # geom_point() +
  geom_path() +
  #label the axes
  xlab("F2 (normalised)") +
  ylab("F1 (normalised)") +
  #reverse the axes to follow conventional vowel plotting
  scale_x_reverse(limits = c(2,-2), position = "top") +
  scale_y_reverse(limits = c(2.3,-2), position = "right") +
  #set the colours
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                 "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  #add a title
  labs(caption = 'Year of birth: {round(frame_along, 0)}') +
  #set the theme
  theme_bw() +
  #make text more visible
  theme(axis.title = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold", angle = 270),
        axis.ticks = element_blank(),
        plot.caption = element_text(size = 30, hjust = 0),
        legend.position = "none") +
  #set the variable for the animation transition i.e. the time dimension
  transition_reveal(participant_year_of_birth) +
  #add in a trail to see the path
  # shadow_trail(max_frames = 100, alpha = 0.1) +
  ease_aes('linear')

sound_change_plot_animation <- animate(sound_change_plot_animation, nframes = 200, fps = 5, rewind = FALSE, start_pause = 10, end_pause = 10, duration = 20)

# animate(sound_change_plot_animation, nframes = 200, fps = 5, rewind = FALSE, start_pause = 10, end_pause = 10, duration = 20, height = 800, width =800)

anim_save(sound_change_plot_animation, filename = "Figures/sound_change_animation.gif")

sound_change_plot_animation

PC1

GAMM modelling

We will predict the F1 and F2 of each of the vowels again using GAMMs, with Comp.1 (the PCA scores for PC1) included as a smooth term. There are also random effects for Speaker and word.

These models are saved in the model_summaries folder for efficiency.

#loop through the vowels
for (i in levels(factor(vowels_all$Vowel))) {
  
  #F1 modelling
  cat(paste0("F1_", i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n"))  #print the vowel the loop is up to for F1, as well as the start time for the model
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F1 for FLEECE
  gam.F1 <- bam(F1_lobanov_2.0 ~
                  s(Comp.1, k = 10) +
                  s(Speaker, bs="re") +
                  s(Word, bs="re"),
                data=vowels_all %>% filter(Vowel == i),
                discrete=T, nthreads=2)
  
  #store the model
  assign(paste0("gam_F1_", i), gam.F1)
  
  #save the model summary
  saveRDS(gam.F1, file = paste0("/Users/james/Documents/GitHub/model_summaries/PC1/gam_F1_", i, ".rds"))
  
  #F2 modelling
  cat(paste0("F2_", i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n")) #print the vowel the loop is up to for F2 , as well as the start time for the model
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F2 for FLEECE
  gam.F2 <- bam(F2_lobanov_2.0 ~
                  s(Comp.1, k = 10) +
                  s(Speaker, bs="re") +
                  s(Word, bs="re"),
                data=vowels_all %>% filter(Vowel == i),
                discrete=T, nthreads=2)
  
  #store the model
  assign(paste0("gam_F2_", i), gam.F2)
  
  #save the model summary
  saveRDS(gam.F2, file = paste0("/Users/james/Documents/GitHub/model_summaries/PC1/gam_F2_", i, ".rds"))
  
}

Visualisation by GAM smooth

x axis = PCA score
y axis = normalised F1/F2
smoothed lines = GAM fit

mod_pred_PC1_values <- readRDS("Data/Models/mod_pred_PC1_values.rds")

#get means for each speaker per vowel and formant
PC1_change_summary <- vowels_all %>%
  group_by(Speaker, Comp.1, Vowel) %>%
  dplyr::summarise(n = n(),
                   F1 = mean(F1_lobanov_2.0),
                   F2 = mean(F2_lobanov_2.0),
                   sd_F1 = sd(F1_lobanov_2.0),
                   sd_F2 = sd(F2_lobanov_2.0)) %>%
  ungroup() %>%
  pivot_longer(F1:F2, names_to = "Formant", values_to = "fit")

#plot the gam smooths predicting F1/F2 per vowel and add the per speaker mean values
PC1_plot_smooth <- mod_pred_PC1_values %>%
  ggplot(aes(x = -Comp.1, y = fit, colour = Formant, fill = Formant)) +
  geom_point(data = PC1_change_summary, size = 0.25, alpha = 0.1) +
  geom_line(size = 1) +
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.2, colour = NA) +
  scale_x_continuous(breaks = c(-4, 0, 4)) +
  xlab("PC1 score") +
  ylab("Predicted model fit (Lobanov 2.0)") +
  facet_grid(~Vowel) +
  theme_bw() +
  theme(legend.position = "top")

PC1_plot_smooth

ggsave(plot = PC1_plot_smooth, filename = "Figures/PC1_plot_gam.png", width = 10, height = 5, dpi = 300)

Visualisation in F1/F2 space

x axis = F2
y axis = F1
colours = lexical set of vowel
lines = trajectory of the F1/F2 in terms of PCA score (-5.5 to 6.5)
vowel labels = start point of the vowel trajectory (smallest PCA score: -5.5)
vowel label size = magnitude of variable loading, this is calculated from the dot plots from the original PCA, as there are 2 possible loadings (one for F1 and one for F2), the largest value is taken to represent the size
arrows = end point of the trajectory (largest PCA score: 6.5)

#transform data so there are separate columns for F1 and F2
PC1_plot_data <- mod_pred_PC1_values %>%
  select(Comp.1, fit, Vowel, Formant) %>%
  mutate(Comp.1 = -Comp.1) %>%
  pivot_wider(names_from = Formant, values_from = fit) %>% #transform the data to wide format so there are separate F1 and F2 variables
  left_join(., PC1_loadings %>%
  group_by(Vowel) %>%
  filter(PC1_loadings_abs == max(PC1_loadings_abs)))

#make data frame to plot starting point, this will give the vowel labels based on the smallest PCA scores
PC1_change_labels1 <- PC1_plot_data %>%
  group_by(Vowel) %>%
  filter(Comp.1 == min(Comp.1))

#make data frame to plot end point, this will give the arrow at the end of the paths based on the largest year of birth coordinates
PC1_change_labels2 <- PC1_plot_data %>%
  group_by(Vowel) %>%
  top_n(wt = Comp.1, n = 2)

#plot
PC1_change_plot <- PC1_plot_data %>%
  mutate(PC1_loadings_abs = PC1_loadings_abs) %>% 
  #set general aesthetics
  ggplot(aes(x = F2, y = F1, colour = Vowel, alpha = Comp.1, group = Vowel)) +
  #plot the vowel labels
  geom_text(data = PC1_change_labels1, aes(x = F2, y = F1, colour = Vowel, group = Vowel, label = Vowel, size = PC1_loadings_abs), inherit.aes = FALSE, show.legend = FALSE) +
  #add year of birth change trajectories
  geom_path(size = 0.5, show.legend = FALSE) +
  #add end points (this gives the arrows)
  geom_path(data = PC1_change_labels2, aes(x = F2, y = F1, colour = Vowel, group = Vowel),
            arrow = arrow(ends = "first", type = "closed", length = unit(0.2, "cm")),
            inherit.aes = FALSE, show.legend = FALSE) +
  #label the axes
  xlab("F2 (normalised)") +
  ylab("F1 (normalised)") +
  #scale the size so the path is not too wide
  scale_size_continuous(range = c(1,5)) +
  #reverse the axes to follow conventional vowel plotting
  scale_x_reverse(limits = c(2,-2.5), position = "top") +
  scale_y_reverse(limits = c(2.3,-2), position = "right") +
  #set the colours
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                 "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  #add a title
  # labs(title = "B) Change in PC1\n     variance explained = 17.1%") +
  #set the theme
  theme_bw() +
  #make title bold
  theme(plot.title = element_text(face="bold"))

PC1_change_plot

ggsave(plot = PC1_change_plot, filename = "Figures/PC1_plot_static.png", width = 5.5, height = 5.5, dpi = 300)

Visualisation by animation

x axis = F2
y axis = F1
colours = lexical set of vowel
movement = trajectory of the F1/F2 in terms PCA score
trails = the previous positions of the vowels

PC1_plot_animation <- PC1_plot_data %>%
  arrange(Comp.1) %>%
  #set general aesthetics
  ggplot(aes(x = F2, y = F1, colour = Vowel, group = Vowel, label = Vowel)) +
  geom_text(aes(size = PC1_loadings_abs, fontface = 2), show.legend = FALSE) +
  # geom_point() +
  geom_path() +
  #label the axes
  xlab("F2 (normalised)") +
  ylab("F1 (normalised)") +
  #reverse the axes to follow conventional vowel plotting
  scale_x_reverse(limits = c(2,-2.8), position = "top") +
  scale_y_reverse(limits = c(2.3,-2), position = "right") +
  #set the colours
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                 "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  #set the label size
  scale_size_continuous(range = c(1,5)) +
  #add a title
  labs(caption = 'PC1 score: {round(frame_along, 2)}') +
  #set the theme
  theme_bw() +
  #make text more visible
  theme(axis.title = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold", angle = 270),
        axis.ticks = element_blank(),
        plot.caption = element_text(size = 30, hjust = 0),
        legend.position = "none") +
  #set the variable for the animation transition i.e. the time dimension
  transition_reveal(Comp.1)

animate(PC1_plot_animation, nframes = 200, fps = 5, rewind = FALSE, start_pause = 10, end_pause = 10, duration = 20)

anim_save("Figures/PC1_animation.gif")

PC1_plot_animation

PC2

GAMM modelling

#loop through the vowels
for (i in levels(factor(vowels_all$Vowel))) {
  
  #F1 modelling
  cat(paste0("F1_", i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n"))  #print the vowel the loop is up to for F1, as well as the start time for the model
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F1 for FLEECE
  gam.F1 <- bam(F1_lobanov_2.0 ~
                  s(Comp.2, k = 10) +
                  s(Speaker, bs="re") +
                  s(Word, bs="re"),
                data=vowels_all %>% filter(Vowel == i),
                discrete=T, nthreads=2)
  
  #store the model
  assign(paste0("gam_F1_", i), gam.F1)
  
  #save the model summary
  saveRDS(gam.F1, file = paste0("/Users/james/Documents/GitHub/model_summaries/PC2/gam_F1_", i, ".rds"))
  
  #F2 modelling
  cat(paste0("F2_", i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n"))  #print the vowel the loop is up to for F2 , as well as the start time for the model
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F2 for FLEECE
  gam.F2 <- bam(F2_lobanov_2.0 ~
                  s(Comp.2, k = 10) +
                  s(Speaker, bs="re") +
                  s(Word, bs="re"),
                data=vowels_all %>% filter(Vowel == i),
                discrete=T, nthreads=2)
  
  #store the model
  assign(paste0("gam_F2_", i), gam.F2)
  
  #save the model summary
  saveRDS(gam.F2, file = paste0("/Users/james/Documents/GitHub/model_summaries/PC2/gam_F2_", i, ".rds"))
  
}

Visualisation by GAM smooth

mod_pred_PC2_values <- readRDS("Data/Models/mod_pred_PC2_values.rds")

#get means for each speaker per vowel and formant
PC2_change_summary <- vowels_all %>%
  group_by(Speaker, Comp.2, Vowel) %>%
  dplyr::summarise(n = n(),
                   F1 = mean(F1_lobanov_2.0),
                   F2 = mean(F2_lobanov_2.0),
                   sd_F1 = sd(F1_lobanov_2.0),
                   sd_F2 = sd(F2_lobanov_2.0)) %>%
  ungroup() %>%
  pivot_longer(F1:F2, names_to = "Formant", values_to = "fit")

#plot the gam smooths predicting F1/F2 per vowel and add the per speaker mean values
PC2_plot_smooth <- mod_pred_PC2_values %>%
  ggplot(aes(x = -Comp.2, y = fit, colour = Formant, fill = Formant)) +
  geom_point(data = PC2_change_summary, size = 0.25, alpha = 0.1) +
  geom_line(size = 1) +
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.2, colour = NA) +
  scale_x_continuous(breaks = c(-4, 0, 4)) +
  xlab("PC2 score") +
  ylab("Predicted model fit (Lobanov 2.0)") +
  facet_grid(~Vowel) +
  theme_bw() +
  theme(legend.position = "top")

PC2_plot_smooth

ggsave(plot = PC2_plot_smooth, filename = "Figures/PC2_plot_gam.png", width = 10, height = 5, dpi = 300)

Visualisation in F1/F2 space

#transform data so there are separate columns for F1 and F2
PC2_plot_data <- mod_pred_PC2_values %>%
  select(Comp.2, fit, Vowel, Formant) %>%
  mutate(Comp.2 = -Comp.2) %>%
  pivot_wider(names_from = Formant, values_from = fit) %>% #transform the data to wide format so there are separate F1 and F2 variables
  left_join(., PC2_loadings %>%
  group_by(Vowel) %>%
  filter(PC2_loadings_abs == max(PC2_loadings_abs)))

#make data frame to plot starting point, this will give the vowel labels based on the smallest PCA scores
PC2_change_labels1 <- PC2_plot_data %>%
  group_by(Vowel) %>%
  filter(Comp.2 == min(Comp.2))

#make data frame to plot end point, this will give the arrow at the end of the paths based on the largest year of birth coordinates
PC2_change_labels2 <- PC2_plot_data %>%
  group_by(Vowel) %>%
  top_n(wt = Comp.2, n = 2)

#plot
PC2_change_plot <- PC2_plot_data %>%
  mutate(PC2_loadings_abs = PC2_loadings_abs) %>% 
  #set general aesthetics
  ggplot(aes(x = F2, y = F1, colour = Vowel, alpha = Comp.2, group = Vowel)) +
  #plot the vowel labels
  geom_text(data = PC2_change_labels1, aes(x = F2, y = F1, colour = Vowel, group = Vowel, label = Vowel, size = PC2_loadings_abs), inherit.aes = FALSE, show.legend = FALSE) +
  #add year of birth change trajectories
  geom_path(size = 0.5, show.legend = FALSE) +
  #add end points (this gives the arrows)
  geom_path(data = PC2_change_labels2, aes(x = F2, y = F1, colour = Vowel, group = Vowel),
            arrow = arrow(ends = "first", type = "closed", length = unit(0.2, "cm")),
            inherit.aes = FALSE, show.legend = FALSE) +
  #label the axes
  xlab("F2 (normalised)") +
  ylab("F1 (normalised)") +
  #scale the size so the path is not too wide
  scale_size_continuous(range = c(1,5)) +
  #reverse the axes to follow conventional vowel plotting
  scale_x_reverse(limits = c(2,-2.5), position = "top") +
  scale_y_reverse(limits = c(2.3,-2), position = "right") +
  #set the colours
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                 "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  #add a title
  # labs(title = "B) Change in PC2\n     variance explained = 17.1%") +
  #set the theme
  theme_bw() +
  #make title bold
  theme(plot.title = element_text(face="bold"))

PC2_change_plot

ggsave(plot = PC2_change_plot, filename = "Figures/PC2_plot_static.png", width = 5.5, height = 5.5, dpi = 300)

Visualisation by animation

PC2_plot_animation <- PC2_plot_data %>%
  arrange(Comp.2) %>%
  #set general aesthetics
  ggplot(aes(x = F2, y = F1, colour = Vowel, group = Vowel, label = Vowel)) +
  geom_text(aes(fontface = 2, size = PC2_loadings_abs), show.legend = FALSE) +
  geom_step() +
  #label the axes
  xlab("F2 (normalised)") +
  ylab("F1 (normalised)") +
  #reverse the axes to follow conventional vowel plotting
  scale_x_reverse(limits = c(2,-2), position = "top") +
  scale_y_reverse(limits = c(2.3,-2), position = "right") +
  #set the colours
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                 "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  #set the label size
  scale_size_continuous(range = c(1,5)) +
  #add a title
  labs(caption = 'PC2 score: {round(frame_along, 2)}') +
  #set the theme
  theme_bw() +
  #make text more visible
  theme(axis.title = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold", angle = 270),
        axis.ticks = element_blank(),
        plot.caption = element_text(size = 30, hjust = 0),
        legend.position = "none") +
  #set the variable for the animation transition i.e. the time dimension
  transition_reveal(Comp.2) +
  #add in a trail to see the path
  # shadow_trail(max_frames = 100, alpha = 0.1) +
  ease_aes('linear')

animate(PC2_plot_animation, nframes = 200, fps = 5, rewind = FALSE, start_pause = 10, end_pause = 10, duration = 20)

anim_save("Figures/PC2_animation.gif")

PC3

GAMM modelling

for (i in levels(factor(vowels_all$Vowel))) {
  
  #F1 modelling
  cat(paste0("F1_", i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n"))  #print the vowel the loop is up to for F1, as well as the start time for the model
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F1 for FLEECE
  gam.F1 <- bam(F1_lobanov_2.0 ~
                  s(Comp.3, k = 10) +
                  s(Speaker, bs="re") +
                  s(Word, bs="re"),
                data=vowels_all %>% filter(Vowel == i),
                discrete=T, nthreads=2)
  
  #store the model
  assign(paste0("gam_F1_", i), gam.F1)
  
  #save the model summary
  saveRDS(gam.F1, file = paste0("/Users/james/Documents/GitHub/model_summaries/PC3/gam_F1_", i, ".rds"))
  
  #F2 modelling
  cat(paste0("F2_", i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n"))  #print the vowel the loop is up to for F2 , as well as the start time for the model
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F2 for FLEECE
  gam.F2 <- bam(F2_lobanov_2.0 ~
                  s(Comp.3, k = 10) +
                  s(Speaker, bs="re") +
                  s(Word, bs="re"),
                data=vowels_all %>% filter(Vowel == i),
                discrete=T, nthreads=2)
  
  #store the model
  assign(paste0("gam_F2_", i), gam.F2)
  
  #save the model summary
  saveRDS(gam.F2, file = paste0("/Users/james/Documents/GitHub/model_summaries/PC3/gam_F2_", i, ".rds"))
  
}

Visualisation by GAM smooth

mod_pred_PC3_values <-  readRDS("Data/Models/mod_pred_PC3_values.rds")

#get means for each speaker per vowel and formant
PC3_change_summary <- vowels_all %>%
  group_by(Speaker, Comp.3, Vowel) %>%
  dplyr::summarise(n = n(),
                   F1 = mean(F1_lobanov_2.0),
                   F2 = mean(F2_lobanov_2.0),
                   sd_F1 = sd(F1_lobanov_2.0),
                   sd_F2 = sd(F2_lobanov_2.0)) %>%
  ungroup() %>%
  pivot_longer(F1:F2, names_to = "Formant", values_to = "fit")

#plot the gam smooths predicting F1/F2 per vowel and add the per speaker mean values
PC3_plot_smooth <- mod_pred_PC3_values %>%
  ggplot(aes(x = -Comp.3, y = fit, colour = Formant, fill = Formant)) +
  geom_point(data = PC3_change_summary, size = 0.25, alpha = 0.1) +
  geom_line(size = 1) +
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.2, colour = NA) +
  scale_x_continuous(breaks = c(-4, 0, 4)) +
  xlab("PC3 score") +
  ylab("Predicted model fit (Lobanov 2.0)") +
  facet_grid(~Vowel) +
  theme_bw() +
  theme(legend.position = "top")

PC3_plot_smooth

ggsave(plot = PC3_plot_smooth, filename = "Figures/PC3_plot_gam.png", width = 10, height = 5, dpi = 300)

Visualisation in F1/F2 space

#extract the smoothed values from the plot and store them
#transform data so there are separate columns for F1 and F2
PC3_plot_data <- mod_pred_PC3_values %>%
  select(Comp.3, fit, Vowel, Formant) %>%
  mutate(Comp.3 = -Comp.3) %>%
  pivot_wider(names_from = Formant, values_from = fit) %>% #transform the data to wide format so there are separate F1 and F2 variables
  left_join(., PC3_loadings %>%
  group_by(Vowel) %>%
  filter(PC3_loadings_abs == max(PC3_loadings_abs)))

#make data frame to plot starting point, this will give the vowel labels based on the smallest PCA scores
PC3_change_labels1 <- PC3_plot_data %>%
  group_by(Vowel) %>%
  filter(Comp.3 == min(Comp.3))

#make data frame to plot end point, this will give the arrow at the end of the paths based on the largest year of birth coordinates
PC3_change_labels2 <- PC3_plot_data %>%
  group_by(Vowel) %>%
  top_n(wt = Comp.3, n = 2)

#plot
PC3_change_plot <- PC3_plot_data %>%
  mutate(PC3_loadings_abs = PC3_loadings_abs) %>% 
  #set general aesthetics
  ggplot(aes(x = F2, y = F1, colour = Vowel, alpha = Comp.3, group = Vowel)) +
  #plot the vowel labels
  geom_text(data = PC3_change_labels1, aes(x = F2, y = F1, colour = Vowel, group = Vowel, label = Vowel, size = PC3_loadings_abs), inherit.aes = FALSE, show.legend = FALSE) +
  #add year of birth change trajectories
  geom_path(size = 0.5, show.legend = FALSE) +
  #add end points (this gives the arrows)
  geom_path(data = PC3_change_labels2, aes(x = F2, y = F1, colour = Vowel, group = Vowel),
            arrow = arrow(ends = "first", type = "closed", length = unit(0.2, "cm")),
            inherit.aes = FALSE, show.legend = FALSE) +
  #label the axes
  xlab("F2 (normalised)") +
  ylab("F1 (normalised)") +
  #scale the size so the path is not too wide
  scale_size_continuous(range = c(1,5)) +
  #reverse the axes to follow conventional vowel plotting
  scale_x_reverse(limits = c(2,-2.5), position = "top") +
  scale_y_reverse(limits = c(2.3,-2), position = "right") +
  #set the colours
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                 "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  #add a title
  # labs(title = "B) Change in PC3\n     variance explained = 17.1%") +
  #set the theme
  theme_bw() +
  #make title bold
  theme(plot.title = element_text(face="bold"))

PC3_change_plot

ggsave(plot = PC3_change_plot, filename = "Figures/PC3_plot_static.png", width = 5.5, height = 5.5, dpi = 300)

Visualisation by animation

PC3_plot_animation <- PC3_plot_data %>%
  arrange(Comp.3) %>%
  #set general aesthetics
  ggplot(aes(x = F2, y = F1, colour = Vowel, group = Vowel, label = Vowel)) +
  geom_text(aes(fontface = 2, size = PC3_loadings_abs), show.legend = FALSE) +
  geom_path() +
  #label the axes
  xlab("F2 (normalised)") +
  ylab("F1 (normalised)") +
  #reverse the axes to follow conventional vowel plotting
  scale_x_reverse(limits = c(2,-2), position = "top") +
  scale_y_reverse(limits = c(2.3,-2), position = "right") +
  #set the colours
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                 "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  #set the label size
  scale_size_continuous(range = c(1,5)) +
  #add a title
  labs(caption = 'PC3 score: {round(frame_along, 2)}') +
  #set the theme
  theme_bw() +
  #make text more visible
  theme(axis.title = element_text(size = 14, face = "bold"),
        axis.text.x = element_text(size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold", angle = 270),
        axis.ticks = element_blank(),
        plot.caption = element_text(size = 30, hjust = 0),
        legend.position = "none") +
  #set the variable for the animation transition i.e. the time dimension
  transition_reveal(Comp.3) +
  #add in a trail to see the path
  # shadow_trail(max_frames = 100, alpha = 0.1) +
  ease_aes('linear')

animate(PC3_plot_animation, nframes = 200, fps = 5, rewind = FALSE, start_pause = 10, end_pause = 10, duration = 20)

anim_save("Figures/PC3_animation.gif")

Comparison with sound change

We can also compare the smooths for the key variables (i.e. contributing > 5% variation) for each PC to the sound change smooths. These plots will use the GAMMs predicting F1/F2 by the PCA scores again,

To do this, we first select the key variables for each PC, filtering the data so only those key variables are visualised.

We then need to take the smooths from the initial GAMMs (i.e. those used to extract the speaker intercepts). Note, it is important to scale the year_of_birth variable from these models so a comparison can be made with the PCA scores. This is done by scaling to a range based on the PCA scores, so the earliest year of birth represents the smallest loading, wherea the most recent year of birth represents the highest loading. This preseves the data and the smooths are identical, but it simply makes visualisation easier.

PC1

#select the key vairables from the PCA
mod_pred_PC1_values1 <- mod_pred_PC1_values %>%
  select(Comp.1, fit, ul, ll, Vowel, Formant) %>%
  mutate(Comp.1 = -Comp.1,
         Variable = paste0(Formant, "_", Vowel),
         var = "PC1",
         participant_year_of_birth = rescale(Comp.1, to = range(mod_pred_PC1_values$participant_year_of_birth))) #rescale the year of birth variable

PC1_loadings_contrib1 <- PC1_loadings_contrib1 %>%
  mutate(Formant = substr(Variable, 1, 2),
         x = ifelse(Formant == "F1", -3.2, 2.8),
         y = -2.75,
         x1 = ifelse(Vowel != "THOUGHT", (min(mod_pred_PC1_values1$Comp.1)+max(mod_pred_PC1_values1$Comp.1))/2, x)) %>%
  group_by(Vowel) %>%
  mutate(Contribution.PC1_max = max(Contribution.PC1)) %>%
  ungroup() %>%
  mutate(Vowel = factor(.$Vowel, levels=unique(.$Vowel[order(.$Contribution.PC1_max)]), ordered=TRUE)) %>%
  group_by(Vowel)

#match the data to the year of birth smooths
mod_pred_yob_values1 <- mod_pred_yob_values %>%
  select(participant_year_of_birth, fit, ul, ll, Vowel, Formant) %>%
  mutate(Variable = paste0(Formant, "_", Vowel),
         var = "yob",
         #rescale the year of birth variable
         Comp.1 = rescale(participant_year_of_birth, to = range(mod_pred_PC1_values1$Comp.1))) %>%
  filter(Variable %in% mod_pred_PC1_values1$Variable) %>%
  rbind(mod_pred_PC1_values1) %>% #combine with the PCA scores data
  left_join(., PC1_contrib %>% select(Variable, Contribution.PC1)) %>% #get contribution values
  mutate(Contribution.PC1a = paste0(round(Contribution.PC1, 1), "%")) %>%
  #reorder the Vowel factor so it is ordered by contribution
  group_by(Vowel) %>%
  mutate(Contribution.PC1_max = max(Contribution.PC1)) %>%
  ungroup() %>%
  mutate(Vowel = factor(.$Vowel, levels=unique(.$Vowel[order(.$Contribution.PC1_max)]), ordered=TRUE)) %>%
  left_join(PC1_loadings_contrib1 %>% select(Variable, cumsum_PC1, highlight))

#plot all model smooths facetted by vowel
PC1_plot_smooth1 <- mod_pred_yob_values1 %>%
  ggplot(aes(x = Comp.1, y = fit, colour = Formant, fill = Formant, linetype = var)) +
  geom_line() + #smooth line
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.1, colour = NA) + #confidence intervals
  scale_x_continuous(breaks = c(-4, 0, 4),
                     sec.axis = dup_axis(breaks = c(-2.80243, 2.071311), labels = c(1900, 1950), name = "Participant year of birth"),  #add secondary axis for year of birth, this is done by rescaling the loading values to scale of year of birth
                     name = "PC1 score"
                     ) +
  scale_y_reverse(breaks = seq(-2, 2, 1), limits = c(2.5, -2.9)) + #revese the axis so it follows convention
  scale_linetype(name = NULL) + #remove the name (var) from the legend
  ylab("Predicted model fit (Lobanov 2.0)") +
  facet_grid(~Vowel) +
  geom_label(data = PC1_loadings_contrib1, aes(x = x, y = y, label = paste0(round(Contribution.PC1, 1), "%"), colour = Formant), size = 3, inherit.aes = FALSE, show.legend = FALSE) +
  theme_bw() +
  theme(legend.position = "top") +
  guides(linetype=guide_legend(override.aes=list(fill=NA)))

PC1_plot_smooth1

ggsave(plot = PC1_plot_smooth1, filename = "Figures/PC1_yob.png", width = 11, height = 5, dpi = 300)

#plot all model smooths facetted by vowel
PC1_plot_smooth_important <- mod_pred_yob_values1 %>%
  filter(highlight == "black") %>%
  ggplot(aes(x = Comp.1, y = fit, colour = Formant, fill = Formant, linetype = var)) +
  geom_line() + #smooth line
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.1, colour = NA) + #confidence intervals
  scale_x_continuous(breaks = c(-4, 0, 4),
                     sec.axis = dup_axis(breaks = c(-2.80243, 2.071311), labels = c(1900, 1950), name = "Participant year of birth"),  #add secondary axis for year of birth, this is done by rescaling the loading values to scale of year of birth
                     name = "PC1 score"
                     ) +
  scale_y_reverse(breaks = seq(-2, 2, 1), limits = c(2.5, -2.9)) + #revese the axis so it follows convention
  scale_linetype(name = NULL) + #remove the name (var) from the legend
  ylab("Predicted model fit (Lobanov 2.0)") +
  facet_grid(~Vowel) +
  geom_label(data = PC1_loadings_contrib1 %>% filter(highlight == "black"), aes(x = x1, y = y, label = paste0(round(Contribution.PC1, 1), "%"), colour = Formant), size = 3, inherit.aes = FALSE, show.legend = FALSE) +
  theme_bw() +
  theme(legend.position = "top") +
  guides(linetype=guide_legend(override.aes=list(fill=NA)))

PC1_plot_smooth_important

ggsave(plot = PC1_plot_smooth_important, filename = "Figures/PC1_yob_important.png", width = 5, height = 5, dpi = 300)

PC2

mod_pred_PC2_values2 <- mod_pred_PC2_values %>%
  select(Comp.2, fit, ul, ll, Vowel, Formant) %>%
  mutate(Comp.2 = -Comp.2,
         Variable = paste0(Formant, "_", Vowel),
         var = "PC2",
         participant_year_of_birth = rescale(Comp.2, to = range(mod_pred_yob_values$participant_year_of_birth)))

PC2_loadings_contrib1 <- PC2_loadings_contrib1 %>%
  mutate(Formant = substr(Variable, 1, 2),
         x = ifelse(Formant == "F1", -3, 3),
         y = -2.75) %>%
  group_by(Vowel) %>%
  mutate(Contribution.PC2_max = max(Contribution.PC2)) %>%
  ungroup() %>%
  mutate(Vowel = factor(.$Vowel, levels=unique(.$Vowel[order(.$Contribution.PC2_max)]), ordered=TRUE)) %>%
  group_by(Vowel)

mod_pred_yob_values2 <- mod_pred_yob_values %>%
  select(participant_year_of_birth, fit, ul, ll, Vowel, Formant) %>%
  mutate(Variable = paste0(Formant, "_", Vowel),
         var = "yob",
         Comp.2 = rescale(participant_year_of_birth, to = range(mod_pred_PC2_values2$Comp.2))) %>%
  filter(Variable %in% mod_pred_PC2_values2$Variable) %>%
  rbind(mod_pred_PC2_values2) %>%
  left_join(., PC2_contrib %>% select(Variable, Contribution.PC2)) %>%
  mutate(Contribution.PC2a = paste0(round(Contribution.PC2, 1), "%")) %>%
  group_by(Vowel) %>%
  mutate(Contribution.PC2_max = max(Contribution.PC2)) %>%
  ungroup() %>%
  mutate(Vowel = factor(.$Vowel, levels=unique(.$Vowel[order(.$Contribution.PC2_max)]), ordered=TRUE)) %>%
  left_join(PC2_loadings_contrib1 %>% select(Variable, cumsum_PC2, highlight))

#plot all model smooths facetted by vowel
PC2_plot_smooth1 <- mod_pred_yob_values2 %>%
  ggplot(aes(x = Comp.2, y = fit, colour = Formant, fill = Formant, linetype = var)) +
  geom_line() +
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.1, colour = NA) +
  scale_x_continuous(breaks = c(-4, 0, 4),
                     sec.axis = dup_axis(breaks = c(-2.103943, 3.018724), labels = c(1900, 1950), name = "Participant year of birth"),  #add secondary axis for year of birth, this is done by rescaling the loading values to scale of year of birth
                     name = "PC2 score"
                     ) +
  scale_y_reverse(breaks = seq(-2, 2, 1), limits = c(2.5, -2.9)) +
  ylab("Predicted model fit (Lobanov 2.0)") +
  scale_linetype(name = NULL) +
  facet_grid(~Vowel) +
  geom_label(data = PC2_loadings_contrib1, aes(x = x, y = y, label = paste0(round(Contribution.PC2, 1), "%"), colour = Formant), size = 3, inherit.aes = FALSE, show.legend = FALSE) +
  theme_bw() +
  theme(legend.position = "top") +
  guides(linetype=guide_legend(override.aes=list(fill=NA)))

PC2_plot_smooth1

ggsave(plot = PC2_plot_smooth1, filename = "Figures/PC2_yob.png", width = 11, height = 5)

#plot all model smooths facetted by vowel
PC2_plot_smooth1_important <- mod_pred_yob_values2 %>%
  filter(highlight == "black") %>%
  ggplot(aes(x = Comp.2, y = fit, colour = Formant, fill = Formant, linetype = var)) +
  geom_line() +
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.1, colour = NA) +
  scale_x_continuous(breaks = c(-4, 0, 4),
                     sec.axis = dup_axis(breaks = c(-2.103943, 3.018724), labels = c(1900, 1950), name = "Participant year of birth"),  #add secondary axis for year of birth, this is done by rescaling the loading values to scale of year of birth
                     name = "PC2 score"
                     ) +
  scale_y_reverse(breaks = seq(-2, 2, 1), limits = c(2.5, -2.9)) +
  ylab("Predicted model fit (Lobanov 2.0)") +
  scale_linetype(name = NULL) +
  facet_grid(~Vowel) +
  geom_label(data = PC2_loadings_contrib1 %>% filter(highlight == "black"), aes(x = 0, y = y, label = paste0(round(Contribution.PC2, 1), "%"), colour = Formant), size = 3, inherit.aes = FALSE, show.legend = FALSE) +
  theme_bw() +
  theme(legend.position = "top") +
  guides(linetype=guide_legend(override.aes=list(fill=NA)))

PC2_plot_smooth1_important

ggsave(plot = PC2_plot_smooth1_important, filename = "Figures/PC2_yob_important.png", width = 7, height = 5)

PC3

mod_pred_PC3_values3 <- mod_pred_PC3_values %>%
  select(Comp.3, fit, ul, ll, Vowel, Formant) %>%
  mutate(Comp.3 = -Comp.3,
         Variable = paste0(Formant, "_", Vowel),
         var = "PC3",
         participant_year_of_birth = rescale(Comp.3, to = range(mod_pred_yob_values$participant_year_of_birth)))

PC3_loadings_contrib1 <- PC3_loadings_contrib1 %>%
  mutate(Formant = substr(Variable, 1, 2),
         x = ifelse(Formant == "F1", -2.7, 2.7),
         y = -2.75) %>%
  group_by(Vowel) %>%
  mutate(Contribution.PC3_max = max(Contribution.PC3)) %>%
  ungroup() %>%
  mutate(Vowel = factor(.$Vowel, levels=unique(.$Vowel[order(.$Contribution.PC3_max)]), ordered=TRUE)) %>%
  group_by(Vowel)

mod_pred_yob_values3 <- mod_pred_yob_values %>%
  select(participant_year_of_birth, fit, ul, ll, Vowel, Formant) %>%
  mutate(Variable = paste0(Formant, "_", Vowel),
         var = "yob",
         Comp.3 = rescale(participant_year_of_birth, to = range(mod_pred_PC3_values3$Comp.3))) %>%
  filter(Variable %in% mod_pred_PC3_values3$Variable) %>%
  rbind(mod_pred_PC3_values3) %>%
  left_join(., PC3_contrib %>% select(Variable, Contribution.PC3)) %>%
  mutate(Contribution.PC3a = paste0(round(Contribution.PC3, 1), "%"),
         colour1 = ifelse(grepl("F1", Variable), "#F8766D", "#00BFC4")) %>%
  group_by(Vowel) %>%
  mutate(Contribution.PC3_max = max(Contribution.PC3)) %>%
  ungroup() %>%
  mutate(Vowel = factor(.$Vowel, levels=unique(.$Vowel[order(.$Contribution.PC3_max)]), ordered=TRUE)) %>%
  left_join(PC3_loadings_contrib1 %>% select(Variable, cumsum_PC3, highlight))

#plot all model smooths facetted by vowel
PC3_plot_smooth1 <- mod_pred_yob_values3 %>%
  ggplot(aes(x = Comp.3, y = fit, colour = Formant, fill = Formant, linetype = var)) +
  geom_line() +
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.1, colour = NA) +
  scale_x_continuous(breaks = c(-4, 0, 4),
                     limits = c(-5, 5),
                     sec.axis = dup_axis(breaks = c(-1.71819, 2.261738), labels = c(1900, 1950), name = "Participant year of birth"),  #add secondary axis for year of birth, this is done by rescaling the loading values to scale of year of birth
                     name = "PC3 score"
                     ) +
  scale_y_reverse(breaks = seq(-2, 2, 1), limits = c(2.5, -2.9)) + #revese the axis so it follows convention
  scale_linetype(name = NULL) +
  xlab("PC3 score") +
  facet_grid(~Vowel) +
  geom_label(data = PC3_loadings_contrib1, aes(x = x, y = y, label = paste0(round(Contribution.PC3, 1), "%"), colour = Formant), size = 3, inherit.aes = FALSE, show.legend = FALSE) +
  theme_bw() +
  theme(legend.position = "top") +
  guides(linetype=guide_legend(override.aes=list(fill=NA)))

PC3_plot_smooth1

ggsave(plot = PC3_plot_smooth1, filename = "Figures/PC3_yob.png", width = 11, height = 5)

#plot all model smooths facetted by vowel
PC3_plot_smooth1_important <- mod_pred_yob_values3 %>%
  filter(highlight == "black") %>%
  ggplot(aes(x = Comp.3, y = fit, colour = Formant, fill = Formant, linetype = var)) +
  geom_line() +
  geom_ribbon(aes(ymin = ll, ymax = ul), alpha = 0.1, colour = NA) +
  scale_x_continuous(breaks = c(-4, 0, 4),
                     limits = c(-5, 5),
                     sec.axis = dup_axis(breaks = c(-1.71819, 2.261738), labels = c(1900, 1950), name = "Participant year of birth"),  #add secondary axis for year of birth, this is done by rescaling the loading values to scale of year of birth
                     name = "PC3 score"
                     ) +
  scale_y_reverse(breaks = seq(-2, 2, 1), limits = c(2.5, -2.9)) + #revese the axis so it follows convention
  scale_linetype(name = NULL) +
  xlab("PC3 score") +
  facet_grid(~Vowel) +
  geom_label(data = PC3_loadings_contrib1 %>% filter(highlight == "black"), aes(x = 0, y = y, label = paste0(round(Contribution.PC3, 1), "%"), colour = Formant), size = 3, inherit.aes = FALSE, show.legend = FALSE) +
  theme_bw() +
  theme(legend.position = "top") +
  guides(linetype=guide_legend(override.aes=list(fill=NA)))

PC3_plot_smooth1_important

ggsave(plot = PC3_plot_smooth1_important, filename = "Figures/PC3_yob_important.png", width = 6, height = 5)

Example speakers

To get an idea of how the vowel spaces of the speakers on the margins differ, i.e. those with the largest absolute PC socres, we can inspect the 12 speakers who have either the largest/smallest PC scores.

First, we will filter the data to focus on these 24 speakers.

PC_example_test <- vowels_all %>%
  group_by(Speaker, participant_year_of_birth, Gender, Vowel) %>%
  dplyr::summarise(F1_mean = mean(F1_lobanov_2.0),
            F2_mean = mean(F2_lobanov_2.0)) %>%
  left_join(PC_speaker_loadings %>% select(Speaker, Comp.1:Comp.3)) %>%
  left_join(PC1_change_labels1 %>% select(Vowel, PC1_loadings_abs)) %>%
  left_join(PC2_change_labels1 %>% select(Vowel, PC2_loadings_abs)) %>%
  left_join(PC3_change_labels1 %>% select(Vowel, PC3_loadings_abs)) %>%
  ungroup() %>%
  mutate(example_speaker = paste0(Speaker, " (", participant_year_of_birth, ")"),
         example1_PC1 = ifelse(dense_rank(Comp.1) <= 12, "high",
                               ifelse(dense_rank(-Comp.1) <= 12, "low", "normal")),
         example1_PC2 = ifelse(dense_rank(Comp.2) <= 12, "high",
                               ifelse(dense_rank(-Comp.2) <= 12, "low", "normal")),
         example1_PC3 = ifelse(dense_rank(Comp.3) <= 12, "high",
                               ifelse(dense_rank(-Comp.3) <= 12, "low", "normal")))

PC1

PC1_example_test_low <- PC_example_test %>%
  filter(example1_PC1 == "low") %>%
  mutate(example_speaker = factor(.$example_speaker, levels=unique(.$example_speaker[order(.$participant_year_of_birth)]), ordered=TRUE)) %>%
  ggplot(aes(x = F2_mean, y = F1_mean, label = Vowel, colour = Vowel)) +
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                      "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  scale_size_continuous(range = c(2,5)) +
  geom_text(aes(size = PC1_loadings_abs), show.legend = FALSE) +
  scale_x_reverse(position = "top", name = "F2 (normalised)", limits = c(2.4, -2.4)) +
        scale_y_reverse(position = "right", name = "F1 (normalised)", limits = c(2.4, -2.4)) +
  theme(plot.title = element_text(size = 20, face = "italic")) +
  facet_wrap(~factor(example_speaker)) +
  theme_bw()

PC1_example_test_high <- PC_example_test %>%
  filter(example1_PC1 == "high") %>%
  mutate(example_speaker = factor(.$example_speaker, levels=unique(.$example_speaker[order(.$participant_year_of_birth)]), ordered=TRUE)) %>%
  ggplot(aes(x = F2_mean, y = F1_mean, label = Vowel, colour = Vowel)) +
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                      "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  scale_size_continuous(range = c(2,5)) +
  geom_text(aes(size = PC1_loadings_abs), show.legend = FALSE) +
  scale_x_reverse(position = "top", name = "F2 (normalised)", limits = c(2.4, -2.4)) +
        scale_y_reverse(position = "right", name = "F1 (normalised)", limits = c(2.4, -2.4)) +
  theme(plot.title = element_text(size = 20, face = "italic")) +
  facet_wrap(~factor(example_speaker)) +
  theme_bw()

ggsave(plot = PC1_example_test_low, filename = "Figures/PC1_examples_low.png", width = 15, height = 15, dpi = 300)

ggsave(plot = PC1_example_test_high, filename = "Figures/PC1_examples_high.png", width = 15, height = 15, dpi = 300)

PC1_example_test_high

PC1_example_test_low

PC2

PC2_example_test_low <- PC_example_test %>%
  filter(example1_PC2 == "low") %>%
  mutate(example_speaker = factor(.$example_speaker, levels=unique(.$example_speaker[order(.$participant_year_of_birth)]), ordered=TRUE)) %>%
  ggplot(aes(x = F2_mean, y = F1_mean, label = Vowel, colour = Vowel)) +
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                      "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  scale_size_continuous(range = c(2,5)) +
  geom_text(aes(size = PC2_loadings_abs), show.legend = FALSE) +
  scale_x_reverse(position = "top", name = "F2 (normalised)", limits = c(2.4, -2.4)) +
        scale_y_reverse(position = "right", name = "F1 (normalised)", limits = c(2.4, -2.4)) +
  theme(plot.title = element_text(size = 20, face = "italic")) +
  facet_wrap(~factor(example_speaker)) +
  theme_bw()

PC2_example_test_high <- PC_example_test %>%
  filter(example1_PC2 == "high") %>%
  mutate(example_speaker = factor(.$example_speaker, levels=unique(.$example_speaker[order(.$participant_year_of_birth)]), ordered=TRUE)) %>%
  ggplot(aes(x = F2_mean, y = F1_mean, label = Vowel, colour = Vowel)) +
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                      "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  scale_size_continuous(range = c(2,5)) +
  geom_text(aes(size = PC2_loadings_abs), show.legend = FALSE) +
  scale_x_reverse(position = "top", name = "F2 (normalised)", limits = c(2.4, -2.4)) +
        scale_y_reverse(position = "right", name = "F1 (normalised)", limits = c(2.4, -2.4)) +
  theme(plot.title = element_text(size = 20, face = "italic")) +
  facet_wrap(~factor(example_speaker)) +
  theme_bw()

ggsave(plot = PC2_example_test_low, filename = "Figures/PC2_examples_low.png", width = 15, height = 15, dpi = 300)

ggsave(plot = PC2_example_test_high, filename = "Figures/PC2_examples_high.png", width = 15, height = 15, dpi = 300)

PC2_example_test_high

PC2_example_test_low

PC3

PC3_example_test_low <- PC_example_test %>%
  filter(example1_PC3 == "low") %>%
  mutate(example_speaker = factor(.$example_speaker, levels=unique(.$example_speaker[order(.$participant_year_of_birth)]), ordered=TRUE)) %>%
  ggplot(aes(x = F2_mean, y = F1_mean, label = Vowel, colour = Vowel)) +
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                      "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  scale_size_continuous(range = c(2,5)) +
  geom_text(aes(size = PC3_loadings_abs), show.legend = FALSE) +
  scale_x_reverse(position = "top", name = "F2 (normalised)", limits = c(2.4, -2.4)) +
        scale_y_reverse(position = "right", name = "F1 (normalised)", limits = c(2.4, -2.4)) +
  theme(plot.title = element_text(size = 20, face = "italic")) +
  facet_wrap(~factor(example_speaker)) +
  theme_bw()

PC3_example_test_high <- PC_example_test %>%
  filter(example1_PC3 == "high") %>%
  mutate(example_speaker = factor(.$example_speaker, levels=unique(.$example_speaker[order(.$participant_year_of_birth)]), ordered=TRUE)) %>%
  ggplot(aes(x = F2_mean, y = F1_mean, label = Vowel, colour = Vowel)) +
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                      "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  scale_size_continuous(range = c(2,5)) +
  geom_text(aes(size = PC3_loadings_abs), show.legend = FALSE) +
  scale_x_reverse(position = "top", name = "F2 (normalised)", limits = c(2.4, -2.4)) +
        scale_y_reverse(position = "right", name = "F1 (normalised)", limits = c(2.6, -2.4)) +
  theme(plot.title = element_text(size = 20, face = "italic")) +
  facet_wrap(~factor(example_speaker)) +
  theme_bw()

ggsave(plot = PC3_example_test_low, filename = "Figures/PC3_examples_low.png", width = 15, height = 15, dpi = 300)

ggsave(plot = PC3_example_test_high, filename = "Figures/PC3_examples_high.png", width = 15, height = 15, dpi = 300)

PC3_example_test_high

PC3_example_test_low

Speaker comparisons

To give an impression of how speaker’s with high/low loadings compare in terms of their vowel spaces, we can take 2 examples from each of the PCs. These examples were chosen based on the Shiny app, importantly they are chosen on the basis that they have similar year of birth and the same gender. This shows that irrespective of their demographic information, we can find contrasting speakers in our dataset who exhibit co-variation of the vocalic variables in each of the PCs.

#make a data frame of the example speakers mean F1 and F2 for each vowel, add in additional information such as PC loading and PCA score
PC_example_data <- vowels_all %>%
  group_by(Speaker, participant_year_of_birth, Gender, Vowel) %>%
  dplyr::summarise(F1_mean = mean(F1_lobanov_2.0),
            F2_mean = mean(F2_lobanov_2.0)) %>%
  left_join(PC_speaker_loadings %>% select(Speaker, Comp.1:Comp.3)) %>%
  left_join(PC1_change_labels1 %>% select(Vowel, PC1_loadings_abs)) %>%
  left_join(PC2_change_labels1 %>% select(Vowel, PC2_loadings_abs)) %>%
  left_join(PC3_change_labels1 %>% select(Vowel, PC3_loadings_abs)) %>%
  ungroup()

PC_example_plot <- PC_example_data %>%
  ggplot(aes(x = F2_mean, y = F1_mean, label = Vowel, colour = Vowel)) +
        scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                      "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
        scale_size_continuous(range = c(2,5)) +
        scale_x_reverse(position = "top", name = "F2 (normalised)", limits = c(max(PC_example_data$F2_mean) + 0.2, min(PC_example_data$F2_mean) - 0.3)) +
        scale_y_reverse(position = "right", name = "F1 (normalised)", limits = c(max(PC_example_data$F1_mean), min(PC_example_data$F1_mean))) +
        xlab("F2 (normalised") +
        ylab("F1 (normalised") +
        theme_bw() +
        theme(strip.text = element_text(size = 14))

PC1_example_old <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "IA_f_099"), aes(size = PC1_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "A. Older", subtitle = "yob: 1922, PC1 score: -0.43") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC1_example_old, filename = "Figures/PC1_example_old.png", dpi = 300, width = 5, height = 5)

PC1_example_small <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "CC_f_391"), aes(size = PC1_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "A. Low", subtitle = "yob: 1942, PC1 score: -5.26") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC1_example_small, filename = "Figures/PC1_example_small.png", dpi = 300, width = 5, height = 5)

PC1_example_large <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "CC_f_052"), aes(size = PC1_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "B. High", subtitle = "yob: 1942, PC1 score: 3.55") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC1_example_large, filename = "Figures/PC1_example_large.png", dpi = 300, width = 5, height = 5)

PC1_example_young <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "Darfield_f_612"), aes(size = PC1_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "D. Younger", subtitle = "yob: 1961, PC1 score: -0.53") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC1_example_young, filename = "Figures/PC1_example_young.png", dpi = 300, width = 5, height = 5)

PC1_example_speakers_all <- plot_grid(PC1_example_old, PC1_example_small, PC1_example_large, PC1_example_young, nrow = 1)

PC1_example_speakers_all

ggsave(plot = PC1_example_speakers_all, filename = "Figures/PC1_example_speakers_all.png", dpi = 300, width = 18, height = 6)

PC2_example_old <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "IA_m_077"), aes(size = PC2_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "A. Older", subtitle = "yob: 1914, PC2 score: 0.58") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC2_example_old, filename = "Figures/PC2_example_old.png", dpi = 300, width = 5, height = 5)

PC2_example_small <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "CC_m_139"), aes(size = PC2_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "B. Lagger", subtitle = "yob: 1933, PC2 score: -3.16") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC2_example_small, filename = "Figures/PC2_example_small.png", dpi = 300, width = 5, height = 5)

PC2_example_large <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "CC_m_406"), aes(size = PC2_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "C. Leader", subtitle = "yob: 1934, PC2 score: 3.79") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC2_example_large, filename = "Figures/PC2_example_large.png", dpi = 300, width = 5, height = 5)

PC2_example_young <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "CC_m_461"), aes(size = PC2_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "D. Younger", subtitle = "yob: 1953, PC2 score: 0.54") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC2_example_young, filename = "Figures/PC2_example_young.png", dpi = 300, width = 5, height = 5)

PC2_example_speakers_all <- plot_grid(PC2_example_old, PC2_example_small, PC2_example_large, PC2_example_young, nrow = 1)

PC2_example_speakers_all

ggsave(plot = PC2_example_speakers_all, filename = "Figures/PC2_example_speakers_all.png", dpi = 300, width = 18, height = 6)

PC3_example_old <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "IA_f_333"), aes(size = PC3_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "A. Older", subtitle = "yob: 1931, PC3 score: 1.23") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC3_example_old, filename = "Figures/PC3_example_old.png", dpi = 300, width = 5, height = 5)

PC3_example_small <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "CC_f_215"), aes(size = PC3_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "A. Low", subtitle = "yob: 1950, PC3 score: -3.41") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC3_example_small, filename = "Figures/PC3_example_small.png", dpi = 300, width = 5, height = 5)

PC3_example_large <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "CC_f_330"), aes(size = PC3_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "B. High", subtitle = "yob: 1952, PC3 score: 3.26") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC3_example_large, filename = "Figures/PC3_example_large.png", dpi = 300, width = 5, height = 5)

PC3_example_young <- PC_example_plot +
  geom_text(data = PC_example_data %>%
  filter(Speaker == "CC_f_343"), aes(size = PC3_loadings_abs), show.legend = FALSE) +
  ggtitle(label = "D. Younger", subtitle = "yob: 1971, PC3 score: -0.11") +
  theme(plot.title = element_text(size = 20, face = "bold"), plot.subtitle = element_text(size = 15))

ggsave(plot = PC3_example_young, filename = "Figures/PC3_example_young.png", dpi = 300, width = 5, height = 5)

PC3_example_speakers_all <- plot_grid(PC3_example_old, PC3_example_small, PC3_example_large, PC3_example_young, nrow = 1)

PC3_example_speakers_all

ggsave(plot = PC3_example_speakers_all, filename = "Figures/PC3_example_speakers_all.png", dpi = 300, width = 18, height = 6)

Combined

To inspect the variables easily in one plot, we will now put the sound change plots and the PC plots together. This will reproduce Figure XX.

vowel_plots_combined <- plot_grid(sound_change_plot, PC1_change_plot, PC2_change_plot, PC3_change_plot, nrow = 1)

vowel_plots_combined1 <- plot_grid(NULL, NULL, NULL, NULL, nrow = 1, labels=c("A) Sound change", "B) PC1: var = 17.2%", "C) PC2: var = 15.8%", "D) PC3: var = 10.1%"), hjust = 0, label_size = 20)

vowel_plots_combined <- plot_grid(vowel_plots_combined1, vowel_plots_combined, nrow = 2, rel_heights = c(0.07, 1))

vowel_plots_combined

ggsave(plot = vowel_plots_combined, filename = "Figures/vowel_plots_combined.png", width = 20, height = 6, dpi = 300)

Another way to combine the dot plots and the vowel plots would be:

top_row <- plot_grid(NULL)

#PC1
PC1_variables_combined_plot1 <- plot_grid(NULL, PC1_loadings_contrib1_plot, NULL, PC1_change_plot, rel_widths = c(0.03, 0.6, 0.03, 0.4), rel_heights = c(0.03, 0.6, 0.03, 0.4), labels = c("A", "", "B", ""), label_size = 25, align = "h", nrow = 1, ncol = 5)

PC1_variables_combined_plot <- plot_grid(top_row, PC1_variables_combined_plot1, rel_heights = c(0.1, 1), rel_widths = c(1, 1), labels = c("PC1", ""), label_size = 25, label_x = -0.015, align = "hv", nrow = 2)

PC1_variables_combined_plot

ggsave(plot = PC1_variables_combined_plot, filename = "Figures/PC1_plot_variables_combined.png", width = 15, height = 7, dpi = 300)

#PC2
PC2_variables_combined_plot1 <- plot_grid(NULL, PC2_loadings_contrib1_plot, NULL, PC2_change_plot, rel_widths = c(0.03, 0.6, 0.03, 0.4), rel_heights = c(0.03, 0.6, 0.03, 0.4), labels = c("A", "", "B", ""), label_size = 25, align = "h", nrow = 1, ncol = 5)

PC2_variables_combined_plot <- plot_grid(top_row, PC2_variables_combined_plot1, rel_heights = c(0.1, 1), rel_widths = c(1, 1), labels = c("PC2", ""), label_size = 25, label_x = -0.015, align = "hv", nrow = 2)

PC2_variables_combined_plot

ggsave(plot = PC2_variables_combined_plot, filename = "Figures/PC2_plot_variables_combined.png", width = 15, height = 7, dpi = 300)

#PC3
PC3_variables_combined_plot1 <- plot_grid(NULL, PC3_loadings_contrib1_plot, NULL, PC3_change_plot, rel_widths = c(0.03, 0.6, 0.03, 0.4), rel_heights = c(0.03, 0.6, 0.03, 0.4), labels = c("A", "", "B", ""), label_size = 25, align = "h", nrow = 1, ncol = 5)

PC3_variables_combined_plot <- plot_grid(top_row, PC3_variables_combined_plot1, rel_heights = c(0.1, 1), rel_widths = c(1, 1), labels = c("PC3", ""), label_size = 25, label_x = -0.015, align = "hv", nrow = 2)

PC3_variables_combined_plot

ggsave(plot = PC3_variables_combined_plot, filename = "Figures/PC3_plot_variables_combined.png", width = 15, height = 7, dpi = 300)

PCA score plot combined with the example speakers

#PC1
PC1_speaker_loadings_example_plot1 <- ggdraw(PC1_speaker_loadings +
    # geom_point(aes(x = 1922, y = -0.43), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE) +
    geom_point(aes(x = 1942, y = -5.26), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE) +
    geom_point(aes(x = 1942, y = 3.55), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE)
    # +
    # geom_point(aes(x = 1961, y = -0.53), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE)
    )

PC1_speaker_loadings_example_plot <- plot_grid(PC1_speaker_loadings_example_plot1, plot_grid(
  # PC1_example_old,
  PC1_example_small,
  PC1_example_large,
  # PC1_example_young,
          nrow = 1),
          nrow = 2,
  rel_heights = c(1.5, 2))

PC1_speaker_loadings_example_plot

ggsave(plot = PC1_speaker_loadings_example_plot, filename = "Figures/PC1_speaker_loadings_example.png", dpi = 300, width = 12, height = 9)

#PC2
PC2_speaker_loadings_example_plot1 <- ggdraw(PC2_speaker_loadings +
    geom_point(aes(x = 1914, y = 0.58), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE) +
    geom_point(aes(x = 1933, y = -3.2), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE) +
    geom_point(aes(x = 1934, y = 3.8), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE) +
    geom_point(aes(x = 1953, y = 0.54), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE))

PC2_speaker_loadings_example_plot <- plot_grid(PC2_speaker_loadings_example_plot1, plot_grid(PC2_example_old, PC2_example_small, PC2_example_large, PC2_example_young,
          nrow = 1),
          nrow = 2)

PC2_speaker_loadings_example_plot

ggsave(plot = PC2_speaker_loadings_example_plot, filename = "Figures/PC2_speaker_loadings_example.png", dpi = 300, width = 18, height = 10)

#PC3
PC3_speaker_loadings_example_plot1 <- ggdraw(PC3_speaker_loadings +
    # geom_point(aes(x = 1931, y = 1.23), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE) +
    geom_point(aes(x = 1950, y = -3.41), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE) +
    geom_point(aes(x = 1952, y = 3.26), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE)
    # +
    # geom_point(aes(x = 1971, y = -0.11), colour = "red", size = 7, shape = 1, stroke = 2, show.legend = FALSE)
    )

PC3_speaker_loadings_example_plot <- plot_grid(PC3_speaker_loadings_example_plot1, plot_grid(
  # PC3_example_old,
  PC3_example_small,
  PC3_example_large,
  # PC3_example_young,
          nrow = 1),
          nrow = 2,
  rel_heights = c(1.5, 2))

PC3_speaker_loadings_example_plot

ggsave(plot = PC3_speaker_loadings_example_plot, filename = "Figures/PC3_speaker_loadings_example.png", dpi = 300, width = 12, height = 9)

Additional analyses

Mixed-effects modelling

An alternative approach to obtaining the speaker intercepts would be to use linear mixed-effects modelling. This is done using the lme4 package, if you are unfamiliar with this form of regression analysis, an introductory tutorial Winter, 2013 can be found here for part 1 and here for part 2, for further information about why we chose the by-speaker intercepts, please refer to the manuscript or see Drager and Hay (2012)

We will generate the intercepts and visually see if there is a strong linear relationship between the intercepts generated from the GAMMs.

Note, this was our original approach taken to the analysis, we decided however that the GAMMs were the most suitable approach given our dataset. We include the following code below for those interested in using linear mixed-effects models.

Fitting procedure

The fitting procedure is identical to that used in the GAMMs analysis.

All models will be fit uniformly, i.e. with the same fixed and random effects structures (note, we model the year of birth variable non-linearly, this is done with a restricted cubic spline with 4 knots, see here for more information about what this means).

Note this process takes ~4 minutes to complete.

#create a data frame to store the intercepts from the models, this will initially contain just the speaker names
lmer_intercepts.tmp <- vowels_all %>%
  select(Speaker) %>%
  arrange(Speaker) %>%
  distinct()

#loop through the vowels

for (i in levels(factor(vowels_all$Vowel))) {
  
  cat(paste0(i, ": ", format(Sys.time(), "%d %B %Y, %r"), " ✅\n"))  #print the vowel the loop is up to
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F1 for FLEECE
  lmer.F1 <- lmer(F1_lobanov_2.0 ~
                  scale(rcs(participant_year_of_birth, 4))*Gender +
                  Speech_rate +
                  (1|Speaker) +
                  (1|Word),
                data = vowels_all,
                subset = Vowel == i)
  
  #extract the speaker intercepts from the model and store them in a temporary data frame
  lmer.F1.intercepts.tmp <- ranef(lmer.F1)[["Speaker"]]
  
  #run the mixed-effects model on the vowel, i.e. if i = FLEECE this will model F2 for FLEECE
  lmer.F2 <- lmer(F2_lobanov_2.0 ~
                  scale(rcs(participant_year_of_birth, 4))*Gender +
                  Speech_rate +
                  (1|Speaker) +
                  (1|Word),
                data = vowels_all,
                subset = Vowel == i)
  
  #extract the speaker intercepts again, storing them in a separate data frame
  lmer.F2.intercepts.tmp <- ranef(lmer.F2)[["Speaker"]]
  
  #rename the variables so it clear which one has F1/F2, i.e. this will give F1_FLEECE, F2_FLEECE
  names(lmer.F1.intercepts.tmp) <- paste0("F1_", i)
  names(lmer.F2.intercepts.tmp) <- paste0("F2_", i)
  
  #combine the intercepts for F1 and F2 and store them in the intercepts.tmp_stress data frame
  lmer_intercepts.tmp <- cbind(lmer_intercepts.tmp, lmer.F1.intercepts.tmp, lmer.F2.intercepts.tmp)
}

write.csv(lmer_intercepts.tmp, "Data/lmer_intercepts.csv", row.names = FALSE)

Load in the lmer intercepts

lmer_intercepts.tmp <- read.csv("Data/lmer_intercepts.csv")

Comparison with GAMMs

Visualise the GAMM intercepts with the lmer intercepts

lmer_intercepts.tmp_F1 <- lmer_intercepts.tmp %>%
  select(Speaker, contains("F1")) %>%
  pivot_longer(-Speaker, names_to = "F_variable", values_to = "F1_lmer_intercept") %>%
  mutate(Vowel = gsub("F1_", "", F_variable))

gam_intercepts.tmp_F1 <- gam_intercepts.tmp %>%
  select(Speaker, contains("F1")) %>%
  pivot_longer(-Speaker, names_to = "F_variable", values_to = "F1_GAMM_intercept") %>%
  mutate(Vowel = gsub("F1_", "", F_variable))

gam_intercepts.tmp_F1 %>%
  left_join(., lmer_intercepts.tmp_F1) %>%
  ggplot(aes(x = F1_GAMM_intercept, y = F1_lmer_intercept, colour = Vowel)) +
  geom_point(size = 0.5, show.legend = FALSE) +
  facet_wrap(~Vowel) +
  theme_bw()

lmer_intercepts.tmp_F2 <- lmer_intercepts.tmp %>%
  select(Speaker, contains("F2")) %>%
  pivot_longer(-Speaker, names_to = "F_variable", values_to = "F2_lmer_intercept") %>%
  mutate(Vowel = gsub("F2_", "", F_variable))

gam_intercepts.tmp_F2 <- gam_intercepts.tmp %>%
  select(Speaker, contains("F2")) %>%
  pivot_longer(-Speaker, names_to = "F_variable", values_to = "F2_GAMM_intercept") %>%
  mutate(Vowel = gsub("F2_", "", F_variable))

gam_intercepts.tmp_F2 %>%
  left_join(., lmer_intercepts.tmp_F2) %>%
  ggplot(aes(x = F2_GAMM_intercept, y = F2_lmer_intercept, colour = Vowel)) +
  geom_point(size = 0.5, show.legend = FALSE) +
  facet_wrap(~Vowel) +
  theme_bw()

Shiny app data

#create data containing means for each speaker and vowel as well as the speaker and variable loadings
Speaker_vowel_means <- vowels_all %>%
  group_by(Speaker, Corpus, Gender, participant_year_of_birth, Vowel) %>%
  summarise(F1_mean = mean(F1_lobanov_2.0),
            F2_mean = mean(F2_lobanov_2.0),
            F1_mean_raw = mean(F1_50),
            F2_mean_raw = mean(F2_50)) %>%
  left_join(PC_speaker_loadings %>% select(Speaker, Comp.1:Comp.3)) %>%
  left_join(PC1_change_labels1 %>% select(Vowel, PC1_loadings_abs)) %>%
  left_join(PC2_change_labels1 %>% select(Vowel, PC2_loadings_abs)) %>%
  left_join(PC3_change_labels1 %>% select(Vowel, PC3_loadings_abs)) %>%
  ungroup()

#save the file in the shiny app folder
saveRDS(Speaker_vowel_means, "Covariation_shiny/ONZE_summary.rds")

#model prediction values for sound change, PC1, PC2 and PC3
mod_pred_data <- sound_change_plot_data %>%
  dplyr::rename(F1_yob = F1,
                F2_yob = F2) %>%
  cbind(PC1_plot_data %>% select(Comp.1, F1:F2) %>%
          dplyr::rename(F1_PC1 = F1,
                F2_PC1 = F2)) %>%
          # mutate(Comp.1 = ifelse(Comp.1 >= 0, as.numeric(as.character(substr(Comp.1, 1, 5))),
          #                        as.numeric(as.character(substr(Comp.1, 1, 6)))))) %>%
  cbind(PC2_plot_data %>% select(Comp.2, F1:F2) %>%
          dplyr::rename(F1_PC2 = F1,
                F2_PC2 = F2)) %>%
          # mutate(Comp.2 = ifelse(Comp.2 >= 0, as.numeric(as.character(substr(Comp.2, 1, 5))),
          #                        as.numeric(as.character(substr(Comp.2, 1, 6)))))) %>%
  cbind(PC3_plot_data %>% select(Comp.3, F1:F2) %>%
          dplyr::rename(F1_PC3 = F1,
                F2_PC3 = F2)) %>%
          # mutate(Comp.3 = ifelse(Comp.3 >= 0, as.numeric(as.character(substr(Comp.3, 1, 5))),
          #                        as.numeric(as.character(substr(Comp.3, 1, 6)))))) %>%
  left_join(PC1_change_labels1 %>% select(Vowel, PC1_loadings_abs)) %>%
  left_join(PC2_change_labels1 %>% select(Vowel, PC2_loadings_abs)) %>%
  left_join(PC3_change_labels1 %>% select(Vowel, PC3_loadings_abs))

saveRDS(mod_pred_data, "Covariation_shiny/mod_pred_data.rds")

#save the animations
PC1_plot_animation1 <- animate(PC1_plot_animation, nframes = 200, fps = 5, rewind = FALSE, start_pause = 10, end_pause = 10, duration = 20, width= 400, height=450)

anim_save(animation = PC1_plot_animation1, "Covariation_shiny/www/PC1_animation.gif")

PC2_plot_animation1 <- animate(PC2_plot_animation, nframes = 200, fps = 5, rewind = FALSE, start_pause = 10, end_pause = 10, duration = 20, width= 400, height=450)

anim_save(animation = PC2_plot_animation1, "Covariation_shiny/www/PC2_animation.gif")

PC3_plot_animation1 <- animate(PC3_plot_animation, nframes = 200, fps = 5, rewind = FALSE, start_pause = 10, end_pause = 10, duration = 20, width= 400, height=450)

anim_save(animation = PC3_plot_animation1, "Covariation_shiny/www/PC3_animation.gif")

Ploting PCA by loading

An alternative to understanding how the variables contribute to each of the PCs is to look at the loadings. These are similiar to the % contribution values we present in the paper, they are widely used in PCA and the larger the absolute value (i.e. either positive or negative) then the more that variable contributes to the PC.

In the plots below, we present the absolute loading on the y axis, the red dashed line (which is located through y = square root of 0.05, approximately 0.224) gives the baseline where all loadings would be if they contributed equally to the PC.

PC1_loadings_plot <- ggplot(PC1_contrib, aes(x=reorder(Variable, Loading_abs), y=Loading_abs)) + #have the variable name on the x axis and loading value on the y
  geom_text(aes(alpha = highlight, label=ifelse(PC1_contrib$direction=="red","–", "+")), colour = PC1_contrib$direction, size = 6, fontface="bold", show.legend = FALSE) + 
  xlab("") + #remove the x axis title
  ylab("PC1 loading (absolute)") +
  geom_hline(yintercept = sqrt(0.05), color = "red", linetype = "dashed") + #add a red dashed line to identify the 0.2 cut off
  scale_y_continuous(breaks = seq(0, 0.4, 0.1), limits = c(0, 0.4)) +
  scale_alpha_manual(values=c(1, 0.3)) +
  theme_bw() + #set the aesthetics of the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1, colour = PC1_contrib$highlight, size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold"),
        axis.title = element_text(size = 14, face = "bold")) #modify the x axis variable names so they are rotated and highlighted

PC2_loadings_plot <- ggplot(PC2_contrib, aes(x=reorder(Variable, Loading_abs), y=Loading_abs)) + #have the variable name on the x axis and loading value on the y
  geom_text(aes(alpha = highlight, label=ifelse(PC2_contrib$direction=="red","–", "+")), colour = PC2_contrib$direction, size = 6, fontface="bold", show.legend = FALSE) + 
  xlab("") + #remove the x axis title
  ylab("PC2 loading (absolute)") +
  geom_hline(yintercept = sqrt(0.05), color = "red", linetype = "dashed") + #add a red dashed line to identify the 0.2 cut off
  scale_y_continuous(breaks = seq(0, 0.4, 0.1), limits = c(0, 0.4)) +
  scale_alpha_manual(values=c(1, 0.3)) +
  theme_bw() + #set the aesthetics of the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1, colour = PC2_contrib$highlight, size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold"),
        axis.title = element_text(size = 14, face = "bold")) #modify the x axis variable names so they are rotated and highlighted
  
PC3_loadings_plot <- ggplot(PC3_contrib, aes(x=reorder(Variable, Loading_abs), y=Loading_abs)) + #have the variable name on the x axis and loading value on the y
  geom_text(aes(alpha = highlight, label=ifelse(PC3_contrib$direction=="red","–", "+")), colour = PC3_contrib$direction, size = 6, fontface="bold", show.legend = FALSE) + 
  xlab("") + #remove the x axis title
  ylab("PC3 loading (absolute)") +
  geom_hline(yintercept = sqrt(0.05), color = "red", linetype = "dashed") + #add a red dashed line to identify the 0.2 cut off
  scale_y_continuous(breaks = seq(0, 0.4, 0.1), limits = c(0, 0.4)) +
  scale_alpha_manual(values=c(1, 0.3)) +
  theme_bw() + #set the aesthetics of the plot
  theme(axis.text.x = element_text(angle = 90, hjust = 1, colour = PC3_contrib$highlight, size = 14, face = "bold"),
        axis.text.y = element_text(size = 14, face = "bold"),
        axis.title = element_text(size = 14, face = "bold")) #modify the x axis variable names so they are rotated and highlighted

#combine the 3 separate PC plots in to one
PC_loadings <- plot_grid(NULL, PC1_loadings_plot, PC2_loadings_plot, PC3_loadings_plot, labels = c("", "PC1 (17.2% variance)", "PC2 (15.8% variance)", "PC3 (10.1% variance)"), label_y = 1.04, rel_heights = c(0.1, 1, 1, 1), nrow = 4)

#view the plot
PC_loadings

ggsave(plot = plot_grid(NULL, PC1_loadings_plot, PC2_loadings_plot, PC3_loadings_plot, labels = c("", "PC1 (17.2% variance)", "PC2 (15.8% variance)", "PC3 (10.1% variance)"), label_x = -0.17, label_y = 1.05, rel_heights = c(0.1, 1, 1, 1), nrow = 4), filename = "Figures/PC_loadings.png", dpi = 600, height = 12, width = 5)

#transform data so there are separate columns for F1 and F2
sound_change_plot_data1 <- mod_pred_yob_values %>%
  select(participant_year_of_birth, fit, Vowel, Formant) %>%
  pivot_wider(names_from = Formant, values_from = fit) %>% #transform the data to wide format so there are separate F1 and F2 variables
  mutate(PC1 = ifelse(Vowel %in% c("START", "STRUT", "THOUGHT"), TRUE, FALSE),
         PC2 = ifelse(Vowel %in% c("DRESS", "LOT", "NURSE", "FLEECE", "KIT", "TRAP"), TRUE, FALSE),
         PC3 = ifelse(Vowel %in% c("DRESS", "GOOSE", "LOT", "START"), TRUE, FALSE))

#make data frame to plot starting point, this will give the vowel labels based on the smallest year of birth coordinates
sound_change_labels1 <- sound_change_plot_data1 %>%
  group_by(Vowel) %>%
  filter(participant_year_of_birth == min(participant_year_of_birth)) %>%
  mutate(F2 = ifelse(Vowel == "DRESS", 1.25,
                     ifelse(Vowel == "LOT", -1.5,
                            ifelse(Vowel == "KIT", F2 - 0.2, F2))),
         F1 = ifelse(Vowel %in% c("DRESS", "NURSE", "THOUGHT", "LOT", "STRUT"), F1 - 0.4,
                     ifelse(Vowel == "TRAP", F1 - 0.6,
                            ifelse(Vowel == "KIT", F1 + 0.01,
                                   ifelse(Vowel == "START", F1 + 0.02, F1))))) %>%
  ungroup()

hull_PC1 <- sound_change_labels1 %>%
  filter(PC1 == TRUE) %>%
  # select(Vowel, F2, F1) %>%
  distinct() %>%
  slice(chull(F1, F2)) %>%
  mutate(PC = "PC1") %>%
  as.data.frame()

hull_PC2 <- sound_change_labels1 %>%
  filter(PC2 == TRUE) %>%
  distinct() %>%
  slice(chull(F1, F2)) %>%
  mutate(PC = "PC2") %>%
  as.data.frame()

hull_PC3 <- sound_change_labels1 %>%
  filter(PC3 == TRUE) %>%
  # select(Vowel, F2, F1) %>%
  distinct() %>%
  slice(chull(F1, F2)) %>%
  mutate(PC = "PC3") %>%
  as.data.frame()

hulls <- rbind(hull_PC1, head(hull_PC1, 1), hull_PC2, head(hull_PC2, 1), hull_PC3, head(hull_PC3, 1)) %>%
  left_join(sound_change_labels1 %>% rename(F1_label = F1, F2_label = F2))

#plot
sound_change_plot <- hulls %>%
  ggplot() +
  geom_path(aes(x = F2, y = F1, linetype = PC), colour = "black", alpha = 1, fill=NA, show.legend = TRUE) +
  geom_label(aes(x = F2_label, y = F1_label, label = Vowel), colour = "white", fill = "white", lwd = NA, show.legend = FALSE) +
  geom_text(aes(x = F2_label, y = F1_label, colour = Vowel, label = Vowel), show.legend = FALSE) +
  #label the axes
  xlab("F2 (normalised)") +
  ylab("F1 (normalised)") +
  #scale the size so the path is not too wide
  scale_size_continuous(range = c(0.2,1)) +
  #reverse the axes to follow conventional vowel plotting
  scale_x_reverse(limits = c(2,-2), position = "top") +
  scale_y_reverse(limits = c(2,-2), position = "right") +
  #set the colours
  scale_color_manual(values = c("#9590FF", "#D89000", "#A3A500", "#39B600", "#00BF7D",
                                 "#00BFC4", "#00B0F6", "#F8766D", "#E76BF3", "#FF62BC")) +
  scale_linetype_manual(values = c(1, 5, 3)) +
  #set the theme
  theme_bw() +
  #make title bold
  theme(plot.title = element_text(face="bold"),
        legend.position = c(0.87, 0.9),
        # legend.background = element_rect(colour = 'black', fill = 'white', linetype='solid'),
        legend.title = element_blank()) +
  guides(linetype=guide_legend(keywidth = 2.5, keyheight = 1, label.position = "left"),
         colour=guide_colorbar(NULL))

sound_change_plot

ggsave(plot = sound_change_plot, filename = "Figures/summary_PC.png", width = 5.5, height = 5.5, dpi = 300)

Sound change GAMM summaries

For reference, the output below gives the GAMM model summaries for the initial GAMM modelling (i.e. predicting the normalised F1/F2 values by year of birth, gender and speech). These models are resource heavy and take up a lot of memory (including the PC models, all 80 GAMM files run in this script take up just under 13GB on my computer), but to obtain a smaller and more efficient summary output, the covariation matricies are dropped, which reduces the file size considerably, but retains the basic summary information.

#make vector containing all .rds filenames from model_summaries folder
model_summary_files = list.files(pattern="*.rds", path = "/Users/james/Documents/GitHub/model_summaries/")

#load each of the files with for loop
for (i in model_summary_files) {
  print(i)
  assign(gsub(".rds", "", i), readRDS(paste0("/Users/james/Documents/GitHub/model_summaries/", i)))
  
  yob_model_summary <- summary.gam(get(gsub(".rds", "", i)), re.test = FALSE)
  
  yob_model_summary$cov.unscaled <- c()
  yob_model_summary$cov.scaled <- c()
  
  saveRDS(object = yob_model_summary, file = paste0("Data/Models/Summaries/trimmed_", i))
}

#make vector containing all .rds filenames from model_summaries folder
model_summary_files = list.files(pattern="*.rds", path = "Data/Models/Summaries/")

for (i in model_summary_files) {
  cat("\n------------------------")
  cat(paste0("\n", i, "\n------------------------"))
  model_summary <- readRDS(paste0("Data/Models/Summaries/", i))
  print(model_summary)
  cat("\n\n")
}

## 
## ------------------------
## trimmed_gam_F1_DRESS.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F1_FLEECE.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F1_GOOSE.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F1_KIT.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F1_LOT.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F1_NURSE.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F1_START.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F1_STRUT.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F1_THOUGHT.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F1_TRAP.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_DRESS.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_FLEECE.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_GOOSE.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_KIT.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_LOT.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_NURSE.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_START.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_STRUT.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_THOUGHT.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284
## 
## 
## 
## ------------------------
## trimmed_gam_F2_TRAP.rds
## ------------------------
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F2_lobanov_2.0 ~ s(participant_year_of_birth, k = 10, bs = "ad", 
##     by = Gender) + s(participant_year_of_birth, k = 10, bs = "ad") + 
##     Gender + s(Speech_rate) + s(Speaker, bs = "re") + s(Word, 
##     bs = "re")
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.78515    0.02040  38.490   <2e-16 ***
## GenderM     -0.01534    0.02344  -0.655    0.513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                        edf Ref.df      F  p-value    
## s(participant_year_of_birth):GenderM 1.767  1.846  6.196  0.00271 ** 
## s(participant_year_of_birth)         3.407  3.532 16.545 6.63e-12 ***
## s(Speech_rate)                       1.001  1.002  3.737  0.05314 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.178   Deviance explained = 20.7%
## fREML =  35260  Scale est. = 0.48792   n = 32284

cat(paste0("End time:\n", format(Sys.time(), "%d %B %Y, %r")))

## End time:
## 12 May 2020, 05:21:35 pm

end_time <- Sys.time()

cat(paste0("\n---------------\nTotal time to compile:\n", as.numeric(end_time - start_time), " minutes\n---------------\n"))

## 
## ---------------
## Total time to compile:
## 11.3688600977262 minutes
## ---------------