ERCEL R minikurz

Data processing using the Tidyverse

Worksheet translations available

Disclaimer: may not be very accurate...

Outline

This workshop session will focus on using the tidyverse package in R to complete common data processing and visualisation steps.

We will focus on the following key learning outcomes:

How to work with a data set that is quite messy
How to clean up the data by using various Tidyverse functions
How to understand different types of data and the ways to structure them

Pre-requisites

You do not need to be an expert in R to complete this worksheet
If you know R but are unfamiliar with the Tidyverse, I hope you will find it helpful and see the differences with base R coding
If you are familiar with the Tidyverse, I hope you might find something new in here

Download the worksheet materials

You can download a .zip folder which contains all the required materials by clicking on the link below:

ERCEL_R_tutorial.zip

Once you have downloaded the folder, double click it to make sure it has been unzipped.

Files

You should see the following folders/files (you can see a brief description by hovering over each item):

Data



Plots

Open the `ercel_analysis.R` file

The main file we will be using in this worksheet is ercel_analysis.R.

You should open the file in RStudio.

It should look like the screenshot below:

You will notice that there is not much code in the script, we will steadily add code to it throughout the worksheet.

Remember, that any lines of code that have a # at the start are comments. Comments are only there to help you and others understand the actual lines of code, which do not start with a #. Comments in code are good, please use them!

IMPORTANT

In order for the script to work properly YOU MUST RUN THE LINE THAT SAYS:

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

This tells R that you want to work in the folder called ERCEL_R_tutorial. Without getting too complicated, it just makes things easier for the code to work!

Learning about directory structures is important, but we probably don't have time today. See this post by Michael Frank.

Tip

You can run a line of code by clicking anywhere on the line and pressing:

cmd + ENTER (Mac)

ctrl + ENTER(Windows)

1. Load in the tidyverse

The tidyverse is a collection of packages that makes using R much easier (in my opinion). The tidy philosophy incorporates design features that should hopefully make learning, using and understanding your and others code more achievable. See the website for more details https://www.tidyverse.org.

The first thing we need to do is load in the 'tidyverse` package to our R session library, which contain specific functions that will make completing the learning outcomes easier. I assume you have installed the package, if not see the next part.

If you have already installed the tidyverse

To load the package, copy and paste, and then run the code below in your ercel_analysis.R file.

#loads the package
library(tidyverse)

If it worked you should see (in the output section/bottom):

If you have not installed the tidyverse

If you got an error saying:

Error in library(tidyverse) : there is no package called ‘tidyverse’

You will need to install the tidyverse package and then add it to the library. This can be done by running the following code:

#installs the package
install.packages("tidyverse")

#loads the package
library(tidyverse)

Bonus

If you close RStudio and open it again, do you need to run the library(tidyverse) code again for the script to work?

Once you have installed a package, do you need to reinstall it every time your restart RStudio?

2. Install other packages

In order to do some other cool things in this worksheet, we will need to install and load some other packages, skimr, ggtext and also readxl (which does not need to be installed, just loaded).

Make sure the below code is run before continuing.

#install the packages
install.packages("skimr")
install.packages("ggtext")

#load the package
library(readxl)
library(skimr)
library(ggtext)

3. Introduction to functions (for beginners)

This section is intended to give those not too familiar with coding a bit of extra information on functions. You can skip to the next section if you have coded before.

In order for R to understand what we want to do, we can make use of the many functions that it understands. Functions are similar to verbs, in that they can do lots of different things, but in order to use them you need to learn how they work. For example, the English verb to read works well in a sentence like 'Jan read the book', but maybe does not work as well in the sentence 'read speak'. This is because the first sentence has a subject ('Jan') and an object ('the book'), whereas the second sentence only has another verb ('speak'), which does not make for a good sentence.

In R there are also syntactic-style rules for how functions work (please note that natural languages and programming languages are very different, I am just using the terms verbs, nouns, syntax etc for an analogy).

The function print() is used to print something. If you try to run the code as it is (i.e. just print()) you will get an error saying:

print()

Error in print.default(): argument "x" is missing, with no default

This is because it needs something to print, just as the verb read needs the noun book, this is indicated by the error message saying argument "x" is missing. So we need something inside the brackets of the function for the function to work (this is called the argument).

print("Ahoj, svět")

[1] "Ahoj, svět"

In the above example, our function is print and our argument is "Ahoj, svět", the output to this code is therefore going to print "Ahoj, svět".

There are lots of different functions that R uses to make your code work. You can use ones from packages (such as the Tidyverse) or write your own. We already used the library function at the beginning, which loaded in the Tidyverse.

Some functions have, or even need, multiple arguments. RStudio makes understanding what the arguments to a function are, as well as what functions are available.

Step 1.

In your R script, if you start to type the word plo you will see a dropdown menu appear (you can also press tab/⇥ key on your keyboard if you don't see it automatically).

This menu lists all the functions that begin with "plo".

Choose the function plot (either by clicking on it or by pressing the down key/↓ and press tab).

This will automatically generate the code plot().

The yellow information box gives you details about what the function does.

Step 2.

Make sure your cursor is inside the brackets of the plot() function. Press tab again.

A new dropdown list appears, giving you the main arguments for the plot function.

You will see they are x = and y =. This is because to plot something in this function you normally need data for the x and y axes.

Fill in the function with the following arguments:

x = 1, y = 2

Your final code should look like this:

plot(x = 1, y = 2)

Bonus

In the line of code below:

plot(x = time, y = fun)

What is the function?

What is the argument?

Do you think the line of code sessionInfo() will work?

4. The data

The data we will be using is from the questionnaire asking for your availability for future R workshop sessions (the link is here if you want to see the design of the questionnaire).

The next thing we need to do is load in the data, so we can practice some data processing procedures.

`read_csv()`

We can load in the data using the read_csv() function.

Inside the brackets of the function we need to specify where the file is located and what it is called, i.e. Data > ercel_questionnaire_data.csv (think of this as doing the click version of File > Open > click > click > click... in Excel)

We also need to store the data and assign it a name, we will call it ercel_data_raw (we assign a name to an object by using this formatname <- object)

#load in the raw .csv file
ercel_data_raw <- read_csv("Data/ercel_questionnaire_data.csv")

You should now see that there is a new object in the environment (top left box of RStudio).

Tip

Note that I refer to this as a raw file. This is because it has not been modified, processed or directly edited.¹

It is the starting point of our data story. Everything we do after will be changes that we (or anybody else) can reproduce.

Please, please, please never change the raw file! Going in to Excel/spreadsheet software and changing it in any way will compromise the integrity of your data.

Any changes you do make should have a trail (a record of what you have done), which is your R code.

¹Disclaimer: the data has actually been modified to anonymise names and remove the feedback comments

`read_excel()`

As researchers, we may not always acquire our data in a .csv format. Naturally, there may be files that are in the .xlsx format. Using the read_csv function will not work for these files. We can use the read_excel for this, which is in the readxl package.

We will not be using the excel version of the data for this worksheet, but will focus on the .csv version. This section is to show you how to read in excel files.

#load in the excel version of the data
ercel_data_raw_xlsx <- read_excel("Data/ercel_questionnaire_data.xlsx")

Tip

I prefer to use .csv files when working in R because:

They are normally smaller in size (for large datasets)
You do not need Excel to open them (when sharing the data with others who do not have Excel)
The file does not contain any text formatting (Excel files may change numbers/text to dates, e.g. "10-12" becomes "10-Dec")
They do not use multiple sheets

See this post by Luis Darcy Verde Arregoitia about why excel can be problematic.

Understanding the data

`View()`

We can get an idea of what the data looks like in a number of ways.

We can use the View() function to open a new tab which allows for a spreadsheet style representation (note, this is not like excel where you can directly edit the data).

You can scroll up/down to see the other rows or left/right to see the other columns.

#open new tab showing the raw data
View(ercel_data_raw)

You should see this (make sure that the View has a a capital V):

Missing values

You might have noticed that there are some cells in our data that say NA (and look slightly grey when we use View).

These are called missing values and occur when the cell is empty or has NA as the value

You can specify in the read_csv() or read_excel() functions if you want to make any particular value in your data a NA value.

E.g. If you have the value NULL in your data, you can automatically code NULL as an NA using this code:

#only code values that are NULL as NAs
#csv file
missing_data_example <- read_csv("Data/missing_data_example.csv", na = "NULL")

#xlsx file
missing_data_example <- read_excel("Data/missing_data_example.xlsx", na = "NULL")

Note that this will now only code "NULL" values as missing (nothing else), but we can make a vector of values if we want to specify multiple values as NAs.

The code below will read in NULL, NA, empty (""), 10 and ahoj values as missing. They will all appear as NA in our missing_data_example2 object:

#code values that are NULL, NA, 10 or ahoj empty as NAs
#csv file
missing_data_example2 <- read_csv("Data/missing_data_example.csv", na = c("NULL", "NA", "", 10, "ahoj"))

#xlsx file
missing_data_example2 <- read_excel("Data/missing_data_example.xlsx", na = c("NULL", "NA", "", 10, "ahoj"))

Column types

Viewing the data visually can be useful for quickly inspecting if the data looks how you expected (i.e. has the file been read properly). However, it does not tell you much about how the data is structured.

Data is normally a bit more complicated when you load it into R. There are (for most data sets) normally multiple columns and multiple rows.

Each of your variables/columns will be assigned a column specification/type:

Specification	Example	Notes	More information
`numeric`/`integer`/`double`	3.14, 10, 10000	as long as there are only numbers in the column (you don't need to worry which name it is, they are all numbers)	link
`character`	"one", "1", "hello"	contains some form of text (which is anything inside speech marks/" ")	link
`factor`	"one", "1", "hello"	different from character/string as there are levels, which makes organising easier	link
`date`/`time`	"05-03-2021", "05-Mar-2021 12:34:56", "2021-03-05 16:00:00 UTC"	special type that evaluates dates/times, see the link, they can be complicated	link
`logical`	TRUE, FALSE	special type that states if the value if either TRUE or FALSE, note not a character/string/factor	link

Data summary with `skim()`

We can see a really nice summary of the data using theskim()function from the skimr package.

This gives you and others a quick and easy way to understand the structure of your data.

#create a summary object
ercel_data_raw_summary <- skim(ercel_data_raw)

#print the summary
ercel_data_raw_summary

Data summary
Name	ercel_data_raw
Number of rows	18
Number of columns	20
_______________________
Column type frequency:
character	13
logical	5
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
Start_Date	0	1.00	12	14	18
End_Date	0	1.00	12	14	18
Response_Type	0	1.00	10	14	2
Recorded_Date	0	1.00	12	14	18
Response_ID	0	1.00	17	17	18
Distribution_Channel	0	1.00	7	9	2
User_Language	0	1.00	2	2	1
vaše_jméno	0	1.00	12	12	18
day_pondělí	3	0.83	24	65	9
day_úterý	4	0.78	12	53	8
day_středa	4	0.78	15	65	9
day_čtvrtek	3	0.83	12	65	9
day_pátek	1	0.94	15	65	10

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
Finished	0	1	0.94	TRU: 17, FAL: 1
Recipient_Last_Name	18	0	NaN	:
Recipient_First_Name	18	0	NaN	:
Recipient_Email	18	0	NaN	:
External_Data_Reference	18	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Progress	0	1	97.22	11.79	50	100.0	100	100	100	▁▁▁▁▇
Duration_in_seconds	0	1	997.83	1679.32	3	73.5	132	762	5695	▇▁▁▁▁

Bonus

Use the ercel_data_raw_summary output to find out the following:

How many rows of data are there?

How many columns of data are there?

How many columns are character variables?

How many missing values are there in the character variable day_pátek?

What is the mean for the numeric variable Progress?

5. Data processing functions

Now we have looked at our data, we can see that it looks really messy. So we might want to process it to make it easier to use.

Some main issues, solutions and the functions we will use:

Issue	Solution	Function
There are columns we do not need	remove them	`select()`
The column names are a bit messy	rename them	`rename()`
There are rows we do not need	remove them	`filter()`
The data needs to be in long/wide format	reshape it	`pivot_longer()`
Some columns contain multiple values in a single cell	separate them	`separate_rows()`
Some columns need to be created/modified	create/modify them	`mutate()`
Need to know summary statistics from the data	summarise it	`summarise()`

Cheatsheet

The pipe
`%>%`

Before we start using any of the above functions, we will first learn how to use the pipe (%>%).

What is it?

The pipe is a special shorthand/operator that allows you to combine multiple lines of related code in a neat and tidy sequence. Think of it as a way to tell R **AND THEN DO THIS**.

Pipes are really helpful and we will be using them throughout this worksheet.

keyboard shortcut

Instead of having to type the pipe each time you use it, you can use the R shortcut:

Ctrl + Shift + M (Windows)

Cmd + Shift + M (Mac)

Other shortcuts can be found in the picture below:

A non-code analogy

Imagine a situation where the following sequence occurs...

You wake up
You get out of bed
You walk to the kitchen
You make breakfast

If this was R code it might look like this:

you1 <- wake_up(your_name)
you2 <- get_out(you1, "bed")
you3 <- walk(you2, "kitchen")
you4 <- make(you3, "breakfast")

This is not the most efficient code as there is a lot of duplication of the word you and <-. You also make 4 new objects (you1, you2, you3, you4), which might get a bit confusing if you have a lot of steps.

This is where the pipe can be really useful.

If you are combining multiple steps to the same initial data set, the pipe can link these steps together, removing the need for redundant features (such as lots of you <-)

This is how we would write the sequence using the pipe.

you <- wake_up(your_name) %>%
  get_out("bed") %>%
  walk("kitchen") %>%
  make("breakfast")

This approach knows the data set you are working with (your_name) and updates that data each time the pipe is used.

an actual example

Let's try to use the pipe with our ercel_data_raw data. We will apply the names() function to the data so we can see the column names. Then the str_to_upper function to make the names uppercase.

ercel_data_raw %>% #use this data
  names() %>% #get the column names
  str_to_upper() #make them all upper case

 [1] "START_DATE"              "END_DATE"               
 [3] "RESPONSE_TYPE"           "PROGRESS"               
 [5] "DURATION_IN_SECONDS"     "FINISHED"               
 [7] "RECORDED_DATE"           "RESPONSE_ID"            
 [9] "RECIPIENT_LAST_NAME"     "RECIPIENT_FIRST_NAME"   
[11] "RECIPIENT_EMAIL"         "EXTERNAL_DATA_REFERENCE"
[13] "DISTRIBUTION_CHANNEL"    "USER_LANGUAGE"          
[15] "VAŠE_JMÉNO"              "DAY_PONDĚLÍ"            
[17] "DAY_ÚTERÝ"               "DAY_STŘEDA"             
[19] "DAY_ČTVRTEK"             "DAY_PÁTEK"

removing/keeping variables
`select()`

If we have another look at our data set (ercel_data_raw) we can see there are a few columns that are empty - they only contain NA values.

Therefore, to make the data a bit more clean, we may want to remove these variables.

We will remove the variables:

Recipient_Last_Name
Recipient_First_Name
Recipient_Email

We can do this with the select() function.

Inside the function, we need to specify which variables we want to remove.

Below are some examples of the ways you can do this...

specify using the variable name

If you know the variable name(s), you can just write them inside the brackets of select (separated by a comma).

i.e. the function should look like this

#KEEP VARIABLES
#keeps the two variables and removes everything else
data %>%
  select(variable_to_keep1,
   variable_to_keep2)

#REMOVE VARIABLES
#removes the two variables and keeps everything else
data %>%
  select(-variable_to_remove1,
         -variable_to_remove2)

Tip

Note that if you want to remove a variable in this way make sure you use a - before the variable.

You also do not use quotation marks when specifying the names.

We will use names() at the end to print out the variable names that are kept, this is just so we can see the names, normally we would not do this part!

ercel_data_raw %>%
  select(-Recipient_Last_Name,
         -Recipient_First_Name,
         -Recipient_Email) %>%
  names()

 [1] "Start_Date"              "End_Date"               
 [3] "Response_Type"           "Progress"               
 [5] "Duration_in_seconds"     "Finished"               
 [7] "Recorded_Date"           "Response_ID"            
 [9] "External_Data_Reference" "Distribution_Channel"   
[11] "User_Language"           "vaše_jméno"             
[13] "day_pondělí"             "day_úterý"              
[15] "day_středa"              "day_čtvrtek"            
[17] "day_pátek"

If we want, we can also just specify the variables we want to keep (this is best when you want to keep just a few variables though!).

ercel_data_raw %>%
  select(Start_Date,
         End_Date,
         Response_Type,
         Progress,
         Duration_in_seconds,
         Finished,
         Recorded_Date,Response_ID,
         External_Data_Reference,
         Distribution_Channel,
         User_Language,
         vaše_jméno,
         day_pondělí,
         day_úterý,
         day_středa,
         day_čtvrtek,
         day_pátek) %>%
  names()

 [1] "Start_Date"              "End_Date"               
 [3] "Response_Type"           "Progress"               
 [5] "Duration_in_seconds"     "Finished"               
 [7] "Recorded_Date"           "Response_ID"            
 [9] "External_Data_Reference" "Distribution_Channel"   
[11] "User_Language"           "vaše_jméno"             
[13] "day_pondělí"             "day_úterý"              
[15] "day_středa"              "day_čtvrtek"            
[17] "day_pátek"

specify using a selection function

We can also use some functions that makes choosing variables easier.

Other functions you can use within the select() function:

Function	Description	Example
`starts_with()`	Starts with a specific string	`select(starts_with("ahoj"))`
`ends_with()`	Ends with a suffix	`select(ends_with("ahoj"))`
`contains()`	Contains a string anywhere in the name	`select(contains("ahoj"))`
`matches()`	Matches a regular expression (see https://regexr.com)	`select(matches("^ahoj"))`

For example, we can remove all variables that start with "Recipient" using the code below.

Tip

Note that if you want to remove variables put a ! before the selection function e.g. !starts_with()

When using these selection functions you are normally working with a string (a character or sequence of character, such as "ahoj", "ahoj123"), it is important use quotation marks when using these functions.

They also ignore the capitalisation of letters (i.e. "Ahoj" is the same as "ahoj"), make sure you know if it your string needs capitalisation. If you do need capitalised letters you can add the argument ignore.case = FALSE, so the code for selecting only variables that start with "Ahoj" (not "ahoj") would be data %>% select(starts_with("Ahoj", ignore.case = FALSE))

ercel_data_raw %>%
  select(!starts_with("Recipient")) %>% #remove only variables that start with "Recipient"
  names()

 [1] "Start_Date"              "End_Date"               
 [3] "Response_Type"           "Progress"               
 [5] "Duration_in_seconds"     "Finished"               
 [7] "Recorded_Date"           "Response_ID"            
 [9] "External_Data_Reference" "Distribution_Channel"   
[11] "User_Language"           "vaše_jméno"             
[13] "day_pondělí"             "day_úterý"              
[15] "day_středa"              "day_čtvrtek"            
[17] "day_pátek"

You can combine multiple selects using the pipe.

ercel_data_raw %>%
  select(!ends_with("Name")) %>% #remove variables ending with "Name"
  select(!contains("Email")) %>% #AND THEN remove variables that contain "Email"
  names()

 [1] "Start_Date"              "End_Date"               
 [3] "Response_Type"           "Progress"               
 [5] "Duration_in_seconds"     "Finished"               
 [7] "Recorded_Date"           "Response_ID"            
 [9] "External_Data_Reference" "Distribution_Channel"   
[11] "User_Language"           "vaše_jméno"             
[13] "day_pondělí"             "day_úterý"              
[15] "day_středa"              "day_čtvrtek"            
[17] "day_pátek"

Bonus

Can you use one of the selection functions to keep only the variables that start with "day"?

Renaming variables
`rename()`

Now that we know how to remove and keep variables, the next step we will focus on is renaming the variables.

To do this we use the rename() function.

Inside the brackets we specify the new name for the variable = the old name.

i.e. the function should look like this rename(new_name = old_name)

We can rename multiple variables within the function (by separating the arguments with a comma).

Let us rename the variables day_pondělí, day_úterý, day_středa, day_čtvrtek, day_pátek by renaming them individually and removing the day_ part.

ercel_data_raw %>%
  rename(pondělí = day_pondělí,
        úterý = day_úterý,
        středa = day_středa,
        čtvrtek = day_čtvrtek,
        pátek = day_pátek) %>%
  names()

 [1] "Start_Date"              "End_Date"               
 [3] "Response_Type"           "Progress"               
 [5] "Duration_in_seconds"     "Finished"               
 [7] "Recorded_Date"           "Response_ID"            
 [9] "Recipient_Last_Name"     "Recipient_First_Name"   
[11] "Recipient_Email"         "External_Data_Reference"
[13] "Distribution_Channel"    "User_Language"          
[15] "vaše_jméno"              "pondělí"                
[17] "úterý"                   "středa"                 
[19] "čtvrtek"                 "pátek"

removing/keeping rows
`filter()`

Now that we know how to remove/keep and rename variables, the next step will be to filter out data we do not want.

In linguistics research it is common to work with data that needs processing in order to have a clean data set to work with.

For example, in corpus linguistics you might want to filter a corpus to remove function words, in psycholinguistics you might want to remove trials with an extremely long reaction time, in bilingualism research you might want to remove speakers with a specific native language.

It is important that any data processing steps that directly modify or change your raw data in any way are transparent, justified and replicable.

We will use the filter() function to remove certain rows of our data.

When filtering data we normally want to do so based on a criteria within a variable, e.g. if you have a variable called word_type that contains the values function and content. If we want to remove the function words from the data we can say:

In our data, if word_type is equal to "function", then keep only those rows.

In code this will look like

data %>%
  filter(word_type == "function")

When writing code in R, there are specific ways to 'say' things like is equal to, the table below outlines some of these special ways. These are called logical and relational operators.

Logical and relational operators in R

How a human might say it	How to say it in R	Example (formula)	Example (realistic)
is equal to	`==`	variable == "string"	ahoj == "hello"
is not equal to	`!=`	variable != "string"	ahoj != "dobrou chuť"
is found within a set/vector	`%in%`	variable %in% c("string1", "string2")	ahoj %in% c("hello", "guten tag")
is not found within a set/vector	`%in%`	!variable %in% c("string1", "string2")	!linguistics %in% c("dobrou chuť", "gut essen")
is smaller than	`<`	variable < number	jedna < 5
is smaller than or equal to	`<=`	variable <= number	pět < 5
is larger than	`>`	variable > number	pět > 1
is larger than or equal to	`>=`	variable >= number	jedna >= 1
is something AND is something else	`&`	variable1 == "string" & variable2 < 5	ahoj == "hello" & jedna < 5
is something OR is something else	`\|`	variable1 == "string" \| variable2 < 5	ahoj == "hello" \| jedna < 5

The criteria for our filtering is quite simple:

Remove any participants who did not complete the questionnaire.

We can look at the variable Finished to check this, where we can see that there is 1 participant with a FALSE value and 17 with a TRUE value (you can use View(ercel_data_raw) again or skim(ercel_data_raw) and look for the Finished variable).

Based on this, we can state that anybody who has a TRUE value for the Finished variable can be kept in our data.

To do this, we can use the following code:

ercel_data_raw %>%
  filter(Finished == TRUE)

ercel_data_raw %>%
  filter(vaše_jméno != "anonymised18")

We can see that we have successfully removed the row that contained the data where Finished == FALSE and we are left with only the data where Finished == TRUE.

Bonus

If you look again at the ercel_data_raw data, you can see that there are a few other variables that we can use to filter out this row.

Can you write the code that filters out the row based on:

The variable Progress

The variable Duration_in_seconds

The variable Response_Type

The variable vaše_jméno

Reshaping data
`pivot_longer()`

https://thumbs.gfycat.com/GlassHastyBighornedsheep-small.gif

One key difference between R and spreadsheet style software is that it is optimised to work with long data, instead of wide data. See Wickham, 2014 for reasons why long is preferred. Put simply, thinking about your data when it is in a long format is much easier, efficient and enjoyable (imagine getting a sofa up several floors using an elevator). Thinking about your data in a wide format is often confusing, complicated and counter-intuitive (think about getting a sofa up several floors, using stairs).

Differences between wide and long data

Shape	Description
wide	1. variables are spread across columns 2. each participant has a row 3. multiple values in a cell
long	1. each variable has a column 2. each observation has a row 3. each value has a cell

In our ercel_data_raw the shape is wide, there is one row for each participant, with different observations spread across different columns, even having multiple values in individual cell.

Let's focus on the columns day_pondělí, day_úterý, day_středa, day_čtvrtek, day_pátek, which have data that contains the participants time availability in the cells.

These 5 columns and the data they contain, can be converted into just 2 columns, when they are shaped to be long:

day (day_pondělí, day_úterý, day_středa, day_čtvrtek, day_pátek) - this is called the variable names as it contains the column names
time (9.00-10.30am, 10.30am-12.00pm, etc.) - this is called the variable values as it contains the column values

We can change from wide to long using the pivot_longer function.

Inside the brackets, we need to specify:

which columns we want to change (cols)
the new name for our names column (names_to)
the new name for our values column (values_to)

This will give us a basic code format of:

data %>%
  pivot_longer(cols = column_names_you_want_to_change,
               names_to = "new_names_column_name",
               values_to = "new_values_column_name")

Let's first use some of the data processing steps to clean up ercel_data_raw

We will create a new data object called ercel_data_clean
This will incorporate some of the data processing steps we have covered so far (filter and select)
The filter will remove the participant who did not finish
The select will keep only the columns vaše_jméno and the ones that start with "day"
This data is still in wide format

#data cleaning steps
ercel_data_clean <- ercel_data_raw %>%
  filter(Finished == TRUE) %>% #remove the participant who did not finish
  select(vaše_jméno, starts_with("day")) #only keep these variables

Now we will change the ercel_data_clean using pivot_longer() to go from wide to long.

We will call this new long data ercel_data_long

#make the ercel_data_clean data long
ercel_data_long <- ercel_data_clean %>%
  pivot_longer(cols = starts_with("day"),
               names_to = "day",
               values_to = "time")

Let's look at our data...

#view the long data
View(ercel_data_long)

Bonus

How many participants did we have in the ercel_data_clean data?

How many columns were there that started with "day"?

What is 5 multipled by 17?

How many rows of data are there in ercel_data_long?

Tip

pivot_wider()

There may also be situations where your data is in a long format and you want to make it wide.

If you want to pivot the data in this way we can use the pivot_wider() function.

This works very similarly to pivot_longer(), but this time our arguments are:

names_from - the column that contains the names of the variables (these will be the column names)

values_from - the column that contains the values of the variables (these will be the cell values)

Here is an example of how to convert our ercel_data_long back to a wide format

#pivot from long to wide
ercel_data_long %>%
  pivot_wider(names_from = day,
              values_from = time)

Separate values into rows
`separate_rows()`

You might have noticed that our ercel_data_long is not technically fully long yet. The column time contains data with multiple values in each cell.

e.g. let's look at row 1 (participant anonymised04, day day_pondělí)

vaše_jméno	day	time
anonymised04	day_pondělí	9.00-10.30am,10.30am-12.00pm,12.00-1.30pm,1.30-3.00pm,3.00-4.30pm

This is problematic because we can't really do anything with the time data in this format.

We want a separate cell that contains each of the individual times.

Giving us something that looks like...

vaše_jméno	day	time
anonymised04	day_pondělí	9.00-10.30am
anonymised04	day_pondělí	10.30am-12.00pm
anonymised04	day_pondělí	12.00-1.30pm
anonymised04	day_pondělí	1.30-3.00pm
anonymised04	day_pondělí	3.00-4.30pm

We can use the separate_rows function to get our data like this.

It will...

take a column with multiple values in each cell
look for a speficic character that separates the values (e.g. a comma)
create a new row for each value (copying the values from the other columns in the data)

The code will work like this...

data %>%
  separate_rows(column_name_you_want_to_separate,
                sep = "specific_character")

This is how we will separate the values from our ercel_data_long, taking our time column, and using a "," separator (we will save the resulting data as an object called ercel_data_long_rows:

#make the time variable have separate rows for each values
ercel_data_long_rows <- ercel_data_long %>%
  separate_rows(time,
                sep = ",")

Creating and modifying variables
`mutate()`

Next, we will focus on making new variables and modifying existing ones. This is done with the mutate() function.

It is common that the data set we are working with may be used to create new variables (for example, log transformed frequencies from raw count based frequencies, taking age values and grouping them into categories, recoding words from a rating scale to numeric values).

It is also a common situation that we may want to keep the data the same, but modify it's structure (for example, changing a character variable to a factor, specifying the order of the levels of a factor).

In both of these situations, we will likely want to use some sort of function to create or modify the data.

We can create a new variable based on a variable already in the data (this is normally the best option):

#CREATE a new variable
data %>%
  mutate(new_variable = some_function(variable_you_want_to_change))

Or we can modify and overwrite the variable completely (this is normally risky as you are modifying the original):

#OVERWRITE an existing variable
data %>%
  mutate(variable_you_want_to_change = some_function(variable_you_want_to_change))

We will use mutate on our ercel_data_long_rows to do the following:

1. EDIT - remove the "day_" prefix from values in our day column

When we used pivot_longer our variables had the names day_pondělí, day_úterý, day_středa, day_čtvrtek, day_pátek, there is a prefix "day_" that is in each of these.

If we want to remove this from the values in our day column we can use mutate() and a function called str_remove().

To use str_remove() we specify a column that contains strings (called the string) and a specific string to remove (called a pattern). Therefore we can specify 2 basic arguments:

The string = argument is the column name that contains the list of strings, i.e. day.

The pattern = argument is the string we want to remove from the strings in the column (this should be in quotation marks), i.e. "day_".

Our str_remove function will look like this:

str_remove(string = day,
           pattern = "day_")

Using str_remove() inside the brackets of mutate() allows us to edit the variable day in a single line of code.

#remove the "day_" prefix from the the day column"
ercel_data_long_rows <- ercel_data_long_rows %>%
  mutate(day = str_remove(string = day,
                          pattern = "day_"))

Tip

There are other functions that can manipulate strings and can be used within a mutate

Have a look at the functions listed on the stringr package page (which is part of the tidyverse).

https://stringr.tidyverse.org/reference/index.html

As this is only intended as an introduction worksheet, we do not have time to describe them all.

2. RECODE - change any missing values from NA to none

Another issue in our data is that there are NA values in the time variable. These occur when a participant did not specify any times that they are available for that specific day.

We may want to recode these values so they are not NA (as that is a special value that means the data is missing).

Similar to our previous mutate example, we can use a function within the brackets of mutate().

This function will be fct_explicit_na(). It takes any NA values and replaces them with a specific string.

To use fct_explicit_na() we specify a column that contains strings (called f) and a specific string we want to replace the NA values with (called na_level).

Therefore we can specify 2 basic arguments:

The f = argument is the column name that contains the NA values, i.e. time.

The na_level = argument is the string we want to replace the NA values with (this should be in quotation marks), i.e. "none".

Our fct_explicit_na function will look like this:

fct_explicit_na(f = time,
                na_level = "none")

Using fct_explicit_na() inside the brackets of mutate() allows us to recode the variable time in a single line of code.

#remove the "day_" prefix from the the day column"
ercel_data_long_rows <- ercel_data_long_rows %>%
  mutate(time = fct_explicit_na(f = time,
                                na_level = "none"))

3. MODIFY - change the internal order of the values in our day and time columns

Another issue in our data is that the variables day and time should have a temporal order to their values.

This is an issue as when we come to plot or analyse such variables, R automatically orders them as alphabetical.

For our day variable this means čtvrtek, pátek, pondělí, středa, úterý

For our time variable this means 1.30-3.00pm, 10.30am-12.00pm, 12.00-1.30pm, 3.00-4.30pm, 9.00-10.30am, none (this one might be a bit confusing, but R reads the first number as the lowest so 1.30 is smaller than 10.30, which is smaller than 12.00 etc.)

We need to tell R if there is a special order to our variables.

We can do this using fct_relevel()

To use fct_relevel() we specify a column that contains strings (called f) and then list the order we want the strings to have in the variable

Our fct_relevel function will look like this for our day variable:

fct_relevel(f = day,
            "pondělí",
            "úterý",
            "středa",
            "čtvrtek",
            "pátek")

Using fct_relevel() inside the brackets of mutate() allows us to modify the variable day and time in a single line of code.

#relevel the day and time variables so they are in a specificed order (not alphabetical)
ercel_data_long_rows <- ercel_data_long_rows %>%
  mutate(day = fct_relevel(day,
                           "pondělí",
                            "úterý",
                            "středa",
                            "čtvrtek",
                            "pátek"),
         time = fct_relevel(time,
                            "9.00-10.30am",
                            "10.30am-12.00pm",
                            "12.00-1.30pm",
                            "1.30-3.00pm",
                            "3.00-4.30pm",
                            "none"))

Tip

Both fct_explicit_na and fct_relevel are from the forcats package, have a look at the other functions listed on the package page (which is part of the tidyverse).

https://forcats.tidyverse.org/reference/index.html

As this is only intended as an introduction worksheet, we do not have time to describe them all.

4. CREATE - create a new variable called time_category that codes times as either "morning" or "afternoon"

We may want to create a new variable from our existing data. There are lots of different situations where we may want to do this.

This can be achieved by using the ifelse() statement, which works within a mutate() by:

evaluating a logical/relational statement (such as x < 5, ahoj == "hello").
if that statement is TRUE, then say what the new value will be (such as "x_is_small", "greeting")
else if that statement is FALSE, then say what the other values will be (such as "x_is_large", "not_greeting")

The structure of an ifelse() in R is:

ifelse(logical_statement,
       value_if_TRUE,
       value_if_FALSE)

Returning to our data, we can use the values in our time variable to create a new variable that specifies if a time is in the morning or in the afternoon.

We can use the logical statement: `time %in% c("9.00-10.30am", "10.30am-12.00pm")
if TRUE: "morning"
else if FALSE: "afternoon"

The code for this would be:

ifelse(time %in% c("9.00-10.30am", "10.30am-12.00pm"),
       "morning",
       "afternoon")

When we combine the ifelse() within a mutate(), we can call our new varaible time_category.

#create a new variable called time_category where any values that are "9.00-10.30am", "10.30am-12.00pm" are "morning" and every other value is "afternoon"
ercel_data_long_rows <- ercel_data_long_rows %>%
  mutate(time_category = ifelse(time %in% c("9.00-10.30am", "10.30am-12.00pm"),
                                "morning",
                                "afternoon"))

6. Combining everything together

We can now combine all of our data processing steps in a complete set of code, using the pipe.

This will:

read in the raw data
filter the data
select which columns to keep
pivot the data to a long format
separate values to individual rows
mutate certain variables

#create the final data using different processing steps
ercel_data <- read_csv(file = "Data/ercel_questionnaire_data.csv") %>%
  filter(Finished == TRUE) %>%
  select(vaše_jméno, starts_with("day")) %>%
  pivot_longer(starts_with("day"), names_to = "day", values_to = "time") %>%
  separate_rows(time, sep = ",") %>%
  mutate(day = str_remove(day, "day_"),
         day = fct_relevel(day, "pondělí",
                             "úterý",
                             "středa",
                             "čtvrtek",
                             "pátek"),
         time = fct_explicit_na(time, na_level = "none"),
         time = fct_relevel(time,
                            "9.00-10.30am",
                            "10.30am-12.00pm",
                            "12.00-1.30pm",
                            "1.30-3.00pm",
                            "3.00-4.30pm",
                            "none"),
         time_category = ifelse(time %in% c("9.00-10.30am", "10.30am-12.00pm"),
                                "morning",
                                "afternoon"))

Preview of the next session (summarising and ggplot)

The below code uses some extra functions that summarise the data and then plots it.

This code is intended as a preview for the next session, where we will cover summarising and plotting.

If you find this section difficult to understand, that is quite natural!

#summarise the data to count availability based on time and day
ercel_data_count <- ercel_data %>%
  count(day, time, .drop = FALSE) %>%
  rename(count = n)

#filter the count data to find the most and least popular day and time
#add an emoji so we can see visualise these better
ercel_annotate <- ercel_data_count %>%
  filter(count %in% c(max(count), min(count))) %>%
  mutate(annotate1 = c("<img src='https://e.unicode-table.com/orig/4f/14f3dee66166bc0bf7967ba32fa111.png' width='20'/>",
                       "<img src='https://www.emoji.com/wp-content/uploads/filebase/thumbnails/icons/emoji-icon-glossy-07-15-objects-household-wooden-spoon-72dpi-forPersonalUseOnly.png' width='20'/>"))

#make the plot
ercel_data_plot <- ercel_data_count %>%
  ggplot(aes(x = time, y = count, colour = (count + 1)^2, label = count)) +
  geom_segment(aes(x = time, y = 0, xend = time, yend = count)) +
  geom_point() +
  geom_label(nudge_y = 0.1, vjust = 0) +
  geom_richtext(data = ercel_annotate, aes(x = time, y = count + 3, label = annotate1),
                fill = NA, label.color = NA, label.padding = grid::unit(rep(0, 4), "pt")) +
  xlab(NULL) +
  scale_y_continuous(breaks = c(0, 5, 10, 15), limits = c(0, 19)) +
  scale_colour_gradient(low = "#fd8d3c", high = "#31a354") +
  facet_grid(~day) +
  theme_bw() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 90, size = 14, hjust = 1),
        strip.text = element_text(size = 20))

#view the plot
ercel_data_plot

#save the plot
ggsave(plot = ercel_data_plot, filename = "Plots/ercel_data_plot.png", width = 9, height = 5)

ERCEL R minikurz

Data processing using the Tidyverse

ERCEL R minikurz

Worksheet translations available

Outline

Pre-requisites

Download the worksheet materials

Files

Open the ercel_analysis.R file

IMPORTANT

1. Load in the tidyverse

If you have already installed the tidyverse

If you have not installed the tidyverse

2. Install other packages

3. Introduction to functions (for beginners)

4. The data

read_csv()

read_excel()

Understanding the data

View()

Missing values

Column types

Data summary with skim()

5. Data processing functions

Cheatsheet

The pipe%>%

removing/keeping variablesselect()

Renaming variablesrename()

removing/keeping rowsfilter()

Reshaping datapivot_longer()

Separate values into rowsseparate_rows()

Creating and modifying variablesmutate()

6. Combining everything together

Preview of the next session (summarising and ggplot)

Open the `ercel_analysis.R` file

`read_csv()`

`read_excel()`

`View()`

Data summary with `skim()`

The pipe
`%>%`

removing/keeping variables
`select()`

Renaming variables
`rename()`

removing/keeping rows
`filter()`

Reshaping data
`pivot_longer()`

Separate values into rows
`separate_rows()`

Creating and modifying variables
`mutate()`