Data Wrangling in R: Get all your data and get it tidy
Dan Isbell
Data Wrangling
Broad term, essentially referring to the process of getting your “raw” data into a format you can work with.Chances are, it might take longer to wrangle your data than it does to actually run your analyses. This motivates us to find ways to wrangle efficiently. We also want to make sure we wrangle our data without errors, as we know for analysis:
GarbageIn -> GarbageOut
We’ll cover a bit of getting all your data together and getting it tidy (Wickham, 2014).
Part I: Get all your data
For Numerous Files, Avoid Copy and Paste
Often, we run into one type of data spread over multiple files. This is common with…- Eye-tracking [1 file per participant]
- SuperLab-based experiments (e.g., self-paced reading, GJTs) [1 file per participant]
- Longitudinal test score data [1 file per test administration]
- Longitudinal survey data [1 file per year/semester/etc.]
If all of your columns in these files are the same, we have a relatively easy, automated solution in R.
Solution, in 3 Steps
From Wickham (2014):- Read in a list of filenames
- Add a column to each file with its name (optional; we’ll skip this today)
- Combine all files in a single dataset
Prep
First, make sure you have the files I sent in a folder on your desktop. In RStudio, click the “Files” tab in the bottom-right window. Click the “…” button and navigate to your desktop folder. Then, click the blue gear icon and select “Set as Working Directory.”Now we’re ready to start. Click the icon of white square with a plus sign to and choose “R Script.”
Step 1
In the script editor (top-left window), enter the following code.#first, we load some libraries that we need to use
library(dplyr)
library(tidyr)
#now let's get to the data
#first, we'll make a list of files in our directory
temp <- list.files(pattern = "*.csv")
print(temp)
[1] "wave_1.csv" "wave_2.csv" "wave_3.csv" "wave_4.csv"
Step 3 (remember, we’re skipping step 2, which was optional)
Now that we have a list of files we want to read in, we can start bringing in the data.#we'll use the lapply (list-apply) function to read in everything on our list
myfiles <- lapply(temp, read.csv, stringsAsFactors = F)
#now we have a list that holds 4 dataframes. We'll put this all into one table.
all <- bind_rows(myfiles)
#and now we have a single table with 4364 observations from 4 different files
head(all)
There are several other ways of doing this (e.g., using a for-loop), but this one is fairly simple. The really nice part of this is that it scales beautifully- whether you have 4 or 400 files, you can get them all in the same file for your analysis without having to open up Excel and cut-and-paste.
Let’s clean things up
While we’re super happy to have successfully gotten all of this data in, we realize that we don’t actually need all of it. Let’s select only the bits we’re interested in:- Student.Identification
- Wave
- Language
- Speaking.Score
d <- select(all, 1:3, Speaking.Score)
head(d)
In the select() function, the first argument is the data you’re selecting from, and after that you can list which columns you are grabbing. You can refer to column numbers explicitly call column names (make sure you match case!).
As you can see, we now have 4 variables- just the stuff I’m interested in for my ongoing project that’s looking at oral proficiency development. But we need to do a bit more wrangling, because I’m only interested in identifiable people with at least 2 valid test scores.
Clean out invalid data
If we look at the dataset, we see that some rows are missing IDs, and some are missing scores. Some scores say things like “UR” or “NR”, which indicated that the person took the wrong level of the test and a valid score could not be assigned.We’ll clear these observations out with dplyr function filter()
#we're going to make a quick list of our valid ACTFL scores
ACTFL <- c("NL","NM","NH","IL","IM","IH","AL","AM","AH","S")
#now the filtering
d <- filter(d, !is.na(Student.Identification), !is.na(Speaking.Score),
Speaking.Score %in% ACTFL)
tail(d)
You can see here that we’re down to 3804 observations. We did this by telling filter() to only keep rows with Student.Identification and Speaking.Score that were not (!) NA (missing). We also told it to only keep rows with valid ACTFL scores.
What if you have some duplicate rows?
This can happen with database downloads. In my case, students may have been registered for multiple language classes at the same time, so they had multiple entries in our database. R can easily find and delete duplicate rows with the distinct() functiond <- distinct(d)
tail(d)
…and now we’re down to 3519 observations.
Get only individual with repeated measures
First, let’s arrange the data in a sensible way with the arrange() function:d <- arrange(d, language, Student.Identification, Wave)
head(d)
What we just did was basically a sorting action like in Excel. In the arrange() function, first we say which data we’re arranging, then we tell the sorting criteria. By default, it uses ascending order, but you can do something like desc(Wave) if you want.
We see that we do indeed have people with 2 or more scores. Hurray! We’re kicking everyone else out by using the group_by() and filter() functions in dplyr. We’ll also use a special operator (%>%) that makes our code a little more concise:
d.rep <- d %>% group_by(Student.Identification) %>% filter(n() > 1)
tail(d.rep)
And here we’re down to 1982 observations. We can see that there are at least 2 observations for each student. This is the data I wanted for my analysis, and now I have a record of how I pulled it. If I got more data later, I could run the same code again.
Part II: Make it tidy
In this part of the workshop, we’ll focus on having a logical structure for datasets, and look at ways to make that happen in R, primarily using the tidyr package.Datasets are composed of values (or data, sing. datum), and each value is associated with a variable and an observation.
A variable is all the data that measure the same attribute. This could be person IDs (i.e., nominal variables that distinguish one person from another) or things like test scores (e.g., ACTFL OPIc scores that measure speaking ability).
An observation is a collection of all variables belonging to the same object (e.g., person) at the same time/in the same snapshot.
Wickham (2014) argues that data are tidy if:
- Each variable has its own column.
- Each observation has its own row.
Warm-Up: A Simple Dataset Design
Scenario (based on Loewen et al., in prep): We are interested in the statistical literacy of SLA researchers. We had 89 researchers take a test on statistics, and we have a total score for each person. We also collected some background data, including research orientation (7 point scale, 1 = qual, 7 = quant), number of stats courses taken, and years in the field. We are interested in the relationships between statistical literacy and statistical background.How would you design this dataset? In groups of 3-4, sketch out a table with values for 1 subject.
Warm-Up: Solution
ID | Literacy Score | Orientation | StatsCourses | YearsinField |
---|---|---|---|---|
1 | 21 | 6 | 2 | 7 |
Now for a challenge…
Things get a bit more complicated when you are measuring people at different times and with multiple measures of the same attribute- but this is very common in our field. Let’s think about the scenario below, and how we could design a tidy dataset.Scenario (based on Isbell, Park, & Lee, under review): We were interested in the impact of pronunciation instruction on the comprehensibility of L2 Korean learners. We had 36 students divided into 2 groups: treatment and control. Students differed in their L1 (English, Chinese) and Proficiency (beginner, intermediate). We had them complete 2 tasks (picture description, read-aloud) at 2 times (pretest, posttest). Each task was scored for comprehensibility by several raters, we’ll look at the average score.
How would you design this dataset? In groups of 3-4, sketch out a table with values for 1 subject.
Solution
Student | L1 | Prof | Group | Time | Task | Score |
---|---|---|---|---|---|---|
01 | E | Int | T | 1 | Pic | 4.2 |
01 | E | Int | T | 1 | Read | 5.1 |
01 | E | Int | T | 2 | Pic | 4.7 |
01 | E | Int | T | 2 | Read | 6.3 |
.. | . | … | . | . | …. | . |
Going from wide to long
In R, we can use the gather() function to go from conventional wide datasets to long, tidy datasets.First, let’s read in our data. You may want to start a new script, but you don’t have to.
library(tidyr)
library(dplyr)
#read in data
df <- read.delim("comprehensibility.txt")
head(df)
So we see here that the comprehensibility scores, which all measure the same attribute, are in 4 different columns. At the same time, we have 4 different measurement instances on the same row; each comprehensibility score is associated with a unique speech sample (task) at a unique time (pre, post).
gather() our comprehensibilty in a single variable column
We need to get all the comprehensibility scores in the same column:df.long <- gather(df, "Task_Time", "Score", 5:8)
#just to see things easier...
df.long <- arrange(df.long, Subj_ID)
head(df.long)
I decided to make a new dataframe (df.long) so that I could go back to my original wide dataset if needed. In the gather() function, I told it to take the data from “df” (our original dataframe), what the “key” column should be called, what the “value” column should be called, and which columns to pull together (in this case columns 5 to 8).
The pre-gather column names are what goes in the “key” column. Since my column names had information on Task and Time, I called the new key “Task_Time.” The values these are connected to are comprehensibilty “Scores.”
As you can see, we’re not quite done yet- I need to split up that Task_Time column so I can run my analyses.
There’s a function for that, too- separate()
Luckily, tidyr has a function for splitting up confounded columns- separate(). Let’s try it out.df.long <- separate(df.long, Task_Time, c("Task", "Time"), sep = "_")
#and a sensible ordering...
df.long <- arrange(df.long, Subj_ID, Time, Task)
head(df.long)
And there we have it! From wide to long, ready for all kinds of snazzy graphics and sophisticated repeated-measures analyses.
There’s more that could/should be done with the dataset, like converting variables to factors, creating some recoded variables (from string to number codes), but at the very least our data is in the right shape.
Q&A
Any questions about the functions and examples introduced today?Any current data problems you’d like to share/brainstorm with the group?
References and Resources
- Cheatsheets: In RStudio, click the Help drop-down menu at the top of the window. Go to Cheatsheets > Data manipulation with dplyr, tidyr
- Help Files: If you need some additional reference material for a particular function, type ?
in the Console (bottom-left panel). For example, typing ?separate will bring up a help file in the bottom-right panel with some example code. - Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59,(10), 1-23. https://www.jstatsoft.org/article/view/v059i10
- Google. You can use normal question syntax (e.g. “How do you…”) and then add a phrase like “in R” or “with dplyr r”
---
title: 'Data Wrangling in R: Get all your data and get it tidy'
author: "Dan Isbell"
output:
  pdf_document: default
  html_notebook: default
  html_document: default
---

#Data Wrangling
Broad term, essentially referring to the process of getting your "raw" data into a format you can work with.

Chances are, it might take longer to wrangle your data than it does to actually run your analyses. This motivates
us to find ways to wrangle efficiently. We also want to make sure we wrangle our data without errors, as we know
for analysis:

*GarbageIn* -> *GarbageOut*

We'll cover a bit of getting all your data together and getting it *tidy* (Wickham, 2014).

#Part I: Get all your data

#For Numerous Files, Avoid Copy and Paste

Often, we run into one type of data spread over multiple files. This is common with...

- Eye-tracking [1 file per participant]

- SuperLab-based experiments (e.g., self-paced reading, GJTs) [1 file per participant]

- Longitudinal test score data [1 file per test administration]

- Longitudinal survey data [1 file per year/semester/etc.]

Handling this manually is not difficult, but it is tedious- imagine copying and pasting 60 spreadsheets into one spreadsheet! The real danger, however, is selectively copying bits from 60 spreadsheets. For example, you might only need a few columns from each SuperLab output file. But this increases the number of manual actions greatly, and thus increases chance of error (such as hitting ctrl-v when you meant to ctrl-c...).

If all of your columns in these files are the same, we have a relatively easy, automated solution in R.

#Solution, in 3 Steps
From Wickham (2014):

  1. Read in a list of filenames
  2. Add a column to each file with its name (optional; we'll skip this today)
  3. Combine all files in a single dataset

  Let's try! We'll walk through some wrangling from my current project looking at longitudinal oral proficiency development measured by the ACTFL OPIc.

#Prep
  First, make sure you have the files I sent in a folder on your desktop.
  In RStudio, click the "Files" tab in the bottom-right window. Click the "..." button and navigate to your
  desktop folder. Then, click the blue gear icon and select "Set as Working Directory."
  
  Now we're ready to start. Click the icon of white square with a plus sign to and choose "R Script."
  
#Step 1
  
  In the script editor (top-left window), enter the following code.

```{r, eval=FALSE}

#first, we load some libraries that we need to use
library(dplyr)
library(tidyr)
```
```{r}
#now let's get to the data
#first, we'll make a list of files in our directory
temp <- list.files(pattern = "*.csv")

```
```{r}
print(temp)
```


#Step 3 (remember, we're skipping step 2, which was optional)
Now that we have a list of files we want to read in, we can start bringing in the data.

```{r}

#we'll use the lapply (list-apply) function to read in everything on our list
myfiles <- lapply(temp, read.csv, stringsAsFactors = F)

#now we have a list that holds 4 dataframes. We'll put this all into one table.
all <- bind_rows(myfiles)

```

```{r}
#and now we have a single table with 4364 observations from 4 different files
head(all)
```


There are several other ways of doing this (e.g., using a for-loop), but this one is fairly simple. The really nice
part of this is that it scales beautifully- whether you have 4 or 400 files, you can get them all in the same file
for your analysis without having to open up Excel and cut-and-paste.

#Let's clean things up
While we're super happy to have successfully gotten all of this data in, we realize that we don't actually need 
all of it. Let's select only the bits we're interested in:

- Student.Identification
- Wave
- Language
- Speaking.Score

We'll use the select() function from the dplyr package to grab these columns.

```{r}

d <- select(all, 1:3, Speaking.Score)
```


```{r}

head(d)

```
In the select() function, the first argument is the data you're selecting from, and after that you can list which columns
you are grabbing. You can refer to column numbers explicitly call column names (make sure you match case!).

As you can see, we now have 4 variables- just the stuff I'm interested in for my ongoing project that's looking at oral 
proficiency development. But we need to do a bit more wrangling, because I'm only interested in **identifiable** people with at least 2 **valid** test scores.

#Clean out invalid data
If we look at the dataset, we see that some rows are missing IDs, and some are missing scores. Some scores say things like "UR" or "NR", which indicated that the person took the wrong level of the test and a valid score could not be assigned.

We'll clear these observations out with dplyr function filter()

```{r}

#we're going to make a quick list of our valid ACTFL scores
ACTFL <- c("NL","NM","NH","IL","IM","IH","AL","AM","AH","S")

#now the filtering
d <- filter(d, !is.na(Student.Identification), !is.na(Speaking.Score), 
            Speaking.Score %in% ACTFL)
```


```{r}
tail(d)
```
You can see here that we're down to 3804 observations. We did this by telling filter() to only keep rows with Student.Identification and Speaking.Score that were **not** (!) NA (missing). We also told it to only keep rows with valid ACTFL scores.

#What if you have some duplicate rows?
This can happen with database downloads. In my case, students may have been registered for multiple language classes at the same time, so they had multiple entries in our database. R can easily find and delete duplicate rows with the distinct() function

```{r}
d <- distinct(d)
```


```{r}
tail(d)
```
...and now we're down to 3519 observations.

#Get only individual with repeated measures
First, let's arrange the data in a sensible way with the arrange() function:

```{r}
d <- arrange(d, language, Student.Identification, Wave)
```


```{r}
head(d)
```
What we just did was basically a sorting action like in Excel. In the arrange() function, first we say which data we're arranging, then we tell the sorting criteria. By default, it uses ascending order, but you can do something like desc(Wave) if you want.

We see that we do indeed have people with 2 or more scores. Hurray! We're kicking everyone else out by using the group_by() and filter() functions in dplyr. We'll also use a special operator (%>%) that makes our code a little more concise:

```{r}
d.rep <- d %>% group_by(Student.Identification) %>% filter(n() > 1)
```


```{r}
tail(d.rep)
```
And here we're down to 1982 observations. We can see that there are at least 2 observations for each student. This is the data I wanted for my analysis, and now I have a record of how I pulled it. If I got more data later, I could run the same code again.

#Part II: Make it *tidy*

In this part of the workshop, we'll focus on having a logical structure for datasets, and look at ways to make that happen in R, primarily using the *tidyr* package.

Datasets are composed of *values* (or *data*, sing. *datum*), and each *value* is associated with a *variable* and an *observation*.

A *variable* is all the data that measure the same attribute. This could be person IDs (i.e., nominal variables that distinguish one person from another) or things like test scores (e.g., ACTFL OPIc scores that measure speaking ability).

An *observation* is a collection of all variables belonging to the same object (e.g., person) *at the same time/in the same snapshot*.

Wickham (2014) argues that data are *tidy* if:

1. Each variable has its own column.
2. Each observation has its own row.

This seems simple, but can be surprisingly tricky in more complicated research (or measurement) designs.

#Warm-Up: A Simple Dataset Design

Scenario (based on Loewen et al., in prep): We are interested in the statistical literacy of SLA researchers. We had 89 researchers take a test on statistics, and we have a total score for each person. We also collected some background data, including research orientation (7 point scale, 1 = qual, 7 = quant), number of stats courses taken, and years in the field. We are interested in the relationships between statistical literacy and statistical background.

How would you design this dataset? In groups of 3-4, sketch out a table with values for 1 subject.

#Warm-Up: Solution

ID | Literacy Score | Orientation | StatsCourses | YearsinField
---|----------------|-------------|--------------|-------------
1  |21              |6            |2             |7

#Now for a challenge...
Things get a bit more complicated when you are measuring people at different times and with multiple measures of the same attribute- but this is *very* common in our field. Let's think about the scenario below, and how we could design a *tidy* dataset.

Scenario (based on Isbell, Park, & Lee, under review): We were interested in the impact of pronunciation instruction on the comprehensibility of L2 Korean learners. We had 36 **students** divided into 2 **group**s: treatment and control. Students differed in their **L1** (English, Chinese) and **Proficiency** (beginner, intermediate). We had them complete 2 **task**s (picture description, read-aloud) at 2 **time**s (pretest, posttest). Each task was scored for comprehensibility by several raters, we'll look at the average **score**.

How would you design this dataset? In groups of 3-4, sketch out a table with values for 1 subject.

#Solution

Student|L1|Prof|Group|Time|Task|Score
-------|--|----|-----|----|----|-----
01|E|Int|T|1|Pic|4.2
01|E|Int|T|1|Read|5.1
01|E|Int|T|2|Pic|4.7
01|E|Int|T|2|Read|6.3
..|.|...|.|.|....|.

Although we might intuitively think 1 subject = 1 row, *tidy* dataset design can turn 1 subject in to dozens of rows. Aside from providing a common reference point for dataset design, it just so happens that lots of stats programs play very nice with tidy data.

#Going from wide to long

In R, we can use the gather() function to go from conventional wide datasets to long, tidy datasets.

First, let's read in our data. You may want to start a new script, but you don't have to.

```{r}
library(tidyr)
library(dplyr)

#read in data
df <- read.delim("comprehensibility.txt")
```

```{r}
head(df)
```
So we see here that the comprehensibility scores, which all measure the same attribute, are in 4 different columns. At the same time, we have 4 different measurement instances on the same row; each comprehensibility score is associated with a unique speech sample (task) at a unique time (pre, post). 

#gather() our comprehensibilty in a single variable column
We need to get all the comprehensibility scores in the same column:

```{r}
df.long <- gather(df, "Task_Time", "Score", 5:8)

#just to see things easier...
df.long <- arrange(df.long, Subj_ID)
```

```{r}
head(df.long)
```

I decided to make a new dataframe (df.long) so that I could go back to my original wide dataset if needed. In the gather() function, I told it to take the data from "df" (our original dataframe), what the "key" column should be called, what the "value" column should be called, and which columns to pull together (in this case columns 5 to 8). 

The pre-gather column names are what goes in the "key" column. Since my column names had information on Task and Time, I called the new key "Task_Time." The values these are connected to are comprehensibilty "Scores."

As you can see, we're not quite done yet- I need to split up that Task_Time column so I can run my analyses.

#There's a function for that, too- separate()
Luckily, *tidyr* has a function for splitting up confounded columns- separate(). Let's try it out.

```{r}
df.long <- separate(df.long, Task_Time, c("Task", "Time"), sep = "_")

#and a sensible ordering...
df.long <- arrange(df.long, Subj_ID, Time, Task)
```

```{r}
head(df.long)
```

And there we have it! From wide to long, ready for all kinds of snazzy graphics and sophisticated repeated-measures analyses.

There's more that could/should be done with the dataset, like converting variables to factors, creating some recoded variables (from string to number codes), but at the very least our data is in the right shape.

#Q&A

Any questions about the functions and examples introduced today?

Any current data problems you'd like to share/brainstorm with the group?

#References and Resources

- Cheatsheets: In RStudio, click the Help drop-down menu at the top of the window. Go to Cheatsheets > Data manipulation with dplyr, tidyr

- Help Files: If you need some additional reference material for a particular function, type ?<function> in the Console (bottom-left panel). For example, typing ?separate will bring up a help file in the bottom-right panel with some example code.

- Wickham, H. (2014). Tidy data. *Journal of Statistical Software, 59,*(10), 1-23.  [https://www.jstatsoft.org/article/view/v059i10](https://www.jstatsoft.org/article/view/v059i10)

- Google. You can use normal question syntax (e.g. "How do you...") and then add a phrase like "in R" or "with dplyr r"

Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating MicroStrategy Online Training Hyderabad
ReplyDelete