-
Notifications
You must be signed in to change notification settings - Fork 91
/
Copy path07-wt-ed-ds-pipeline.Rmd
1208 lines (982 loc) · 51.4 KB
/
07-wt-ed-ds-pipeline.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Walkthrough 1: The education data science pipeline with online science class data {#c07}
## Topics Emphasized
For this and the remaining walkthroughs, we refer to the topics emphasized in terms of distinct but related
steps involved in the process of data science. In this book, we use the six steps - described in detail
in [Chapter 3](#c03) - from @grolemund2018's depiction of the process.
As mentioned in [Chapter 5](#c05), then, the topics emphasized are those that are the *particular* focus of each chapter; most of the walkthroughs contain some element of all of the five aspects, but all have specific emphases.
For this chapter on the education data science pipeline, those emphases are:
- Tidying data
- Transforming data
## Functions Introduced
- `data.frame()`
- `dplyr::summarize()`
- `tidyr::pivot_longer() and tidyr::pivot_wider()`
- `tidyr::left_join()`, `tidyr::right_join()`, `tidyr::semi_join()`, and `tidyr::anti_join()`
- `lm()`
- `ggplot2::ggplot()`
- `apaTables::apa.cor.table()`
- `sjPlot::tab_model()`
## Vocabulary
In this section (here and in the other walkthrough sections), we include key terms that are introduced and used in the chapter.
- data frame
- item
- joins
- keys
- log-trace data
- passed arguments
- reverse scale
- regression
- survey
- tibble
- vectorize
## Introduction to the Walkthroughs
This chapter is the first of eight walkthroughs included in the book. In it, we
present *one approach* to analyzing a specific dataset. In this chapter, the
approach is what we refer to as the *education data science pipeline*, orsome the steps taken in many data science projects, including cleaning and tidying data as well as exploring and visualizing or modeling it.
Here, we will be using data from a number of online science classes and will show the process of
working with an education dataset from start to finish. While the walkthroughs
are very different, the structure and section headings will be consistent
throughout the walkthroughs. For example, every walkthrough will begin with a
vocabulary section, followed by an introduction to the dataset and an
introduction to the question or problem explored in the walkthrough.
We note that this chapter assumes familiarity with the four core concepts that
comprise the foundational skills framework: projects, functions, packages, and data. If you would
like a refresher about (or an introduction to) any of those, the foundational skills chapter,
then reading and writing and running some of the code in the previous chapter [Chapter 6](#06) may be helpful to you.
## Chapter Overview
In this walkthrough, we explore some of the key steps that are a part of many
data science in education projects. In particular, we explore how to process and
prepare data. These steps are sometimes referred to as "data wrangling". To do
so, we rely heavily on a set of tools that we use throughout *all* of the
walkthroughs, those associated with the {tidyverse}.
The {tidyverse} is a set of packages for data manipulation, exploration, and
visualization. that uses the design philosophy of 'tidy' data [@wickham2014].
Tidy data has a specific structure: each variable is a column, each observation
is a row, and each type of observational unit is a table. We'll discuss both the
{tidyverse} and tidy data much more throughout the book. For more information,
see the Foundational Skills chapters or <https://www.tidyverse.org/>.
### Background
The online science classes we explore in this chapter were designed and taught
by instructors through a statewide online course provider designed to supplement
- but not replace - students' enrollment in their local school. For example,
students may have chosen to enroll in an online physics class because one was
not offered at their school. The data were originally collected for a research
study, which utilized a number of different data sources to understand students'
course-related motivation. These datasets included:
1. A self-report survey assessing three aspects of students' motivation
2. Log-trace data, such as data output from the learning management system
3. Discussion board data
4. Academic achievement data
Our high-level *purpose* for this walkthrough is to conduct an analysis that
helps explain students' performance in these online courses. The *problem* we
are facing is a very common one when it comes to data science in education: the
data are complex and in need of further processing before we can get to the step
of running analyses. We will use this same dataset in the final walkthrough [Walkthrough 8/Chapter 14](#c14), and we will provide more details there about the dataset and the context.
To understand students' performance, we will focus on a learning management
system (LMS) variable that indicates the amount of time students spent within
the course LMS. We will also explore the effects of the type of science course
and the section in which students are enrolled on student performance.
First, these different data sources will be described in terms of how they were
provided by the school.
### Data Sources
#### Data Source \#1: Self-Report Survey about Students' Motivation
The first data source is a self-report survey. This was data collected before
the start of the course. The survey included ten items, each corresponding to
one of three motivation *measures*: interest, utility value, and perceived
competence. A *measure* is a concept that we try to make concrete and assess
using survey questions. The three motivation measures we explore here come from
Expectancy-Value Theory, which states that students are motivated to learn when
they both believe that they can achieve something (expectancy, also known as
perceived competence) and believe that the concept they are trying to learn is
important (value) [@wigfield2000]. There are multiple types of value, but we
explore two of them here: interest and utility value. Utility value is the
degree to which a person is able to connect the concept being learned with
something they will utilize in their future life. This survey included the
following ten items:
1. I think this course is an interesting subject. (Interest)
2. What I am learning in this class is relevant to my life. (Utility value)
3. I consider this topic to be one of my best subjects. (Perceived competence)
4. I am not interested in this course. (Interest - reverse coded)
5. I think I will like learning about this topic. (Interest)
6. I think what we are studying in this course is useful for me to know.
(Utility value)
7. I don’t feel comfortable when it comes to answering questions in this area.
(Perceived competence - reverse coded)
8. I think this subject is interesting. (Interest)
9. I find the content of this course to be personally meaningful. (Utility
value)
10. I’ve always wanted to learn more about this subject. (Interest)
### Data Source \#2: Log-Trace Data
*Log-trace data* is data generated from our interactions with digital
technologies, such as archived data from social media postings (see
[Chapter 11](#c11) and [Chapter 12](#c12)). In education, an increasingly common
source of log-trace data is that generated from interactions with learning
management systems and other digital tools [@siemens2012]. The data for this
walk-through is a summary type of log-trace data: the number of minutes students
spent on the course. While this data type is fairly straightforward, there are
even more complex sources of log-trace data out there (e.g. time stamps
associated with when students started and stopped accessing the course!).
### Data Source \#3: Academic Achievement and Gradebook Data
This is a common source of data in the education realm: the graded assignments
students completed. In this walkthrough, we examine only students' final grade
for the course.
### Data Source \#4: Discussion Board Data
Discussion board data is both rich and unstructured because it is made up of
large chunks of text that the students wrote. While discussion board data was
collected for this research project, but we do not examine it in this
walkthrough. More information about analyzing text data can be found in
[Chapter 11](#c11).
### Methods
In this walkthrough, we will concentrate on merging different datasets together
by using the different "joins" available in the {dplyr} package. We will also
start exploring how to run linear models in R.
## Load Packages
This analysis uses R packages, which are collections of R code that help users
code more efficiently, as you will recall from [Chapter 1](#c1). We load these
packages with the function `library()`. The specific packages we'll use here
will help us organize the structure of the data using the {tidyverse}
[@R-tidyverse],create formatted tables using {apaTables} [@R-apaTables] and
{sjPlot} [@R-sjPlot], and export datasets using {readxl} [@R-readxl].
***Install packages (if necessary)***
If you have not installed any of these packages before, you will need to do so
before loading them (if you run the code below *prior* to installing the packages, you
should see a message indicating that the package is not available). If you have installed
these before, then you can skip this step.
You can install a single
package, such as the {tidyverse} package, as follows:
```{r, eval = F}
install.packages("tidyverse")
```
If you must install two or more packages, you can do so in a single call to the `install.packages()`
function; the names of the packages must be provided to the function as follows:
```{r, eval = F}
install.packages(c("tidyverse", "apaTables))
```
When you're installing a package for the first time (which may be needed for the other walkthrough chapters, as well), you will need to take these same steps, first.
The good news is that you only need to install a package *once*, after which you can simply load it using `library()` (as below).
More on the installation of packages is included in the [Packages section](#c06p) of [Chapter 6](#c06).
``` {r, loading-packages, message = F, warning = F}
library(tidyverse)
library(apaTables)
library(sjPlot)
library(readxl)
library(dataedu)
```
## Import Data
This code chunk loads the log-trace data and self-report survey data from the
{dataedu} package. Note that we assign a dataset to an object three different
times, once for each of the three different datasets. We assign each of the
datasets a name using `<-`.
``` {r}
# Pre-survey for the F15 and S16 semesters
pre_survey <- dataedu::pre_survey
# Gradebook and log-trace data for F15 and S16 semesters
course_data <- dataedu::course_data
# Log-trace data for F15 and S16 semesters - this is for time spent
course_minutes <- dataedu::course_minutes
```
## View Data
Now that we've successfully loaded all three log-trace datasets, we can visually
inspect the data by typing the names that we assigned to each dataset. Running
each line, one at a time, will show the first several rows of each dataset.
``` {r}
pre_survey
course_data
course_minutes
```
## Process Data
Often, survey data needs to be processed in order to be most useful. We are
going to start with one of the three datasets we just saved as objects into our
environment: the dataset with the pre-survey items. Here, we process the
self-report items into three scales for 1) interest, 2) perceived competence,
and 3) utility value. We do this by:
- Renaming the question variables to something more manageable
- Reversing the response scales on questions 4 and 7
- Categorizing each question into a measure
- Computing the mean of each measure
Let's take these steps in order:
1. Take the pre-survey data and save it as a new object with the same name
"pre\_survey." Rename the question columns to something much simpler. To do
this, we will use the `rename()` function and will input first the new
variable name, then the "=" sign, and then the old variable name. You'll
notice another function at the bottom called `mutate_at()`. We'll explain
that below this code chunk.
``` {r}
pre_survey <-
pre_survey %>%
# Rename the qustions something easier to work with because R is case sensitive
# and working with variable names in mix case is prone to error
rename(
q1 = Q1MaincellgroupRow1,
q2 = Q1MaincellgroupRow2,
q3 = Q1MaincellgroupRow3,
q4 = Q1MaincellgroupRow4,
q5 = Q1MaincellgroupRow5,
q6 = Q1MaincellgroupRow6,
q7 = Q1MaincellgroupRow7,
q8 = Q1MaincellgroupRow8,
q9 = Q1MaincellgroupRow9,
q10 = Q1MaincellgroupRow10
) %>%
# Convert all question responses to numeric
mutate_at(vars(q1:q10), list( ~ as.numeric(.)))
```
Let's take a moment to discuss the {dplyr} function `mutate_at()`. `mutate_at()`
is a version of `mutate()`, which changes the values in an existing column or
creates new columns. It's useful in education datasets because you'll often need
to transform your data before analyzing it. In the code chunk we just ran, we
used `mutate_at()` to convert the data in all ten variables into a numeric
format.
To learn a little more about `mutate()`, try the example below, where we create
a new data frame called "df". A data frame is a two-dimensional structure that stores tables. The table has a header and data rows and each cell stores values.
We fill this data frame with two columns: "male" and
"female." Each column has only one value, and that value is 5. In the second
part of the code, we add a `total_students` column by adding the number of
`male` students and `female` students.
Note that we create the dataset with `tibble()`, which is from the {tibble} package included in the tidyverse. A tibble is a special type of data frame that makes working with the tidy data a little easier. More information is available in R for Data Science [@grolemund2018].
```{r}
# Dataset of students
df <- tibble(
male = 5,
female = 5
)
# Use mutate to create a new column called "total_students"
# populate that column with the sum of the "male" and "female" variables
df %>% mutate(total_students = male + female)
```
To return to our original data, we used `mutate_at()`, a special version of
`mutate()` which conveniently changes the values of multiple columns. In our
dataset `pre_survey`, we let `mutate_at()` know we want to change the variables
`q1` through `q10`. We do this with the argument `vars(q1:q10)`.
2. Next we'll reverse the scale of the survey responses on questions 4 and 7 so
the responses for all questions can be interpreted in the same way. As you
can see from the survey questions we listed earlier in the chapter, the
phrasing of questions 4 and 7 is opposite the phrasing of the other
questions. Rather than write a lot of code once to reverse the scales for
question 4 then writing it again to reverse the scales on question 7, we'll
build a function that does that job for us. Then we'll use the same function
that we wrote to recode both question 4 and question 7. This will result in
much less code, and it will make it easier for us to change in the future.
We'll use the function `case_when()` within our function to reverse the scale of
the item responses. `case_when()` is useful when you need to replace the values
in a column with other values based on some criteria. Education datasets use a
lot of codes to describe demographics, like numerical codes for disability
categories, race groups, or proficiency in a test. When you work with codes like
this, you'll often want to change the codes to values that are easier to
understand. For a example, a consultant analyzing how students did on state
testing might use `case_when()` to replace proficiency codes like 1, 2, or 3 to
more descriptive words like "below proficiency", "proficient", or "advanced."
`case_when()` lets you vectorize the rules you want to use to change values in a
column. When a sequence of criteria is vectorized, R will evaluate a value in a
column against each criteria in your `case_when()` sequence. `case_when()` is
helpful because it does this without requiring you to write complicated code
loops. Instead, you can systematically evaluate your criteria using code that is
compact and readable (once you understand how all the arguments work).
The left hand side of each `case_when()` argument will be a formula that returns
either a `TRUE` or a `FALSE`. In the function below, we'll use logical operators
in the left hand side of the formula like this: `question == 1 ~ 5`. Here are
some logical operators you can use in the future:
- `==`: equal to
- `>`: greater than
- `<`: lesser than
- `>=`: greater than or equal to
- `<=`: lesser than or equal to
- `!=`: not equal to
- `!`: not
- `&`: and
- `|`: or
Let's make this all concrete and use it here in our function that reverses the
scale of the survey responses. In the first part of the code chunk below, we
will write our function. Note that by running the first bit of code here, we
won't be changing anything in our data. Instead, we are creating a reusable,
general-purpose code chunk that we will then apply to the specific survey
questions we want to recode.
```{r}
# This part of the code is where we write the function:
# Function for reversing scales
reverse_scale <- function(question) {
# Reverses the response scales for consistency
# Arguments:
# question - survey question
# Returns:
# a numeric converted response
# Note: even though 3 is not transformed, case_when expects a match for all
# possible conditions, so it's best practice to label each possible input
# and use TRUE ~ as the final statement returning NA for unexpected inputs
x <- case_when(
question == 1 ~ 5,
question == 2 ~ 4,
question == 3 ~ 3,
question == 4 ~ 2,
question == 5 ~ 1,
TRUE ~ NA_real_
)
x
}
# And here's where we use that function to reverse the scales
# We use the pipe operator %>% here
# Reverse scale for questions 4 and 7
pre_survey <-
pre_survey %>%
mutate(q4 = reverse_scale(q4),
q7 = reverse_scale(q7))
```
You'll notice that we call the `reverse_scale()` function we wrote, but we call
it within the `mutate()` function. By doing things this way, we are overwriting
the previous data in the columns for questions 4 and 7 with our newly recoded
values for questions 4 and 7.
3. Next, we'll use a function called `pivot_longer()` in order to transform our
`pre_survey` dataset from wide format to long format. That means instead of
having 1,102 observations of 12 variables, we will now have 11,020
observations of 4 variables. By using `pivot_longer()`, we make it so that
each question & response pair has its own line in the data. Therefore, since
we have 10 question variables (columns) in the `pre_survey` dataset, after
we use `pivot_longer()`, we will end up with 10 times as many observations
(rows) as before. Additionally, we no longer need a separate column for each
individual question, since each question-response pair is now on its own
line. What was previously one row of data now takes up ten rows of data. so
`pivot_longer()` automatically deletes those empty columns after condensing
all the data. We'll save this new dataset as an object called `measure_mean`
in order to represent new variables we will create in future data processing
steps.
``` {r}
# Pivot the dataset from wide to long format
measure_mean <-
pre_survey %>%
# Gather questions and responses
pivot_longer(cols = q1:q10,
names_to = "question",
values_to = "response")
```
4. Next, we'll take our new `measure_mean` dataset and create a column called
`measure`. We'll fill that column with one of three question categories:
- `int`: interest
- `uv`: utility value
- `pc`: perceived competence .
We will use the `case_when()` function we learned earlier in order to do this.
When we pivoted from wide format to long format in the prior step, we ended up
with one variable containing all possible question numbers (q1, q2, etc.). Now,
we want to tell R which question numbers correspond to which categories. To do
this, we will supply `case_when()` with a list of all the question numbers that
correspond to each category: interest, utility value, and perceived competence.
We'll introduce a new operator in order to do this: `%in!%`. Practically, this
operator means that R should look within a list for something. In the code
below, we tell R to use questions 1, 4, 5, 8, and 10 to create the category
`int`. We then tell R to use questions 2, 6, and 9 to calculate `uv` and to use
questions 3 and 7 to calculate `pc`. We indicate the start of each list with the
letter "c" and an open parenthesis, and we indicate the end of each list with a
close parenthesis.
```{r}
# Add measure variable
measure_mean <- measure_mean %>%
# Here's where we make the column of question categories called "measure"
mutate(
measure = case_when(
question %in% c("q1", "q4", "q5", "q8", "q10") ~ "int",
question %in% c("q2", "q6", "q9") ~ "uv",
question %in% c("q3", "q7") ~ "pc",
TRUE ~ NA_character_)
)
```
5. Last, we'll take that same `measure_mean` dataset and create a new variable
called `mean_response.` Since we are calculating the mean by category, we
will need to first group the responses together using a function called
`group_by()`. This function helps us get set up to calculate new columns
with grouped data. Next, we'll use the function `summarize()` to create two
new variables: `mean_response` and `percent_NA.` We'll find the mean
response of each category using the `mean()` function.
``` {r}
# Add measure variable
measure_mean <- measure_mean %>%
# First, we group by the new variable "measure"
group_by(measure) %>%
# Here's where we compute the mean of the responses
summarize(
# Creating a new variable to indicate the mean response for each measure
mean_response = mean(response, na.rm = TRUE),
# Creating a new variable to indicate the percent of each measure that
# had NAs in the response field
percent_NA = mean(is.na(response))
)
measure_mean
```
With that last step, we have finished processing the `pre_survey` dataset. We
have renamed the relevant variables and computed some means that we can use
later.
### Processing the Course Data
When we loaded the data earlier in the chapter, we loaded three datasets:
`pre_survey`, `course_data`, and `course_minutes.` Next, we can process the
course data that we already loaded into the environment in order to create new
variables which we can use in analyses.
Information about the course subject, semester, and section are stored in a
single column, `CourseSectionOrigID`. This format of data storage is not ideal.
If we instead give each piece of information its own column, we'll have more
opportunities for later analysis. We'll use a function called `separate()` to do
this. Below, we will load `course_data` and run the `separate()` function to
split up the subject, semester, and section so we can use them later on.
``` {r}
# split course section into components
course_data <-
course_data %>%
# Give course subject, semester, and section their own columns
separate(
col = CourseSectionOrigID,
into = c("subject", "semester", "section"),
sep = "-",
remove = FALSE
)
```
After running the code chunk above, take a look at the dataset `course_data` to
make sure it looks the way you'd expect it to look. In this case, we are
expecting that we will add 3 new variables, taking the total number of variables
in this dataset from 8 to 11. We will still see the original variable
`CourseSectionOrigID` in the data as well.
### Joining the Data
In this chapter, we are looking at two datasets that are derived from the same
courses. In order for these datasets to be most useful to us, we'd like all that
data to be in one place.
To join the course data and pre-survey data, we need to create similar *keys*.
Our goal here is to have one variable that matches across both datasets. Once we
have that common variable in both datasets, we can merge the datasets on the
basis of that variable.
When we look at the `course_data` and `pre_survey` datasets in our environment,
we see that both have variables for the course and the student. However, this
information is captured in different variable names in each dataset. Our first
goal will be to rename two variables in each of our datasets so that they will
match. One variable will correspond to the course, and the other will correspond
to the student. We are not changing anything in the data itself at this step -
instead, we are just cleaning the column headers up so that we can look at the
data all in one place.
Let's start with the pre-survey data. We will rename `RespondentID` and
`opdata_CourseID` to be `student_id` and `course_id`, respectively. Here, we are
going to use the same `rename()` function we learned earlier in this chapter.
```{r}
pre_survey <-
pre_survey %>%
rename(student_id = opdata_username,
course_id = opdata_CourseID)
pre_survey
```
Those variable names look better now!
When we look at the data more closely, though, we will notice that the
`student_id` variable has another issue - the variable has some additional
characters before and after *the actual ID* that we will need to be able to join
this data with the other data sources we have. Why does this variable have these
additional characters? Why is there a "1" at the end of every 5-digit ID number?
We are not sure! Sometimes, educational data from different systems (used for
different purposes) may have additional "meta"-data added on. In any event, here
is what the variables look like before processing:
```{r}
head(pre_survey$student_id)
```
What we need is the five characters in between the underscore symbols - these:
`_`.
One way to do this is to use the `str_sub()` function from the {stringr}
package. This function lets us subset *string* variables: variables that store
text data. You can specify the indices of the variables you want the string to
*start* and *end* with.
Here, for example, is how we can select only the content starting with the
second character, skipping the first underscore in the process. This next chunk
of code will not change our data, but will show you how the `str_sub()` function
works by supplying a number-and-underscore combination to the function that is
in the same format as our data.
```{r}
str_sub("_99888_1", start = 2)
```
We can apply the same thinking to delete characters from the end of a string. We
will use a `-` to indicate that we want to start from the right side of the
string of characters. Interestingly, when we specify the argument `end` below,
we will tell it the placement of the first character we want to INCLUDE. When we
type `end = -3`, we end up deleting only the last 2 characters. Our new
rightmost character will be the final 8.
```{r}
str_sub("_99888_1", end = -3)
```
Putting the pieces together, the following should return what we want. Try
running the code below to see if it yields the 5-digit ID number we are shooting
to extract.
```{r}
str_sub("_99888_1", start = 2, end = -3)
```
_Note: you may receive a warning telling you that `NA` values were introduced by coercion. This happens when we change data types, and we will overlook this warning message for the purposes of this walkthrough._
We can apply this process to our data using `mutate()`. We convert the string
into a number using `as.numeric()` in the next portion of the code. This step is
important so the data can be joined to the other, numeric `student_id` variables
(in the other datasets):
```{r}
# Re-create the variable "student_id" so that it excludes the extraneous characters
pre_survey <- pre_survey %>%
mutate(student_id = str_sub(student_id, start = 2, end = -3))
# Save the new variable as numeric so that R no longer thinks it is text
pre_survey <- pre_survey %>%
mutate(student_id = as.numeric(student_id))
```
Now that the `student_id` and `course_id` variables are ready to go in the
`pre_survey` dataset, let's proceed to the course data. Our goal is to rename
two variables that correspond to the course and the student so that we can match
with the other variables we just created for the pre-survey data. In the code
chunk below, we will rename both those variables.
``` {r}
course_data <-
course_data %>%
rename(student_id = Bb_UserPK,
course_id = CourseSectionOrigID)
```
Now that we have two variables that are consistent across both datasets - we
have called them `course_id` and `student_id` - we can join the two datasets
using the {dplyr} function, `left_join()`.
`left_join()` is named based on the 'direction' that the data is being joined.
Note the order of the data frames passed to our "left" join. Left joins retain
all of the rows in the data frame on the "left", and joins every matching row in
the right data frame to it. We will use two variables as keys for joining the
datasets, and we will specify those after the word `by`.
Let's save our joined data as a new object called `dat`.
```{r}
dat <-
left_join(course_data, pre_survey,
by = c("student_id", "course_id"))
dat
```
Let's hone in on how this code is structured. After `left_join()`, we see
`course_data` and then `pre_survey`. In this case, `course_data` is the "left"
data frame (passed as the *first* argument), while `pre_survey` is the "right"
data frame (passed as the *second* argument). So, in the above code, what
happens? You can run the code yourself to check.
Our aim with that code is that all of the rows in `course_data` are retained in
our new data frame, `dat`, with matching rows of `pre_survey` joined to it. An
important note is that there are not multiple matching rows of `pre_survey`;
otherwise, you would end up with more rows in `dat` than expected. There is a
lot packed into this one function. Joins are extremely powerful - and common -
in many data analysis processing pipelines, both in education and in any field.
Think of all of the times you have data in more than one data frame, but you
want everything to be in a single data frame! As a result, we think that joins
are well worth investing the time to be able to use.
With most types of data, `left_join()` is helpful for carrying out most tasks
related to joining datasets. However, there are functions for other types of
joins that we want to make sure you know how to use. They may be less frequently
used than `left_join()`, but they are still worth mentioning. Note that for all
of these, the "left" data frame is always the first argument, and the "right"
data frame is always the second. When running the code chunks below, it can be
helpful to pay attention to the number of observations and variables in the
datasets before and after the joining. Eventually, the obscure names of these
types of joins will start to become more intuitive as you use them more often.
#### `semi_join()`
`semi_join()`: joins and retains all of the *matching* rows in the "left" and
"right" data frame. This is useful when you are only interested in keeping the
rows (or cases/observations) that are able to be joined. `semi_join()` will not
create duplicate rows of the left data frame, even when it finds multiple
matches on the right data frame. It will also keep only the columns from the
left data frame.
For example, the following returns only the rows that are present in both
`course_data` and `pre_survey`:
```{r}
dat_semi <-
semi_join(course_data,
pre_survey,
by = c("student_id", "course_id"))
dat_semi
```
#### `anti_join()`
`anti_join()`: *removes* all of the rows in the "left" data frame that can be
joined with those in the "right" data frame.
```{r}
dat_anti <-
anti_join(course_data,
pre_survey,
by = c("student_id", "course_id"))
dat_anti
```
#### `right_join()`
`right_join()`: perhaps the least helpful of the three, `right_join()` works the
same as `left_join()`, but by retaining all of the rows in the "right" data
frame, and joining matching rows in the "left" data frame (so, the opposite of
`left_join()`).
```{r}
dat_right <-
right_join(course_data,
pre_survey,
by = c("student_id", "course_id"))
dat_right
```
If we wanted this to return exactly the same output as `left_join()` (and so to
create a data frame that is identical to the `dat` data frame we created above), we could
simply switch the order of the two data frames to be the opposite of those used
for the `left_join()` above:
```{r}
dat_right <-
right_join(pre_survey,
course_data,
by = c("student_id", "course_id"))
dat_right
```
Now that we've gone through the different types of joins available, we will
return to our main focus: joining our course datasets together. While we didn't
do any data processing steps on it, we still have the `course_minutes` dataset
in our environment from when we loaded it there earlier in the chapter. In the
code chunk below, we will rename the necessary variables in that dataset so that
it is ready to merge. Then, we will merge the `course_minutes` dataset, with its
newly renamed variables `student_id` and `course_id`, with our `dat` dataset.
```{r}
course_minutes <-
course_minutes %>%
rename(student_id = Bb_UserPK,
course_id = CourseSectionOrigID)
course_minutes <-
course_minutes %>%
# Change the data type for student_id in course_minutes so we can match to
# student_id in dat
mutate(student_id = as.integer(student_id))
dat <-
dat %>%
left_join(course_minutes,
by = c("student_id", "course_id"))
```
Note that they're now combined, even though the course data had many more rows.
The pre-survey data has been joined for each student-course combination. We
have a pretty large data frame! Let's take a quick look.
``` {r}
dat
```
It looks like we have `r nrow(dat)` observations from 22 variables.
### Finding Distinct Cases at the Student-Level
If a student was enrolled in two courses, she will have a different final grade for each of
those two courses. However, our data in
its current form has many rows representing *each course. An easy way we can visually inspect to make sure every row
is the same for the same student, by course is to use the `glimpse()` function.
Try it below.
```{r}
glimpse(dat)
```
You can also use `View(dat)` in order to view the data in RStudio's viewer.
Visually inspecting the first several rows of data, we see that they all
correspond to the same student for the same course. As we expected, the
`FinalGradeCEMs` variable (representing students' final grade) is also consistent across these rows.
Since we are not carrying out a finer-grained analysis using the `Gradebook_Item`, these
duplicate rows are not necessary. We only want variables at the student level,
and not at the level of different gradebook items. We can extract only the
unique student-level data using the `distinct()` function. This function takes
as arguments the name of the data frame and the name of the variables used to
determine what counts as a unique case.
Imagine having a bucket of Halloween candy that has 100 pieces of candy. You
know that these 100 pieces are really just a bunch of duplicate pieces from a
relatively short list of candy brands. `distinct()` takes that bucket of 100
pieces and returns a bucket containing only one of each distinct piece.
Another thing to note about `distinct()` is that it will only return the
variable(s) (it is possible to pass more than one variable to `distinct()`) you
used to determine uniqueness, *unless* you include the argument `.keep_all =
TRUE`. For the sake of making it simple to view the output, we will omit this
argument for now.
Were we to run `distinct(dat, Gradebook_Item)`, what do you think would be
returned? Running the following code returns a one-column data frame that lists the names of
every distinct gradebook item.
``` {r}
distinct(dat, Gradebook_Item)
```
You might be wondering whether some gradebook
items have the same names across courses. We can return the unique *combination*
of courses and gradebook items by simply adding another variable to
`distinct()`:
``` {r}
distinct(dat, course_id, Gradebook_Item)
```
The data frame we get when we run the code chunk above yields a much longer (more
observations) dataset. Thus, it looks like *a lot* of gradebook items were
repeated across courses - likely across the different sections of the same
course. If you'd like, you can continue to investigate this: we would be curious
to hear what you find if you do!
Next, let's use a similar process to find the unique values at the student
level. Thus, instead of exploring unique gradebook items, we will explore unique
students (still accounting for the course, as students could enroll in more than
one course.) This time, we will add the `keep_all = TRUE` argument.
```{r}
dat <-
distinct(dat, course_id, student_id, .keep_all = TRUE)
```
This is a much smaller data frame - with one row for each student in the course.
Whereas our prior version of the `dat` dataset had over 40,000 rows, that prior
version is only helpful if we wanted to do an analysis at the level of specific
students' grades for specific gradebook items. Our new dataset keeps only the
unique combinations of student and course, leaving us with a more manageable
number of observations: `r nrow(dat)`. Now that our data are ready to go, we can
start to ask some questions of the data.
Let's take one last step. Since we will be using the final grade variable in
many of the figures and analyses that follow, let's rename it using the {dplyr}
`rename()` function to something that is a bit easier to type and remember than
`FinalGradeCEMS`:
```{r}
dat <- rename(dat, final_grade = FinalGradeCEMS)
```
## Analysis
In this section, we focus on some initial analyses in the form of visualizations
and some models. We expand on these in [Chapter 13](#c13). Before we start
visualizing relationships between variables in our survey dataset, let's
introduce {ggplot2}, a visualization package we'll be using in our walkthroughs.
### About {ggplot2}
{ggplot2} is a package we’ll be using a lot for graphing our education datasets.
{ggplot2} is designed to build graphs layer by layer, where each layer is a
building block for your graph. Making graphs in layers is useful because we can
think of building up our graphs in separate parts: the data comes first, then
the x-axis and y-axis, and finally other components like text labels and graph
shapes. When something goes wrong and your ggplot2 code returns an error, you
can learn about what’s happening by removing one layer at a time and running it
again until the code works properly. Once you know which line is causing the
problem, you can focus on fixing it.
The first two lines of {ggplot2} code look similar for most graphs. The first
line tells R which dataset to graph and which columns the x-axis and y-axis will
represent. The second line tells R which shape to use when drawing the graph.
You can tell R which shape to use in your graphs with a family of {ggplot2}
functions that start with `geom_`. {ggplot2} has many graph shapes you can use,
including points, bars, lines, and boxplots. Here’s a {ggplot2} example using a
dataset of school mean test scores to graph a bar chart:
```{r fig7-1, warning = FALSE, message = FALSE, results = 'hide', fig.showtext = TRUE, fig.cap = "Example Plot"}
# make dataset
students <-
tibble(
school_id = c("a", "b", "c"),
mean_score = c(10, 20, 30)
)
# tell R which dataset to plot and which columns the x-axis and y-axis will represent
students %>%
ggplot(aes(x = school_id, y = mean_score)) +
# draw the plot
geom_bar(stat = "identity",
fill = dataedu_colors("darkblue")) +
theme_dataedu()
```
The `data` argument in the first line tells R we’ll be using the dataset called
`students`. The `aes` argument tells R we’ll be using values from the
`school_id` column for the x-axis and values from the `mean_score` column for
the y-axis. In the second line, the `geom_bar` function tells R we’ll drawing
the graph using the bar chart format. Each line of {ggplot2} code is connected by a
`+` at the end to tell R the next line of code is an additional {ggplot2} layer to
add.
Writing code is like writing essays. There's a range of acceptable styles and certainly you can practice unusual ways of writing, but other people will find it harder to understand what you want to say. In this book, you'll see variations in {ggplot2} style, but all within what we believe is the range of acceptable conventions. Here are some examples:
- Piping data to `ggplot()` using `%>%` vs including it as an argument in `ggplot()`
- Using `ggtitle()` for labels vs using `labs()`
- Order of `ggplot()` levels
It's ok if those terms are new to you. The main point is there are multiple ways to make the plot you want. You'll see that in this book and in other peoples' code. As you learn, we encourage you to practice empathy and think about how well your code conveys your ideas to other people, including yourself when you look at it many weeks from when you wrote it.
### The Relationship between Time Spent on Course and Final Grade
One thing we might be wondering is how time spent on course is related to
students' final grade. Let's make a plot to depict that relationship. Below,
we'll use `geom_point` instead of `geom_bar`.
``` {r fig7-2, message = FALSE, warning = FALSE, results = 'hide', fig.cap = "Percentage Earned vs. Time Spent", fig.showtext = TRUE}
dat %>%
# aes() tells ggplot2 what variables to map to what feature of a plot
# Here we map variables to the x- and y-axis
ggplot(aes(x = TimeSpent, y = final_grade)) +
# Creates a point with x- and y-axis coordinates specified above
geom_point(color = dataedu_colors("green")) +
theme_dataedu() +
labs(x = "Time Spent",
y = "Final Grade")
```
_Note: you may receive a warning that reads `Warning message: Removed 5 rows containing missing values (geom_point).` This is due to the `NA` values that were introduced through coercion earlier in this walkthrough, and are not a cause for alarm!_
There appears to be *some* relationship. What if we added a line of best fit - a
linear model? The code below is the same plot we just made, but it includes the
addition of another layer called `geom_smooth`.
``` {r fig7-3, message = FALSE, warning = FALSE, results = 'hide', fig.cap = "Adding a Line of Best Fit", fig.showtext = TRUE}
dat %>%
ggplot(aes(x = TimeSpent, y = final_grade)) +
geom_point(color = dataedu_colors("green")) + # same as above
# this adds a line of best fit
# method = "lm" tells ggplot2 to fit the line using linear regression
geom_smooth(method = "lm") +
theme_dataedu() +
labs(x = "Time Spent",
y = "Final Grade")
```
Looking at this plot, it appears that the more time students spent on the
course, the higher that there final grade is.
What is the line doing in the upper right part of the graph? Based upon the trend
that is observable in the data, the line of best fit predicts that students who spend
a particular amount of time on the course *earn greater than 100* for their final grade!
Of course, this is not possible, and highlights the importance of understanding your
data and carefully interpreting lines of best fit (and other, more sophisticated analyses)
carefully, keeping that understanding and knowledge in mind as you present and make sense
of the results.
### Linear Model (Regression)
We can find out exactly what the relationship between these two variables is
using a linear model. We discuss linear models in more detail in
[Chapter 10](#c10).
Let's use this technique to model the relationship between the time spent on the
course and students' final grade. Here, we predict
`final_grade`. Students' final, reported grade is the dependent, or *y*-variable, and so
we enter it first, after the `lm()` command and before the tilde (`~`) symbol.
To the right of the tilde is one independent variable, `TimeSpent`, or the time
that students spent on the course. We also pass, or provide, the data frame,
`dat`. At this point, we're ready to run the model. Let's run this line of code
and save the results to an object - we chose `m_linear`, but any name will work.
We will then run the `summary()` function on the output.
``` {r}