-
Notifications
You must be signed in to change notification settings - Fork 0
/
merging.Rmd
113 lines (91 loc) · 2.77 KB
/
merging.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
title: "Biased highschool teacher"
author: "Julian Barg"
date: "October 19, 2018"
output: rmarkdown::github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(reticulate)
```
## Read data
First, we read a file with student names and attributes.
```{r read, message=FALSE}
library(readr)
library(tidyverse)
students <- read_csv('students.csv')
glimpse(students)
```
Next, we read the file with the teachers preferences into memory.
```{r read2, message=FALSE}
preferences <- read_csv('preferences.csv')
glimpse(preferences)
```
## Transform dataframe
We tranform the students dataframe to the long format.
```{r melt}
library(reshape2)
students_long <- melt(students,
id.vars=c('First Name', 'Last Name'),
variable.name='Characteristic')
head(students_long)
```
Next, we remove rows where the value is missing or false, because the preferences dataframe only makes statements about TRUE values.
```{r drop}
students_long <- subset(students_long, value==TRUE, drop=TRUE)
head(students_long)
```
We can even drop the value column then.
```{r drop_value}
students_long <- students_long[ ,1:3]
head(students_long, 2)
```
## Merging options
Now, we can use every merging option to join the two dataframe.
### Inner:
```{r inner_join, message=FALSE}
inner <- inner_join(students_long, preferences)
str(inner)
```
### Full (outer):
```{r full_join, message=FALSE}
full <- full_join(students_long, preferences)
str(full)
```
### Left:
```{r left_join, message=FALSE}
left <- left_join(students_long, preferences)
str(left)
```
### Right:
```{r right_join, message=FALSE}
right <- right_join(students_long, preferences)
str(right)
```
## 'Who' column
Notice the unwanted Who column though. It seems the teacher wrote down the name of a student that had the characteristic as an example here. To add those rows, we rename that column and carry out an additional outer join on all rows, which effectively adds all rows from the second dataframe. Then we could clean up the dataframe and obtain our final dataframe.
```{r , message=FALSE}
join_2 <- union_all(full, rename(preferences, 'First Name' = Who))
join_2 <- join_2[ ,1:5]
print(join_2)
```
## Final result
```{r final}
join_2 %>%
filter(is.na(`First Name`)==FALSE | is.na(`Last Name`)==FALSE) %>%
group_by(`First Name`, `Last Name`) %>%
filter(Priority == min(Priority)) %>%
select(-Characteristic) %>%
arrange(`First Name`, `Last Name`)
```
## Join in python
```{python test}
import pandas as pd
preferences_py = r.preferences
students_long_py = r.students_long
pandas_join = pd.merge(preferences_py, students_long_py,
left_on='Characteristic',
right_on='Characteristic',
how='outer')
print(pandas_join)
```