This project analyzes the relationship between the rating and the proportion of five-star ratings of movies on Douban, a Chinese online rating platform. By fitting the data to the linear model, this project reveals the prediction confidence interval of five-star proportions at each level of rating, showing that some movies display abnormal distribution of ratings.
This project uses MySQL to store data. Execute MovieRating.sql to prepare the data tables required.
This project takes two steps to collect data from Douban. Firstly, this project uses Selenium to carry out the stratified sampling of movies in different eras. Selenium scrapes around 3800 movie URLs in total, with each URL directing to the individual page of the movie with the rating, distribution of ratings, published region and date, and movie tags. Secondly, this project uses requests and regular expression to scrape data from the individual movie pages.
Movies have different numbers of tags recorded in columns from tag1 to tag5. Melting all tags into one column generates a tag variable for further analysis.
db = get_database_connection("localhost", 3306, "root", "123456")
sql = """
SELECT *
FROM movie_ratings
JOIN movie_tags,
rating_distributions
WHERE tag_id=movie_id AND tag_id=distribution_id
"""
# The SQL code above joins three different tables
df = pd.read_sql(sql, db)
# Melt columns of different tags into a single column
df_tag_melt = pd.melt(df, id_vars=['name', 'rating', 'region', 'date', 'one', 'two', 'three', 'four', 'five'],
value_vars=['tag1', 'tag2', 'tag3', 'tag4', 'tag5'], var_name="tag", value_name="tag_name")
In the raw data, movies have their produced regions joined by "/". To consider the effect of the produced region, the region column must be firstly split into different columns by "/", and then melted into a single column.
df_region_split = pd.DataFrame((x.split('/') for x in df_tag_melt['region']), columns=['region1', 'region2', 'region3', 'region4', 'region5', 6, 7, 8, 9, 10])
df = pd.concat([df_tag_melt, df_region_split], axis=1)
# Melt the columns of different regions into a single column
df = pd.melt(df, id_vars=['name', 'rating', 'date', 'one', 'two', 'three', 'four', 'five', 'tag', 'tag_name'],
value_vars=['region1', 'region2', 'region3', 'region4', 'region5'], var_name='region',
value_name='region_name')
The operations of splitting and melting generate many rows with missing values. Dropping the missing values keeps the valid data.
df = df.dropna()
df = df.reset_index(drop=True)
To take the date released into consideration, assign the era of the movies as a variable according to the date released.
# Use the era dict to assign "eras" for corresponding dates
era_dict = {r"202\d": "2020s", r"201\d": "2010s", r"200\d": "2000s", r"199\d": "1990s", r"198\d": "1980s", r"197\d": "1970s", r"196\d": "1960s"}
for row in df.itertuples():
for era in era_dict:
df.loc[row[0], 'era'] = 'Earlier'
if len(re.findall(era, str(row))) > 0:
df.loc[row[0], 'era'] = era_dict[era]
break
To consider the relationship between the rating and whether the movie is produced in China, assign 1 for movies that are made in China, and 0 for those that are not.
To better fit the linear model, take log10 of the response variable "five-star proportion."
model <- lm(log10_proportion~rating, data = data)
summary(model)
pred.int <- predict(model, interval = "prediction")
mydata <- cbind(data, pred.int)
p <- ggplot(mydata, aes(rating, log10_proportion)) +
geom_point(colour="#4D4D4D") +labs(x="Rating", y="Five-Stars Proportion", title='Relationship between Rating and Five-Stars Proportion')+
stat_smooth(method = lm, colour="black")
p + geom_line(aes(y = lwr), color = "red", linetype = "dashed")+
geom_line(aes(y = upr), color = "red", linetype = "dashed")
ggplot(data, aes(log10_proportion, rating, color=factor(is_made_in_china)))+geom_point()+
labs(x="Five-stars Proportion", y="Rating", title="Relationship Between Rating and Five-stars Proportion")+
theme_bw()+scale_y_log10()+geom_smooth(method="lm")+guides(color=guide_legend(title="Whether made in China \n (0:No, 1:Yes)"))
ggplot(data, aes(log10_proportion, rating, color=factor(is_made_in_china)))+geom_point()+
labs(x="Five-stars Proportion", y="Rating", title="Relationship Between Rating and Five-stars Proportion")+
theme_bw()+scale_y_log10()+geom_smooth(method="lm")+guides(color=guide_legend(title="Whether made in China \n (0:No, 1:Yes)"))+facet_wrap("era")