Skip to content

Shannon-Watts/netflix_analysis

Repository files navigation

📺 Netflix analysis 📺

This project was completed by Chisimnulia Okoye, Sofia Kauser, and Shannon Watts.

Research proposal

Our project focuses on two datasets sourced from Kaggle. We chose these datasets as we love to watch Netflix ... but we had run out of tv shows and films to watch. We wanted to see how many of IMDBs top 1000 films and tv series were on Netflix. Surely the most rated and top grossing films are the most interesting? So we sought to know these films/tv shows and how many of them were not on Netflix.

Research Questions:

How many of IMDB’s top 1000 films are currently on Netflix?

What is the corresponding IMDB score for these films, has Netflix missed any major top rated films?

What release year are most common in IMDB’s top 1000? Possible suggestions for films to be added next month?

The Datasets:

We chose these datasets because we thought it would best illustrate what we wanted to find.

IMDB Movies Dataset https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

image

Netflix Movies and TV Shows https://www.kaggle.com/datasets/shivamb/netflix-shows

image

Analysis 🔍

Our ETL on the Netflix and IMDB data allowed us to conclude some very interesting analysis...

How many of IMDB’s top 1000 films are currently on Netflix? We found that there were 8807 titles currently on Netflix that WERE NOT on the IMDB top 1000...

We found that the average rating for films and tv shows on Netflix that were also on IMDB top 1000 was 7.9 and grossed at approx 68,000,000 - not too bad, but we have provided some examples below on how they can increase their viewer ratings by adding higher IMDB rated films.

What is the corresponding IMDB score for these films, has Netflix missed any major top rated films?

We looked into the top five highest rated IMDB films that were not on netflix. They had an IMDB rating of at least 9. These included The Shawshank Redemption, The Godfather Part I and II, The Dark Knight, and 12 Angry Men. We would argue that Netflix is missing out by not showing these films.. no wonder people have decided to spend their weekends outide again...

What release year are most common in IMDB’s top 1000? Possible suggestions for films to be added next month?

We found that 2018 was the release year with the highest count of films and TV shows on Netflix and in the IMDB top 1000. We then decided to look at the top rated films on IMDB that are not currenrly showing on Netflix. We found that Capharnaum, Spider-Man: Into the Spider-Verse, Avengers:Infinity War, Tumbbad, and Andhadhun were missing from Netflix... We definitely agree that these should be added next month!

Extract, Transform & Load: how we came to our conclusions

Extract 📂

We decided to extract the two CSV files and examine both separately to see what we were working with.

image

image

Transform 🧹

Many of the columns were not needed so we dropped many of them and then we renamed the column heads in both DataFrames to that we could concatanate the two DataFrames:

image

Lastly, we wanted to change the null values in the rows 'IMDB_Rating', 'Meta_score', 'No_of_Votes', 'Gross' - because we had many null values as there is clearly not many shows and films on Netflix that are also in the top 1000 IMDB rated list. We changed this to 'not currently in IMDB top 1000' to make it clear.

image

Some errors enountered whilst Transforming Data

We did run into one main issue when we tried to load the data to PostgreSQL - so we had to retrun to the transform stage and figure out what the issue was. We kept receiving an operational error related to 'PG'. The error promopted us to look at the row with title 'Apollo 13'. Upon further examination we found that the there was an original error in the CSV file. The 'certificate' which was 'PG' had been listed in the 'release_year' column. We rectified this by using .loc to find the exact row with the error.

image

We changed the value to 0 - we recognise that this is an anomaly

image

Another major error we encountered whilst loading, was commas in the 'Gross' column to eliminate this we : image

Lastly our third major error occured due to the change of null values in the rows 'IMDB_Rating', 'Meta_score', 'No_of_Votes', 'Gross'. We needed to perform analysis using these columns, in order to do this we had to revert back to the NaN values and this allowed us to perform the analysis we required.

image

Load 📠

We chose to load our DateFrames into PostgreSQL. We chose a relational database rather than a non-relational database (e.g. such as MongoDB) because we wanted to load our data into a fixed data template, visualise and manage the table easily. We also used a relatively small dataset (around 10,000 rows) which meant that PostgreSQL could handle our data and queries. We also wished to run queries on the data and view the results in tabular form.

image

image

Our data was now ready for analysis... A snapshot of which is below:

image

image

image

image

image

image

About

An ETL project: Netflix and IMDB score analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published