Data Analyst Nanodegree

Data Wrangling

Project: Wrangle and Analyze Data

Wrangle and Analyze Data

Introduction

Real-world data rarely comes clean. Using Python and its libraries, I will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling.

The dataset that I will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent". WeRateDogs has over 4 million followers and has received international media coverage.

WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for us to use in this project. This archive contains basic tweet data (tweet ID, timestamp, text, etc.) for all 5000+ of their tweets as they stood on August 1, 2017. More on this soon.

What Software Do I Need?

The following packages (i.e. libraries) need to be installed.

pandas
numpy
requests
tweepy
json

The Data

Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced."

Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But we, because we have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? We're going to query Twitter's API to gather this valuable data.

Image Predictions File

One more cool thing: I ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

Key Points

Key points to keep in mind when data wrangling for this project:

We only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
Fully assessing and cleaning the entire dataset requires exceptional effort so only a subset of its issues (eight (8) quality issues and two (2) tidiness issues at minimum) need to be assessed and cleaned.
Cleaning includes merging individual pieces of data according to the rules of tidy data.
The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
We do not need to gather the tweets beyond August 1st, 2017. We can, but note that we won't be able to gather the image predictions for these tweets since we don't have access to the algorithm used.

Project Details

We will perform the following tasks in this project:

Data wrangling, which consists of:

Gathering data
Assessing data
Cleaning data

Storing, analyzing, and visualizing our wrangled data

Reporting on 1) data wrangling efforts and 2) data analyses and visualizations

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
act_report.html		act_report.html
image-predictions.tsv		image-predictions.tsv
tweet_json.txt		tweet_json.txt
twitter-archive-enhanced.csv		twitter-archive-enhanced.csv
twitter_archive_master.csv		twitter_archive_master.csv
wrangle_act.ipynb		wrangle_act.ipynb
wrangle_report.pdf		wrangle_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

act_report.html

act_report.html

image-predictions.tsv

image-predictions.tsv

tweet_json.txt

tweet_json.txt

twitter-archive-enhanced.csv

twitter-archive-enhanced.csv

twitter_archive_master.csv

twitter_archive_master.csv

wrangle_act.ipynb

wrangle_act.ipynb

wrangle_report.pdf

wrangle_report.pdf

Repository files navigation

Data Analyst Nanodegree

Data Wrangling

Project: Wrangle and Analyze Data

Wrangle and Analyze Data

Table of Contents

Introduction

What Software Do I Need?

The Data

Enhanced Twitter Archive

Additional Data via the Twitter API

Image Predictions File

Key Points

Project Details

About

Releases

Packages

Languages

sanjeevai/Wrangle_and_Analyze_data

Folders and files

Latest commit

History

Repository files navigation

Data Analyst Nanodegree

Data Wrangling

Project: Wrangle and Analyze Data

Wrangle and Analyze Data

Table of Contents

Introduction

What Software Do I Need?

The Data

Enhanced Twitter Archive

Additional Data via the Twitter API

Image Predictions File

Key Points

Project Details

About

Topics

Resources

Stars

Watchers

Forks

Languages