Skip to content

Hold-Krykke/PythonExam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

4. Semester Python Exam

Table of contents

Made by:

  • Cph-mh748 - Malte Hviid-Magnussen
  • Cph-rn118 - Rúni Vedel Niclasen
  • Cph-ab363 - Asger Bjarup
  • Cph-cs340 - Camilla Staunstrup

Purpose of the program

We would like to delve deeper into text analysis and web scraping.

We scrape data from Twitter, based on hashtag searches, and use different techniques to clean, analyze and present the data.

Example tweets to perform sentiment analysis on could be:

  • #Trump, #Trump2020
  • #Biden, #Biden2020
  • #Election2020

Technologies

  • Web scraping of Twitter, based on hashtags
    • Technologies:
      • Web scraping with BeautifulSoup4
      • Cleaning data with the emoji package.
      • File handling with os, Path modules.
  • Preprocessing of Twitter data (clean-up, removing stop words)
    • Technologies:
      • Regex
      • Natural Language Toolkit (NLTK)
  • Sentiment analysis
    • Technologies:
      • Natural Language Toolkit (NLTK)
  • Presentation (graphs/plots)
    • Technologies:
      • matplotlib
      • pandas
      • File handling with the Path module.
  • Availability (To the user)
    • Technologies:
      • Flask
      • Argparse for the CLI

Things that we didn't implement but would have liked to:

  • Other types of text/topic analysis
  • More technologies, such as sklearn
  • Utilize Twitters advanced search functions, such as sorting by popularity, with/without pictures, etc.

Using the program

  1. Clone the repo and follow the instructions in setup.ipynb

Note: Not all plots work with all data. A few cases might result in bad output.

Using the program with Flask

Starting the server

  • Open terminal in root directory
  • cd into the modules folder
  • Use python to run the flask_service.py
  • Wait for a while until it says Running on http://127.0.0.1:5000/ (Press CTRL+C to quit) in the terminal (this might take a while (~40 seconds) since the machine learning model is trained once every time the server is started)

Using the endpoint

The server exposes a single endpoint /api/sentiment where you have to make all your requests. Use Postman or a similar tool to test the server at http://localhost:5000/api/sentiment - we have not deployed the server. There is no UI for the server so every request has to be made in a tool like Postman. (Showing examples from Postman)

  • All requests made must use the HTTP method POST
  • You can make a request without providing any search options which will result in a code 400 response but will give you an example of what to provide the body of your request: image
  • Click Preview and copy everything after Example: . Paste it into the body of your request
  • All search options must be provided in JSON format. The body of a request can look like this: image
  • You can click beautify to make the JSON look proper
  • Another example of JSON in the request body: { "hashtags": [ "trump", "biden" ], "start_date": "2020-5-17", "end_date": "2020-5-22", "plot_type": "line", "remove_sentiment": "Uncertain", "tweet_count": 300, "fresh_search": true }
  • The JSON above will result in the following plot: image
  • The y-axis shows the amount of tweets. The x-axis shows the date

Explanation of search options Data gathering

  • Hashtags - the hashtags you want to search for on twitter
    • Example: "hashtags": [ "trump", "biden" ]
    • Must be an array of strings with the name hashtags

Data filtering

  • Start date - the start date for the period of time you want tweets from
    • Example: "start_date": "2020-5-17"
    • Choosing a start date that is 5 - 10 days before the end date will give the prettiest plot
    • Choosing a start date at a later point in time than the end date will result in no data which means the plot can't be created
  • End date - the end date of the period of time you want tweets from
    • Example: "end_date": "2020-5-22"
    • We recommend choosing the current date as end date so you can get the latest tweets
    • Choosing an end date that is in the future won't give any future predictions or results
  • Plot type - the type of plot you want
    • Example: "plot_type": "line"
    • There are 3 types of plots: bar, line and pie
    • We recommend using the line plot (the other types may not work)
  • Removing sentiment - remove either Positive tweets or Negative tweets or the ones with a mixed sentiment (Uncertain)
    • Example: "remove_sentiment": "Uncertain"
    • All three values must be spelled with the first letter in upper case
  • Tweet amount - the amount of tweets you want to scrape from Twitter
    • Example: "tweet_amount": 300
    • The higher the tweet count is, the further back in time you can go since the web scraper scrapes tweets in the same order as tweets are view on Twitter (which is somewhat chronologically)
    • Default is 300
  • Fresh search - whether or not you want to get new tweets or tweets from previous searches (if available)
    • Example: "fresh_search": true
    • Default is false
    • A fresh search of 300 tweets takes ~10 seconds
  • Search for mentions or hashtags
    • Example: "search_for": { "mentions": "@JoeBiden" }
    • Example: "search_for": { "hashtags": "#trump" }
    • Requires an object with a single attribute with a key that must be either mentions or hashtags. The value should match the key so if the key is mentions then the value must begin with @
    • We recommend not using this filter (especially the mentions option) since it in most cases filters away all the data resulting in an empty plot or no plot at all
  • Get statistics - Use this if you want some statistics about the data instead of a plot with an analysis
    • Example: "get_stats": "hashtags"
    • There are two options: "hashtags" and "mentions"
    • You can use this option to look through the list of hashtags or mentions in the gathered tweets and if you e.g. find out that @realDonaldTrump has been mentioned ten times then you can do a new search with these options: { "hashtags": [ "trump", "biden" ], "start_date": "2020-5-17", "end_date": "2020-5-22", "plot_type": "line", "search_for": { "mentions": "@realDonaldTrump" }, "tweet_count": 300, "fresh_search": false } to find the sentiment of those tweets.

Overall Recommendation

  • Choose an end that is equal to the current data
  • Choose a start date ~ten days before end date
  • Search for "trump" and "biden"
  • Remove the "Uncertain" sentiment
  • Choose "line" as plot type
  • Choose a tweet amount of 300

JSON: { "hashtags": [ "trump", "biden" ], "start_date": "2020-5-12", "remove_sentiment": "Uncertain", "end_date": "2020-5-22", "plot_type": "line", "tweet_amount": 300 }


Using the program with CLI

  1. In the root folder, run python app.py -h to print the help output:

Default values

All the optional arguments have default values.
The program can run using all default values by simply passing the hashtags you want to gather info from.

Examples

Utilizing default values to search for the hashtags #trump and #biden:
python app.py trump biden
This would run the program using the following values:

{'certainty_high': 0.75,
 'certainty_low': 0.25,
 'date': [datetime.date(2020, 5, 22),
          datetime.date(2020, 5, 27)],
 'fresh_search': False,
 'hashtags': ['trump', 'biden'],
 'plot_type': 'pie',
 'remove_sentiment': None,
 'save_plot': False,
 'search_hashtags': None,
 'search_mentions': None,
 'search_urls': None,
 'tweet_count': 300}

Date by default is set to current day + 5 days

Changing plot type and filtering on dates (hashtags omitted for brevity)
python app.py -p bar -d 2020-06-01 2020-06-02 or
python app.py --plot bar --date 2020-06-01 2020-06-02

Search for a specific amount of tweets (1000) and save the generated plots locally (hashtags omitted for brevity)
python app.py -s -c 1000 or
python app.py --save --count 1000

About

4. Semester Python Eksamens Projekt

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •