Skip to content

harshkavdikar1/Tweet-Analysis-With-Kafka-and-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tweet Analysis using Kafka and Spark Streaming

Built a real-time analytics dashboard to visualize the trending hashtags and @mentions at a given location by using real time streaming twitter API to get data.

Installation Guide

Download and Install Kafka, Spark, Python and npm.

  1. You can refer to following guide to install kafka.
  2. https://towardsdatascience.com/running-zookeeper-kafka-on-windows-10-14fc70dcc771

  3. Spark can be downloaded from following link
  4. https://spark.apache.org/downloads.html


How to run the code.

  • Create kafka topic.
  • Update conf file with your secret key and access tokens.
  • Install Python dependencies.
  •  pip install -r requirements.txt
    
  • Install Node js dependencies.
  • npm install
    
  • Start Zookeeper
  • Open cmd and execute

    zkserver
    
  • Start Kafka
  • Go to Kafka installation directory. ..\kafka_2.11-2.3.1\bin\windows. Open cmd here and execute following command.

    kafka-server-start.bat C:\ProgramData\Java\kafka_2.11-2.3.1\config\server.properties
    
  • Run python file to fetch tweets.
  • python fetch_tweets.py
    
  • Run python file to analyze tweets.
  • python analyze_tweets.py
    
  • Start npm server
  • npm start
    

Technology stack

stack


Area Technology
Front-End HTML5, Bootstrap, CSS3, Socket.IO, highcharts.js
Back-End Express, Node.js
Cluster Computing Framework Apache Spark (python)
Message Broker Apache kafka

Architecture


architecture


How it works

  1. Extract data from Twitter's streaming API and put it into Kakfa topic.
  2. Spark is listening to this topic, it will read the data from topic, analyze it is using spark streaming and put top 10 trending hashtags and @mentions into another kafka topic.
  3. Spark Streaming creates DStream whenever it read the data from kafka and analyze it by performing operation like map, filter, updateStateByKey, countByValues and forEachRDD on the RDD and top 10 hashtags and mentions are obtained from RDD using SparkSQL.
  4. Node.js will pick up the this data from kafka topic on server side and emit it to the socket.
  5. Socket will push data to user's dashboard which is rendered using highcharts.js in realtime.
  6. The dashboard is refreshed every 60 secs.


hashtags

mentions

About

A real time analytics dashboard to analyze the trending hashtags and @ mentions at any location using kafka and spark streaming.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published