Twitter & Kafka

Use Tweepy Python client to stream tweets to Kafka.

To install Tweepy

pip install tweepy

Avro data format for communication between Kafka and Spark Streaming.

Avro schema:

{
    "name": "TwitterEvent",
    "type": "record",
    "fields": [
        {"name": "id", "type": "string"},
        {"name": "source", "type": "string"},
        {"name": "text", "type": "string"},
        {"name": "lang", "type": "string"}
    ]
}

To produce tweets to Kafka, use Confluent Python client:

pip install confluent-kafka
pip install avro # for encoding data to Avro

Spark Streaming

Install pyspark library:

pip install pyspark

Code snippet to start Spark Streaming:

batch_duration = 10 # 10 seconds
topic = 'twitter' # kafka topic

sc = SparkContext("local[2]", "SentimentAnalysisWithSpark")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, batch_duration) 
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {topic: 1}, valueDecoder=decode_tweet)
...
ssc.start()  
ssc.awaitTermination()

Note: In other to KafkaUtils can understand Avro, we need to specify valueDecoder for the createStream method.

To run the code:

spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 python_file_path

This command is for Kafka v0.8 and Spark Streaming c2.3.0 on Scala 2.11. If you use diffirent versions, change package versions accordingly.

Hbase

Hbase is used to store tweets.

We use three tables in the project:

Table 'tweets' for storing tweets

CREATE 'tweets', 'tweet'

Table 'neg_counter' for storing negative word frequency

CREATE 'neg_counter', 'info'

Table 'pos_counter' for storing positive word frequency

CREATE 'pos_counter', 'info'

Hive

Hive is used to integrate SparkSQL and Hbase

We use three external tables to link with Hbase tables respectively:

CREATE EXTERNAL TABLE tweets(key string, text string, cleaned_text string, source string, lang string, target string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,tweet:text,tweet:cleaned_text,tweet:source,tweet:lang,tweet:target")  
TBLPROPERTIES ("hbase.table.name" = "tweets");

CREATE EXTERNAL TABLE neg_counter(key string, count bigint) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:counter#b") TBLPROPERTIES ("hbase.table.name" = "neg_counter");

CREATE EXTERNAL TABLE pos_counter(key string, count bigint) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:counter#b") TBLPROPERTIES ("hbase.table.name" = "pos_counter");

Spark SQL

To connect Spark SQL with Hbase:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Sentiment Query") \
    .config("hive.metastore.uris", "thrift://localhost:9083") \
    .enableHiveSupport().getOrCreate()

And then query data:

query = spark.sql("SELECT target, count(target) AS count FROM tweets GROUP BY target")

# visualize data
df = query.toPandas()
df.plot(kind='pie', labels=df['target'], y='count', autopct='%1.1f%%', startangle=90, legend=False)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
1.twitter_streaming.ipynb		1.twitter_streaming.ipynb
2.spark_streaming.ipynb		2.spark_streaming.ipynb
3.spark_sql.ipynb		3.spark_sql.ipynb
README.md		README.md
twitter.avsc		twitter.avsc
untitled.txt		untitled.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter & Kafka

Spark Streaming

Hbase

Hive

Spark SQL

About

Releases

Packages

Languages

tienpham/sentiment_analysis

Folders and files

Latest commit

History

Repository files navigation

Twitter & Kafka

Spark Streaming

Hbase

Hive

Spark SQL

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages