Skip to content

tienpham/sentiment_analysis

Repository files navigation

Twitter & Kafka

Use Tweepy Python client to stream tweets to Kafka.

To install Tweepy

pip install tweepy

Avro data format for communication between Kafka and Spark Streaming.

Avro schema:

{
    "name": "TwitterEvent",
    "type": "record",
    "fields": [
        {"name": "id", "type": "string"},
        {"name": "source", "type": "string"},
        {"name": "text", "type": "string"},
        {"name": "lang", "type": "string"}
    ]
}

To produce tweets to Kafka, use Confluent Python client:

pip install confluent-kafka
pip install avro # for encoding data to Avro

Spark Streaming

Install pyspark library:

pip install pyspark

Code snippet to start Spark Streaming:

batch_duration = 10 # 10 seconds
topic = 'twitter' # kafka topic

sc = SparkContext("local[2]", "SentimentAnalysisWithSpark")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, batch_duration) 
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {topic: 1}, valueDecoder=decode_tweet)
...
ssc.start()  
ssc.awaitTermination()

Note: In other to KafkaUtils can understand Avro, we need to specify valueDecoder for the createStream method.

To run the code:

spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 python_file_path

This command is for Kafka v0.8 and Spark Streaming c2.3.0 on Scala 2.11. If you use diffirent versions, change package versions accordingly.

Hbase

Hbase is used to store tweets.

We use three tables in the project:

Table 'tweets' for storing tweets

CREATE 'tweets', 'tweet'

Table 'neg_counter' for storing negative word frequency

CREATE 'neg_counter', 'info'

Table 'pos_counter' for storing positive word frequency

CREATE 'pos_counter', 'info'

Hive

Hive is used to integrate SparkSQL and Hbase

We use three external tables to link with Hbase tables respectively:

CREATE EXTERNAL TABLE tweets(key string, text string, cleaned_text string, source string, lang string, target string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,tweet:text,tweet:cleaned_text,tweet:source,tweet:lang,tweet:target")  
TBLPROPERTIES ("hbase.table.name" = "tweets");
CREATE EXTERNAL TABLE neg_counter(key string, count bigint) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:counter#b") TBLPROPERTIES ("hbase.table.name" = "neg_counter");
CREATE EXTERNAL TABLE pos_counter(key string, count bigint) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:counter#b") TBLPROPERTIES ("hbase.table.name" = "pos_counter");

Spark SQL

To connect Spark SQL with Hbase:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Sentiment Query") \
    .config("hive.metastore.uris", "thrift://localhost:9083") \
    .enableHiveSupport().getOrCreate()

And then query data:

query = spark.sql("SELECT target, count(target) AS count FROM tweets GROUP BY target")

# visualize data
df = query.toPandas()
df.plot(kind='pie', labels=df['target'], y='count', autopct='%1.1f%%', startangle=90, legend=False)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published