Use Tweepy Python client to stream tweets to Kafka.
To install Tweepy
pip install tweepy
Avro data format for communication between Kafka and Spark Streaming.
Avro schema:
{
"name": "TwitterEvent",
"type": "record",
"fields": [
{"name": "id", "type": "string"},
{"name": "source", "type": "string"},
{"name": "text", "type": "string"},
{"name": "lang", "type": "string"}
]
}
To produce tweets to Kafka, use Confluent Python client:
pip install confluent-kafka
pip install avro # for encoding data to Avro
Install pyspark library:
pip install pyspark
Code snippet to start Spark Streaming:
batch_duration = 10 # 10 seconds
topic = 'twitter' # kafka topic
sc = SparkContext("local[2]", "SentimentAnalysisWithSpark")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, batch_duration)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {topic: 1}, valueDecoder=decode_tweet)
...
ssc.start()
ssc.awaitTermination()
Note: In other to KafkaUtils can understand Avro, we need to specify valueDecoder for the createStream method.
To run the code:
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 python_file_path
This command is for Kafka v0.8 and Spark Streaming c2.3.0 on Scala 2.11. If you use diffirent versions, change package versions accordingly.
Hbase is used to store tweets.
We use three tables in the project:
Table 'tweets' for storing tweets
CREATE 'tweets', 'tweet'
Table 'neg_counter' for storing negative word frequency
CREATE 'neg_counter', 'info'
Table 'pos_counter' for storing positive word frequency
CREATE 'pos_counter', 'info'
Hive is used to integrate SparkSQL and Hbase
We use three external tables to link with Hbase tables respectively:
CREATE EXTERNAL TABLE tweets(key string, text string, cleaned_text string, source string, lang string, target string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,tweet:text,tweet:cleaned_text,tweet:source,tweet:lang,tweet:target")
TBLPROPERTIES ("hbase.table.name" = "tweets");
CREATE EXTERNAL TABLE neg_counter(key string, count bigint)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:counter#b") TBLPROPERTIES ("hbase.table.name" = "neg_counter");
CREATE EXTERNAL TABLE pos_counter(key string, count bigint)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:counter#b") TBLPROPERTIES ("hbase.table.name" = "pos_counter");
To connect Spark SQL with Hbase:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Sentiment Query") \
.config("hive.metastore.uris", "thrift://localhost:9083") \
.enableHiveSupport().getOrCreate()
And then query data:
query = spark.sql("SELECT target, count(target) AS count FROM tweets GROUP BY target")
# visualize data
df = query.toPandas()
df.plot(kind='pie', labels=df['target'], y='count', autopct='%1.1f%%', startangle=90, legend=False)