This project provides an introduction to using Apache Spark for data processing and analysis. It includes hands-on examples for working with Spark DataFrames and instructions for setting up Spark on your local machine.
Apache Spark is a powerful distributed computing framework widely used for big data processing, machine learning, and real-time analytics.
We can use Spark with structured data using Spark DataFrames, query data with SQL, or process unstructured data using RDDs (Resilient Distributed Datasets).
This guide has been tested with:
-Python 3.10.11 (newest is 3.12.4 - which does NOT work) -PySpark 3.5.3 -Spark 3.5.3 -JDK 17 -Winutils for Spark 3
Read about Spark’s features and capabilities on the Apache Spark Homepage.
Visit the Apache Spark Examples Page to work through Spark’s official examples, including:
- DataFrame Example - demonstrates creating a Spark DataFrame and performing operations like filtering, aggregation, and adding columns.
- SQL Example - illustrates how to query data using Spark SQL.
- RDD Example - ntroduces RDDs for processing unstructured data.
- Streaming Example - shows how to handle real-time data-in-motion with structured streaming.
Explore how different Spark-based projects might be structured by checking the docs folder. Each project demonstrates the use of a scalable pipeline architecture tailored to a specific domain.
-
Spark Sales Project - analyzes customer, product, and sales data to generate insights like sales trends and product performance.
-
Spark Social Media Project - processes social media data to uncover user engagement patterns, hashtag trends, and sentiment analysis.
-
Spark Internet of Things (IOT) Project - aggregates and visualizes IoT sensor data to detect anomalies, monitor device usage, and analyze trends.
Follow the instructions to set up your system first:
Examples of .gitignore and requirements.txt are provided (yours may vary).
The project assumes the smart sales respository is organized like the following (yours may vary - adjust your paths using pathlib to reflect your existing project layout).
project/
├── data/prepared
│ ├── customers_data_prepared.csv
│ ├── products_data_prepared.csv
│ ├── sales_data_prepared.csv
├── scripts/
│ ├── step0-pipeline.py # Orchestrate the pipeline
│ ├── step1-extract.py # Extract stage: Read data from sources
│ ├── step2-transform.py # Transform stage: Process data for insights
│ ├── step3-load.py # Load stage: Save results to storage
│ └── step4-visualize.py # Visualize results using seaborn or matplotlib
├── notebooks/
│ ├── insights.ipynb # Notebook orchestrating extract-transform-load (ETL) + visualization
├── .gitignore
├── README.md
└── requirements.txt
Follow the instructions to manage your local virtual environment:
We keep our Python scripts in the scripts folder.
Create new files with these names in your scripts folder. Paste the contents from the file provided in this repo.
-
In VS Code, open a PowerShell terminal in the root project folder.
-
Activate your local project environment everytime you open a terminal to work on the project.
.\.venv\Scripts\activate
Protip: After running the command once, you can usually get it back by typing just the initial dot and then hitting the right arrow key - or use the up arrow to access prior commands.
- Execute the script.
py scripts\step0_pipeline.py
Protip: After running the command once, you can usually get it back by typing just the initial py and then hitting the right arrow key - or use the up arrow to access prior commands.
If you get a Windows Firewall alert regarding the JDK, click Allow.
-
In VS Code, open a terminal in the root project folder.
-
Activate your local project environment everytime you open a terminal to work on the project.
source .venv/bin/activate
- Execute the script
python3 scripts/step0_pipeline.py
Add or update the files to make your own functionality.
Paste the contents from the file provided in this repo.
Execute your scripts - or experiment with a Jupyter notebook.
$Env:HADOOP_HOME
Test-Path "$Env:HADOOP_HOME\bin\winutils.exe"
$env:Path = [System.Environment]::GetEnvironmentVariable("Path", "Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path", "User")