smart-sales-spark

This project provides an introduction to using Apache Spark for data processing and analysis. It includes hands-on examples for working with Spark DataFrames and instructions for setting up Spark on your local machine.

Apache Spark is a powerful distributed computing framework widely used for big data processing, machine learning, and real-time analytics.

We can use Spark with structured data using Spark DataFrames, query data with SQL, or process unstructured data using RDDs (Resilient Distributed Datasets).

Versions Matter!

This guide has been tested with:

-Python 3.10.11 (newest is 3.12.4 - which does NOT work) -PySpark 3.5.3 -Spark 3.5.3 -JDK 17 -Winutils for Spark 3

Apache Spark Homepage

Read about Spark’s features and capabilities on the Apache Spark Homepage.

Apache Spark Examples

Visit the Apache Spark Examples Page to work through Spark’s official examples, including:

DataFrame Example - demonstrates creating a Spark DataFrame and performing operations like filtering, aggregation, and adding columns.
SQL Example - illustrates how to query data using Spark SQL.
RDD Example - ntroduces RDDs for processing unstructured data.
Streaming Example - shows how to handle real-time data-in-motion with structured streaming.

Spark Pipeline Project Examples

Explore how different Spark-based projects might be structured by checking the docs folder. Each project demonstrates the use of a scalable pipeline architecture tailored to a specific domain.

Spark Sales Project - analyzes customer, product, and sales data to generate insights like sales trends and product performance.
Spark Social Media Project - processes social media data to uncover user engagement patterns, hashtag trends, and sentiment analysis.
Spark Internet of Things (IOT) Project - aggregates and visualizes IoT sensor data to detect anomalies, monitor device usage, and analyze trends.

FIRST: Set Up Your Machine

Follow the instructions to set up your system first:

SECOND: Activate your local Virtual Enviroment & Install Dependencies

Examples of .gitignore and requirements.txt are provided (yours may vary).

The project assumes the smart sales respository is organized like the following (yours may vary - adjust your paths using pathlib to reflect your existing project layout).

project/
├── data/prepared
│   ├── customers_data_prepared.csv
│   ├── products_data_prepared.csv
│   ├── sales_data_prepared.csv
├── scripts/
│   ├── step0-pipeline.py           # Orchestrate the pipeline
│   ├── step1-extract.py            # Extract stage: Read data from sources
│   ├── step2-transform.py          # Transform stage: Process data for insights
│   ├── step3-load.py               # Load stage: Save results to storage
│   └── step4-visualize.py          # Visualize results using seaborn or matplotlib
├── notebooks/
│   ├── insights.ipynb             # Notebook orchestrating extract-transform-load (ETL) + visualization
├── .gitignore
├── README.md
└── requirements.txt

Follow the instructions to manage your local virtual environment:

VIRTUAL_ENV

THIRD: Run PySpark Basic Script

We keep our Python scripts in the scripts folder.

Create new files with these names in your scripts folder. Paste the contents from the file provided in this repo.

On Windows Machine

In VS Code, open a PowerShell terminal in the root project folder.
Activate your local project environment everytime you open a terminal to work on the project.

.\.venv\Scripts\activate

Protip: After running the command once, you can usually get it back by typing just the initial dot and then hitting the right arrow key - or use the up arrow to access prior commands.

Execute the script.

py scripts\step0_pipeline.py

Protip: After running the command once, you can usually get it back by typing just the initial py and then hitting the right arrow key - or use the up arrow to access prior commands.

If you get a Windows Firewall alert regarding the JDK, click Allow.

On Mac/LInux Machine

In VS Code, open a terminal in the root project folder.
Activate your local project environment everytime you open a terminal to work on the project.

source .venv/bin/activate

Execute the script

python3 scripts/step0_pipeline.py

FOURTH: ENHANCE FUNCTIONALITY

Add or update the files to make your own functionality.

Paste the contents from the file provided in this repo.

Execute your scripts - or experiment with a Jupyter notebook.

Troubleshooting

$Env:HADOOP_HOME

Test-Path "$Env:HADOOP_HOME\bin\winutils.exe"

$env:Path = [System.Environment]::GetEnvironmentVariable("Path", "Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path", "User")

BYPASS LOCAL INSTALLATION - TRY IT ON THE WEB

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
images		images
notebooks		notebooks
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SETUP_LOG_LEVELS.md		SETUP_LOG_LEVELS.md
SETUP_SPARK_MAC_LINUX.md		SETUP_SPARK_MAC_LINUX.md
SETUP_SPARK_WINDOWS.md		SETUP_SPARK_WINDOWS.md
VIRTUAL_ENV.md		VIRTUAL_ENV.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smart-sales-spark

Versions Matter!

Apache Spark Homepage

Apache Spark Examples

Spark Pipeline Project Examples

FIRST: Set Up Your Machine

SECOND: Activate your local Virtual Enviroment & Install Dependencies

THIRD: Run PySpark Basic Script

On Windows Machine

On Mac/LInux Machine

FOURTH: ENHANCE FUNCTIONALITY

Troubleshooting

BYPASS LOCAL INSTALLATION - TRY IT ON THE WEB

About

Releases

Packages

Languages

License

denisecase/smart-sales-spark

Folders and files

Latest commit

History

Repository files navigation

smart-sales-spark

Versions Matter!

Apache Spark Homepage

Apache Spark Examples

Spark Pipeline Project Examples

FIRST: Set Up Your Machine

SECOND: Activate your local Virtual Enviroment & Install Dependencies

THIRD: Run PySpark Basic Script

On Windows Machine

On Mac/LInux Machine

FOURTH: ENHANCE FUNCTIONALITY

Troubleshooting

BYPASS LOCAL INSTALLATION - TRY IT ON THE WEB

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages