Skip to content

denisecase/smart-sales-spark

Repository files navigation

smart-sales-spark

This project provides an introduction to using Apache Spark for data processing and analysis. It includes hands-on examples for working with Spark DataFrames and instructions for setting up Spark on your local machine.

Apache Spark is a powerful distributed computing framework widely used for big data processing, machine learning, and real-time analytics.

We can use Spark with structured data using Spark DataFrames, query data with SQL, or process unstructured data using RDDs (Resilient Distributed Datasets).

Versions Matter!

This guide has been tested with:

-Python 3.10.11 (newest is 3.12.4 - which does NOT work) -PySpark 3.5.3 -Spark 3.5.3 -JDK 17 -Winutils for Spark 3

Apache Spark Homepage

Read about Spark’s features and capabilities on the Apache Spark Homepage.

Apache Spark Examples

Visit the Apache Spark Examples Page to work through Spark’s official examples, including:

  • DataFrame Example - demonstrates creating a Spark DataFrame and performing operations like filtering, aggregation, and adding columns.
  • SQL Example - illustrates how to query data using Spark SQL.
  • RDD Example - ntroduces RDDs for processing unstructured data.
  • Streaming Example - shows how to handle real-time data-in-motion with structured streaming.

Spark Pipeline Project Examples

Explore how different Spark-based projects might be structured by checking the docs folder. Each project demonstrates the use of a scalable pipeline architecture tailored to a specific domain.

FIRST: Set Up Your Machine

Follow the instructions to set up your system first:

SECOND: Activate your local Virtual Enviroment & Install Dependencies

Examples of .gitignore and requirements.txt are provided (yours may vary).

The project assumes the smart sales respository is organized like the following (yours may vary - adjust your paths using pathlib to reflect your existing project layout).

project/
├── data/prepared
│   ├── customers_data_prepared.csv
│   ├── products_data_prepared.csv
│   ├── sales_data_prepared.csv
├── scripts/
│   ├── step0-pipeline.py           # Orchestrate the pipeline
│   ├── step1-extract.py            # Extract stage: Read data from sources
│   ├── step2-transform.py          # Transform stage: Process data for insights
│   ├── step3-load.py               # Load stage: Save results to storage
│   └── step4-visualize.py          # Visualize results using seaborn or matplotlib
├── notebooks/
│   ├── insights.ipynb             # Notebook orchestrating extract-transform-load (ETL) + visualization
├── .gitignore
├── README.md
└── requirements.txt

Follow the instructions to manage your local virtual environment:

THIRD: Run PySpark Basic Script

We keep our Python scripts in the scripts folder.

Create new files with these names in your scripts folder. Paste the contents from the file provided in this repo.

On Windows Machine

  1. In VS Code, open a PowerShell terminal in the root project folder.

  2. Activate your local project environment everytime you open a terminal to work on the project.

.\.venv\Scripts\activate

Protip: After running the command once, you can usually get it back by typing just the initial dot and then hitting the right arrow key - or use the up arrow to access prior commands.

  1. Execute the script.
py scripts\step0_pipeline.py

Protip: After running the command once, you can usually get it back by typing just the initial py and then hitting the right arrow key - or use the up arrow to access prior commands.

If you get a Windows Firewall alert regarding the JDK, click Allow.

On Mac/LInux Machine

  1. In VS Code, open a terminal in the root project folder.

  2. Activate your local project environment everytime you open a terminal to work on the project.

source .venv/bin/activate
 
  1. Execute the script
python3 scripts/step0_pipeline.py

FOURTH: ENHANCE FUNCTIONALITY

Add or update the files to make your own functionality.

Paste the contents from the file provided in this repo.

Execute your scripts - or experiment with a Jupyter notebook.

Troubleshooting

$Env:HADOOP_HOME

Test-Path "$Env:HADOOP_HOME\bin\winutils.exe"

$env:Path = [System.Environment]::GetEnvironmentVariable("Path", "Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path", "User")

BYPASS LOCAL INSTALLATION - TRY IT ON THE WEB

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published