GitHub - vtsiatouras/recipy: A crawler framework that scrapes data from famous Greek cuisine sites.

A crawler framework that scrapes data from famous Greek cuisine sites.

Current target sites:

Installation

DjangoREST & Scrapy Installation from source

This section contains the installation instructions in order to set up a local development environment. The instructions have been validated for Ubuntu 20.04.

First, install all required software with the following command:

sudo apt update
sudo apt install git python3 python3-pip python3-dev postgresql postgresql-contrib

The project dependencies are managed with pipenv. You can install it with:

pip install --user pipenv

pipenv should now be in your PATH. If not, logout and log in again. Then install all dependencies with:

pipenv install --dev

Then you can enable the python environment with:

pipenv shell

All commands from this point forward require the python environment to be enabled.

Environment variables

The project uses environment variables in order to keep private data like user names and passwords out of source control. You can either set them at system level, or by creating a file named .env at the root of the repository. The required environment variables for development are:

RECIPY_DATABASE_USER: The database user
RECIPY_DATABASE_PASSWORD: The database user password
RECIPY_DATABASE_HOST: The database host. For local development use localhost
RECIPY_DATABASE_NAME: The database name.

Local Development

In order to run the project on your workstation, you must create a database named according to the value of the RECIPY_DATABASE_NAME environment variable, at the host that is specified by the RECIPY_DATABASE_HOST environment variable. You can create the database by running:

sudo -u postgres psql
postgres=# CREATE DATABASE recipy_development_db;

After you create the database, you can populate it with the initial schema by running:

python manage.py migrate

Now you can run the web server, exposing the API:

python manage.py runserver

The API is available at http://127.0.0.1:8000/api/v1/

The documentation Swagger page of the API is available at http://127.0.0.1:8000/api/swagger

Also in order to populate the database with data you must run the crawlers. In order to do that, just simply run the following

cd crawlers
./deploy.sh

This will spawn a Scrapyd instance and will execute all the crawlers concurrently.

The Scrapyd management page is available at http://127.0.0.1:6800

If you want to run each crawler saperately run:

scrapy crawl <crawler-name>

Installation using Docker (RECOMMENDED)

Initially, install Docker Engine (click the link to see instructions) & Docker Compose in order to build the project.

Set up the .env at the root of the repository!

RECIPY_DATABASE_USER: The database user
RECIPY_DATABASE_PASSWORD: The database user password
RECIPY_DATABASE_HOST: db The host name must be db
RECIPY_DATABASE_NAME: The database name.

Then just execute the following:

docker-compose up --build

Then you have the database, the API & the crawlers & the React frontend client up and running!

The database is exposed at jdbc:postgresql://localhost:5433/

The API, the Swagger page and the Scrapy page are available to the same addresses that referred above. The React client is available at http://127.0.0.1:5000/

Aditional Notes

The diagram below shows the structure and the main components of the ReciPy project.

The project is structured mainly by:

The Crawlers component which gathers all the required data from the targeted websites
A database in which those data are stored
An API that is able to access the data and to provide them following the REST architecture
And finally a web application, that is used as the User Interface, from which the users can search for recipes that exists in one of the targeted websites.

Below is an example of the management console of Scrapyd showing the status of each crawler process:

The following endpoints were implemented in order to serve all the requests of the front-end application:

Below is the Diagram of the respective schema that we used in order to store the various Recipes, Sites and Ingredients that were retrieved from the crawlers:

Finally, the below screenshots display the frontend application.

Search page:

Recipe detail page:

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
crawlers		crawlers
frontend		frontend
readme_assets		readme_assets
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

DjangoREST & Scrapy Installation from source

Environment variables

Local Development

Installation using Docker (RECOMMENDED)

Aditional Notes

About

Releases

Packages

Contributors 3

Languages

vtsiatouras/recipy

Folders and files

Latest commit

History

Repository files navigation

Installation

DjangoREST & Scrapy Installation from source

Environment variables

Local Development

Installation using Docker (RECOMMENDED)

Aditional Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages