A crawler framework that scrapes data from famous Greek cuisine sites.
Current target sites:
This section contains the installation instructions in order to set up a local development environment. The instructions have been validated for Ubuntu 20.04.
First, install all required software with the following command:
sudo apt update
sudo apt install git python3 python3-pip python3-dev postgresql postgresql-contrib
The project dependencies are managed with pipenv. You can install it with:
pip install --user pipenv
pipenv
should now be in your PATH
. If not, logout and log in again. Then install all dependencies with:
pipenv install --dev
Then you can enable the python environment with:
pipenv shell
All commands from this point forward require the python environment to be enabled.
The project uses environment variables in order to keep private data like user names and passwords out of source
control. You can either set them at system level, or by creating a file named .env
at the root of the repository.
The required environment variables for development are:
RECIPY_DATABASE_USER
: The database userRECIPY_DATABASE_PASSWORD
: The database user passwordRECIPY_DATABASE_HOST
: The database host. For local development uselocalhost
RECIPY_DATABASE_NAME
: The database name.
In order to run the project on your workstation, you must create a database named according to the value of the
RECIPY_DATABASE_NAME
environment variable, at the host that is specified by the
RECIPY_DATABASE_HOST
environment variable. You can create the database by running:
sudo -u postgres psql
postgres=# CREATE DATABASE recipy_development_db;
After you create the database, you can populate it with the initial schema by running:
python manage.py migrate
Now you can run the web server, exposing the API:
python manage.py runserver
The API is available at http://127.0.0.1:8000/api/v1/
The documentation Swagger page of the API is available at http://127.0.0.1:8000/api/swagger
Also in order to populate the database with data you must run the crawlers. In order to do that, just simply run the following
cd crawlers
./deploy.sh
This will spawn a Scrapyd instance and will execute all the crawlers concurrently.
The Scrapyd management page is available at http://127.0.0.1:6800
If you want to run each crawler saperately run:
scrapy crawl <crawler-name>
Initially, install Docker Engine (click the link to see instructions) & Docker Compose in order to build the project.
Set up the .env
at the root of the repository!
RECIPY_DATABASE_USER
: The database userRECIPY_DATABASE_PASSWORD
: The database user passwordRECIPY_DATABASE_HOST
:db
The host name must bedb
RECIPY_DATABASE_NAME
: The database name.
Then just execute the following:
docker-compose up --build
Then you have the database, the API & the crawlers & the React frontend client up and running!
The database is exposed at jdbc:postgresql://localhost:5433/
The API, the Swagger page and the Scrapy page are available to the same addresses that referred above. The React client is available at http://127.0.0.1:5000/
The diagram below shows the structure and the main components of the ReciPy project.
The project is structured mainly by:
- The Crawlers component which gathers all the required data from the targeted websites
- A database in which those data are stored
- An API that is able to access the data and to provide them following the REST architecture
- And finally a web application, that is used as the User Interface, from which the users can search for recipes that exists in one of the targeted websites.
Below is an example of the management console of Scrapyd showing the status of each crawler process:
The following endpoints were implemented in order to serve all the requests of the front-end application:
Below is the Diagram of the respective schema that we used in order to store the various Recipes, Sites and Ingredients that were retrieved from the crawlers:
Finally, the below screenshots display the frontend application.
Search page:
Recipe detail page: