All notable changes to the Maven Crawler will be documented in this file. The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Adds command line arguments for running the crawler with various options.
- Sends Maven coordinates to a Kafka topic.
- Sends the extracted Maven coordinates as a JSON-compatible string.
- The extracted Maven coordinates have a timestamp based on Unix epochs.
- Avoids downloading and processing POM files if already exists on the disk.
- A non-recursive approach to extract the Maven coordinates.
- Saves and loads the crawled pages from a queue.
- Adds the URL of an extracted POM file to its JSON string.
- Users can set a limit to extract a specified number of Maven coordinates.
- In the case of no Kafka server, the extracted Maven coordinates can be saved in a file.
- Adds a
setup.py
script to install the crawler as a command-line tool.
- Solved the issue of
URLError
while crawling Maven repositories. - Solved an issue where the artifactID of parent package is extracted from a POM file.
- Solved an issue where a wrong groupID is extracted from a POM file.
- Solved an issue where the JSON string of some Maven coordinates have newlines, tabs or spaces.