This repository contains multiple HR policy documents in Greek, alongside a set of testing questions aimed at evaluating the performance of a Retrieval-Augmented Generation (RAG)-based bot. This tool processes, clusters, and manages HR policy documents, providing similarity search capabilities with Qdrant as a vector database.
The following questions are designed to test the bot's ability to handle overlapping content from HR policies:
- Πότε πρέπει να υποβάλω αίτηση για άδεια μητρότητας ή πατρότητας;
- Ποιοι κανόνες ασφάλειας και υγείας πρέπει να τηρούνται στον εργασιακό χώρο;
- Πώς αξιολογείται η απόδοσή μου και ποια είναι τα κριτήρια;
- Ποιες ευκαιρίες εκπαίδευσης και ανάπτυξης παρέχει η εταιρεία;
- Τι περιλαμβάνει η πολιτική εργασιακής ισορροπίας;
- Ποια προγράμματα υποστήριξης προσφέρονται για την υγεία και την ευεξία των εργαζομένων;
The clustering_tool.py
script is designed to process, cluster, and manage HR policy documents in Greek. It uses advanced NLP techniques and integrates with Qdrant for efficient storage and retrieval of text embeddings.
-
Document Processing and Chunking
- Reads
.txt
HR policy documents and splits them into chunks, preserving context and managing overlapping content.
- Reads
-
Embedding Generation
- Uses Sentence Transformers to generate high-dimensional embeddings for Greek text.
-
Clustering
- Determines optimal clusters via the Elbow Method and performs KMeans clustering.
- Visualizes clusters with t-SNE for easy interpretation.
-
Vector Database Integration
- Stores embeddings and metadata in Qdrant for easy retrieval and persistence.
-
Similarity Search
- Provides similarity search to find related passages for given queries, returning relevant passages with document and cluster information.
- Python 3.7+: Ensure Python is installed on your system.
- Docker: Required for running the Qdrant vector database.
git clone https://github.com/Argyx/HR_Policy_Assignment.git
cd HR_Policy_Assignment
Use a virtual environment for isolation:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
Run Qdrant using Docker:
docker run -p 6333:6333 qdrant/qdrant
For data persistence:
docker run -p 6333:6333 -v qdrant_storage:/qdrant/storage qdrant/qdrant
Place all .txt
HR policy documents in the folder specified in folder_path
within clustering_tool.py
.
Run the script to process documents, generate embeddings, perform clustering, and store data in Qdrant.
python clustering_tool.py
- Document Processing: Reads and splits documents into manageable chunks while preserving context.
- Embedding Generation: Converts text chunks into embeddings.
- Clustering: Determines optimal clusters and assigns passages.
- Qdrant Integration: Stores embeddings and metadata for efficient retrieval.
- Visualization: Creates a t-SNE plot to visualize clusters.
- Similarity Search: Executes sample similarity searches for each query.
This CSV file contains all passages, cluster assignments, and metadata:
- Cluster: The assigned cluster ID.
- Document: The document name.
- Passage Index: The passage's index in the document.
- Passage: The text content of the passage.
The embeddings and associated metadata are stored in a Qdrant collection. This enables efficient similarity search queries and data persistence.
A t-SNE plot is generated to visualize the distribution of clusters. Each point represents a passage, colored by cluster assignment.
This HTML report is generated for each query, showing the top 3 most similar passages found. Each passage is highlighted to indicate similar content words, based on a configurable similarity threshold.
This HTML file provides detailed recommendations on overlapping content found across different documents. Each entry includes:
- Document A and B: The documents involved in the overlap.
- Passages: The respective overlapping passages.
- Similarity Score: The score indicating how closely the passages align.
- Recommendation: Suggestions for handling the overlap, such as merging or revising content.
In the similarity search results, we have enhanced the highlighting of similar text by introducing different colors to represent different levels of similarity between the query and the passages.
The highlight_similar_text_using_model
function has been updated to assign different colors to words in the passages based on their semantic similarity to the words in the query.
- Red (>= 0.9): Very high similarity. Words that are almost identical or highly related to the query terms.
- Orange (>= 0.8): High similarity. Words that are strongly related to the query.
- Yellow (>= 0.7): Moderate similarity. Words that are somewhat related to the query.
- Green (>= 0.5): Low similarity. Words that have a loose connection to the query.
This color-coded highlighting makes it easier to quickly identify the most relevant parts of each passage in relation to the query.
You can adjust the similarity thresholds and corresponding colors in the highlight_similar_text_using_model
function call within the script. The thresholds
parameter accepts a list of tuples, each containing a threshold value and a color name.
# Custom thresholds and colors
custom_thresholds = [
(0.9, 'red'), # Very high similarity
(0.8, 'orange'), # High similarity
(0.7, 'yellow'), # Moderate similarity
(0.5, 'green') # Low similarity
]
# Use the function with custom thresholds
highlighted_passage = highlight_similar_text_using_model(
query,
passage['passage'],
model,
greek_stopwords,
thresholds=custom_thresholds
)
Each query section includes:
- Query: The query text displayed in bold.
- Similar Passages: The top 3 similar passages are shown in an ordered list, each including:
- Score: Similarity score between the query and the passage.
- Document: The name of the document containing the passage.
- Passage Index: Index of the passage within the document.
- Highlighted Passage: The passage text with similar words marked using HTML
<mark>
tags.
This structure is helpful for visual inspection of the model's performance and allows easy testing of various queries against the stored document embeddings.
- Maximum Tokens per Chunk: Modify
max_tokens
inclustering_tool.py
to control text chunk size.
max_tokens = 128
- Qdrant Configuration: Update
qdrant_host
andqdrant_port
if Qdrant is hosted externally.
qdrant_host = 'localhost'
qdrant_port = 6333
- Qdrant Issues: Verify Docker setup if connection errors occur.
- Large Documents: Adjust
max_tokens
or optimize sentence splitting for large files.
Contributions are welcome! Open issues or submit pull requests for enhancements.
This project is licensed under the MIT License.
- Sentence Transformers for embedding generation.
- Qdrant for vector database management.
- Kneed for elbow method detection.
- Scikit-learn for clustering and evaluation metrics.
- Matplotlib and Seaborn for data visualization.