HR Policy Clustering and Similarity Search Tool

This repository contains multiple HR policy documents in Greek, alongside a set of testing questions aimed at evaluating the performance of a Retrieval-Augmented Generation (RAG)-based bot. This tool processes, clusters, and manages HR policy documents, providing similarity search capabilities with Qdrant as a vector database.

Questions for Testing

The following questions are designed to test the bot's ability to handle overlapping content from HR policies:

Πότε πρέπει να υποβάλω αίτηση για άδεια μητρότητας ή πατρότητας;
Ποιοι κανόνες ασφάλειας και υγείας πρέπει να τηρούνται στον εργασιακό χώρο;
Πώς αξιολογείται η απόδοσή μου και ποια είναι τα κριτήρια;
Ποιες ευκαιρίες εκπαίδευσης και ανάπτυξης παρέχει η εταιρεία;
Τι περιλαμβάνει η πολιτική εργασιακής ισορροπίας;
Ποια προγράμματα υποστήριξης προσφέρονται για την υγεία και την ευεξία των εργαζομένων;

Overview of the Clustering Tool

The clustering_tool.py script is designed to process, cluster, and manage HR policy documents in Greek. It uses advanced NLP techniques and integrates with Qdrant for efficient storage and retrieval of text embeddings.

Key Features

Document Processing and Chunking
- Reads .txt HR policy documents and splits them into chunks, preserving context and managing overlapping content.
Embedding Generation
- Uses Sentence Transformers to generate high-dimensional embeddings for Greek text.
Clustering
- Determines optimal clusters via the Elbow Method and performs KMeans clustering.
- Visualizes clusters with t-SNE for easy interpretation.
Vector Database Integration
- Stores embeddings and metadata in Qdrant for easy retrieval and persistence.
Similarity Search
- Provides similarity search to find related passages for given queries, returning relevant passages with document and cluster information.

Installation

Prerequisites

Python 3.7+: Ensure Python is installed on your system.
Docker: Required for running the Qdrant vector database.

Clone the Repository

git clone https://github.com/Argyx/HR_Policy_Assignment.git
cd HR_Policy_Assignment

Install Python Dependencies

Use a virtual environment for isolation:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt

Set Up Qdrant

Run Qdrant using Docker:

docker run -p 6333:6333 qdrant/qdrant

For data persistence:

docker run -p 6333:6333 -v qdrant_storage:/qdrant/storage qdrant/qdrant

Usage

Document Preparation

Place all .txt HR policy documents in the folder specified in folder_path within clustering_tool.py.

Running the Tool

Run the script to process documents, generate embeddings, perform clustering, and store data in Qdrant.

python clustering_tool.py

Script Workflow

Document Processing: Reads and splits documents into manageable chunks while preserving context.
Embedding Generation: Converts text chunks into embeddings.
Clustering: Determines optimal clusters and assigns passages.
Qdrant Integration: Stores embeddings and metadata for efficient retrieval.
Visualization: Creates a t-SNE plot to visualize clusters.
Similarity Search: Executes sample similarity searches for each query.

Outputs

1. `clusters.csv`

This CSV file contains all passages, cluster assignments, and metadata:

Cluster: The assigned cluster ID.
Document: The document name.
Passage Index: The passage's index in the document.
Passage: The text content of the passage.

2. Qdrant Collection

The embeddings and associated metadata are stored in a Qdrant collection. This enables efficient similarity search queries and data persistence.

3. Visualization: `tsne_clusters.png`

A t-SNE plot is generated to visualize the distribution of clusters. Each point represents a passage, colored by cluster assignment.

4. Similarity Report: `similarity_report.html`

This HTML report is generated for each query, showing the top 3 most similar passages found. Each passage is highlighted to indicate similar content words, based on a configurable similarity threshold.

5. Overlap Table: `overlap_recommendations.html`

This HTML file provides detailed recommendations on overlapping content found across different documents. Each entry includes:

Document A and B: The documents involved in the overlap.
Passages: The respective overlapping passages.
Similarity Score: The score indicating how closely the passages align.
Recommendation: Suggestions for handling the overlap, such as merging or revising content.

Highlighting Similarity Levels with Colors

In the similarity search results, we have enhanced the highlighting of similar text by introducing different colors to represent different levels of similarity between the query and the passages.

How It Works

The highlight_similar_text_using_model function has been updated to assign different colors to words in the passages based on their semantic similarity to the words in the query.

Red (>= 0.9): Very high similarity. Words that are almost identical or highly related to the query terms.
Orange (>= 0.8): High similarity. Words that are strongly related to the query.
Yellow (>= 0.7): Moderate similarity. Words that are somewhat related to the query.
Green (>= 0.5): Low similarity. Words that have a loose connection to the query.

This color-coded highlighting makes it easier to quickly identify the most relevant parts of each passage in relation to the query.

Customizing Thresholds and Colors

You can adjust the similarity thresholds and corresponding colors in the highlight_similar_text_using_model function call within the script. The thresholds parameter accepts a list of tuples, each containing a threshold value and a color name.

Example:

# Custom thresholds and colors
custom_thresholds = [
    (0.9, 'red'),       # Very high similarity
    (0.8, 'orange'),    # High similarity
    (0.7, 'yellow'),    # Moderate similarity
    (0.5, 'green')      # Low similarity
]

# Use the function with custom thresholds
highlighted_passage = highlight_similar_text_using_model(
    query,
    passage['passage'],
    model,
    greek_stopwords,
    thresholds=custom_thresholds
)

Structure of `similarity_report.html`

Each query section includes:

Query: The query text displayed in bold.
Similar Passages: The top 3 similar passages are shown in an ordered list, each including:
- Score: Similarity score between the query and the passage.
- Document: The name of the document containing the passage.
- Passage Index: Index of the passage within the document.
- Highlighted Passage: The passage text with similar words marked using HTML <mark> tags.

This structure is helpful for visual inspection of the model's performance and allows easy testing of various queries against the stored document embeddings.

Customization

Adjusting Clustering Parameters

Maximum Tokens per Chunk: Modify max_tokens in clustering_tool.py to control text chunk size.

max_tokens = 128

Qdrant Configuration: Update qdrant_host and qdrant_port if Qdrant is hosted externally.

qdrant_host = 'localhost'
qdrant_port = 6333

Troubleshooting

Qdrant Issues: Verify Docker setup if connection errors occur.
Large Documents: Adjust max_tokens or optimize sentence splitting for large files.

Contributing

Contributions are welcome! Open issues or submit pull requests for enhancements.

License

This project is licensed under the MIT License.

Acknowledgements

Sentence Transformers for embedding generation.
Qdrant for vector database management.
Kneed for elbow method detection.
Scikit-learn for clustering and evaluation metrics.
Matplotlib and Seaborn for data visualization.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
sample_output		sample_output
.gitignore		.gitignore
Readme.md		Readme.md
clustering_tool.py		clustering_tool.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HR Policy Clustering and Similarity Search Tool

Questions for Testing

Overview of the Clustering Tool

Key Features

Installation

Prerequisites

Clone the Repository

Install Python Dependencies

Set Up Qdrant

Usage

Document Preparation

Running the Tool

Script Workflow

Outputs

1. `clusters.csv`

2. Qdrant Collection

3. Visualization: `tsne_clusters.png`

4. Similarity Report: `similarity_report.html`

5. Overlap Table: `overlap_recommendations.html`

Highlighting Similarity Levels with Colors

How It Works

Customizing Thresholds and Colors

Example:

Structure of `similarity_report.html`

Customization

Adjusting Clustering Parameters

Troubleshooting

Contributing

License

Acknowledgements

About

Releases

Packages

Languages

Argyx/HR_Policy_Assignment

Folders and files

Latest commit

History

Repository files navigation

HR Policy Clustering and Similarity Search Tool

Questions for Testing

Overview of the Clustering Tool

Key Features

Installation

Prerequisites

Clone the Repository

Install Python Dependencies

Set Up Qdrant

Usage

Document Preparation

Running the Tool

Script Workflow

Outputs

1. clusters.csv

2. Qdrant Collection

3. Visualization: tsne_clusters.png

4. Similarity Report: similarity_report.html

5. Overlap Table: overlap_recommendations.html

Highlighting Similarity Levels with Colors

How It Works

Customizing Thresholds and Colors

Example:

Structure of similarity_report.html

Customization

Adjusting Clustering Parameters

Troubleshooting

Contributing

License

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `clusters.csv`

3. Visualization: `tsne_clusters.png`

4. Similarity Report: `similarity_report.html`

5. Overlap Table: `overlap_recommendations.html`

Structure of `similarity_report.html`

Packages