This project is part of the MLOps Zoomcamp offered by Data Talks Club cohort 2024.
In today's competitive financial landscape, efficient loan approval processes are crucial. This project aims to develop and deploy a Machine Learning (ML) model to predict loan eligibility. By leveraging MLOps practices, we will build a robust and automated system for loan assessment. This will enable faster loan decisions, improve customer experience, and optimize risk management for the financial institution.
Traditional Loan Approval Process:
Currently, loan eligibility decisions are primarily made through human underwriters who assess various borrower data points like income, credit score, employment history, debt-to-income ratio, and collateral. This manual process can be time-consuming, prone to bias, and lack consistency, leading to potential delays and dissatisfied customers. Additionally, it can be challenging to accurately assess the creditworthiness of non-traditional borrowers who may need a more extensive credit history.
Machine Learning Approach:
This project proposes a Machine Learning (ML) model to automate and enhance the loan eligibility prediction process. The model will learn from historical loan data, identifying patterns differentiating approved and rejected loan applications. This data-driven approach can lead to:
- Faster Approvals: Automated predictions can significantly reduce processing time, allowing quicker loan decisions.
- Reduced Bias: ML models are objective and unbiased, mitigating the risk of human judgment influencing loan decisions.
- Improved Efficiency: Streamlined loan assessment frees up underwriters' time for more complex cases.
- Enhanced Risk Management: The model can identify risk factors and predict potential defaults, allowing lenders to make informed decisions.
- Machine Learning: Scikit-learn
- Experiment tracking and model registry: CometML
- Cloud Infraestructure: Docker, Terraform, AWS (EC2 and S3)
- Linting and Formatting: Pylint, Flake8, autopep8
- Testing: Pytest
- Automation: GitHub Actions (CI/CD Pipeline)
- Orchestration: Prefect
Let's check the complete directory of the project.
-
Data Ingestion:
- The data was extracted from the Kaggle Loan Eligibility Dataset.
- Let's check the raw data
- Data cleaning procedures will ensure data quality and address missing values or inconsistencies.
-
Exploratory Data Analysis (EDA):
- Data visualizations will be used to understand the distribution of loan features, identify potential correlations, and uncover any hidden patterns.
- Feature importance analysis will assess the influence of each factor on loan eligibility.
-
- New features were created based on existing data to improve model performance.
- Data scaling was applied to ensure all features are on a similar scale.
-
- SelectFromModel method was applied to select the more relevant features.
-
- Various ML algorithms, such as Logistic Regression, Random Forest, or Gradient Boosting, will be trained and evaluated on some of the data.
- Model selection will be based on accuracy, precision, and recall metrics.
- Prefect orchestrated the workflow with the following pipeline. Note: Let's provide the API KEY to use Prefect. You can check the Quickstart guide.
pip install -U prefect --pre prefect cloud login -k '<my-api-key>'
- Data ingestion
- Feature engineering
- Feature selection
- Training
- Model registry
-
Experiment Tracking and Version Control:
- Comet ML was used to track the experiment.
- You need to set up an API_KEY to use the package in the project.
- You can check the official Comet documentation.
``` pip install comet_ml comet login ```
- It is an example here for you to include in your project.
``` # Get started in a few lines of code import comet_ml comet_ml.login() exp = comet_ml.Experiment() # Start logging your data with: exp.log_parameters({"batch_size": 128}) exp.log_metrics({"accuracy": 0.82, "loss": 0.012})```
-
Model registry and Version Control:
-
Model Testing:
-
The scripts were assessed using Pylint and flake8 and were formatted using autopep8.
-
The model with the best performance was deployed using a Flask application.
-
At the beginning, it was tested on the host machine.
-
-
Model Deployment:
- Once the application was tested locally, the Makefile was created to containerize the app.
To build the image:
Make build
To push the image to the docker hub repo:
Make push
To run the image locally or on the cloud:
Make run
Note: provide docker credentials on the terminal to pull the docker image. Let's check the image on your docker hub repo:
-
The production-ready model was deployed on AWS infrastructure (EC2 and S3). Using Terraform as IAC to manage computational resources. From the [app directory] (src/deployment), run:
terraform init terraform plan terraform apply
Executing these commands will perform the following activities:
- Provide AWS infrastructure
- Enable TCP traffic (HTTP and HTTPS)
- Install and enable docker on the EC2 instance
- Pull the docker image from my docker hub repo
- Run the image on the EC2 instance
- Print the public IP address of the Flask app.
-
GitHub Actions automates the CI/CD pipeline. The pipeline has the following steps:
- Checkout repository
- Set up Python
- Set up Terraform
- Initialize Terraform
- Init, plan, and apply terraform tasks.
- Print the public IP address of the Flask app.
- The Flask app is deployed automatically on the AWS cloud:
Note: The app will be available until the Attempt 2 review is completed.
-
Monitoring and Continuous Improvement: - The deployed model's performance will be continuously monitored through key metrics. - Periodic retraining with new data will be conducted to ensure the model stays accurate and adapts to changing market conditions.
By implementing this data-driven approach, the project aims to significantly improve loan eligibility assessment, leading to faster decisions, enhanced customer satisfaction, and optimized risk management for the financial institution.