Projects

Implementing RAG System for Large Language Model (LLM) Applications at Procter & Gamble

As a Data Scientist at Procter & Gamble, I specialize in integrating advanced LLM frameworks into custom-built applications designed to optimize data retrieval and natural language processing (NLP). Currently, I am leading the development of a Retrieval-Augmented Generation (RAG) system that enhances Large Language Models (LLMs) by incorporating external data sources. This project leverages cutting-edge AI technologies such as OpenAI’s GPT models, Hugging Face Transformers, and LlamaIndex to optimize the retrieval of private CSV-based patent data and generate accurate, summary-based responses.
The project has multiple components:
Data Handling & Preprocessing:
I manage and process large-scale proprietary datasets stored in CSV format, ensuring that the data is securely handled and optimized for use in LLMs. This involves structuring the data and indexing it using LlamaIndex, allowing the retrieval system to quickly access the most relevant information while maintaining strict compliance with privacy standards.
RAG Integration with LLMs:
By integrating the LlamaIndex retrieval system with GPT-based LLMs, I enable the models to pull in the most relevant patent information and generate contextually accurate summaries. Llama.cpp is used to enhance the system’s efficiency, enabling it to run models on local hardware while maintaining high performance. The retrieval mechanism dynamically selects pertinent data, improving both the precision and relevance of the responses generated by the LLM.
Interface Development with Flask & HTML:
To ensure ease of use and accessibility, I am building an intuitive web-based interface using Flask and HTML. This interface allows users to interact with the system by inputting queries and receiving structured, summary-based responses in real-time. The UI is designed to be user-friendly, providing smooth interactions and the ability to handle large text datasets efficiently. By coding the front-end in HTML, I ensure a responsive and accessible interface that can be easily customized for future needs.
Optimization for Summarization & Response Generation:
The system not only retrieves the most relevant documents but also leverages the power of LLMs to generate concise and informative summaries. I fine-tune the models to ensure domain-specific accuracy, especially in the context of patent data. The goal is to reduce response times while improving the quality of generated summaries, making the system highly efficient for research and decision-making.
Scalability & Deployment:
To make the system scalable and accessible across platforms, I use Flask to serve the application and deploy it using Docker, ensuring that it can scale up efficiently. The integration of RAG into the LLM pipeline enables the system to handle growing datasets while maintaining fast, accurate performance. I also integrate RESTful API endpoints to allow other systems or applications to interact with the retrieval model.
Challenges Overcome:
Data Privacy & Security: Handling proprietary patent data requires compliance with strict security guidelines. I have developed secure data pipelines that ensure encrypted and controlled access, protecting the integrity of the data throughout the entire process.
Efficient Scaling for Large Datasets: Optimizing the retrieval process for large-scale datasets while maintaining low latency was a core challenge. By improving indexing and leveraging Llama.cpp, I achieved significant performance gains without sacrificing accuracy.
Tools & Technologies:
Programming Languages & Frameworks: Python, Flask, HTML, Docker
LLM & NLP Technologies: OpenAI API, Hugging Face Transformers, PyTorch, TensorFlow
Data Management & Retrieval: LlamaIndex, spaCy, CSV-based data pipelines
Model Optimization: Llama.cpp for running models efficiently on local hardware
Deployment & Scalability: Flask (backend), HTML (frontend), Docker (scalable deployment), RESTful APIs

Predicting the Best Pill for Addiction Recovery: A Machine Learning Study:

Developed a machine learning model to improve treatment options for individuals struggling with addiction. Using real data from Druglib Reviews, the project focused on analyzing patient feedback on various medications. Preprocessing involved cleaning text, tokenization, and sentiment analysis, ensuring the dataset was well-prepared for training. The model, trained using XGBoost, outperformed others with a 92% accuracy and an AUC score of 0.95. This approach provided insights for clinicians to make data-driven decisions to personalize addiction recovery plans. Tools used include Python, Scikit-learn, XGBoost, and NLTK.

Churn Prediction Model: A Machine Learning Study

This project focuses on building a model to predict whether a customer is likely to churn based on user-provided data, enabling businesses to intervene and retain customers. The model uses features like customer demographics, interaction history, and service usage patterns to assess churn risk.Python, Scikit-learn, XGBoost, Pandas, NumPy, Matplotlib, Seaborn, Jupyter Notebooks, SQL for data retrieval, and AWS SageMaker for model deployment. Data was preprocessed through normalization, feature engineering, and missing data imputation.

Natural Language Processing (NLP) Project – Sentiment Analysis:

Led a comprehensive NLP project to perform sentiment analysis on the Rotten Tomatoes movie reviews dataset. The project included end-to-end development, from data preprocessing (tokenization, stop-word removal, and lemmatization) to building and training a BERT-based model for accurate sentiment classification. Achieved high accuracy by fine-tuning the model on labeled sentiment data and deploying it for real-time predictions, enabling fast and reliable sentiment analysis. The project demonstrated expertise in BERT, NLP techniques, and deploying machine learning models for practical applications in text-based sentiment analysis.

Work Experiences

R&D Data Scientist, Proctor and Gamble(P&G)

As a Data Scientist at P&G, I specialize in building advanced frameworks utilizing OpenAI’s GPT models. I integrate these technologies into a custom Retrieval-Augmented Generation (RAG) application to optimize data retrieval and processing. This approach enhances the overall system’s performance, enabling more insightful and accurate decision-making. I design and implement advanced NLP models using transformer-based architectures like GPT to improve natural language understanding and generation, enhancing the performance and scalability of Large Language Models (LLMs) for solving complex business challenges such as sentiment analysis, recommendation systems, and automated responses. I utilize machine learning libraries like PyTorch, TensorFlow, and Hugging Face Transformers for building, training, and fine-tuning models for production. I deploy scalable NLP solutions using Flask and Docker to ensure efficient accessibility across platforms. By integrating RAG systems, I combine external knowledge sources with LLM-generated text for more robust responses. I manage large-scale text datasets, focusing on preprocessing, tokenization, and embedding generation with tools like LlamaIndex and spaCy. I continuously stay at the forefront of NLP and LLM research, experimenting with novel architectures to drive innovation. Additionally, I collaborate closely with cross-functional teams to deliver tailored NLP solutions that align with specific project needs.

Graduate Research Assistant
Teacher Assistant

I was a teaching assistant for Database Theory and Deep Learning courses. In Database Theory, I covered topics like Relational Algebra, Normalization, Indexing, and Transaction Management. For Deep Learning, the focus was on Neural Networks, CNNs, RNNs, and Transfer Learning. My responsibilities included grading, providing one-on-one support, conducting labs, and assisting students with implementing deep learning models using TensorFlow and PyTorch.

PHP Developer Intern

Worked on multiple projects involving the creation and optimization of web tools for data processing and display. Implemented efficient algorithms to manage data flow and ensure accurate visual representation.
Developed user-friendly interfaces using HTML, CSS, and Bootstrap, focusing on responsive design for seamless performance across devices. Utilized W3Schools to stay updated with modern web development techniques and ensure code quality. Collaborated with front-end and back-end teams to streamline project development and optimize performance, enhancing the overall functionality and user experience of the applications.

August 9, 2024
Data Scientist NLP/LLM

developing advanced frameworks and custom RAG applications using GPT models to optimize data retrieval and decision-making at P&G

August 9, 2024
August 8, 2023
Data Scientist

I worked as a Data Scientist analyzing and cleaning data, focusing on data engineering tasks with SQL, Microsoft Azure, Python, and other tools to enhance data processing and model development.

August 8, 2023
July 1, 2022
Teacher Assistant

I was a teaching assistant for Database Theory and Deep Learning, covering Relational Algebra and Neural Networks. I graded assignments, supported students, and assisted with TensorFlow and PyTorch.

July 1, 2022
July 1, 2021
Graduate Research Assistant

This is timeline description. Please click here to change this description.

July 1, 2021
January 1, 2019
Timeline Heading 5

This is timeline description. Please click here to change this description.

January 1, 2019

Techniques, Software & Instrument

◦ Languages: SQL, R, Python, C++, LATEX, Php, Java, Bash Scripting
◦ Softwares: Power BI, Jet Brains IDEs, Visual Studio Code, Adobe Illustrator, Notepad ++
◦ Tools & Libs: OpenCV, GIT
◦ Platforms: Linux, Apache Spark, Web, Windows, MacOS, Raspberry
◦ Soft Skills: Project Management, Leadership, Time Management, Communication, Problem-Solving
◦ Technical Skills: Data Analysis, Linear Algebra, Object Oriented Programming, Multi Thread Programming

Certificates

Natural Language Processing: NLP With Transformers in Python

I recently completed the “Natural Language Processing: NLP With Transformers in Python” course on Udemy, where I gained a comprehensive understanding of modern NLP techniques using transformer models. This course covered a range of essential topics, including:
Understanding Transformers: I learned about the architecture of transformer models, including key concepts such as attention mechanisms, multi-head attention, and the significance of models like BERT.
NLP Frameworks: The course provided hands-on experience with various NLP frameworks such as HuggingFace’s Transformers, TensorFlow, PyTorch, spaCy, and NLTK.
Sentiment Analysis: I developed skills to perform sentiment analysis and language classification using state-of-the-art transformer models, applying techniques to analyze financial Reddit data.
Named Entity Recognition (NER): I explored named entity recognition using spaCy and transformers, enabling me to extract meaningful entities from text.
Question Answering Models: The course included building full-stack question-answering models, allowing me to understand how to create systems that can accurately respond to user queries.
Advanced Search Technologies: I learned about advanced search technologies like Elasticsearch and Facebook AI Similarity Search (FAISS), enhancing my ability to implement efficient retrieval systems.
Performance Evaluation: The course emphasized measuring the performance of NLP models using advanced metrics such as ROUGE, precision, recall, and F1 scores, ensuring a robust evaluation of model effectiveness.
Fine-tuning Models: I gained experience in fine-tuning transformer models for specialized use cases, broadening my capability to tailor models to specific tasks
Certificate

Python for Machine Learning & Data Science Masterclass

Data Science & Machine Learning with Python: Master essential data science skills and understand machine learning from top to bottom.
Data Pipeline Workflows: Create workflows to analyze, visualize, and derive insights from data.
Portfolio Development: Build a portfolio of real-world data science projects.
Data Analysis: Analyze your datasets to gain valuable insights.
NumPy for Numerical Processing: Learn to handle numerical data effectively.
Feature Engineering: Conduct feature engineering through real-world case studies.
Pandas for Data Manipulation: Master data manipulation techniques with Pandas.
Supervised Machine Learning: Create algorithms to predict classes and continuous values.
Data Visualization with Matplotlib & Seaborn: Develop customized visualizations and beautiful statistical plots.
Scikit-learn for Machine Learning: Utilize powerful algorithms for machine learning applications.
Deployment of Machine Learning Models: Learn to deploy models as interactive APIs and understand the full machine learning lifecycle.
Certificate

Programming for Everybody (Getting Started with Python)

Installation and Basics: Install Python and write your first program.
Variables: Use variables to store, retrieve, and calculate information.
Core Programming Tools: Utilize functions and loops.

Skills Gained
Algorithms
Computer Programming
Critical Thinking
Problem Solving
Programming Principles
Python Programming
Software Engineering
Theoretical Computer Scienc
Certificate