A fully local AI Powered Data Profiling and Cleaning Project

Data Engineering is rapidly driven towards AI first approach but there are ways to use AI and still manage sensitive data

Share
A fully local AI Powered Data Profiling and Cleaning Project
Photo by Igor Shalyminov / Unsplash

🚀 Excited to share my latest project — a fully local, AI-powered Data Cleaning Agent!

Tired of sending sensitive data to cloud-based cleaning tools? I built a self-hosted, LLM-driven data cleaning pipeline that runs entirely on your machine.

What it does:

🔹 Profiles your dirty CSV data and detects quality issues (mixed types, invalid emails, negative ages, empty columns, and more)

🔹 Plans a cleaning strategy using a local LLM (via Ollama — no API keys needed)

🔹 Generates and executes Python cleaning code with built-in safety wrappers

🔹 Validates the output with fail-safe corruption detection and auto-retry

🔹 Puts you in control with a Human-in-the-Loop (HITL) approval step before anything executes

What makes it special:

🏗️ Extensible Architecture - Clean separation of concerns with a plugin-based agent registry. Agents self-register and are discovered dynamically.

🤖 LLM Provider Abstraction - Unified interface that currently supports Ollama but is designed to plug in OpenAI, Anthropic, or any other provider with minimal effort.

🛡️ Data Safety First - Auto-injected safety wrappers (pd.to_numeric(errors='coerce')), type checks before string operations, and corruption detection that forces retries when things go wrong.

📊 Smart Profiling - Detects text_in_numeric, mixed_types, suspicious_age, invalid_email_format, completely_empty columns, and more with smart keyword matching to avoid false positives.

Tech Stack:

LangGraph (agent orchestration) · Ollama (local LLMs) · Polars (data processing) · SQLite (HITL checkpointing) · Rich (terminal UX) · YAML config (environment-aware)

Why I built it:

AI first approach is rapidly becoming a norm in Data Engineering. As Data Engineers deal with sensitive data, I wanted to look at ways to improve the workflow by using AI locally as much as possible for the time-consuming processes in the Data Engineering.

AI Agents can profile the source data for issues, but unfortunately, the output is not reliable due to the probabilistic nature of these models.

This project attempts to harness a series of AI Agents in a workflow where data profiling and cleaning are done under an extensible architecture supported by LLMs that are abstracted and designed to be deterministic.

GitHub: https://github.com/vaibhav-kalekar/data-engg-agentic-workflow

Open for feedback! 🙌


⚠️ Current Challenges & Known Issues:

  1. Hardcoded output path — The executor writes to a fixed cleaned_sales_data.csv path instead of respecting user-specified output locations.
  2. No unit tests — The entire pipeline lacks a test suite, making regression-prone changes risky.
  3. Polars mixed with pandas — The profiler uses Polars, but the validator reads JSON snapshots via pl.read_json(StringIO(...)), creating an inconsistent data flow and potential serialization edge cases.
  4. No streaming for LLM responses — The generate() method is synchronous; long-running LLM calls block the terminal with no progress feedback.
  5. Limited file format support — Only CSV is supported. No Excel, Parquet, or JSON input/output.
  6. No Docker/deployment — No containerized setup for reproducible environments.
  7. No CLI argument for output path — Users can specify input via --filepath, but have no way to control where the cleaned file is saved.
  8. Config file fallback is silent — If llm.yaml doesn't exist, the code silently falls back to DEFAULT_CONFIG with no warning, which can lead to confusion.
  9. No logging framework — Console output via Rich is great for interactive use, but there's no structured logging for debugging or audit trails.
  10. No async support — The entire graph is synchronous; no concurrent node execution, even where possible.

🔜 Next Steps:

  1. Add a comprehensive test suite — Unit tests for each agent, integration tests for the full pipeline, and fixture-based dirty data for regression testing.
  2. Support multiple file formats — Add Excel (.xlsx), Parquet (.parquet), and JSON as input/output options.
  3. Implement LLM response streaming — Use generate_stream() for real-time progress updates during long LLM calls.
  4. Add Docker support — Dockerfile + docker-compose for one-command setup with Ollama.
  5. Add output path CLI argument--output data/cleaned.csv to complement the existing --filepath.
  6. Add structured logging — Replace Rich console output with Python's logging module for production-grade observability.
  7. Add OpenAI provider — Implement OpenAIProvider to give users a cloud fallback when Ollama isn't available.
  8. Add config validation — Validate llm.yaml on startup and surface clear errors instead of silent fallbacks.
  9. Add a web UI — Streamlit or Gradio frontend for users who prefer a GUI over terminal interaction.
  10. Add custom agent plugins — Allow users to write and register their own cleaning agents without modifying the core codebase.

Before you leave

I don’t write about technology because I am fascinated by it.

I write about it because I’ve seen how quietly it reshapes the world around us—how decisions made in systems and code eventually find their way into the texture of everyday life.

Artificial Intelligence feels like one of those inflection points.

Not because it is unprecedented, but because of how quickly it is being normalized, scaled, and absorbed—often without the same level of reflection that went into building it.

If this essay resonates, it probably means you’ve felt some version of that tension too. You will enjoy more of such AI and technology perspectives here.


I explore these ideas further through essays, speculative fiction, and photography—different media, but the same underlying question:

What does it mean to be human in a world being redesigned faster than we can feel it?

You can subscribe to read more of my writing in your inbox