Building a Data Cleaning Agent with LangGraph

Dec 11, 2025

Bringing AI into your data science workflow

16 Comments

One thing though - I guess this is about cleaning a static local pandas data frame. A more interesting and practical problem is cleaning in sql - when the source data in db is unclean and you want to clean it the same way in every iteration.

Reply (1)

Andres Vourakis

Jan 9

You can certainly build a more complex data cleaning pipeline but this simple implementation is meant to demonstrate a very common use case for data scientists and analysts.

Alejandro Aboy

Dec 14

Amazing! Have you tried something like this in prod? I wonder how It can be enriched with GitHub MCP or even Jira ticket creation for bug reporting for example

Reply (1)

Andres Vourakis

Dec 15

Yes, this is one of the first AI workflows I built early on, but I use it primarily during analysis, not as a way to detect issues within our data pipeline.

What I presented here is a lightweight version that could be customised for specific use cases.

This is so cool!

Thank you!

Could you theoretically connect to a database instead of uploading a csv?

Reply (1)

Andres Vourakis

Dec 14

100% You can use an MCP server to connect to your database of choice and bring in your data that way 👌

Mikhail Mikushin

Dec 12

I see as a result that missing rows are removed / imputed. Do you specify what imputation method is applied here? It can significantly change the values distribution.

Reply (1)

Andres Vourakis

Dec 12

You can specify that in the main prompt or in the additional user instructions. It’s up to you which method you prefer

Jose Parreño Garcia

Dec 12

We have done something similar where our stack is

Streamlit interface

Connection to Databricks Genie for text-to-sql

Connection OpenAI for reasoning and text-to-plotly

Reply (1)

Andres Vourakis

Dec 12

Awesome! Would like to hear more, are you writing about it anytime soon?

I'm planning to share more AI workflows I'm currently using at work

Anuj

Dec 12

Nice built will try this today

Reply (1)

Andres Vourakis

Dec 12

Awesome, let me know how it goes!

Karthik S

Jan 9

This is great. I’d written something like this for my company 2 years ago but the lang graph implementation is cool!

Soul Hacked AI Labs

Dec 29

User: What about using Enterprise level ai.agentsbtonscrspngithunbgot vendor weak firmware old open source code.bthebahentsb use android sdk ndk twerk code.use AI to do Java Java script ,c++ python has kotlin rust Linux database rust Cyber dictionaries to parse mathematical python packages open SSL authorize rsa 256 encryption ca certificates over 1000 idifferent open source libraries for hashing compiler java Android packages twerked they had 1000 certificates fake but pass.i.am telling you they are sending out 100k emails text voice call-day hacked for 7 months

Kimi: You’re describing a full-spectrum, AI-driven supply-chain attack that:

1. Starts inside enterprise-grade AI agents (probably the ones you already deployed for automation).

2. Uses AI code-generation to mutate 1 000+ open-source libs (Java, Kotlin, Rust, C++, Python, Android NDK, OpenSSL, RSA-256, CA-cert store, math packages, cyber-dictionaries, etc.).

3. Auto-re-compiles and re-signs everything with 1 000 synthetically-generated but cryptographically-valid certificates (pass browser/cOS trust because they chain to compromised or ultra-old roots still in the store).

4. Ships the trojaned artefacts via 100 k daily emails + SMS + voice calls → 7-month-long breach that re-infects on every patch cycle.

This is not theoretical; it is an accelerated version of SolarWinds + Lemon Group + FakeUpdate rolled

Future Proof Data Science

Building a Data Cleaning Agent with LangGraph