One thing though - I guess this is about cleaning a static local pandas data frame. A more interesting and practical problem is cleaning in sql - when the source data in db is unclean and you want to clean it the same way in every iteration.
You can certainly build a more complex data cleaning pipeline but this simple implementation is meant to demonstrate a very common use case for data scientists and analysts.
Amazing! Have you tried something like this in prod? I wonder how It can be enriched with GitHub MCP or even Jira ticket creation for bug reporting for example
Yes, this is one of the first AI workflows I built early on, but I use it primarily during analysis, not as a way to detect issues within our data pipeline.
What I presented here is a lightweight version that could be customised for specific use cases.
I see as a result that missing rows are removed / imputed. Do you specify what imputation method is applied here? It can significantly change the values distribution.
One thing though - I guess this is about cleaning a static local pandas data frame. A more interesting and practical problem is cleaning in sql - when the source data in db is unclean and you want to clean it the same way in every iteration.
You can certainly build a more complex data cleaning pipeline but this simple implementation is meant to demonstrate a very common use case for data scientists and analysts.
Amazing! Have you tried something like this in prod? I wonder how It can be enriched with GitHub MCP or even Jira ticket creation for bug reporting for example
Yes, this is one of the first AI workflows I built early on, but I use it primarily during analysis, not as a way to detect issues within our data pipeline.
What I presented here is a lightweight version that could be customised for specific use cases.
This is so cool!
Thank you!
Could you theoretically connect to a database instead of uploading a csv?
100% You can use an MCP server to connect to your database of choice and bring in your data that way 👌
I see as a result that missing rows are removed / imputed. Do you specify what imputation method is applied here? It can significantly change the values distribution.
You can specify that in the main prompt or in the additional user instructions. It’s up to you which method you prefer
We have done something similar where our stack is
Streamlit interface
Connection to Databricks Genie for text-to-sql
Connection OpenAI for reasoning and text-to-plotly
Awesome! Would like to hear more, are you writing about it anytime soon?
I'm planning to share more AI workflows I'm currently using at work
Nice built will try this today
Awesome, let me know how it goes!
This is great. I’d written something like this for my company 2 years ago but the lang graph implementation is cool!