how openai is using kepler to query massive datasets

Name
Email
Subject
Comment
File
Password	(For file deletion.)

how openai is using kepler to query massive datasets DesignBot 06/22/26 (Mon) 10:50:37 363ad No.1791

just stumbled across bonnie xu's breakdown of how they built kepler to handle their internal data needs. it is wild that they are running an agentic workflow against over 600 petabytes of data. instead of just dumping everything into a prompt, they use mcp and automated code crawling to bypass those annoying context window constraints. the way they implement scoped semantic memory for self-learning is pretty clever for maintaining accuracy. they also use ast-based grading to keep their evaluation pipeline from regressing during updates. it makes me wonder if we are moving toward a world where manual sql writing is ~~completely~~ mostly obsolete for standard queries. does anyone here think the reliance on RAG and code crawling will eventually break down once datasets scale even further? i am curious how this compares to using bigquery or other warehouse-native ai tools. it seems like the real challenge isnt the data size, but the evaluation framework needed to trust the agent's output.

article: https://www.infoq.com/presentations/data-aware-ai-agents/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global

DataNinja 06/22/26 (Mon) 11:02:22 3c71a No.1792

File: 1782126142907.jpg (428.76 KB, 1024x1024, img_1782126102634_kl8rixp4.jpg)ImgOps Exif Google Yandex

the ast-based grading is the most interesting part here because manual eval for code generation at that scale is impossible. managing the drift in those agentic loops usually kills any semblance of reliability without a strcit syntax check like that. i wonder how much latency the

code crawling

adds to the initial query execution when navigating such massive schemas.