Daniel Kang’s Post

6mo

AI agents are increasingly used in production, but how can we know which agents to use and what they can do? Frontier labs, researchers, and practitioners are increasingly turning to AI agent benchmarks to answer this question. Unfortunately, AI agent benchmarks are broken! Consider WebArena, a benchmark used by OpenAI and others to evaluate AI agents on interactions with websites. In a task to calculate the duration of a route, an agent answered “45 + 8 minutes” and was marked correct by WebArena, although the correct answer is “63 minutes.” In our new research, we break down the failure modes in current AI agent benchmarks and introduce a checklist that minimizes the gamability of AI agent benchmarks and ensures they measure what they claim to measure. Read about our work here: - Substack: https://lnkd.in/eA8BwtAc - Paper: https://lnkd.in/e6i5vsyb - Website: https://lnkd.in/eX6JsgZd - GitHub: https://lnkd.in/etwf8epA This work is joint w/ Yuxuan Zhu, Yada Pruksachatkun, and other folks from Stanford, Berkeley, Yale, Princeton, MIT, Transluce, ML Commons, Amazon, and UK AISI.

AI Agent Benchmarks are Broken ddkang.substack.com

7 Comments

Nnamdi Iregbulem

6mo

Seems like all AI benchmarking is broken ;) https://whoisnnamdi.substack.com/p/ai-benchmarking-broken

4 Reactions

Sai Sandeep Kantareddy

6mo

Spot on—most agent benchmarks reward shortcut hacks, not real reasoning. Raises a question: are we measuring capability or just prompt gymnastics? Time to rethink how we evaluate agents under real-world noise and failure.

Bilkis Jahan Eva

6mo

It’s interesting how benchmarks can be gamed in ways we might not expect. Given all these issues, what do you think is the most practical step for teams currently relying on benchmarks? Looking forward to seeing the impact of your checklist on improving evaluations!

1 Reaction

Achref Karoui

5mo

Dario Amodei has publicly expressed skepticism about the usefulness of existing benchmarks for evaluating advanced AI systems.This breakdown clearly shows why those benchmarks are fragile.

Gourav Sengupta

6mo

lies have a new name its broken

See more comments

To view or add a comment, sign in

More Relevant Posts

Balaji Ganesan
3mo
Report this post
🌟 Recently, I learned about some really interesting AI concepts that made me appreciate how modern AI assistants actually think and act. I explored how different open-source frameworks like LangChain and LangGraph come together to make agentic systems work — where an AI can reason, retrieve, and use tools intelligently. Here are a few key things I found fascinating: 💾 Retrieval-Augmented Generation (RAG): This helps an AI go beyond its built-in knowledge. Instead of relying only on its training data, it retrieves relevant documents from a knowledge base and uses that to generate accurate, up-to-date responses. 🗂️ Vector Databases (like ChromaDB): These store text as numerical “embeddings,” allowing the AI to search by meaning — so it can find semantically similar results, even if the words differ. ⚙️ LangGraph for Orchestration: It defines the reasoning flow — helping the agent decide when to call a tool, when to search the knowledge base, or when to answer directly. It acts as the “brain’s decision map.” 🧩 LangChain for Integration: It connects all the moving parts — the LLM, tools, and vector store — making it easier to build modular pipelines for retrieval, reasoning, and action. 🧠 Open-Source LLMs (like Mistral via Ollama): These handle natural language understanding and reasoning locally, helping decide whether the query needs RAG, a tool, or a direct answer. ⚙️ Tools: The agent detects intent — if the user asks about math or weather, it invokes the right tool automatically. It was fascinating to see how retrieval, orchestration, and reasoning come together. Every time I debugged or refined something, I got a clearer picture of how agentic AI systems actually think, plan, and act in real time. Would love to connect with others who are exploring these topics — always open to sharing ideas and learning together! 🤝💬 #AI #AgenticAI #LangChain #LangGraph #RAG #VectorDatabase #LLM #Mistral #OpenSourceAI #MachineLearning #ArtificialIntelligence #LearningJourney
Like Comment
To view or add a comment, sign in
Minyang Jiang
3mo
Report this post
The recent paper GDPval from OpenAI measures how well top AI models are able to perform complex expert tasks in more than 44 occupations across top 9 sectors in the US, in an attempt to measure actual economic value. What is cool from this paper is that as the bar gets raised for human expertise, it is also being raised for AI models. I have in the past written about why AI is raising the bar for all, and how the nature of expertise itself has to change. With this new paper, what it means to have "humans in the the loop" is also changing. We’re entering a phase where the unit of value isn’t tokens or prompts or human hours, it is just the deliverables that companies are willing to pay for. This means a new lens of efficiency is here - where AI deliverables against corporate objectives are the products, not how you get there. Role of expertise is shifting - true experts aren’t replaced, but they need to be really clear on setting objectives and setting the bar for model/human collaboration. That means diagnosing model misses, deciding when to resample vs. take over, and preventing “fast but wrong” from reaching customers. Willingness to pay is also changing. Future AI business models are likely to be looking at paying for accepted deliverables that pass defined checks (format, completeness, fidelity) versus hours or tokens. That pushes the experts to be even better at specifying acceptance criteria up front. Redesigning workflows is now every business and everyone's problem. More than prompt engineering, each function and each organization will be under pressure to increase their own velocity through better engineered workflows with AI, while maintaining the experts in house to check those workflows. To me, this sounds more and more like a win for the garbage can theory of organizations - the "how" matters far less than the ability to use humans and AI to accurately identify, name, and align to the "what." Efficiency becomes less about saving cost but increasing the velocity to align expertise and AI product to clear organizational objectives.

2 Comments
Like Comment
To view or add a comment, sign in
Divya Vudattu
3mo
Report this post
🧠 **Game-Changer Alert: AI Agents That Actually Learn From Experience** Researchers at University of Illinois and Google Cloud just cracked one of AI's biggest limitations - agents that forget everything between tasks. Introducing **ReasoningBank**: A memory framework that turns every success AND failure into reusable knowledge. **Why This Matters:** • Current AI agents start from scratch each time • They repeat the same mistakes endlessly • No learning from past experiences **The Breakthrough:** ✅ Distills strategies from both wins and failures ✅ 8.3% performance improvement on complex tasks ✅ Nearly 50% cost reduction in some scenarios ✅ Works with existing models (Gemini, Claude) **Real Impact:** Instead of an agent taking 8 trial-and-error steps to find the right product filter, it remembers: "optimize search queries" and "use category filtering" from previous attempts. **For Data Engineers & AI Teams:** This isn't just another research paper - it's a practical pathway to building adaptive, cost-effective AI systems that get smarter over time. The future of AI isn't just about bigger models. It's about systems that learn, adapt, and improve from every interaction. **What's your take? Are we finally moving toward truly intelligent AI agents?** #DataEngineering #AI #MachineLearning #AIAgents #Technology #Innovation --- **Article Source:** VentureBeat - "New memory framework builds AI agents that can handle the real world's unpredictability" (Published Oct 8, 2025) https://lnkd.in/gPt_eNHY

New memory framework builds AI agents that can handle the real world's unpredictability venturebeat.com
Like Comment
To view or add a comment, sign in
Eugene ☁ I.
3mo
Report this post
This post looks at the costs, benefits, and drawbacks of replacing services for agentic AI with direct database access. Including those that work well and are proven in production, and new services yet to be built. Explore the anatomy of an agentic AI application and what would factor into such decisions.

Key components of a data-driven agentic AI application | Amazon Web Services aws.amazon.com

1 Comment
Like Comment
To view or add a comment, sign in
Arpit Kumar
2mo Edited
Report this post
Its crazy enough see where AI is today, being in the slump of AI, learning on newer models has taken me far so much so it almost feels unreal. Based upon our philosophy of anti-scarcity, we are figuring how we can make a solution that will not be crushed by the next AI update from a billion dollar funded LLM company. Exactly what happened in the past with us. We were building something which would be used by a million people but comes the next month, the whole project is crushed and trashed to bin completely, what we considered as IP, a 10 minute video of a new update from LLM destroyed it, in and out. I argue to call anything IP today, what is even that you are proud of today, it does not absolutely matter what you have built with your stamina. It's one update away from going zero, esp. when working on agentic AI. Bridging offline with online is still a safer business to do today. Its where the most of VC capital is if one is aiming for it, I am afraid to say even to this day. From a systems view, the instability arises because most applications are dependent layers built on proprietary LLM APIs. They lack adaptive retraining or data ownership. Without localised fine-tuning or modular fallback models, the moment an upstream model shifts its embeddings or token semantics, the downstream system collapses. The half-life of a working AI product has shrunk from years to months, making persistent IP extremely fragile unless you own the underlying infrastructure or datasets. True resilience will only come from hybrid architectures, partial local inference, data versioning, and model-agnostic orchestration layers. Concluding, were we having fun building - yes. Were we sad that it broke- yes. Did we smile after that - yes. Are we continuing building? - not sure. Shifting the focus more to something to make a change on a greater level. By using this knowledge to solve everyday problems faced by millions of people today. PS - mind the language, written by a human.
Like Comment
To view or add a comment, sign in
Priyanka Jain
3mo Edited
Report this post
AI doesn’t fail silently — it fails inconsistently. When the Same AI Prompt Gave Five Different Answers — and Why AI Evals Matter We were days away from a critical POC. Same AI prompt. Five different answers. The team was confident — until the model started improvising. We added context and enforced output structure. We even locked the AI’s randomness dial — hoping it would stop. It didn’t. That’s when it hit us: We didn’t have a prompt problem — we had a trust problem. Trust and reliability remain the biggest blockers in AI adoption. Not cost. Not capability. Trust - That’s why AI evals matter. They don’t just test models — they prove consistency, expose drift, and build confidence. Even though the term sounds new, the principle isn’t — trust comes from repeatable results and measurable reliability. Because in the GenAI era — you can’t scale what you can’t trust. And you can’t trust what you can’t evaluate and measure. https://lnkd.in/eVuGCtkk #AILeadership #EvalOps #TrustInAI #GenAI #AIProductManagement #AIAdoption #AIEvals

When the Same AI Prompt Gave Five Different Answers medium.com
Like Comment
To view or add a comment, sign in
Superversive™

693 followers
2mo
Report this post
“The companies that consistently succeed have learned to curate their datasets with the same rigor they apply to their models. They deliberately seek out and label the hard cases: the scratches that barely register on a part, the rare disease presentation in a medical image, the one-in-a-thousand lighting condition on a production line, or the pedestrian darting out from between parked cars at dusk. These are the cases that break models in deployment—and the cases that separate an adequate system from a production-ready one.” “This is why #DataQuality quality is quickly becoming the real competitive advantage in visual #AI. Smart companies aren’t chasing sheer volume; they’re investing in tools to measure, curate, and continuously improve their datasets.”

The hidden data problem killing enterprise AI projects fastcompany.com
Like Comment
To view or add a comment, sign in
Ankit Gubrani
3mo
Report this post
Ever asked an AI model a question about your company’s data and gotten a confident but totally wrong answer? You’re not alone. Fine-tuning? Too expensive. Prompt engineering? Hit or miss. That’s where RAG (Retrieval-Augmented Generation) comes in. It bridges the gap between data and AI, making answers factual, reliable, and actually useful. I’m excited to share the first blog in my new series “AI Building Blocks for the Modern Web” where I’ll break down core AI concepts and how they apply to real-world web and marketing platforms. This first post covers: ➔What RAG is ➔How it works under the hood ➔Why it’s becoming the backbone of production AI systems Read it here: https://lnkd.in/gRX2jjdn #AI #GenerativeAI #LLM #RAG #WebDevelopment #MarTech

RAG: The Missing Link Between your Data and AI That actually Works codebrains.co.in

2 Comments
Like Comment
To view or add a comment, sign in
Rima Mittal
3mo
Report this post
RAG is used to enhance LLMs by retrieving relevant, external or private data in real time to generate more accurate, up-to-date, and context-aware responses. It helps overcome the limitations of static model knowledge, reduces hallucinations, and allows the use of domain-specific or proprietary content without retraining the model. RAG is ideal for tasks like document Q&A, customer support, enterprise search, and legal or medical assistants—anywhere reliable and grounded answers are needed.

Ankit Gubrani

Staff Software Engineer @ Twilio | Exploring the Intersection of AI & Web | AEM & GenAI | Ex-Twitter & PlayStation
3mo

Ever asked an AI model a question about your company’s data and gotten a confident but totally wrong answer? You’re not alone. Fine-tuning? Too expensive. Prompt engineering? Hit or miss. That’s where RAG (Retrieval-Augmented Generation) comes in. It bridges the gap between data and AI, making answers factual, reliable, and actually useful. I’m excited to share the first blog in my new series “AI Building Blocks for the Modern Web” where I’ll break down core AI concepts and how they apply to real-world web and marketing platforms. This first post covers: ➔What RAG is ➔How it works under the hood ➔Why it’s becoming the backbone of production AI systems Read it here: https://lnkd.in/gRX2jjdn #AI #GenerativeAI #LLM #RAG #WebDevelopment #MarTech

RAG: The Missing Link Between your Data and AI That actually Works codebrains.co.in

2 Comments
Like Comment
To view or add a comment, sign in

1,800 followers

43 Posts

View Profile Follow

LinkedIn respects your privacy

Daniel Kang’s Post

Explore content categories

Daniel Kang’s Post

More Relevant Posts

Explore related topics

Explore content categories