LinkedIn respects your privacy LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads ) on and off LinkedIn. Learn more in our Cookie Policy . Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings . Accept Reject Agree & Join LinkedIn By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement , Privacy Policy , and Cookie Policy . Skip to main content LinkedIn Top Content People Learning Jobs Games Sign in Create an account Daniel Kang’s Post Daniel Kang 6mo Report this post AI agents are increasingly used in production, but how can we know which agents to use and what they can do? Frontier labs, researchers, and practitioners are increasingly turning to AI agent benchmarks to answer this question. Unfortunately, AI agent benchmarks are broken! Consider WebArena, a benchmark used by OpenAI and others to evaluate AI agents on interactions with websites. In a task to calculate the duration of a route, an agent answered “45 + 8 minutes” and was marked correct by WebArena, although the correct answer is “63 minutes.” In our new research, we break down the failure modes in current AI agent benchmarks and introduce a checklist that minimizes the gamability of AI agent benchmarks and ensures they measure what they claim to measure. Read about our work here: - Substack: https://lnkd.in/eA8BwtAc - Paper: https://lnkd.in/e6i5vsyb - Website: https://lnkd.in/eX6JsgZd - GitHub: https://lnkd.in/etwf8epA This work is joint w/ Yuxuan Zhu , Yada Pruksachatkun , and other folks from Stanford, Berkeley, Yale, Princeton, MIT, Transluce, ML Commons, Amazon, and UK AISI. AI Agent Benchmarks are Broken ddkang.substack.com 137 7 Comments Like Comment Share Copy LinkedIn Facebook X Nnamdi Iregbulem 6mo Report this comment Seems like all AI benchmarking is broken ;) https://whoisnnamdi.substack.com/p/ai-benchmarking-broken Like Reply 4 Reactions 5 Reactions Sai Sandeep Kantareddy 6mo Report this comment Spot on—most agent benchmarks reward shortcut hacks, not real reasoning. Raises a question: are we measuring capability or just prompt gymnastics? Time to rethink how we evaluate agents under real-world noise and failure. Like Reply 1 Reaction Bilkis Jahan Eva 6mo Report this comment It’s interesting how benchmarks can be gamed in ways we might not expect. Given all these issues, what do you think is the most practical step for teams currently relying on benchmarks? Looking forward to seeing the impact of your checklist on improving evaluations! Like Reply 1 Reaction 2 Reactions Achref Karoui 5mo Report this comment Dario Amodei has publicly expressed skepticism about the usefulness of existing benchmarks for evaluating advanced AI systems.This breakdown clearly shows why those benchmarks are fragile. Like Reply 1 Reaction Gourav Sengupta 6mo Report this comment lies have a new name its broken Like Reply 1 Reaction See more comments To view or add a comment, sign in More Relevant Posts Balaji Ganesan 3mo Report this post 🌟 Recently, I learned about some really interesting AI concepts that made me appreciate how modern AI assistants actually think and act. I explored how different open-source frameworks like LangChain and LangGraph come together to make agentic systems work — where an AI can reason, retrieve, and use tools intelligently. Here are a few key things I found fascinating: 💾 Retrieval-Augmented Generation (RAG): This helps an AI go beyond its built-in knowledge. Instead of relying only on its training data, it retrieves relevant documents from a knowledge base and uses that to generate accurate, up-to-date responses. 🗂️ Vector Databases (like ChromaDB): These store text as numerical “embeddings,” allowing the AI to search by meaning — so it can find semantically similar results, even if the words differ. ⚙️ LangGraph for Orchestration: It defines the reasoning flow — helping the agent decide when to call a tool, when to search the knowledge base, or when to answer directly. It acts as the “brain’s decision map.” 🧩 LangChain for Integration: It connects all the moving parts — the LLM, tools, and vector store — making it easier to build modular pipelines for retrieval, reasoning, and action. 🧠 Open-Source LLMs (like Mistral via Ollama): These handle natural language understanding and reasoning locally, helping decide whether the query needs RAG, a tool, or a direct answer. ⚙️ Tools: The agent detects intent — if the user asks about math or weather, it invokes the right tool automatically. It was fascinating to see how retrieval, orchestration, and reasoning come together. Every time I debugged or refined something, I got a clearer picture of how agentic AI systems actually think, plan, and act in real time. Would love to connect with others who are exploring these topics — always open to sharing ideas and learning together! 🤝💬 #AI #AgenticAI #LangChain #LangGraph #RAG #VectorDatabase #LLM #Mistral #OpenSourceAI #MachineLearning #ArtificialIntelligence #LearningJourney 19 Like Comment Share Copy LinkedIn Facebook X To view or add a comment, sign in Minyang Jiang 3mo Report this post The recent paper GDPval from OpenAI measures how well top AI models are able to perform complex expert tasks in more than 44 occupations across top 9 sectors in the US, in an attempt to measure actual economic value. What is cool from this paper is that as the bar gets raised for human expertise, it is also being raised for AI models. I have in the past written about why AI is raising the bar for all, and how the nature of expertise itself has to change. With this new paper, what it means to have "humans in the the loop" is also changing. We’re entering a phase where the unit of value isn’t tokens or prompts or human hours, it is just the deliverables that companies are willing to pay for. This means a new lens of efficiency is here - where AI deliverables against corporate objectives are the products, not how you get there. Role of expertise is shifting - true experts aren’t replaced, but they need to be really clear on setting objectives and setting the bar for model/human collaboration. That means diagnosing model misses, deciding when to resample vs. take over, and preventing “fast but wrong” from reaching customers. Willingness to pay is also changing. Future AI business models are likely to be looking at paying for accepted deliverables that pass defined checks (format, completeness, fidelity) versus hours or tokens. That pushes the experts to be even better at specifying acceptance criteria up front. Redesigning workflows is now every business and everyone's problem. More than prompt engineering, each function and each organization will be under pressure to increase their own velocity through better engineered workflows with AI, while maintaining the experts in house to check those workflows. To me, this sounds more and more like a win for the garbage can theory of organizations - the "how" matters far less than the ability to use humans and AI to accurately identify, name, and align to the "what." Efficiency becomes less about saving cost but increasing the velocity to align expertise and AI product to clear organizational objectives. 9 2 Comments Like Comment Share Copy LinkedIn Facebook X To view or add a comment, sign in Divya Vudattu 3mo Report this post 🧠 **Game-Changer Alert: AI Agents That Actually Learn From Experience** Researchers at University of Illinois and Google Cloud just cracked one of AI's biggest limitations - agents that forget everything between tasks. Introducing **ReasoningBank**: A memory framework that turns every success AND failure into reusable knowledge. **Why This Matters:** • Current AI agents start from scratch each time • They repeat the same mistakes endlessly • No learning from past experiences **The Breakthrough:** ✅ Distills strategies from both wins and failures ✅ 8.3% performance improvement on complex tasks ✅ Nearly 50% cost reduction in some scenarios ✅ Works with existing models (Gemini, Claude) **Real Impact:** Instead of an agent taking 8 trial-and-error steps to find the right product filter, it remembers: "optimize search queries" and "use category filtering" from previous attempts. **For Data Engineers & AI Teams:** This isn't just another research paper - it's a practical pathway to building adaptive, cost-effective AI systems that get smarter over time. The future of AI isn't just about bigger models. It's about systems that learn, adapt, and improve from every interaction. **What's your take? Are we finally moving toward truly intelligent AI agents?** #DataEngineering #AI #MachineLearning #AIAgents #Technology #Innovation --- **Article Source:** VentureBeat - "New memory framework builds AI agents that can handle the real world's unpredictability" (Published Oct 8, 2025) https://lnkd.in/gPt_eNHY New memory framework builds AI agents that can handle the real world's unpredictability venturebeat.com 3 Like Comment Share Copy LinkedIn Facebook X To view or add a comment, sign in Eugene ☁ I. 3mo Report this post This post looks at the costs, benefits, and drawbacks of replacing services for agentic AI with direct database access. Including those that work well and are proven in production, and new services yet to be built. Explore the anatomy of an agentic AI application and what would factor into such decisions. Key components of a data-driven agentic AI application | Amazon Web Services aws.amazon.com 11 1 Comment Like Comment Share Copy LinkedIn Facebook X To view or add a comment, sign in Arpit Kumar 2mo Edited Report this post Its crazy enough see where AI is today, being in the slump of AI, learning on newer models has taken me far so much so it almost feels unreal. Based upon our philosophy of anti-scarcity, we are figuring how we can make a solution that will not be crushed by the next AI update from a billion dollar funded LLM company. Exactly what happened in the past with us. We were building something which would be used by a million people but comes the next month, the whole project is crushed and trashed to bin completely, what we considered as IP, a 10 minute video of a new update from LLM destroyed it, in and out. I argue to call anything IP today, what is even that you are proud of today, it does not absolutely matter what you have built with your stamina. It's one update away from going zero, esp. when working on agentic AI. Bridging offline with online is still a safer business to do today. Its where the most of VC capital is if one is aiming for it, I am afraid to say even to this day. From a systems view, the instability arises because most applications are dependent layers built on proprietary LLM APIs. They lack adaptive retraining or data ownership. Without localised fine-tuning or modular fallback models, the moment an upstream model shifts its embeddings or token semantics, the downstream system collapses. The half-life of a working AI product has shrunk from years to months, making persistent IP extremely fragile unless you own the underlying infrastructure or datasets. True resilience will only come from hybrid architectures, partial local inference, data versioning, and model-agnostic orchestration layers. Concluding, were we having fun building - yes. Were we sad that it broke- yes. Did we smile after that - yes. Are we continuing building? - not sure. Shifting the focus more to something to make a change on a greater level. By using this knowledge to solve everyday problems faced by millions of people today. PS - mind the language, written by a human. Like Comment Share Copy LinkedIn Facebook X To view or add a comment, sign in Priyanka Jain 3mo Edited Report this post AI doesn’t fail silently — it fails inconsistently. When the Same AI Prompt Gave Five Different Answers — and Why AI Evals Matter We were days away from a critical POC. Same AI prompt. Five different answers. The team was confident — until the model started improvising. We added context and enforced output structure. We even locked the AI’s randomness dial — hoping it would stop. It didn’t. That’s when it hit us: We didn’t have a prompt problem — we had a trust problem. Trust and reliability remain the biggest blockers in AI adoption. Not cost. Not capability. Trust - That’s why AI evals matter. They don’t just test models — they prove consistency, expose drift, and build confidence. Even though the term sounds new, the principle isn’t — trust comes from repeatable results and measurable reliability. Because in the GenAI era — you can’t scale what you can’t trust. And you can’t trust what you can’t evaluate and measure. https://lnkd.in/eVuGCtkk #AILeadership #EvalOps #TrustInAI #GenAI #AIProductManagement #AIAdoption #AIEvals When the Same AI Prompt Gave Five Different Answers medium.com 29 Like Comment Share Copy LinkedIn Facebook X To view or add a comment, sign in Superversive™ 693 followers 2mo Report this post “The companies that consistently succeed have learned to curate their datasets with the same rigor they apply to their models. They deliberately seek out and label the hard cases: the scratches that barely register on a part, the rare disease presentation in a medical image, the one-in-a-thousand lighting condition on a production line, or the pedestrian darting out from between parked cars at dusk. These are the cases that break models in deployment—and the cases that separate an adequate system from a production-ready one.” “This is why #DataQuality quality is quickly becoming the real competitive advantage in visual #AI . Smart companies aren’t chasing sheer volume; they’re investing in tools to measure, curate, and continuously improve their datasets.” The hidden data problem killing enterprise AI projects fastcompany.com 1 Like Comment Share Copy LinkedIn Facebook X To view or add a comment, sign in Ankit Gubrani 3mo Report this post Ever asked an AI model a question about your company’s data and gotten a confident but totally wrong answer? You’re not alone. Fine-tuning? Too expensive. Prompt engineering? Hit or miss. That’s where RAG (Retrieval-Augmented Generation) comes in. It bridges the gap between data and AI, making answers factual, reliable, and actually useful. I’m excited to share the first blog in my new series “AI Building Blocks for the Modern Web” where I’ll break down core AI concepts and how they apply to real-world web and marketing platforms. This first post covers: ➔What RAG is ➔How it works under the hood ➔Why it’s becoming the backbone of production AI systems Read it here: https://lnkd.in/gRX2jjdn #AI #GenerativeAI #LLM #RAG #WebDevelopment #MarTech RAG: The Missing Link Between your Data and AI That actually Works codebrains.co.in 24 2 Comments Like Comment Share Copy LinkedIn Facebook X To view or add a comment, sign in Rima Mittal 3mo Report this post RAG is used to enhance LLMs by retrieving relevant, external or private data in real time to generate more accurate, up-to-date, and context-aware responses. It helps overcome the limitations of static model knowledge, reduces hallucinations, and allows the use of domain-specific or proprietary content without retraining the model. RAG is ideal for tasks like document Q&A, customer support, enterprise search, and legal or medical assistants—anywhere reliable and grounded answers are needed. Ankit Gubrani Staff Software Engineer @ Twilio | Exploring the Intersection of AI & Web | AEM & GenAI | Ex-Twitter & PlayStation 3mo Ever asked an AI model a question about your company’s data and gotten a confident but totally wrong answer? You’re not alone. Fine-tuning? Too expensive. Prompt engineering? Hit or miss. That’s where RAG (Retrieval-Augmented Generation) comes in. It bridges the gap between data and AI, making answers factual, reliable, and actually useful. I’m excited to share the first blog in my new series “AI Building Blocks for the Modern Web” where I’ll break down core AI concepts and how they apply to real-world web and marketing platforms. This first post covers: ➔What RAG is ➔How it works under the hood ➔Why it’s becoming the backbone of production AI systems Read it here: https://lnkd.in/gRX2jjdn #AI #GenerativeAI #LLM #RAG #WebDevelopment #MarTech RAG: The Missing Link Between your Data and AI That actually Works codebrains.co.in 12 2 Comments Like Comment Share Copy LinkedIn Facebook X To view or add a comment, sign in 1,800 followers 43 Posts View Profile Follow Explore related topics How to Assess AI Agents Using Practical Benchmarks How to Use AI Agents to Streamline Digital Workflows How to Use AI Agents for Business Value Creation How to Use AI Agents to Optimize Code How to Use AI Agents to Improve SaaS Business Models How to Build Production-Ready AI Agents Challenges with AI Reasoning Benchmarks Common Pitfalls of AI Agents The Future of AI Agents in Various Industries How to Boost Productivity With Developer Agents Show more Show less Explore content categories Career Productivity Finance Soft Skills & Emotional Intelligence Project Management Education Technology Leadership Ecommerce User Experience Recruitment & HR Customer Experience Real Estate Marketing Sales Retail & Merchandising Science Supply Chain Management Future Of Work Consulting Writing Economics Artificial Intelligence Employee Experience Workplace Trends Fundraising Networking Corporate Social Responsibility Negotiation Communication Engineering Hospitality & Tourism Business Strategy Change Management Organizational Culture Design Innovation Event Planning Training & Development Show more Show less LinkedIn © 2026 About Accessibility User Agreement Privacy Policy Cookie Policy Copyright Policy Brand Policy Guest Controls Community Guidelines العربية (Arabic) বাংলা (Bangla) Čeština (Czech) Dansk (Danish) Deutsch (German) Ελληνικά (Greek) English (English) Español (Spanish) فارسی (Persian) Suomi (Finnish) Français (French) हिंदी (Hindi) Magyar (Hungarian) Bahasa Indonesia (Indonesian) Italiano (Italian) עברית (Hebrew) 日本語 (Japanese) 한국어 (Korean) मराठी (Marathi) Bahasa Malaysia (Malay) Nederlands (Dutch) Norsk (Norwegian) ਪੰਜਾਬੀ (Punjabi) Polski (Polish) Português (Portuguese) Română (Romanian) Русский (Russian) Svenska (Swedish) తెలుగు (Telugu) ภาษาไทย (Thai) Tagalog (Tagalog) Türkçe (Turkish) Українська (Ukrainian) Tiếng Việt (Vietnamese) 简体中文 (Chinese (Simplified)) 正體中文 (Chinese (Traditional)) Language Sign in to view more content Create your free account or sign in to continue your search Sign in Welcome back Email or phone Password Show Forgot password? Sign in or By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement , Privacy Policy , and Cookie Policy . New to LinkedIn? Join now or New to LinkedIn? Join now By clicking Continue to join or sign in, you agree to LinkedIn’s User Agreement , Privacy Policy , and Cookie Policy .