[theme is: we don’t know how deep AI intellectual capacity is, and we don’t know how many tasks need that? all these things that our dinner identified as missing from current AIs, how hard will they be, might some of it be amenable to prompting + scaffolding + a touch of RL?]

Build on https://lemmata.substack.com/p/what-i-wish-i-knew-about-frontiermath to write about the need to measure creativity, and fuzzy capabilities more generally (also build on my initial writeup from our first dinner). Write a post questioning how real world tasks break down into background, creativity, execution. Highlight the questions about creativity required for the FM problems. 

My suspicion is that a significant chunk of FrontierMath problems can be solved by applying advanced mathematical techniques in relatively straightforward ways. If anything, this might obscure their difficulty to humans: most people don’t have the right knowledge, and without the right knowledge the problems seem impossible; but with the right knowledge, they aren’t so bad.

https://amistrongeryet.substack.com/p/were-finding-out-what-humans-are/comment/95456981

Ethan Mollick (@emollick) posted at 11:06 PM on Tue, Feb 18, 2025: As I have written many times, AI is not naturally a great tutor, it offers explanations but, without proper prompting, tends to tell your answers rather than engaging you in the process of understanding. I find explanations on demand very promising, but they aren't there yet.(https://x.com/emollick/status/1892108434171949321?t=JRIExSHeGTJitIUwXGVCDQ&s=03)

Taren: feels like he does say that this is basically a prompting/product problem?

Steve: I read this as "the student has to find the right questions to ask". Other parts of his tweet sound discouraging ("not naturally a great tutor", "they aren't there yet"). But I guess it is ambiguous.

From https://erictopol.substack.com/p/when-doctors-with-ai-are-outperformed: When A.I. systems attempted to gather patient information through direct interviews, their diagnostic accuracy plummeted — in one case from 82 percent to 63 percent. The study revealed that A.I. still struggles with guiding natural conversations and knowing which follow-up questions will yield crucial diagnostic information.

Taren: this is a great example, and/but i wonder how much of this is a product/prompting problem vs a capabilities problem... feels like a naive user of AI setting up the interview process, vs an expert user, could have a very different outcome here -- and hard to say which type it was in this case?

The difficulty our first-dinner participants had in deciding whether a capability gap can be met using prompting, data/scale, or architectural changes. Taren notes that all three routes could be viable, on different time scales.

Sigal Samuel (@SigalSamuel) posted at 10:00 AM on Fri, Feb 21, 2025:
The big AI story of the past 6 months is: Companies now claim that their AI models are capable of genuine reasoning.

Is that true?

I found that the best answer lies in between hype and skepticism.

https://t.co/b3ZuMjO0ZJ Thanks to @ajeya_cotra @RyanPGreenblatt @MelMitchell1
(https://x.com/SigalSamuel/status/1892997861886820474?t=x29SrzUJR8dq9mwmmnaiDQ&s=03)

https://arxiv.org/abs/2410.06992

Our analysis reveals some critical issues with the SWE-bench dataset: 1) 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments. We refer to as solution leakage problem. 2) 31.08% of the passed patches are suspicious patches due to weak test cases, i.e., the tests were not adequate to verify the correctness of a patch. When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%. We also observed that the same data quality issues also exist in the two variants of SWE-bench, i.e., SWE-bench Lite and SWE-Bench Verified. In addition, over 94% of the issues were created before LLM's knowledge cutoff dates, posing potential data leakage issues.

Review https://epochai.substack.com/p/ai-progress-is-about-to-speed-up. Note reference to Moravec's paradox. Also note expectation that capabilities which are weak today will continue to be weak.

https://aidanmclaughlin.notion.site/reasoners-problem

https://x.com/AndrewCritchPhD/status/1891887600102932629?t=0UjiKsyU97miKXKTPaKllg&s=03

[Zvi] Have o1-Pro give you a prompt to have Deep Research do Deep Research on Deep Research prompting, use that to create prompt templates for Deep Research. The results are here in case you want to try the final form.

https://news.ycombinator.com/item?id=43169586: Most of the time, most of the devs I know, including myself, are not really creating novelty with the code itself, but with the product.

A recent Matt Levine column talks about how humans do better than current AIs and traditional ML in out-of-distribution situations. See the section that ends with:

The stereotype about algorithmic trading and investing is something like “algorithms tend to learn on historical data and are poorly suited to dealing with regime changes, while humans are more flexible and have better gut instincts to handle sharp breaks with the past.” I have often been skeptical of that stereotype. Humans also learn on historical data, and less of it: If you’ve been trading for 10 years, in some sense you only really have access to 10 years of market history, while a computer can hold the last 200 years of data in its mind.

But Sasha Gill makes me rethink that. She has roughly zero years of market history, she barely knows what a yard of cable is, but she’s keeping an eye on Truth Social. She’s handling the regime change. If you are a computer trained on recent historical data, a sharp increase in FX volatility might catch you flat-footed. If you’re a human trader straight out of university, you’ll be like “ah yes time to fire up Truth Social.” The algorithm has never even heard of Truth Social! Good time to be a human FX trader.

https://x.com/littmath/status/1898461323391815820

First section of https://www.theintrinsicperspective.com/p/ai-plays-pokemon-but-so-does-teslas

https://x.com/slatestarcodex/status/1896457193215742274

Ethan Mollick (@emollick) posted at 11:24 AM on Sun, Mar 09, 2025:If it turns out LLMs are only capable of recombinatory innovation (finding novel connections among existing knowledge), that would still be very useful. Most innovation is recombination and one of the big issues in science is that fields are too vast for scientists to bridge them https://t.co/XrZUlbKkek(https://x.com/emollick/status/1898802108926661012?t=5xHM6k-VgnlfCiewN4E6dA&s=03)