Summarizer

Pelican on Bicycle Benchmark

Simon Willison's informal SVG generation test, discussion of whether it's being trained on specifically, quality improvements in latest models, and debate over its validity as a casual benchmark

← Back to Gemini 3 Deep Think

Simon Willison’s "Pelican on Bicycle" SVG benchmark has evolved from a lighthearted personal test into a controversial lightning rod for debating whether AI labs are "benchmaxxing" by specifically training on famous informal prompts. While the latest model results show unprecedented technical coherence and artistic quality, critics argue that the benchmark's visibility creates a perverse incentive for companies to curate specialized training data, potentially misleading the public about a model’s general reasoning. Despite these concerns about manipulation, many enthusiasts maintain that the test remains a valuable ritual, arguing that its validity is easily defended by swapping the pelican for an "ocelot on a skateboard" to see if the model's underlying spatial logic holds up.

45 comments tagged with this topic

View on HN · Topics
Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks. The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.
View on HN · Topics
If you want that to get better, you need to produce a 3d model benchmark and popularize it. You can start with a pelican riding a bicycle with working bicycle.
View on HN · Topics
building a benchmark is a great idea, thanks, maybe I will have a couple of days to spend on this soon
View on HN · Topics
The pelican riding a bicycle is excellent . I think it's the best I've seen. https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/
View on HN · Topics
So, you've said multiple times in the past that you're not concerned about AI labs training for this specific test because if they did, it would be so obviously incongruous that you'd easily spot the manipulation and call them out. Which tbh has never really sat right with me, seemingly placing way too much confidence in your ability to differentiate organic vs. manipulated output in a way I don't think any human could be expected to. To me, this example is an extremely neat and professional SVG and so far ahead it almost seems too good to be true. But like with every previous model, you don't seem to have the slightest amount of skepticism in your review. I don't think I truly believe Google cheated here, but it's so good it does therefore make me question whether there could ever be an example of a pelican SVG in the future that actually could trigger your BS detector? I know you say it's just a fun/dumb benchmark that's not super important, but you're easily in the top 3 most well known AI "influencers" whose opinion/reviews about model releases carry a lot of weight, providing a lot of incentive with trillions of dollars flying around. Are you still not at all concerned by the amount of attention this benchmark receives now/your risk of unwittingly being manipulated?
View on HN · Topics
The other SVGs I tried from my private collection of prompts were all similarly impressive.
View on HN · Topics
Tbh they'd have to be absolutely useless at benchmarkmaxxing if they didn't include your pelican riding a bicycle...
View on HN · Topics
This benchmark outcome is actually really impressive given the difficulty of this task. It shows that this particular model manages to "think" coherently and maintain useful information in its context for what has to be an insane overall amount of tokens, likely across parallel "thinking" chains. Likely also has access to SVG-rendering tools and can "see" and iterate on the result via multimodal input.
View on HN · Topics
Wow. I wonder how it would do with pure CSS a la https://diana-adrianne.com/
View on HN · Topics
We've reached PGI
View on HN · Topics
I routinely check out the pelicans you post and I do agree, this is the best yet. It seemed to me that the wings/arms were such a big hangup for these generators.
View on HN · Topics
How likely this problem is already on the training set by now?
View on HN · Topics
If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.
View on HN · Topics
Why would they train on that? Why not just hire someone to make a few examples.
View on HN · Topics
I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.
View on HN · Topics
Would it not be better to have 100 such tests "Pelican on bicycle", "Tiger on stilts"..., and generate them all for every new model but only release a new one each time. That way you could show progression across all models, attempts at benchmaxxing would be more obvious. Given the crazy money and vying for supremacy among AI companies right now it does seem naive to belive that no attempt at better pelicans on bicycles is being made. You can argue "but I will know because of the quality of ocelots on skateboards" but without a back catalog of ocelots on skateboards to publish its one datapoint and leaves the AI companies with too much plausible deniability. The pelicans-on-bicycles is a bit of fun for you (and us!) but it has become a measure of the quality of models so its serious business for them. There is an assymetry of incentives and high risk you are being their useful idiot. Sorry to be blunt.
View on HN · Topics
But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.
View on HN · Topics
The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.
View on HN · Topics
When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.
View on HN · Topics
Huh? AI labs are routinely spending millions to billions to various 3rd party contractors specializing in creating/labeling/verifying specialized content for pre/post-training. This would just be one more checkbox buried in hundreds of pages of requests, and compared to plenty of other ethical grey areas like copyright laundering with actual legal implications, leaking that someone was asked to create a few dozen pelican images seems like it would be at the very bottom of the list of reputational risks.
View on HN · Topics
Well, since we're all talking about sourcing training material to "benchmaxx" for social proof, and not litigating the whole "AI bubble" debate, just the entire cottage industry of data curation firms: https://scale.com/data-engine https://www.appen.com/llm-training-data https://www.cogitotech.com/generative-ai/ https://www.telusdigital.com/solutions/data-for-ai-training/... https://www.nexdata.ai/industries/generative-ai --- P.S. Google Comms would have been consulted re putting a pelican in the I/O keynote :-) https://x.com/simonw/status/1924909405906338033
View on HN · Topics
The embarrassment of getting caught doing that would be expensive.
View on HN · Topics
For every combination of animal and vehicle? Very unlikely. The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
View on HN · Topics
No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
View on HN · Topics
You can easily make a RLAIF loop. - Take a list of n animals * m vehicule - Ask a LLM to generate SVG for this n*m options - Generate png from the svg - Ask a Model with vision to grade the result - Change your weight accordingly No need to human to draw the dataset, no need of human to evaluate.
View on HN · Topics
More likely you would just train for emitting svg for some description of a scene and create training data from raster images.
View on HN · Topics
You can always ask for a tyrannosaurus driving a tank.
View on HN · Topics
Is there a list of these for each model, that you've catalogued somewhere?
View on HN · Topics
At the moment that's mostly my tag page here but I really need to formalize it: https://simonwillison.net/tags/pelican-riding-a-bicycle/
View on HN · Topics
The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)
View on HN · Topics
It's not actually, look up some photos of the sun setting over the ocean. Here's an example: https://stockcake.com/i/sunset-over-ocean_1317824_81961
View on HN · Topics
That’s only if the sun is above the horizon entirely.
View on HN · Topics
No, it's not. https://stockcake.com/i/serene-ocean-sunset_1152191_440307
View on HN · Topics
Yes, it is. In that photo the sun is clearly above the horizon, the bottom half is just obscured by clouds.
View on HN · Topics
Do you have to still keep trying to bang on about this relentlessly? It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion. Again, like I said before, it's also a terrible benchmark.
View on HN · Topics
I'll agree to disagree. In any thread about a new model, I personally expect the pelican comment to be out there. It's informative, ritualistic and frankly fun. Your comment however, is a little harsh. Why mad?
View on HN · Topics
It being a terrible benchmark is the bit.
View on HN · Topics
Eh, i find it more of a not very informative but lighthearted commentary
View on HN · Topics
It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go! Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?
View on HN · Topics
It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.
View on HN · Topics
maybe you're a pro vector artist but I couldn't create such a cool one myself in illustrator tbh
View on HN · Topics
Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.
View on HN · Topics
Highly disagree. I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view. If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.
View on HN · Topics
I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context. In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible . I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.
View on HN · Topics
The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.