Summarizer

LLM Input

llm/065c6e83-d0d5-4aca-be3d-92768a8a3506/topic-10-61ca1b6e-984c-40d3-864e-6cab27b04fc2-input.json

prompt

The following is content for you to summarize. Do not respond to the comments—summarize them.

<topic>
Determinism and Reproducibility # Concerns about non-deterministic LLM outputs. Discussion of whether software engineering can accommodate probabilistic tools. Comparisons to gambling and slot machines.
</topic>

<comments_about_topic>
1. Reproducing experimental results across models and vendors is trivial and cheap nowadays.

2. Except that merely surfacing them changes their behavior, like how you add that one printf() call and now your heisenbug is suddenly nonexistent

3. Both can be true. I have personally experienced both.

Some problems AI surprised me immensely with fast, elegant efficient solutions and problem solving. I've also experienced AI doing totally absurd things that ended up taking multiple times longer than if I did it manually. Sometimes in the same project.

4. You will never convince me that this isn't confirmation bias, or the equivalent of a slot machine player thinking the order in which they push buttons impacts the output, or some other gambler-esque superstition.

These tools are literally designed to make people behave like gamblers. And its working, except the house in this case takes the money you give them and lights it on fire.

5. "The equivalent of saying, which slot machine were you sitting at It'll make me money"

6. That’s because it’s superstition.

Unless someone can come up with some kind of rigorous statistics on what the effect of this kind of priming is it seems no better than claiming that sacrificing your first born will please the sun god into giving us a bountiful harvest next year.

Sure, maybe this supposed deity really is this insecure and needs a jolly good pep talk every time he wakes up. or maybe you’re just suffering from magical thinking that your incantations had any effect on the random variable word machine.

The thing is, you could actually prove it, it’s an optimization problem, you have a model, you can generate the statistics, but no one as far as I can tell has been terribly forthcoming with that , either because those that have tried have decided to try to keep their magic spells secret, or because it doesn’t really work.

If it did work, well, the oldest trick in computer science is writing compilers, i suppose we will just have to write an English to pedantry compiler.

7. Its a wild time to be in software development. Nobody(1) actually knows what causes LLMs to do certain things, we just pray the prompt moves the probabilities the right way enough such that it mostly does what we want. This used to be a field that prided itself on deterministic behavior and reproducibility.

Now? We have AGENTS.md files that look like a parent talking to a child with all the bold all-caps, double emphasis, just praying that's enough to be sure they run the commands you want them to be running

(1 Outside of some core ML developers at the big model companies)

8. It’s like playing a fretless instrument to me.

Practice playing songs by ear and after 2 weeks, my brain has developed an inference model of where my fingers should go to hit any given pitch.

Do I have any idea how my brain’s model works? No! But it tickles a different part of my brain and I like it.

9. i have like the faintest vague thread of "maybe this actually checks out" in a way that has shit all to do with consciousness

sometimes internet arguments get messy, people die on their hills and double / triple down on internet message boards. since historic internet data composes a bit of what goes into an llm, would it make sense that bad-juju prompting sends it to some dark corners of its training model if implementations don't properly sanitize certain negative words/phrases ?

in some ways llm stuff is a very odd mirror that haphazardly regurgitates things resulting from the many shades of gray we find in human qualities.... but presents results as matter of fact. the amount of internet posts with possible code solutions and more where people egotistically die on their respective hills that have made it into these models is probably off the charts, even if the original content was a far cry from a sensible solution.

all in all llm's really do introduce quite a bit of a black box. lot of benefits, but a ton of unknowns and one must be hyperviligant to the possible pitfalls of these things... but more importantly be self aware enough to understand the possible pitfalls that these things introduce to the person using them. they really possibly dangerously capitalize on everyones innate need to want to be a valued contributor. it's really common now to see so many people biting off more than they can chew, often times lacking the foundations that would've normally had a competent engineer pumping the brakes. i have a lot of respect/appreciation for people who might be doing a bit of claude here and there but are flat out forward about it in their readme and very plainly state to not have any high expectations because _they_ are aware of the risks involved here. i also want to commend everyone who writes their own damn readme.md.

these things are for better or for worse great at causing people to barrel forward through 'problem solving', which is presenting quite a bit of gray area on whether or not the problem is actually solved / how can you be sure / do you understand how the fix/solution/implementation works (in many cases, no). this is why exceptional software engineers can use this technology insanely proficiently as a supplementary worker of sorts but others find themselves in a design/architect seat for the first time and call tons of terrible shots throughout the course of what it is they are building. i'd at least like to call out that people who feel like they "can do everything on their own and don't need to rely on anyone" anymore seem to have lost the plot entirely. there are facets of that statement that might be true, but less collaboration especially in organizations is quite frankly the first steps some people take towards becoming delusional. and that is always a really sad state of affairs to watch unfold. doing stuff in a vaccuum is fun on your own time, but forcing others to just accept things you built in a vaccuum when you're in any sort of team structure is insanely immature and honestly very destructive/risky. i would like to think absolutely no one here is surprised that some sub-orgs at Microsoft force people to use copilot or be fired, very dangerous path they tread there as they bodyslam into place solutions that are not well understood. suddenly all the leadership decisions at many companies that have made to once again bring back a before-times era of offshoring work makes sense: they think with these technologies existing the subordinate culture of overseas workers combined with these techs will deliver solutions no one can push back on. great savings and also no one will say no.

10. I know why it works, to varying and unmeasurable degrees of success. Just like if I poke a bull with a sharp stick, I know it's gonna get it's attention. It might choose to run away from me in one of any number of directions, or it might decide to turn around and gore me to death. I can't answer that question with any certainty then you can.

The system is inherently non-deterministic. Just because you can guide it a bit, doesn't mean you can predict outcomes.

11. > The system is inherently non-deterministic.

The system isn't randomly non-deterministic; it is statistically probabilistic.

The next-token prediction and the attention mechanism is actually a rigorous deterministic mathematical process. The variation in output comes from how we sample from that curve, and the temperature used to calibrate the model. Because the underlying probabilities are mathematically calculated, the system's behavior remains highly predictable within statistical bounds .

Yes, it's a departure from the fully deterministic systems we're used to. But that's not different than the many real world systems: weather, biology, robotics, quantum mechanics. Even the computer you're reading this right now is full of probabilistic processes, abstracted away through sigmoid-like functions that push the extremes to 0s and 1s.

12. A lot of words to say that for all intents and purposes... it's nondeterministic.

> Yes, it's a departure from the fully deterministic systems we're used to.

A system either produces the same output given the same input[1], or doesn't.

LLMs are nondeterministic by design . Sure, you can configure them with a zero temperature, a static seed, and so on, but they're of no use to anyone in that configuration. The nondeterminism is what gives them the illusion of "creativity", and other useful properties.

Classical computers, compilers, and programming languages are deterministic by design , even if they do contain complex logic that may affect their output in unpredictable ways. There's a world of difference.

[1]: Barring misbehavior due to malfunction, corruption or freak events of nature (cosmic rays, etc.).

13. Humans are nondeterministic.

So this is a moot point and a futile exercise in arguing semantics.

14. But we can predict the outcomes, though. That's what we're saying, and it's true. Maybe not 100% of the time, but maybe it helps a significant amount of the time and that's what matters.

Is it engineering? Maybe not. But neither is knowing how to talk to junior developers so they're productive and don't feel bad. The engineering is at other levels.

15. > But we can predict the outcomes [...] Maybe not 100% of the time

So 60% of the time, it works every time.

... This fucking industry.

16. You've missed the point. This isn't engineering, it's gambling.

You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time. Just like you can roll dice the exact same way on the exact same table and you'll get two totally different results. People are doing their best to constrain that behavior by layering stuff on top, but the foundational tech is flawed (or at least ill suited for this use case).

That's not to say that AI isn't helpful. It certainly is. But when you are basically begging your tools to please do what you want with magic incantations, we've lost the fucking plot somewhere.

17. I think that's a pretty bold claim, that it'd be different every time. I'd think the output would converge on a small set of functionally equivalent designs, given sufficiently rigorous requirements.

And even a human engineer might not solve a problem the same way twice in a row, based on changes in recent inspirations or tech obsessions. What's the difference, as long as it passes review and does the job?

18. > You could take the exact same documents, prompts, and whatever other bullshit, run it on the exact same agent backed by the exact same model, and get different results every single time

This is more of an implementation detail/done this way to get better results. A neural network with fixed weights (and deterministic floating point operations) returning a probability distribution, where you use a pseudorandom generator with a fixed seed called recursively will always return the same output for the same input.

19. Exactly; the original commenter seems determined to write-off AI as "just not as good as me".

The original article is, to me, seemingly not that novel. Not because it's a trite example, but because I've begun to experience massive gains from following the same basic premise as the article. And I can't believe there's others who aren't using like this.

I iterate the plan until it's seemingly deterministic, then I strip the plan of implementation, and re-write it following a TDD approach. Then I read all specs, and generate all the code to red->green the tests.

If this commenter is too good for that, then it's that attitude that'll keep him stuck. I already feel like my projects backlog is achievable, this year.

20. Strongly agree about the deterministic part. Even more important than a good design, the plan must not show any doubt, whether it's in the form of open questions or weasel words. 95% of the time those vague words mean I didn't think something through, and it will do something hideous in order to make the plan work

21. > Read deeply, write a plan, annotate the plan until it’s right, then let Claude execute the whole thing without stopping, checking types along the way.

As others have already noted, this workflow is exactly what the Google Antigravity agent (based off Visual Studio Code) has been created for. Antigravity even includes specialized UI for a user to annotate selected portions of an LLM-generated plan before iterating it.

One significant downside to Antigravity I have found so far is the fact that even though it will properly infer a certain technical requirement and clearly note it in the plan it generates (for example, "this business reporting column needs to use a weighted average"), it will sometimes quietly downgrade such a specialized requirement (for example, to a non-weighted average), without even creating an appropriate "WARNING:" comment in the generated code. Especially so when the relevant codebase already includes a similar, but not exactly appropriate API. My repetitive prompts to ALWAYS ask about ANY implementation ambiguities WHATSOEVER go unanswered.

From what I gather Claude Code seems to be better than other agents at always remembering to query the user about implementation ambiguities, so maybe I will give Claude Code a shot over Antigravity.

22. The author is quite far on their journey but would benefit from writing simple scripts to enforce invariants in their codebase. Invariant broken? Script exits with a non-zero exit code and some output that tells the agent how to address the problem. Scripts are deterministic, run in milliseconds, and use zero tokens. Put them in husky or pre-commit, install the git hooks, and your agent won’t be able to commit without all your scripts succeeding.

And “Don’t change this function signature” should be enforced not by anticipating that your coding agent “might change this function signature so we better warn it not to” but rather via an end to end test that fails if the function signature is changed (because the other code that needs it not to change now has an error). That takes the author out of the loop and they can not watch for the change in order to issue said correction, and instead sip coffee while the agent observes that it caused a test failure then corrects it without intervention, probably by rolling back the function signature change and changing something else.

23. It’s worrying to me that nobody really knows how LLMs work. We create prompts with or without certain words and hope it works. That’s my perspective anyway

24. My workflow is a bit different.

* I ask the LLM for it's understanding of a topic or an existing feature in code. It's not really planning, it's more like understanding the model first

* Then based on its understanding, I can decide how great or small to scope something for the LLM

* An LLM showing good understand can deal with a big task fairly well.

* An LLM showing bad understanding still needs to be prompted to get it right

* What helps a lot is reference implementations. Either I have existing code that serves as the reference or I ask for a reference and I review.

A few folks do it at my work do it OPs way, but my arguments for not doing it this way

* Nobody is measuring the amount of slop within the plan. We only judge the implementation at the end

* it's still non deterministic - folks will have different experiences using OPs methods. If claude updates its model, it outdates OPs suggestions by either making it better or worse. We don't evaluate when things get better, we only focus on things not gone well.

* it's very token heavy - LLM providers insist that you use many tokens to get the task done. It's in their best interest to get you to do this. For me, LLMs should be powerful enough to understand context with minimal tokens because of the investment into model training.

Both ways gets the task done and it just comes down to my preference for now.

For me, I treat the LLM as model training + post processing + input tokens = output tokens. I don't think this is the best way to do non deterministic based software development. For me, we're still trying to shoehorn "old" deterministic programming into a non deterministic LLM.
</comments_about_topic>

Write a concise, engaging paragraph (3-5 sentences) summarizing the key points and perspectives in these comments about the topic. Focus on the most interesting viewpoints. Do not use bullet points—write flowing prose.

topic

Determinism and Reproducibility # Concerns about non-deterministic LLM outputs. Discussion of whether software engineering can accommodate probabilistic tools. Comparisons to gambling and slot machines.

commentCount

24

← Back to job