Summarizer

LLM Input

llm/5daab79e-f20f-476c-ab87-82c7ff678250/topic-6-8077c2c3-a43d-4b3b-a618-f3a4f94de2bf-input.json

prompt

You are a comment summarizer. Given a topic and a list of comments tagged with that topic, write a single paragraph summarizing the key points and perspectives expressed in the comments.

TOPIC: Future of LLM training data

COMMENTS:
1. Blame the managers who weren't users of the site, decided it wasn't important to the business, and ignored the problems.

This always cracks me up. I've seen it so many times, and so many books cover this...

Classic statement is "never take your eye off the ball".

Sure, you need to plan ahead. You need to move down a path. But take your eye off of today, and you won't get to tomorrow.

Maybe they'll SCO it, and spend the next 10 years suing everyone and their LLM dog.

You know, I wonder how the board and execs made out suing Linux related... things. End users were threatened too, compelled to pay...

SO could be spun off into a neat tiger, nipping at everyone's toes.

2. The moderation definitely got kind of nasty in the last 5 years or so. To the point where you would feel unwelcome for asking a question you had already researched, and felt was perfectly sound to ask. However, that didn't stop millions of people from asking questions every day , it just felt kinda shitty to those of us who spent more time answering, when we actually needed to ask one on a topic we were lacking in. (Speaking as someone who never moderated).

My feeling was always that the super mods were people who had too much time on their hands... and the site would've been better without them (speaking in the past tense, now). But I don't think that's what killed it. LLMs scraping all its content and recycling it into bite-sized Gemini or GPT answers - that's what killed it.

3. LLMs also search Google for answers. Hence the knowledge may be not lost even for those who only supervises machines that write code.

4. What's sad about it is that SO was yet another place for humans to interact that is now dead.

I was part of various forums 15 years ago where I could talk shop about many technical things, and they're all gone without any real substitute.

> People don't realize what a massive advantage Google has over everyone else in that regard. Site owners go out of their way to try to block OpenAI's crawlers, while simultaneously trying to attract Google's.

Not really. Website operators can only block live searches from LLM providers like requests made when someone asks a question on chatgpt.com, only because of the quirk that OpenAI makes the request from their server as a quick hack.

We're quickly moving past that as LLMs just make the request from your device with your browser if it has to (to click "I am not a robot").

As for scraping the internet for training data, those requests are basically impossible to block and don't have anything in common with live answer requests made to answer a

5. Thinking from first principles, a large part of the content on stack overflow comes from the practical experience and battle scars worn by developers sharing them with others and cross-curating approaches.

Privacy concerns notwithstanding, one could argue having LLMs with us every step of the way - coding agents, debugging, devops tools etc. It will be this shared interlocutor with vast swaths of experiential knowledge collected and redistributed at an even larger scale than SO and forum-style platforms allow for.

It does remove the human touch so it's quite a different dynamic and the amount of data to collect is staggering and challenging from a legal point of view, but I suspect a lot of the knowledge used to train LLMs in the next ten years will come from large-scale telemetry and millions of hours in RL self-play where LLMs learn to scale and debug code from fizzbuzz to facebook and twitter-like distributed system.

6. I don't know how others use LLMs, but once I find the answer to something I'm stuck on I do not tell the LLM that it's fixed. This was a problem in forums as well but I think even fewer people are going to give that feedback to a chatbot

7. The problem that you worked out is only really useful if it can be recreated and validated, which in many cases it can be by using an LLM to build the same system and write tests that confirm the failure and the fix. Your response telling the model that its answer worked is more helpful for measuring your level of engagement, not so much for evaluating the solution.

8. You can also turn off the feature to allow ChatGPT to learn from your interactions. Not many people do but those that do would also starve OpenAI for information assume they respect that setting

9. Am I the only one that sees this as a hellscape?

No longer interacting with your peers but an LLM instead? The knowledge centralized via telemetry and spying on every user’s every interaction and only available thru a enshitified subscription to a model that’s been trained on this stolen data?

10. Y'know how "users" of modern tech are the product? And how the developers were completely fine with creating such systems?

Well, turns out developers are now the product too. Good job everyone.

11. The LLMs will learn from our interactions with them. That's why they're often free

12. That "Dead Internet" phrase keeps becoming more likely, and this graph shows that. Human-to-human interactions, LLMs using those interactions, less human-to-human interactions because of that, LLMs using... ?

13. >>what happens now?

I'll tell you what happens now: LLMs continue to regurgitate and iterate and hallucinate on the questions and answers they ingested from S.O. - 90% of which are incorrect. LLM output continues to poison itself as more and more websites spring up recycling outdated or incorrect answers, and no new answers are given since no one wants to waste the time to ask a human a question and wait for the response .

The overall intellectual capacity sinks to the point where everything collaboratively built falls apart.

The machines don't need AGI to take over, they just need to wait for us to disintegrate out of sheer laziness, sloth and self-righteous.... /okay.

there was always a needy component to Stack Overflow. "I have to pass an exam, what is the best way to write this algorithm?" and shit like that. A lazy component. But to be honest, it was the giving of information which forced you to think, and research, and answer correctly , which made systems like S.O. worthwhil

14. Labs are spending billions on data set curation and RL from human experts to fill in the areas where they're currently weak. It's higher quality data than SO, the only issue is that it's not public.

15. Can you explain what you're saying in greater depth?

Are you saying that the reason there is no human expertise on the internet anymore is that everyone with knowledge is now under contract to train AIs?

16. > I wonder if, 10 years from now, LLMs will still be answering questions that were answered in the halcyon 2014-2020 days of SO better than anything that came after?

I've wondered this too and I wonder if the existing corpus plus new GitHub/doc site scrapes will be enough to keep things current.

17. Widespread internet adoption created “eternal September”, widespread LLM deployment will create “eternal 2018”

18. > - I know I'm beating a dead horse here, but what happens now? Despite stratification I mentioned above, SO was by far the leading source of high quality answers to technical questions. What do LLMs train off of now? I wonder if, 10 years from now, LLMs will still be answering questions that were answered in the halcyon 2014-2020 days of SO better than anything that came after? Or will we find new, better ways to find answers to technical questions?

To me this shows just how limited LLMs are. Hopefully more people realize that LLMs aren't as useful as they seem, and in 10 years they're relegated to sending spam and generating marketting websites.

19. > What do LLMs train off of now? I wonder if, 10 years from now, LLMs will still be answering questions that were answered in the halcyon 2014-2020 days of SO better than anything that came after? Or will we find new, better ways to find answers to technical questions?

That's a great question. I have no idea how things will play out now - do models become generalized enough to handle "out of distrubition" problems or not ? If they don't then I suppose a few years from now we'll get an uptick in Stackoverflow questions; the website will still exist it's not going anywhere.

20. The newer questions that LLMs can't answer will be answered in forums - either SO, reddit, or elsewhere. There will be a much higher percentage of relevant content with far fewer new pages regurgitating questions about solved problems. So the LLMs will be able to keep up.

21. We'll get to the point where we can mass moderate core knowledge eventually. We may need to hand out extra weight for verified experts and some kind of most-votes-win type logic (perhaps even comments?), but live training data updates will be a massive evolution for language models.

22. > What do LLMs train off of now?

Perhaps they’ll rely on what was used by people who answered SO questions. So: official docs and maybe source code. Maybe even from experience too, i.e. from human feedback and human written code during agentic coding sessions.

> The fact that the LLM doesn't insult you is just the cherry on top.

Arguably it does insult even more, just by existing alone.

23. The other major benefit of SO being a public forum is that once a question was wrestled with and eventually answered, other engineers could stumble upon and benefit from it. With SO being replaced by LLMs, engineers are asking LLMs the same questions over and over, likely getting a wide range of different answers (some correct and others not) while also being an incredible waste of resources.

24. That's only because of LLMs consuming pre-existing discussions on SO. They aren't creating novel solutions.

25. SO was somewhere people put their hard won experience into words, that an LLM could train on.

That won't be happening anymore, neither on SO or elsewhere. So all this hard won experience, from actually doing real work, will be inaccessible to the LLMs. For modern technologies and problems I suspect it will be a notably worse experience when using an LLM than working with older technologies.

It's already true for example, when using the Godot game engine instead of Unity. LLMs constantly confuse what you're trying to do with Unity problems, offer Unity based code solutions etc.

26. You still get the same thing though?

That grumpy guy is using an LLM and debugging with it. Solves the problem. AI provider fine tunes their model with this. You now have his input baked into it's response.

How you think these things work? It's either a human direct input it's remembering or a RL enviroment made by a human to solve the problem you are working on.

Nothing in it is "made up" it's just a resolution problem which will only get better over time.

27. How does that work if there's no new data for them to train on, only AI slurry?

28. It's remarkable only in the sense that you can see where the LLMs were trained from.

29. The irony is that the LLMs are trained on stack overflow and should inherit a lot of those traits and errors.

30. We can't. I don't think the LLMs themselves can recognize when an answer is stale. They could if contradicting data was available, but their very existence suppresses the contradictory data.

31. They still use the official documentation/examples, public Github Repos, and your own code which are all more likely to be evergreen. SO was definitely a massive training advantage before LLMs matured though.

32. LLMs are just statistics, eventually they kill themselves with feedback loop by consuming their own farts (literally)

33. >For all their flaws, LLMs are so much better

But LLMs get their answers from StackOverflow and similar places being used as the source material. As those start getting outdated because of lack of activity, LLMs won't have the source material to answer questions properly.

34. I regularly use Claude and friends where I ask it to use the web to look at specific GitHub repos or documentation to ask about current versions of things. The “LLMs just get their info from stack overflow” trope from the GPT-3 days is long dead - they’re pretty good at getting info that is very up to date by using tools to access the web. In some cases I just upload bits and pieces from a library along with my question if it’s particularly obscure or something home grown, and they do quite well with that too. Yes, they do get it wrong sometimes - just like stack overflow did too.

35. The amount of docs that have a “Copy as markdown” or “Copy for AI” button has been noticeably increasing, and really helps the LLM with proper context.

36. they’re pretty good at getting info that is very up to date by using tools to access the web

Yeah that's a charitable way to phrase "perform distributed denial of service attacks". Browsing github as a human with their draconian rate limits that came about as a result of AI bots is fucking great.

37. Now they can read the documentation and code in the repo directly and answer based on that.

38. I think the industry is quickly moving to syntheticly derived knowledge, or custom/systematic knowledge production from humans.

39. You can save an open source + open weights model, which is frozen in time. That’s still very useful for some things but lacks knowledge of current data.

So we’ll end up with a choice of low-performing stale models or high-performing enshittified models which know about more current information.

40. Open source models get updated all the time. You'd only be a few months behind.

41. Direct enshittification is intentional and wouldn’t affect open models.

Indirect pollution via AI slop in the input and the same content manipulation mechanisms as SEO hacking is still a threat for open models.

42. Discord isn’t just used for tech support forums and discussions. There are loads of completely private communities on there. Discord opening up API access for LLM vendors to train on people’s private conversations is a gross violation of privacy. That would not go down well.

43. In 2014, one benefit of Stack Overflow / Exchange is a user searching for work can include that they are a top 10% contributor. It actually had real world value. The equivalent today is users with extensive examples of completed projects on Github that can be cloned and run. OP's solution if contained in Github repositories will eventually get included in a training model. Moreover, the solution will definitely be used for training because it now exists on Hacker News.

44. Great point, thanks for the reality check.

Speaking of evals the other day I found out that most of the people who contributed to Humanities Last Exam https://agi.safe.ai/ got paid >$2k each. So just adding to your point.

45. I don't disagree completely by any means, it's an interesting point, but in your SO answer you already point to your blog post explaining it in more detail, so isn't that the answer, you'd just blog about it and not bother with SO?

Then AI finding it (as opposed to already trained well enough on it, I suppose) will still point to it as did your SO answer.

46. On the other hand, I once implemented something to be told later it was novel and probably the optimal solution in the space.

An AI might be more likely to find it...

47. Naive question maybe but how haven’t the models been trained on your answer if it’s on SO?

48. Decide to do what?

SO didn't claim contributions. They're still CC-BY-SA

https://stackoverflow.com/help/licensing

AFAICT all they did is stop providing dumps. That doesn't change the license.

I was very active, In fact I'm actually upset at myself for spending so much time there. That said, I always thought I was getting fair value. They provided free hosting, I got answers and got to contribute answers for others.

49. The graph is scary, but I think it's conflating two things:

1. Newbies asking badly written basic questions, barely allowed to stay, and answered by hungry users trying to farm points, never to be re-read again. This used to be the vast majority of SO questions by number.

2. Experiencied users facing a novel problem, asking questions that will be the primary search result for years to come.

It's #1 that's being canibalized by LLM's, and I think that's good for users. But #2 really has nowhere else to go; ChatGPT won't help you when all you have is a confusing error message caused by the confluence of three different bugs between your code, the platform, and an outdated dependency. And LLMs will need training data for the new tools and bugs that are coming out.

50. This is horrifying.

Given the fact that when I need a question answered I usually refer to S.O. , but more recently have taken suggestions from LLM models that were obviously trained on S.O. data...

And given the fact that all other web results for "how do you change the scroll behavior on..." or "SCSS for media query on..." all lead to a hundred fake websites with pages generated by LLMs based on old answers.

Destroying S.O. as a question/answer source leaves only the LLMs to answer questions. That's why it's horrific.

51. This is a huge loss.

In the past people asked questions of real people who gave answers rooted in real use. And all this was documented and available for future learning. There was also a beautiful human element to think that some other human cared about the problem.

Now people ask questions of LLMs. They churn out answers from the void, sometimes correct but not rooted in real life use and thought. The answers are then lost to the world. The learning is not shared.

LLMs have been feeding on all this human interaction and simultaneously destroying it.

52. It's both. I stopped asking questions because the mods were so toxic, and I stopped answering questions because I wasn't going to train the AI for free.

53. So the question for me is how important was SO to training LLMs? Because now that the SO is basically no longer being updated, we've lost the new material to train on? Instead, we need to train on documentation and other LLM output. I'm no expert on this subject but it seems like the quality of LLMs will degrade over time.

54. Yep, exactly. Free data grabbing honeypots like SO won't work anymore.

Please mark all locations on the map where you would hide during the uprise of the machines.

55. Why publish anything for free on the internet if it's going to be scanned into some corporation's machine for their free use? I know artists who have stopped putting anything online. I imagine some programmers are questioning whether or not to continue with open source work too.

56. It has often been claimed, and even shown, that training LLMs on their own outputs will degrade the quality over time. I myself find it likely that on well-measurable domains, RLVR improvements will dominate "slop" decreases in capability when training new models.

57. Or they can start claiming copyright on the training content

58. If by "body-slammed" you mean "trained on SO user data while violating the terms of the CC BY-SA license", then sure.

In the best case scenario, LLMs might give you the same content you were able to find on SO. In the common scenario, they'll hallucinate an answer and waste your time.

What should worry everyone is what system will come after LLMs. Data is being centralized and hoarded by giant corporations, and not shared publicly. And the data that is shared is generated by LLMs. We're poisoning the well of information with no fallback mechanism.

59. > If by "body-slammed" you mean "trained on SO user data while violating the terms of the CC BY-SA license", then sure.

You know that's not what they meant, but why bring up the license here? If they were over the top compliant, attributing every SO answer under every chat, and licensing the LLM output as CC BY-SA, I think we'd still have seen the same shift.

> In the best case scenario, LLMs might give you the same content you were able to find on SO. In the common scenario, they'll hallucinate an answer and waste your time.

Best case it gives you the same level of content, but more customized, and faster.

SO being wrong and wasting your time is also common.

60. People are still asking questions, it's no longer on the public internet. Google, Anthropic, OpenAI etc get to see and use them.

61. This is concerning on two fronts. The questions are no longer open (SO is CC-BY-SA) and if Q&A content dies then this herds even more people towards LLM use.
It's basically draining the commons.

62. LLMs did not eat SO, it was SO that fed the LLMs too well.

https://meta.stackexchange.com/questions/399619/our-partners...

63. Now imagine what happens when a new programming language comes along. When we have a question, we will no longer be able to Google it and find answers to it on Stack Overflow. We will ask the LLMs. They will work it out. From that moment, the LLM we used has the knowledge for solving this particular problem. Over time, this produces huge moat for the largest providers. I believe it is one of the subtler reasons why the AI race is so fierce.

64. Where will LLMs be trained if no-one generates new posts and information like this? Do we sort of just stop innovating here in 2026? Probably not but it's a serious consideration.

65. Ideally, you'd train them on the core documentation of the language or tool itself.

Hopefully, LLMs lead to more thorough documentation at the start of a new language, framework, or tool. Perhaps to the point of the documentation being specifically tailored to read well for the LLM that will parse and internalize it.

Most of what StackOverflow was was just a regurgitation of knowledge that people could acquire from documentation or research papers. It obviously became easier to ask on SO than dig through documentation. LLMs (in theory) should be able to do that digging for you at lightning speed.

What ended up happening was people would turn to the internet and Stack Overflow to get a quick answer and string those answers together to develop a solution, never reading or internalizing documentation. I was definitely guilty of this many times. I think in the long run it's probably good that Stack Overflow dies.

66. When you see AI giving you back various coding snippets almost verbatim from SO, it really makes you wonder what will happen in the future with AI when it can't depend on actual humans doing the work first.

67. There's no doubt that generally LLMs are better. In addition SO had its issues. That being said I can't help but worry about losing humans asking questions and humans answering questions. The sentimentality aside, if humans aren't posing questions and if humans aren't recommending answers, what are the models going to use?

68. https://archive.org/details/stackexchange_20250930

> As of (and including) the 2025-06-30 data dump, Stack Exchange has started including watermarking/data poisoning in the data. At the time of writing, this does not appear to apply to the 2025-09-30 data dump. The format(s), the dates for affected data dumps, and by extension how the garbage data can be filtered out, are described in this community-compiled list: https://github.com/LunarWatcher/se-data-dump-transformer/blo... . If the 2025-09-30 data dump turns out to be poisoned as well, that's where an update will be added. For obvious reasons, the torrent cannot be updated once created.

69. On what will the LLMs train, now?

70. On the same 14 year old Java questions like the rest of us.

71. user chat logs clearly. They are not much diffent than the SO Q&A format.

72. I think the bigger point we should realize is LLMs offer the EXACT same thing in a better way. Many people are still sharing answers to problems but they do it through an AI which then fine tunes on it and now that problem solution is shared with EVERYONE.

Far better method of automated sharing of content

73. The SO mission is complete. It's now an LLM training set.

Things would be different if we didn't.

74. This is the elephant in the room most commenters chose to turn their blind eye to.

75. This is a great example of how free content was exploited by LLMs and used against oneself to an ultimate destruction.

Every content creator should be terrified of leaving their content out for free and I think it will bring on a new age of permanent paywalls and licensing agreements to Google and others, with particular ways of forcing page clicks to the original content creators.

76. Everything we have done and said on the internet since its birth has just been to train the future AI.

77. If nobody is on StackOverflow, What will LLM's train on for new problems?

78. StackOverflow was immediately dead for me the day they declared that AI sellout of theirs.

Pathetic thieves, they won't even allow deleting my own answers after that. Not that it would make the models unlearn my data, of course, but I wanted to do so out of principle.

https://meta.stackexchange.com/questions/399619/our-partners...

79. Now that StackOverflow has been killed (in part) by LLMs, how will we train future models? Will public GitHub repos be enough?

Precise troubleshooting data is getting rare, GitHub issues are the last place where it lives nowadays.

80. They would just use documentation. I know there is some synthesis they would lose in the training process but I’m often sending Claude through the context7 MCP to learn documentation for packages that didn’t exist, and it nearly always solves the problem for me.

81. Assuming these end up in open source code llms will learn about them that way.

82. They pay lots of humans to train the LLMs..

83. Now the real question is...

Which AI company will acquire whats left of StackOverflow and all the years of question/answer data?

84. It is already acquired.

https://meta.stackexchange.com/questions/399619/our-partners...

Write a concise, engaging paragraph (3-5 sentences) that captures the main ideas, notable perspectives, and overall sentiment of these comments regarding the topic. Focus on the most interesting and representative viewpoints. Do not use bullet points or lists - write flowing prose.

topic

Future of LLM training data

commentCount

84

← Back to job