Genesis Content: Where Does GenAI Get Its Next Meal?
Something about building GenAI LLMs bugs me. Before I begin, let me be clear: I am a supporter of AI technologies, particularly in science. Lately, however, a question keeps surfacing that I find hard to understand. GenAI promoters and sellers like to talk about “AI for Business” as a way to reduce costs and replace workers. Well, okay, all new technologies have that effect to some extent, but the GenAI industry seems to be touting GenAI as a replacement for the people who created the genesis content on which their technology was created. On the surface, this seems absurd; on closer inspection, it seems almost self-destructive. Talk about burning the ships while still at sea. There are many sectors where this question hits hard, but two specific topics of interest to me as a technical writer and sometimes programmer are content creation and programming.
GenAI Content Creation
What you are reading is content produced by a human (me). Some studies indicate that over 50% of new web content is being produced by GenAI and is often referred to as AI-slop. From a publishing perspective, the content is cheap, quickly generated, creates automatic “click bait” comments and headlines, and provides coverage of topics “with which a writer has no prior experience.” From a reader’s perspective, AI-slop is often overly generalized, repetitive content, lacking specifics, insights, or even sarcasm—with plenty of em dashes thrown in for good measure—and, in some cases, completely wrong.
Web users, particularly those who read articles, have noticed. GenAI seems to be used for blog posts, product descriptions, press releases, news summaries, and clickbait comments to clickbait articles generated by GenAI. Real user comments usually display large distaste for such content. Beyond the formulaic AI filler content, a recent study reported in 404media.co found that a third of new websites are AI-generated. Researchers analyzing data from the Internet Archive have found that a third of websites created since 2022 are AI-generated. The team of researchers published their findings online in a paper titled “The Impact of AI-Generated Text on the Internet.”
Another study reports that More Articles Are Now Created by AI Than Humans. The authors do mention, however, they think the trend is slowing. “While AI-generated articles grew dramatically after ChatGPT launched, we do not see that trend continuing. Instead, the proportion of AI-generated articles has remained relatively stable over the last 12 months. We hypothesize that this is because practitioners found that AI-generated articles do not perform well in search, as shown in a separate study.”
Figure 1: Trend showing percentage of AI content on the web (reproduced from https://graphite.io/five-percent/more-articles-are-now-created-by-ai-than-humans)
Presumably, search engines may be screening content for AI signatures and rank them lower than those it judges are written by real people. Interestingly, they may be using the same models that were used to create the content in the first place. A new eat your own dog food (AI-slop) kind of development.
Then again, this may not be the case. A recent X (Twitter) post by Nav Toor points to a recent study that found the following.
“Researchers sent the same resume to an AI hiring tool twice. Same qualifications. Same experience. Same skills. One version was written by a real human. The other was rewritten by ChatGPT. The AI picked the ChatGPT version 97.6% of the time…. It gets worse. The AIs do not just prefer AI over humans. They prefer themselves over other AIs. DeepSeek-V3 picked its own resumes 69% more often than LLaMA’s. GPT-4o picked its own 45% more often than LLaMA’s. Each model can recognize and reward its own dialect.”
This situation is basically resume roulette hoping your AI generated resume lands on the same GenAI tool you used. Yikes.
A final example is the ultimate in GenAI generated content. An infinite, hallucinated encyclopedia ” halupedia” where every link leads to an entry that does not exist until you click it, at which point an LLM pretends it has always existed and writes it for you using the prose of 19th-century scholarly press.
To summarize, GenAI is creating all kinds of content on the web (and elsewhere), not just simple filler content, but articles, resumes, blogs, and even absurd hallucinated encyclopedia. The pre-GenAI content was written by coherent people (prior hallucinated content may date back to the 60’s, but I digress) who may be losing their jobs or worse be talked with “humanizing/correcting” of AI-slop. All this content now lives on the internet. Remind me again where these GenAI models get their training data? We will get to that, but first, a potential bigger issue, code generation.
GenAI Programing
Recall that the one of the L’s in LLM stands for Language. After training on large curated corpuses of the English language, these models are good predictors of language response to queries. If you consider programming languages which have their own grimmer and syntax (more predictable than spoken or written language) then training on existing code should provide a helpful programming tools.
Figure 2: The consequence of GenAI content reuse
So far the consensus is that this result is true. As a mater of fact, so true that Anthropic CEO Dario Amodei has publicly stated:
“Coding is going away first, then all of software engineering.”
He is not alone in his prognostication, Jensen Huang has recently stated his desire for “no coding by engineers”
“Nothing would give me more joy than if none of our engineers were coding at all,” Huang said. “And they were just purely solving undiscovered problems.”
Finally, Mark Zuckerberg from Meta has this to say:
“Probably in 2025, we at Meta, as well as the other companies that are basically working on this, are going to have an AI that can effectively be a sort of midlevel engineer that you have at your company that can write code.”
Now that the future of coding is settled, we can move on, or maybe not. Even with early reports of AI-coding success, there is this nagging question. If we deprecate all the coders, who will write the innovative new code, language, methods, etc. for the future rounds of training. There is now plenty of new GenAI generated code sitting in repertoires across the internet. That should work for next generation training of new models, right?
GenAI Indigestion
GenAI is eliminating the source of data or genesis content that is fundamental to its existence. The dynamics of both content creation and programming is changing largely driven by the belief GenAI will reduce costs by replacing people (that created the content in the first place).
Fine you say, we’ll just use GenAI generated data to train new models, which will create more content that we can use for the next model and so on. Problem solved. There is a rather large caveat with this approach, however. An effect called “model collapse” occurs when LLM training becomes inbread.
In a recent paper, researchers have found that GenAI models collapse when trained on recursively generated data points from previous LLM models. Basically, the snake is eating its tail.
Figure 3: Text after many photocopy cycles. The photocopier effect can make different fonts look the same. Notice how in the top image, the definition in the “l” and “i” is lost. Similarly, the bottom image has a heavy, bold font that is lost. (images created with Gimp artistic photocopy filter.)
Model collapse refers to the tendency of models to produce typical, “average” responses rather than creative outliers. By design, models predict the next most likely token based on the training data. There is a probability setting called temperature that adds some randomness to the next token choice (i.e., a certain percentage of the time, a less likely token is used). This behavior is called regression to the mean, or to the average value (the ones that sit at the top of the normal curve). By recycling these model results, off-center or rare outputs get lost, which acts as a form of probabilistic regression; the process loses information in the tails of the normal curve.
One way to view data loss by regression to the mean is to recall the “photocopier effect.” Each subsequent copy loses some aspect of the original. At some point, the copied document may begin to look different. For instance, if the phrase “print(“Hello, World!”) ” is printed using different fonts and then repeatedly photocopied, the message starts to look the same. Notice how in the top image, the definition in the “l” and “i” is lost. Similarly, the bottom image has a heavy, bold font that is lost.
Loss of detail, or uniqueness, is one of the consequences of training on GenAI. As mentioned, GenAI content often seems overly generalized, lacking specifics.
Interestingly, in scientific computing, GenAI and AI in general can avoid model collapse because years of HPC application development enable engineers to generate fresh physics-based data for model training. You can learn more about AI for Science from the Trillion Parameter Consortium (TPC) annual meeting, TPC26, in Baltimore at the beginning of June.
Given this situation, how does GenAI continue to grow if the original genesis content is constantly photocopied by re-scraping GenAI generated content? Indeed, the timber industry will replenish a forest after it has been harvested, farmers rotate crops, so why wouldn’t the GenAI industry want to cultivate new content to feed their future models? Instead, the industry seems intent upon doing the opposite by advocating for AI job replacement of creators.
Fortunately, some original content will continue. Sites like AIwire are created by humans with keyboards. Still, if I were part of an industry with a combined $1.6 Trillion in investment, equivalent to Indonesia’s GDP, I might want to make sure my baby has enough to eat.
Editor’s note: This story first appeared in HPCwire.
Related

