Reddit Migration

37 readers

2 users here now

### About Community Tracking and helping #redditmigration to Kbin and the Fediverse. Say hello to the decentralized and open future. To see latest reeddit blackout info, see here: https://reddark.untone.uk/

founded 1 year ago

Overwriting Comments w/ AI Output Is the Quickest Way to Make Reddit's Data Useless to LLM Firms (arxiv.org)

submitted 1 year ago by NevermindNoMind@lemmy.world to c/RedditMigration@kbin.social

3 comments fedilink hide all child comments

A new study shows that LLM models that are fed too much content that was generated by LLMs eventually collapse. Essentially, text generated by AI is poison if it makes its way into an LLMs training data. If the model eats too much of this poison, the model dies. By replacing your Reddit comments with AI generated text, you can effectively increase the toxicity of Reddit's dataset, and thereby decrease its value to firms training new LLMs. This will probably happen naturally anyway as spam bots and so forth continue taking over Reddit, but if you want to go out in a petty way, this is a good option.

I linked the actual study, but I first read about this on Platformer, where he was writing more broadly about how the AI is filing up the web with synthetic content and the problems that is causing. He was using this study to point out that it will be increasingly hard for developers to find good content for the LLMs to train on due to there being so much AI generated content, and the risk of the LLMs consuming too much AI content. Here is what he wrote:

A second, more worrisome study comes from researchers at the University of Oxford, University of Cambridge, University of Toronto, and Imperial College London. It found that training AI systems on data generated by other AI systems — synthetic data, to use the industry’s term — causes models to degrade and ultimately collapse.

While the decay can be managed by using synthetic data sparingly, researchers write, the idea that models can be “poisoned” by feeding them their own outputs raises real risks for the web.

And that’s a problem, because — to bring together the threads of today’s newsletter so far — AI output is spreading to encompass more of the web every day.

“The obvious larger question,” Clark writes, “is what this does to competition among AI developers as the internet fills up with a greater percentage of generated versus real content.”

When tech companies were building the first chatbots, they could be certain that the vast majority of the data they were scraping was human-generated. Going forward, though, they’ll be ever less certain of that — and until they figure out reliable ways to identify chatbot-generated text, they’re at risk of breaking their own models.

Even the study's abstract doesn't make a lot of sense to me, so here is an AI generated ELI5 (I am fully aware of the irony):

This paper is about how computers learn to write like humans. They use a lot of text from the internet to learn how to write. But if they use too much text that they wrote themselves, they start to forget how humans write. This is bad because we want computers to write like humans. So we need to make sure that computers learn from humans and not just from other computers.

top 2 comments

sorted by: hot top controversial new old

[–] Methylman@lemmy.world 1 points 1 year ago (1 children)

As we are on the eve of rexxit - Is there a "best" way to sabotage our posts?

I suppose I see two ways of achieving this - 1) a single AI-response that we edit all posts with; or 2) actually using an AI to "reply", as in different posts which emulate the answers a human would provide but generated by AI

Imo, route 2 would be more time-consuming but harder to 'prevent' from degrading the dataset from reddit's perspective?

[–] HandsHurtLoL@kbin.social 1 points 1 year ago* (last edited 1 year ago)

I used a free download called Redact to go through all my comments on June 11 and replace with AI language garbage. I did not delete submissions at this time, however, though that is an option in Redact. This process took almost 4 hours because I had two 11+ year old accounts.

Because I started this late at night and am in a specific time zone, a few of the subs I commented in the most had gone dark (midnight of June 12) and my comments could not be edited on my SFW account. In doing this, I was permabanned from several subreddits on my NSFW account.

Today, I opened Redact again to see if I could alter comments/remove submissions on my account that had the most subs go dark. Redact wouldn't even run for my SFW account so I logged in to reddit directly and saw a message that my account had been deactivated, which is why I think Redact was throwing me errors. I manually deleted all my submissions from both my accounts and manually deleted any comments that were original language from me.

I left up the AI edited comments and then deleted both my accounts.

load more comments