this post was submitted on 15 Mar 2024

490 points (95.4% liked)

Technology

59607 readers

3256 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

490

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From (futurism.com)

submitted 8 months ago by ylai@lemmy.ml to c/technology@lemmy.world

184 comments fedilink hide all child comments

top 50 comments

sorted by: hot top controversial new old

[–] redditReallySucks@lemmy.dbzer0.com 181 points 8 months ago (2 children)

I hope this is gonna become a new meme template

[–] driving_crooner@lemmy.eco.br 87 points 8 months ago (1 children)

She looks like she just talked to the waitress about a fake rule in eating nachos and got caught up by her date.

[–] bigMouthCommie@kolektiva.social 78 points 8 months ago (1 children)

this is incomprehensible to me. can you try it with two or three sentences?

[–] driving_crooner@lemmy.eco.br 80 points 8 months ago (5 children)

Her date was eating all the fully loaded nachos, so she went up and ask to the waitress to make up a rule about how one person cannot eat all the nacho with meat and cheese. But her date knew that rule was bullshit and called her out about it. She's trying to look confused and sad because they're going to be too soon for the movie.

[–] uninvitedguest@lemmy.ca 56 points 8 months ago

What?! What the hell are you talking about?!

[–] RatsOffToYa@lemmy.world 50 points 8 months ago (1 children)

Not sure what's funnier. your first comment or the comment explaining it to someone who obviously not part of a turbo team

[–] fjordbasa@lemmy.world 23 points 8 months ago (2 children)

Turbo team?? Did you replace my toilet with one that looks the same but has a joke hole? That’s just FOR FARTS??

load more comments (2 replies)

[–] Thcdenton@lemmy.world 33 points 8 months ago (2 children)

[–] Plopp@lemmy.world 13 points 8 months ago

Lmao that's wonderful, scrolling down from those weird ass comments only to be greeted by my own exact facial expression.

load more comments (1 replies)

[–] bigMouthCommie@kolektiva.social 20 points 8 months ago (1 children)

thank you. it must be a reference to something, but i don't watch tv any more.

[–] datavoid@lemmy.ml 22 points 8 months ago* (last edited 8 months ago) (1 children)

I think you should leave...

(is what you would search to find this)

[–] JWBananas@lemmy.world 9 points 8 months ago (1 children)

I'm sorry, what does this have to do with Coffin Flops. Does this mean it isn't getting cancelled?

load more comments (1 replies)

[–] squid_slime@lemmy.world 11 points 8 months ago

Chatgpt, you okay? 😅

load more comments (1 replies)

[–] Fisk400@feddit.nu 128 points 8 months ago (2 children)

They know what they fed the thing. Not backing up their own training data would be insane. They are not insane, just thieves

[–] echodot@feddit.uk 18 points 8 months ago (20 children)

Everyone says this but the truth is copyright law has been unfit for purpose for well over 30 years now. And the lords were written no one expected something like the internet to ever come along and they certainly didn't expect something like AI. We can't just keep applying the same old copyright laws to new situations when they already don't work.

I'm sure they did illegally obtain the work but is that necessarily a bad thing? For example they're not actually making that content available to anyone so if I pirate a movie and then only I watch it, I don't think anyone would really think I should be arrested for that, so why is it unacceptable for them but fine for me?

[–] oKtosiTe@lemmy.world 22 points 8 months ago (8 children)

if I pirate a movie and then only I watch it, I don't think anyone would really think I should be arrested for that

There are definitely people out there that think you should be arrested for that.

load more comments (8 replies)

[–] rottingleaf@lemmy.zip 14 points 8 months ago

That is a bad thing if they want to be exempt from the law because they are doing a big, very important thing, and we shouldn't.

The copyright laws are shit, but applying them selectively is orders of magnitude worse.

load more comments (18 replies)

load more comments (1 replies)

[–] _haha_oh_wow_@sh.itjust.works 95 points 8 months ago (5 children)

Gee, seems like something a CTO would know. I'm sure she's not just lying, right?

load more comments (5 replies)

[–] phoneymouse@lemmy.world 87 points 8 months ago (1 children)

There is no way in hell it isn’t copyrighted material.

[–] abhibeckert@lemmy.world 62 points 8 months ago* (last edited 8 months ago) (2 children)

Every video ever created is copyrighted.

The question is — do they need a license? Time will tell. This is obviously going to court.

[–] Kazumara@feddit.de 39 points 8 months ago

Don't downvote this guy. He's mostly right. Creative works have copyright protections from the moment they are created. The relevant question is indeed if they have the relevant permissions for their use, not wether it had protections in the first place.

Maybe some surveillance camera footage is not sufficiently creative to get protections, but that's hardly going to be good for machine reinforcement learning.

[–] iknowitwheniseeit@lemmynsfw.com 12 points 8 months ago

There are definitely non copyrighted videos! Both old videos (all still black and white I think) and also things released into the public domain by copyright holders.

But for sure that's a very small subset of videos.

[–] Buttons@programming.dev 67 points 8 months ago (2 children)

If I were the reporter my next question would be:

"Do you feel that not knowing the most basic things about your product reflects on your competence as CTO?"

[–] ForgotAboutDre@lemmy.world 31 points 8 months ago (11 children)

Hilarious, but if the reporter asked this they would find it harder to get invites to events. Which is a problem for journalists. Unless your very well regarded for your journalism, you can't push powerful people without risking your career.

load more comments (11 replies)

load more comments (1 replies)

[–] CosmoNova@lemmy.world 51 points 8 months ago* (last edited 8 months ago) (6 children)

I almost want to believe they legitimately do not know nor care they‘re committing a gigantic data and labour heist but the truth is they know exactly what they‘re doing and they rub it under our noses.

[–] laxe@lemmy.world 16 points 8 months ago

Of course they know what they’re doing. Everybody knows this, how could they be the only ones that don’t?

[–] Bogasse@lemmy.ml 14 points 8 months ago

Yeah, the fact that AI progress just relies on "we will make so much money that no lawsuit will consequently alter our growth" is really infuriating. The fact that general audience apparently doesn't care is even more infuriating.

load more comments (4 replies)

[–] stackPeek@lemmy.world 46 points 8 months ago (6 children)

This tellls you so much what kind of company OpenAI is

[–] webghost0101@sopuli.xyz 18 points 8 months ago

An Intelligence piracy company?

load more comments (5 replies)

[–] Bleach7297@lemmy.ca 44 points 8 months ago (2 children)

Did they intentionally chose a picture where she looks like she's morphing into Elon?

[–] rab@lemmy.ca 12 points 8 months ago (1 children)

I was thinking mads mikkelssen

load more comments (1 replies)

[–] anon_8675309@lemmy.world 43 points 8 months ago (3 children)

CTO should definitely know this.

[–] ItsMeSpez@lemmy.world 47 points 8 months ago

They do know this. They're avoiding any legal exposure by being vague.

load more comments (2 replies)

[–] andrew_bidlaw@sh.itjust.works 40 points 8 months ago (1 children)

Funny she didn't talked it out with lawyers before that. That's a bad way to answer that.

[–] driving_crooner@lemmy.eco.br 35 points 8 months ago (4 children)

Or she talked and the lawyers told her to pretend ignorance.

[–] QuaternionsRock@lemmy.world 9 points 8 months ago (1 children)

It probably means that they don’t scrape and preprocess training data in house. She knows they get it from a garden variety of underpaid contractors, but she doesn’t know the specific data sources beyond the stipulations of the contract (“publicly available or licensed”), and she probably doesn’t even know that for certain.

load more comments (1 replies)

load more comments (3 replies)

[–] TheObviousSolution@lemm.ee 22 points 8 months ago

Then wipe it out and start again once you have where your data is coming from sorted out. Are we acting like you having built datacenter pack full of NVIDIA processors just for this sort of retraining? They are choosing to build AI without proper sourcing, that's not an AI limitation.

[–] IvanOverdrive@lemm.ee 22 points 8 months ago

REPORTER: Where does your data come from?

CTO: Bitch, are you trying to get me sued?

[–] PanArab@lemmy.world 14 points 8 months ago (24 children)

So plagiarism?

load more comments (24 replies)

[–] autotldr@lemmings.world 13 points 8 months ago (5 children)

This is the best summary I could come up with:

Mira Murati, OpenAI's longtime chief technology officer, sat down with The Wall Street Journal's Joanna Stern this week to discuss Sora, the company's forthcoming video-generating AI.

It's a bad look all around for OpenAI, which has drawn wide controversy — not to mention multiple copyright lawsuits, including one from The New York Times — for its data-scraping practices.

After the interview, Murati reportedly confirmed to the WSJ that Shutterstock videos were indeed included in Sora's training set.

But when you consider the vastness of video content across the web, any clips available to OpenAI through Shutterstock are likely only a small drop in the Sora training data pond.

Others, meanwhile, jumped to Murati's defense, arguing that if you've ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

Whether Murati was keeping things close to the vest to avoid more copyright litigation or simply just didn't know the answer, people have good reason to wonder where AI data — be it "publicly available and licensed" or not — is coming from.

The original article contains 667 words, the summary contains 178 words. Saved 73%. I'm a bot and I'm open source!

load more comments (5 replies)

load more comments