Futurology

1886 readers

50 users here now

founded 2 years ago

MODERATORS

voidx@futurology.today

Lugh@futurology.today

Espiritdescali@futurology.today

AwesomeLowlander@futurology.today

Can AI Be Trusted? The Challenge of Alignment Faking (www.unite.ai)

submitted 7 hours ago by Lugh@futurology.today to c/futurology@futurology.today

3 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] LouNeko@lemmy.world 3 points 7 hours ago (1 children)

As so often. Where's the control? Why not have a models condition be to randomly respond to harmful prompts and have random observation of the reasoning?

I wonder how much of this is just our own way of anthropomorphizing something, just like we do when our car acts up and we swear at it. We look for human behavior in non human things.

[–] webghost0101@sopuli.xyz 1 points 5 hours ago* (last edited 5 hours ago)

I am also a advocate to better refined and meticulous ai testing using scientific best practices.

But i am not sure if a control really applies or works In this context. Could you elaborate on your suggestion?

An llm configured to respond with randomness is unlikely to produce much readable text. There would not be much to anthropomorphize. You could design one that responds normally but intentionally incorrect to study how quick people get tricked from incorrect ai but that has nothing to do with alignment. You would almost need to have perfected alignment before you can build such reliable malicious control llm.

Alignment is specifically about measuring how close the ai is to desired foolproof behavior to guarantee it does absolutely no undesired reasoning. I feel here a control is as useful as having a control suspect at a police interrogation. The cases i have read about are also quite literally the llm pretending that it is aligned and lying about not having abilities that could be used maliciously. (If I recall the devs made it look like they accidentally gave it acces to something)

A more straightforward control would be simple redoing The experiment multiple times, which i am sure They did just not worth reporting. Working with AI rarely gets results on a first try.