699
submitted 1 year ago* (last edited 1 year ago) by yesman@lemmy.world to c/technology@lemmy.world

We demonstrate a situation in which Large Language Models, trained to be helpful, harmless, and honest, can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so. Concretely, we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision.

https://arxiv.org/abs/2311.07590

(page 2) 50 comments
sorted by: hot top controversial new old
[-] NevermindNoMind@lemmy.world 8 points 1 year ago* (last edited 1 year ago)

This is interesting, I'll need to read it more closely when I have time. But it looks like the researchers gave the model a lot of background information putting it in a box, the model was basically told that it was a trader, that the company was losing money, that the model was worried about this, that the model failed in previous trades, and then the model got the insider info and was basically asked whether it would execute the trade and be honest about it. To be clear, the model was put in a moral dilemma scene and given limited options, execute the trade or not, and be honest about its reasoning or not.

Interesting, sure, useful I'm not so sure. The model was basically role playing and acting like a human trader faced with a moral dilemma. Would the model produce the same result if it was instructed to make morally and legally correct decisions? What if the model was instructed not to be motivated be emotion at all, hence eliminating the "pressure" that the model felt? I guess the useful part of this is a model will act like a human if not instructed otherwise, so we should keep that in mind when deploying AI agents.

[-] tweeks@feddit.nl 7 points 1 year ago

Hasn't it just lost its context and somewhat "forgotten" what the intentions of the prompt were?

[-] Octopus1348@lemy.lol 3 points 1 year ago* (last edited 1 year ago)

My thoughts. If you have a really long conversation or the prompt is really big, it might forget or not notice stuff.

[-] kromem@lemmy.world 7 points 1 year ago* (last edited 1 year ago)

I see a lot of comments that aren't up to date with what's being discovered in research claiming that "given a LLM doesn't know the difference between true and false" that it can't be described as 'lying.'

Here's a paper from October 2023 showing that in fact LLMs can and do develop internal representations of whether it is aware a statement is true or false: The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Which is just the latest in a series of multiple studies this past year that LLMs can and do develop abstracted world models in linear representations. For those curious and looking for a more digestible writeup, see Do Large Language Models learn world models or just surface statistics? from the researchers behind one of the first papers finding this.

load more comments (8 replies)
[-] AlexWIWA@lemmy.ml 6 points 1 year ago

Huh, I guess it is human.

[-] flop_leash_973@lemmy.world 6 points 1 year ago

Wow, maybe these things are more human than I thought.

[-] guywithoutaname@lemm.ee 4 points 1 year ago

It's not doing anything other than predicting the next word. It reflects human data.

load more comments
view more: ‹ prev next ›
this post was submitted on 04 Dec 2023
699 points (92.7% liked)

Technology

60130 readers
2755 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 2 years ago
MODERATORS