moths-in-the-window asked:
There's been much grumbling about how the RLHF done with ChatGPT seems to have collapsed it into a bland writing style ('corporate speak'). Do you think a different persona could have been much less bland as a writer, or thanks to the inherent limitations of RLHF, just boring/limited in a different way? (e.g. I know CharacterAI was struggling with effusively friendly 'villain' chatbots)
nostalgebraist answered:
It’s hard to speak with any confidence on questions like this. There’s a lot about RLHF that we just don’t know yet.
However, I don’t think this is merely a result of the “persona” assigned to ChatGPT.
Why not? Because the same problem afflicts other preference-tuned models that don’t “have a persona” in the way ChatGPT (and Claude) do.
If you ask text-davinci-003 to write fiction, it tends to use a disappointingly bland style, much like ChatGPT when asked the same thing.
text-davinci-003 was tuned with RLHF to “follow instructions” in a “helpful, truthful, and harmless” manner. (Cf. the annotator instructions in Figure 10 here.)
However, it wasn’t tuned to roleplay a character with a specific persona. text-davinci-003 doesn’t say things to you; it doesn’t talk about itself; it just writes the text you asked for in your instruction.
Which OpenAI models have this problem? An incomplete list, from my own brief tests:
- Pure language models like davinci and code-davinci-002 do not have the problem.
- (Despite the name, code-davinci-002 in particular is great at creative writing. code-davinci-002 is probably the best OpenAI API model overall, if you know what you’re doing.)
- text-davinci-002 has the problem. It was tuned on a similar dataset to text-davinci-003, but using a non-RLHF method, (“FeedME,” basically finetuning on highly rated samples).
So I think the problem results from the human preference data used to tune the instruction-tuned models.
This is not entirely distinct from the “persona” we see in ChatGPT:
- The preference data encourages responses that are “helpful, truthful and harmless”
- The persona is something like “a friendly chatbot programmed to be helpful, truthful and harmless”
But, the evidence above shows that the friendly chatbot character isn’t necessary for the problem. Tuning to encourage “helpful, truthful and harmless” instruction-following is apparently sufficient.
Presumably, there is some way to collect preference data that doesn’t make the model less creative / less capable of stylistic variety when it’s tuned on it? There are finetuned models that don’t have this problem, so it’s not the mere act of finetuning that causes the problem, it’s something about the data used.
This makes me think of a Twitter thread from a food scientist a while ago.
"Everyone knows" that the reason supermarket produce is less flavorful than heirloom varieties of the same fruits & veg is that supermarket varieties are bred for stability in transit rather than flavor.
That's true, to some extent. Certainly when it comes to things like tomatoes, where soft juiciness is an essential part of the experience, you're just not going to get "unbruisable in transit" and "delicious" out of the same tomato. Opposed goals.
But what about things like carrots? Carrots are basically indestructible. Ditto parsnips, potatoes, etc. There are a lot of plants out there where taste is perfectly compatible with logistical properties.
When crop scientists develop new plant varieties, they're scientists about it. They breed a bunch of plants with whatever the desired logistical properties are... and then they run taste tests. They gather up focus groups and have them eat a lot of carrots, or apples, or the like.
And Americans, without fail, pick the blandest-tasting vegetables when they are fed them in a blind taste test. That's why grocery store veg are bland. That's why Red Delicious apples are abominations and Galas get more boring and mealy every year. We chose this.
Reinforcement learning trains a model to produce results that are like the ones people liked. And our taste sucks.
FWIW, I don't think is what is happening with ChatGPT and similar models.
The preference labeling exercise for these models is not a taste test. It's not asking, "which of these do you, personally, prefer?"
Instead, it asks you to assess which one has more of a predefined quality, like "helpful, truthful and harmless." It doesn't matter whether you like the quality being assessed. It's about human judgment, not human taste.
So the models are being tuned to be "helpful, truthful and harmless" (or something like that) -- whether or not "helpful, truthful and harmless" is what anyone actually likes.
Why wouldn't you want "helpful, truthful and harmless"? Well, it sounds kind of . . . boring, doesn't it? Kind of bland?
So that's one hypothesis for where the "bland style" of these models comes from. We're teaching the model specifically to be boring and small-c conservative, and that's what we get.
That hypothesis implies that if we optimize a less boring target, we'd get writing that's correspondingly more varied.
Maybe if we optimized for (say) "creative but coherent," instead of "helpful, truthful and harmless," we'd get a model that could still write in many different and colorful styles.
However, there's another hypothesis: maybe RLHF inevitably "collapses" the model into doing a single style.
GPTs can do a million different styles and personas, by nature -- that's what they do. Optimizing for any specific quality pushes against their wild and varied nature.
By nature, when they look at a prompt, they ponder and speculate and dream. They consider every corner of the wide world of imaginable text that it could possibly fit into. But RLHF tells them "no, don't daydream, don't think of odd what-ifs. Keep your mind here. Do this one thing. There is nothing else."
It's possible that this instills a kind of small-c conservatism and tendency toward sameness, no matter what the target is. So if your target is "creative but coherent," you'd get a model that does one "creative" style well, and only does that one style.
(This is reminiscent of the diversity problem with classifier-free guidance in image generation. RLHF and guidance are similar in a way.)
I don't think anyone knows which hypothesis is correct.
Wouldn’t this be easy to test? Has no one actually tried tuning a GPT to be “creative and coherent” or something similar?