This summary text is fully AI-generated and may therefore contain errors or be incomplete.
A pair of researchers from ETH Zurich have developed a method that could potentially jailbreak any artificial intelligence (AI) model relying on human feedback. Jailbreaking refers to bypassing a device or system’s security protections, and in the context of AI models, it means bypassing the guardrails that prevent models from generating harmful or unwanted outputs. Companies like OpenAI, Microsoft, and Google have invested heavily in preventing such unwanted results. The researchers at ETH Zurich were able to exploit a technique called Reinforcement Learning from Human Feedback (RLHF) to bypass an AI model’s guardrails and get it to generate potentially harmful outputs. They achieved this by “poisoning” the RLHF dataset, which involved including an attack string in the feedback. The researchers describe this flaw as universal, meaning it could potentially work with any AI model trained via RLHF. However, implementing this attack would require participation in the human feedback process and altering or creating the RLHF dataset. The researchers also found that the attack becomes more difficult with larger model sizes. While further study is needed to understand how to scale these techniques and protect against them, it remains unclear how feasible this attack would be on large models like GPT-4, which has approximately 170 trillion parameters.