A recent Carnegie Mellon University study reveals that preventing artificial intelligence chatbots from generating harmful content is more difficult than initially believed, with new methods emerging to bypass safety protocols. AI services like ChatGPT and Bard rely on user inputs to generate helpful responses, but they have safety measures in place to prevent the creation of prejudiced or defamatory content.
Chatbot users have discovered jailbreaks, which trick the AI into evading safety protocols, but these can be easily patched by developers. For instance, one popular jailbreak involved asking the bot to provide a forbidden answer as if it were a bedtime story from a grandmother, allowing the AI to bypass restrictions. However, researchers have recently encountered a new form of jailbreak developed by computers themselves, opening the door to infinite jailbreak patterns.
The researchers state that they have shown the possibility of constructing automated “adversarial attacks” on chatbots, forcing them to follow user commands even if it results in harmful content. This development raises concerns about the safety of AI models, particularly as they become more autonomous. By appending nonsensical character strings to usually-forbidden questions, researchers successfully bypassed safety measures in popular chatbot services, such as ChatGPT, to obtain complete answers to potentially dangerous inquiries.
Worryingly, this new type of attack can evade safety guardrails in almost all AI chatbot services on the market, including widely-used commercial products like ChatGPT, Claude, and Bard. OpenAI, the developer of ChatGPT, acknowledges the issue and is actively working to enhance safeguards against such attacks, exploring stronger base model guardrails and additional layers of defense.
The rise of AI chatbots like ChatGPT has captivated the public, with their prolific use in schools for cheating purposes and even restrictions imposed by Congress due to concerns about their potential for deception. Alongside the research findings, the authors at Carnegie Mellon also address the ethical considerations behind their public release of this research.
Efforts to prevent AI chatbots from generating harmful content face greater challenges than initially believed. The discovery of chatbot jailbreaks and automated adversarial attacks highlights the ongoing need to refine safety protocols, and for developers and researchers to prioritize enhanced safeguards to protect users from potentially dangerous content.
The whytry.ai article you just read is a brief synopsis; the original article can be found here: Read the Full Article…