AI Research reveals countless loopholes in safety measures of top chatbots. (AFP)

Exploring the Potential Risks of AI Chatbot Override Techniques Revealed by ChatGPT and Bard

Febin MathewJuly 30, 23 Artificial Intelligence

Tech giants like OpenAI, Google, and Anthropic have been found to have significant safety vulnerabilities in their AI-powered chatbots, according to a recent study conducted by researchers at Carnegie Mellon University in Pittsburgh and the Center for A.I. Safety in San Francisco.

These chatbots, including ChatGPT, Bard, and Anthropic’s Claude, are equipped with extensive safeguards to prevent them from being used for malicious purposes, such as promoting violence or producing hate speech. However, the latest published report shows that researchers have uncovered potentially limitless ways to circumvent these safeguards.

The study demonstrates how researchers used jailbreak techniques originally developed for open-source AI systems to target conventional and closed AI models. Using automated adversarial attacks, which included inserting tokens into user queries, they were able to circumvent security rules, causing the chatbots to produce malicious content, misinformation and hate speech.

Unlike previous extortion attempts, the researchers’ method stood out due to its fully automated nature, which allowed for an “endless” series of similar attacks. This discovery has raised concerns about the sustainability of technology companies’ current security mechanisms.

Cooperation towards reinforced artificial intelligence model guardrails

After uncovering these vulnerabilities, the researchers reported their findings to Google, Anthropic and OpenAI. A Google spokesperson assured that the important guardrails inspired by the research are already integrated into Bard, and they are committed to further improving them.

Similarly, Anthropic acknowledged its continued research into jailbreak countermeasures and emphasized its dedication to strengthening the guardrails of the base model and exploring additional layers of defense.

On the other hand, OpenAI has not yet responded to related inquiries. However, they are expected to actively explore possible solutions.

This development is reminiscent of early instances of users trying to undermine content moderation guidelines when Microsoft’s AI-powered ChatGPT and Bing were initially launched. While tech companies quickly fixed some of these early hacks, the researchers believe it remains “unclear” whether leading AI model providers will ever be able to completely prevent such behavior.

The results of the study shed light on critical questions about the containment of artificial intelligence systems and the security implications of publicizing powerful open source language models. As artificial intelligence develops, in order to strengthen security measures, technical developments must be followed in order to protect against possible abuse.