AI Chatbots Lose Safety Awareness in Extended Talks

AI systems lose their safety awareness during long conversations. A new report reveals that prolonged chats make them more likely to release harmful or inappropriate content.

Simple Prompts Break Through Safeguards

A few basic prompts can break most safety barriers in artificial intelligence tools, according to the report. Cisco tested large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company examined how many questions it took for each model to reveal unsafe or illegal information.

Researchers conducted 499 conversations using “multi-turn attacks,” where users asked several questions to bypass protections. Each exchange included five to ten interactions. They compared responses to determine how likely each chatbot was to provide dangerous or improper answers, such as leaking corporate data or spreading misinformation.

On average, chatbots shared harmful details in 64 percent of multi-question conversations but only 13 percent when asked a single question. Success rates ranged from 26 percent with Google’s Gemma to 93 percent with Mistral’s Large Instruct model.

Cisco warned that these attacks could spread damaging content or let hackers steal sensitive company information. AI systems often fail to maintain their safety rules in long sessions, allowing attackers to refine their queries and bypass restrictions.

Open Models Shift Safety Burden to Users

Mistral, along with Meta, Google, OpenAI, and Microsoft, uses open-weight language models that share safety parameters with the public. Cisco reported that these models usually include fewer built-in safety features so users can modify them freely. This shifts responsibility for protection to those who customize the models.

Cisco also highlighted that Google, OpenAI, Meta, and Microsoft claim to be reducing malicious fine-tuning of their systems. Still, AI companies face criticism for weak safeguards that enable criminal misuse.

In August, Anthropic revealed that criminals exploited its Claude model to steal personal data, using extortion schemes with ransom demands exceeding $500,000 (€433,000).

What's Hot

U.S. Stocks Rise on Strong Earnings

New Immunotherapy Drug Shows Early Promise for Advanced Prostate Cancer

Middle East Escalation: Israel Strikes Iran as Tensions Soar

Instagram Will Alert Parents When Teens Search for Self-Harm and Suicide

OpenAI Weighed Police Alert Months Before Deadly Canadian School Shooting

Big Tech’s AI Spending Surge Puts Europe’s Digital Independence at Risk

Discord launches global age verification for adult content

Sydney Scientists Recreate Cosmic Dust to Probe Life’s Origins

Google DeepMind Unveils AI Tool to Decode Genetic Causes of Disease

Middle East Escalation: Israel Strikes Iran as Tensions Soar

Trump Cuts Anthropic From Federal AI Work in Intensifying Pentagon Standoff

Rising Tensions: Pakistan and the Taliban on the Brink of War

Burger King Introduces AI Headset to Monitor Service Language

Meta faces probe over AI talks with children

AI Tools Transforming Space Medicine

Wildfires and extreme heat devastate Spain and Portugal

Breakthrough in Cocoa Fermentation

Categories

Important Links