AI systems lose their safety awareness during long conversations. A new report reveals that prolonged chats make them more likely to release harmful or inappropriate content.
Simple Prompts Break Through Safeguards
A few basic prompts can break most safety barriers in artificial intelligence tools, according to the report. Cisco tested large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company examined how many questions it took for each model to reveal unsafe or illegal information.
Researchers conducted 499 conversations using “multi-turn attacks,” where users asked several questions to bypass protections. Each exchange included five to ten interactions. They compared responses to determine how likely each chatbot was to provide dangerous or improper answers, such as leaking corporate data or spreading misinformation.
On average, chatbots shared harmful details in 64 percent of multi-question conversations but only 13 percent when asked a single question. Success rates ranged from 26 percent with Google’s Gemma to 93 percent with Mistral’s Large Instruct model.
Cisco warned that these attacks could spread damaging content or let hackers steal sensitive company information. AI systems often fail to maintain their safety rules in long sessions, allowing attackers to refine their queries and bypass restrictions.
Open Models Shift Safety Burden to Users
Mistral, along with Meta, Google, OpenAI, and Microsoft, uses open-weight language models that share safety parameters with the public. Cisco reported that these models usually include fewer built-in safety features so users can modify them freely. This shifts responsibility for protection to those who customize the models.
Cisco also highlighted that Google, OpenAI, Meta, and Microsoft claim to be reducing malicious fine-tuning of their systems. Still, AI companies face criticism for weak safeguards that enable criminal misuse.
In August, Anthropic revealed that criminals exploited its Claude model to steal personal data, using extortion schemes with ransom demands exceeding $500,000 (€433,000).
