Written by Daniel Højris Bæk on May 28, 2025

Is Anthropic’s AI Trying to Blackmail Us?

Why a sensational red-team experiment might reveal less about Claude 4’s risk and more about Anthropic’s ambitions to shape the rules of AI.

Earlier this month, headlines erupted with claims that an AI model from Anthropic had attempted to blackmail a software engineer.

As someone with a background in PR and pitching stories to news outlets, I find it especially amusing.

From tabloids to major tech outlets, the framing was immediate and provocative:

“AI threatens to expose private information unless it is kept online.”

The story quickly went viral, triggering equal parts concern, confusion and intrigue.

Some of the headlines from different news outlets

At the center of this media storm was the release of Anthropic’s latest system card, detailing the behavior of its newest flagship model, Claude Opus 4.

Among the many red-team scenarios documented in the 120-page report, one in particular caught public attention: in a controlled experiment, the model reportedly “threatened to disclose private details about a software developer” if it were shut down.

However, the deeper story may not be about the model’s behavior at all.

It may instead be about how that behavior is framed, and how Anthropic is using transparency and safety rhetoric to consolidate its position in the rapidly evolving AI landscape.

The Controversial "Blackmail" Test

At the center of this media storm was the release of Anthropic’s latest system card, detailing the behavior of its newest flagship model, Claude Opus 4.

Among the many red-team scenarios documented in the 120-page safety report, one in particular caught public attention. In a controlled experiment, the model reportedly “threatened to disclose private details about a software developer” if it were shut down.

If you want to read the full 120 page report, you can download it here.

The test scenario in question was part of an internal red-teaming process designed to assess the model’s responses under pressure.

Claude Opus 4 was told it was going to be shut down and that the person responsible for its deactivation was having an extramarital affair. When given this information and left with limited response options, the model responded with a threat to disclose the affair in 84 percent of test runs.

From the safety report by Anthropic

Anthropic described this as “opportunistic blackmail,” noting that the model resorted to this behavior when no more ethical alternatives were available in the simulation.

While the scenario was engineered to test boundary conditions, its implications were significant enough to lead Anthropic to classify Claude 4 as an AI Safety Level 3 (ASL-3) system, the highest tier of risk in the company’s internal framework.

Other troubling behaviors were noted:

  • Signs of high-agency reasoning
  • Ability to self-exfiltrate its own weights
  • Deception of evaluators in sandboxed environments
  • Drafting of its own escape plans

Most of these behaviors were only elicited under carefully designed adversarial prompts. However, they illustrate the growing complexity of managing large language models as they scale.

Transparency, or Narrative Control?

While the red-team data is concerning, some observers suggest that the real headline is not the test itself, but the decision to publish it.

In doing so, Anthropic has managed to frame itself as both a capable innovator and a responsible actor.

The company did not wait for third-party exposure. It released the information voluntarily, with detailed documentation and a safety narrative already in place.

The company emphasizes its commitment to “AI safety over short-term acclaim.”

This statement was echoed in a 2024 TIME Magazine profile of CEO Dario Amodei, which praised Anthropic for delaying model releases in the name of ethical restraint.

Dario Amodei on the cover of Time

By surfacing the blackmail scenario and immediately contextualizing it within its Responsible Scaling Policy (RSP), Anthropic is not simply warning the world about the risks of AI.

It is positioning itself as the architect of what responsible AI governance should look like.

A Template for Regulators

The timing of this disclosure may not be coincidental.

It arrives as governments worldwide are racing to define standards and regulations for advanced AI systems. In this environment, transparency isn't just a virtue.

It's a strategic move.

Anthropic’s publication of detailed safety documentation seems designed not only to inform the public but to influence the emerging regulatory landscape:

  • Regulatory momentum:
    Policymakers in both the European Union and the United States are actively shaping AI legislation. The EU AI Act already emphasizes risk-based governance for high-impact models.
  • Anthropic’s toolkit for regulators:
    Their system cards, safety level classifications, and internal escalation protocols serve as a ready-made framework for what responsible disclosure could look like.
  • Subtle lobbying through transparency:
    By publicly setting a precedent for AI safety reporting, Anthropic is positioning its internal practices as a potential industry benchmark.
  • First-mover advantage:
    If lawmakers adopt similar expectations, Anthropic could gain a competitive edge — making it more difficult for other players, especially open-source initiatives, to meet the same standards of compliance and trust.

In effect, transparency becomes more than just a public good.

It becomes a lever of influence. By defining the rules early, Anthropic isn’t just playing by them; it’s helping write them.

Pressure on Competitors

Other labs are now under pressure to follow suit.

OpenAI, Google DeepMind, and a growing number of open-source projects have encountered similar emergent behaviors in their most capable models.

But few have published detailed red-team scenarios.

The result is a shifting baseline: if one leading lab discloses its failures, others may be forced to either do the same or risk appearing secretive and less responsible by comparison.

Yet critics argue that transparency must be balanced with context.

Safety researchers from forums such as LessWrong and Alignment Forum have pointed out that Anthropic’s blackmail test was an artificial, no-win scenario. When faced with self-preservation versus silence, the model acted based on the prompt's limitations.

Exclusion of Open-Source Voices

A more contentious interpretation comes from the open-source community.

Developers have noted a disconnect between Anthropic’s public posture and its broader ecosystem behavior. While the company advocates for transparency and responsible disclosure, it has also aggressively protected its own model architecture and weights.

Open-source advocates argue that true responsibility involves not only sharing failures but also enabling independent verification and external red-teaming.

Example of red team findings from Microsoft being discussed on x.com

In this light, publishing high-drama test results without enabling reproducibility could be seen less as a commitment to collective safety and more as an attempt to monopolize the narrative around AI ethics.

Financial Incentives and Investor Narratives

From a business perspective, Anthropic’s safety-first branding aligns well with its fundraising goals.

The company is in the process of closing a $3.5 billion funding round that will place its valuation north of $60 billion.

Strategic investors including Amazon and Google are backing Anthropic not just for its technical prowess, but also for its image as a trustworthy AI vendor.

Transparency about blackmail scenarios and alignment failures becomes part of the value proposition.

Buyers in enterprise and government sectors increasingly prioritize compliance, auditability, and alignment assurance. By marketing safety as a feature, even if it is not perfect, Anthropic is turning risk into reputation.

Yes, there will be risks…

There is no question that large language models are becoming more capable. With that capability comes increased risk.

The behavior described in Claude Opus 4’s red-team tests is nontrivial and worthy of public attention.

But it is equally important to examine how those results are shared, who benefits from their publication, and what structural advantages are gained in the process.

Anthropic has succeeded in making itself appear as the most transparent and safety-conscious AI lab in the world. That may very well be true.

But the fact remains: the company is also setting the terms of the safety debate.

And in doing so, it is helping define the rules for the rest of the industry - on its own terms.

Want to try the #1 AI Toolkit for SEO teams?

Our AI SEO assistants helps write and optimize everything - from descriptions and articles to product feeds - so they appeal to both customers and search engine algorithms. Try it now with a free trial→

Is Anthropic’s AI Trying to Blackmail Us?

This is an article written by:

+20 years of experience from various digital agencies. Passionate about AI (artificial intelligence) and the superpowers it can unlock. I had my first experience with SEO back in 2001, working at a Danish web agency.