Mythos: Three Weeks Later, My AI Thesis Already Looks Conservative

Written by Luis Estrada | Apr 9, 2026 3:38:20 PM

Three weeks ago, I published a whitepaper titled 'AI in Balance Sheet Management: The Next Architecture of Agents and Automation.' In it, I argued that artificial intelligence capabilities were growing exponentially, and that most financial institutions were significantly underestimating the pace.

I laid out the data, citing a benchmark called METR that tracks how long AI can work autonomously on complex tasks. The data showed a consistent doubling pattern. Based on this, I built a three-phase adoption framework, making the case that the "asymmetry of being wrong" is the real risk: overestimating AI leads to recoverable costs, but underestimating it creates a compounding competitive gap. My conclusion was clear: institutions need to act now rather than wait.

Three weeks later, Anthropic released the Claude Mythos Preview System Card.

I need to discuss what’s in this document, because it doesn’t just validate my thesis, it makes my publication look conservative. And keep in mind, my whitepaper was deliberately aggressive.

This article covers three key points:

A Quantitative Analysis: Where Mythos likely sits on the capability curve I presented, and why the evidence suggests it may have shattered the existing trend.
Real-World Capabilities: What the model can actually do, not just in abstract benchmarks, but in terms of real-world consequences so significant that its own creator decided against a public release.
The Impact on Strategy: What this means for how financial institutions should approach AI adoption, including why the phased framework I proposed may already be obsolete.

For the full foundation, including the exponential data, the three-phase framework, and detailed use cases for balance sheet management, I encourage you to read the complete whitepaper. What follows is the update that arrived much faster than I ever expected.

1. The Trend Just Broke Upward: How Mythos Changes the AI Capability Curve

METR: The Metric That Matters

In my whitepaper, I centered much of the argument on a single benchmark: METR’s Time Horizon evaluation. Unlike benchmarks that test whether an AI can answer a question correctly, METR measures something far more relevant to real-world deployment: how long an AI model can work autonomously on a complex software task and still succeed at least 50% of the time. It is, in essence, a measure of reliable independent work capacity.

When I wrote the whitepaper, the frontier sat at Claude Opus 4.6, with a p50 time horizon of approximately 12 hours. That already represented extraordinary progress, moving from roughly 4 minutes for GPT-4 in March 2023 to 12 hours in February 2026, a trajectory that doubled roughly every four months.

But 12 hours, while impressive, is still within the realm of what a competent human analyst does in a single workday. It is the kind of number that allows skeptics to argue that AI is a useful tool but not a transformational force. The question I posed in the whitepaper was: What happens when this number keeps doubling?

Mythos may have answered that question far sooner than anyone expected.

Extrapolating Mythos Capability from Anthropic’s Data

Anthropic has not released a METR evaluation for Mythos; the model's restricted access makes independent benchmarking difficult. However, Anthropic did publish extensive benchmark results in the Mythos System Card, and METR has published the full evaluation data for every other frontier model. This gives us enough information to build a reasoned estimate.

The key benchmark for this exercise is SWE-bench Pro, a rigorous evaluation of real-world software engineering capability across 1,865 tasks. Unlike the older SWE-bench Verified (which has become saturated, with the top six models clustered between 76% and 81%), SWE-bench Pro still discriminates meaningfully between models. Critically, the Mythos System Card reports both Opus 4.6 and Mythos scores on SWE-bench Pro using the same evaluation scaffolding, making the comparison direct.

The numbers:

Model	SWE-bench Pro	METR p50 Time Horizon
Claude Opus 4.5	45.9%	4h 53m
Claude Opus 4.6	53.4%	11h 59m
Claude Mythos Preview	77.8%	???

That jump from 53.4% to 77.8% (a 24.4 percentage point improvement) is the largest single-model leap in SWE-bench Pro history. For context, the gap between Opus 4.5 and Opus 4.6 was 7.5 percentage points. Mythos didn't just take a step forward; it leaped.

Where Mythos Likely Lands on the AI Capability Curve

I used two approaches to estimate Mythos's METR time horizon.

The Trend Projection: This simply extends the established doubling curve. METR's own data shows capability doubling approximately every 129 days. Opus 4.6 sits at roughly 12 hours. Mythos arrived about 61 days later. Extending the trend gives us approximately 17 hours, a modest 1.4x improvement over Opus. Just the next point on the exponential curve. Business as usual.
Benchmark-Adjusted Estimates: This uses the relationship between SWE-bench Pro improvements and METR improvements to calibrate a different projection. I have one clean data point: from Opus 4.5 to Opus 4.6, a +7.5 percentage point improvement on SWE-bench Pro corresponded to a 2.45x increase in METR autonomous duration. Applying this calibration to Mythos's +24.4 percentage point jump yields two estimates:
- Conservative (power-law scaling): Approximately 112 hours (14 work days) of continuous autonomous work. This is 9.3x Opus 4.6.
- Aggressive (exponential scaling): Approximately 222 hours (27.75 work days) of continuous autonomous work. This is 18.5x Opus 4.6.

METR Trend vs Benchmark-Adjusted Estimates for Mythos

I want to be transparent about the limitations. I am calibrating from a single model transition (Opus 4.5 to 4.6), so the scaling rate carries uncertainty. Mythos has not been independently evaluated by METR. These are estimates, not measurements.

But the direction is unambiguous. The trend line, which itself represents an exponential, predicts 17 hours. Every benchmark-adjusted method, regardless of whether I use conservative or aggressive assumptions, places Mythos well above the trend. Even if I halved the conservative estimate out of caution, I would still be looking at roughly 56 hours: more than two full days of autonomous work, nearly 5x Opus 4.6, and substantially above what the established doubling rate would predict.

What This Actually Means: Mythos' revelations about the future of AI timelines

The exponential trend in AI capability was the centerpiece of my whitepaper's argument. I presented it as a reason for urgency. What the Mythos data suggests is something more uncomfortable: the trend itself may be accelerating. Not just continuing to double at the established rate, but doubling faster, with the curve bending further upward.

The gap between the trend projection (17 hours) and the benchmark-adjusted range (112–222 hours) is the gap between "the future is arriving on schedule" and "the future just moved closer." For institutions that built their AI strategy around the assumption that they had 12 to 18 months before capability reached truly transformational levels, this gap should be deeply unsettling.

An AI system that can work reliably for 12 hours is impressive. One that can work for five to nine days represents a qualitatively different proposition. It is not just a faster tool, but an autonomous capability that can take on work previously reserved for experienced teams working across multiple business days.

But duration alone isn't what makes Mythos significant. What the model does within that timeframe is what changed even Anthropic’s own plans. That is the subject of the next section.

2. It's Not About the Hours, It’s About Capability: What Can Mythos Actually Do in Real-World Scenarios?

When the Builder Won’t Release Its Own Model

In the previous section, I focused on how long Mythos can work. Now, I need to talk about what it can actually do. The nature of the capability matters more than the duration, and it is significant enough that it changed the behavior of the company that created it.

Since its founding, Anthropic has released every major model it has built to the public. Claude 3, Claude 3.5, Claude 4, Opus 4.5, and Opus 4.6 were all made commercially available. Their business depends on it. Revenue comes from API access and subscriptions; releasing models isn't a "nice-to-have": it is the business model.

With Mythos, they broke that pattern. For the first time in the company's history, Anthropic decided not to release a model to the public. Not because it underperformed, but because it worked too well.

To understand why, I need to explain what Anthropic found when they tested Mythos against real-world security challenges. I’ll keep this non-technical, because the implications matter far more than the mechanics.

Finding What Decades of Experts Missed

Can Mythos identify vulnerabilities that humans and tools miss?

Software, the kind that runs operating systems, web browsers, servers, and the financial infrastructure your institution relies on, contains flaws. Some of these flaws are known and patched. Others remain hidden, undiscovered even by the teams that wrote the code and the security professionals whose job is to find them. In the security world, these hidden flaws are called "zero-day vulnerabilities": weaknesses that no one knows exist until someone finds them. Whoever finds them first holds enormous power: they can either fix them or exploit them.

Anthropic tested Mythos against this problem. The results were unlike anything the industry had seen.

Mythos identified thousands of previously unknown, high-severity vulnerabilities across every major operating system and every major web browser. These weren't minor issues; they were critical flaws that could be used to compromise systems at scale. Three examples illustrate the magnitude of what happened:

OpenBSD: A flaw in a system specifically designed for security that had been hiding for 27 years.
FFmpeg: A bug in software used in virtually every video application on earth that automated security testing had encountered five million times without ever detecting.
Linux Kernel: Multiple linked vulnerabilities in the foundation of most servers worldwide that, chained together, would allow an attacker to escalate from basic user access to full control of the machine.

These are not the kinds of flaws that previous AI models could find. When Anthropic tested earlier models on the ability to find and exploit vulnerabilities in a web browser, the results were stark. Claude Opus 4.6, the most capable model available before Mythos, succeeded 7.6% of the time. Mythos succeeded 85.2% of the time. That is not an incremental improvement. That is crossing a threshold from "occasionally capable" to "reliably capable."

Furthermore, Mythos became the first AI model to complete a simulated real-world corporate network attack end-to-end, a task that would take an experienced human security specialist over 10 hours. The System Card explicitly states that Mythos is capable of conducting autonomous cyber-attacks on small-scale corporate networks with weak defenses.

What Anthropic Did Instead of Releasing Mythos

This is the part that should command the attention of every executive in financial services.

Anthropic looked at these results and concluded that the risk of making Mythos publicly available was too high. The same capability that finds vulnerabilities to fix them can find vulnerabilities to exploit them. If this model were available to anyone with an API key, it would fundamentally shift the balance of power in cybersecurity overnight.

So, instead of selling their most powerful product, they built a defensive alliance.

Project Glasswing is a coalition of twelve founding partners, including AWS, Apple, Google, Microsoft, NVIDIA, and JPMorgan Chase, assembled specifically to use Mythos’s capabilities defensively. The purpose is to fix critical software before malicious actors can develop similar capabilities. Anthropic even committed $100 million in usage credits to open-source security organizations.

Read that list of names again. These are not companies that join initiatives casually. When the world's tech giants and largest banks simultaneously agree to participate in a defensive program built around a single AI model, they aren't doing it for PR. They are doing it because their own security teams concluded the capability warranted an institutional response.

Why Mythos Matters for Financial Executives

I am not writing this to alarm you about cybersecurity, though the implications for financial infrastructure are obvious. I am writing it because this is the clearest proof that the nature of AI capability has changed qualitatively, not just quantitatively.

Consider what Mythos did: it found patterns in complex, interconnected systems that decades of expert human review and millions of automated tests missed. It didn't do this just by being "faster." It did it by reasoning at a depth and across a breadth of context that earlier models simply could not sustain.

That same capability, deep reasoning across complex, interconnected systems over long periods, is precisely what matters in balance sheet management. Regulatory frameworks like IRRBB, CSRBB, LCR, and NSFR are not simple checklists. They interact with market conditions, strategic positioning, hedging decisions, and funding structures. The institutions that manage these well are the ones whose teams can hold the most complexity in their heads at once. Mythos demonstrated that it can hold more complexity, across longer time horizons, than any human team.

The cybersecurity story is the proof of concept. When the company that built a model decides it is too capable to sell, the question is no longer whether AI has reached a transformational threshold. The question is what you do with that knowledge.

That brings us to the final point: why the deployment strategy I proposed just three weeks ago may already be obsolete.

3. Beyond the Three Phases Framework: What Mythos Means for AI Strategy in Financial Institutions?

The Part of My AI Adoption Framework That Needs Revising

In my whitepaper, "AI in Balance Sheet Management: The Next Architecture of Agents and Automation", I proposed a three-phase framework for AI adoption in balance sheet management:

Phase 1: Generic large language models used as conversational assistants (chatbots).
Phase 2: AI agents equipped with institutional knowledge and connected to real tools and systems.
Phase 3: Full automation, with teams of AI agents executing complex workflows under human-designed governance.

I still believe this accurately describes the path most institutions will take. However, I wrote it with a specific, unexamined assumption: that humans would be the architects at every stage. In my mind, humans would decide which tasks to hand to AI, design the workflows, and define where judgment should take over. The AI was the worker; the human was the designer.

If Mythos is what the evidence suggests, a system capable of days of autonomous work with reasoning depth that exceeds human capability in complex domains, then that assumption is the part of my whitepaper that will age the fastest.

The Inversion of Control: Why Is Human-Led AI System Design Becoming a Bottleneck?

Every AI implementation I see in financial services follows the same logic: humans study the work, identify pain points, and build the integration. This is sensible, responsible, and, by the time you finish implementing it, likely outdated.

This approach is constrained by a bottleneck no one talks about: human capacity to design good systems. A Head of ALM can describe what their team does. They can identify pain points. They can probably point to two or three processes that feel ripe for automation. But can they simultaneously hold in mind the full interaction between IRRBB limits, CSRBB exposure, LCR buffer requirements, NIM targets, hedging costs, FTP allocation methodology, ALCO reporting cadence, and regulatory submission timelines, and then design an optimal operating model that accounts for all of these simultaneously? No. No human can. We work through these in parts, with coordination that is inevitably imperfect.

The bottleneck isn't the AI's ability to execute; it's the human's ability to design a system sophisticated enough to take full advantage of that execution.

What if you removed that bottleneck entirely?

What Is the ‘Meta-Developer’ Model Enabled by Mythos?

Here is what I believe becomes possible with a Mythos-class model, not as speculation, but as a direct consequence of the capabilities documented in the System Card and demonstrated through Project Glasswing.

Instead of painstakingly walking through Phases 1, 2, and 3 with human architects at every step, you train a Mythos-class model on the full reality of how your department operates. Not an abstract description in a strategy document, but the actual working environment. How the team spends its days. What the goals are. What regulation you adhere to. What tools you use. What systems hold the data. How decisions flow from analysis to ALCO to execution. The relationships between functions. The pain points no one has time to fix. The workarounds everyone knows about but has learned to live with.

You give it access, not just knowledge, but actual connections to your systems, whether through API integrations, database connections, or the code itself.

And then you ask it to build.

Not to answer questions like a chatbot. Not to execute a single workflow like a Phase 2 agent. You ask it to design and develop the entire team of specialized AI agents needed to run the department.

One agent for IRRBB scenario analysis.
Another for LCR monitoring and optimization.
Another for hedging strategy.
Another for FTP.
Another for regulatory reporting.
Another for ALCO preparation.

Each one is purpose-built for its specific function, connected to the relevant systems, operating under governance rules that the model itself defines based on its understanding of the regulatory environment.

This is not a single AI doing everything. This is AI as the software architect that builds, deploys, and orchestrates a team of specialized agents, the same way a CTO would build an engineering team, except this CTO can hold every regulatory requirement, every data dependency, every risk interaction, and every operational constraint in context simultaneously. And it can write the code to make it all work.

The Critical Inversion: Who Defines the Human Role?

In every current AI implementation, humans decide where AI should contribute. The human says: "automate this report" or "build me an agent that monitors LCR."

In what I am describing, the inversion is complete. The AI, having understood the full complexity of the department's operations, defines where humans should intervene.

Perhaps it determines that the final ALCO investment decision should remain human, not because the AI cannot make it, but because regulatory and governance frameworks require human accountability at that level. Perhaps it identifies that the relationship with the supervisor during ILAAP submissions benefits from human judgment in ways that AI cannot replicate.

Perhaps it finds that certain market-making decisions in stressed conditions require the kind of intuition that comes from having lived through previous crises.

The point is not that humans disappear. The point is that the AI, having a more complete view of the operation than any individual or team could hold, is better positioned to determine where human involvement adds genuine value versus where it is simply a legacy of how things have always been done. Instead of humans asking, "Where can AI help me?", the model answers, "Here is where you are irreplaceable, and here is everything else."

Learning, Adapting, Evolving

Can AI systems built with Mythos continuously learn and evolve?

There is a crucial distinction between what I am describing and traditional software development. If you hire a consulting firm to redesign your department and build new systems, what you get is a static deliverable. It works on the day it launches. Six months later, when regulations change, markets shift, and new products are introduced, the system begins to decay. It requires expensive, slow human intervention to update.

The agents that a Mythos-class model builds are not static automations. They are designed with the capacity to learn from their own operations, to recognize when the patterns in the data are shifting, when a regulatory interpretation is evolving, when a hedging strategy that worked last quarter is underperforming this quarter. And when they reach the boundaries of what they can handle, the meta-developer, the Mythos-class model, can come back in and redesign, extend, or rebuild them.

This creates something that has not existed in balance sheet management before: an operating model that continuously evolves. Not through annual consulting engagements or quarterly system reviews, but through ongoing adaptation driven by agents that are purpose-built to learn.

The Practical Path Forward to Adopting Mythos-Like Capabilities

I am not suggesting you hand your balance sheet to an AI tomorrow. I am suggesting the path is shorter than it appears:

The training phase. You give the model full context: how the department works today, what the goals are, what tools and systems are in use, what regulatory frameworks apply, and how decisions are made. This is not fundamentally different from the onboarding process you would give a senior hire, except the model can absorb and integrate the information at a scale and speed that no human can match.
The design and build phase. The model architects the target operating model and develops the individual agents. It writes the code, sets up the integrations, and defines the governance and escalation rules. This is where the capability demonstrated in the METR and SWE-bench Pro data becomes directly relevant, the sustained, autonomous software development capacity that the previous sections documented.
The parallel run. You do not shut down the existing department and switch to AI on a Monday morning. You run both operating models side by side for a defined period. The existing team continues to operate as it always has. The AI-built agents operate in parallel. You compare outputs. The delta between the two, in ALCO portfolio profitability, in regulatory optimization, in hedging efficiency, in FTP accuracy, becomes the evidence base for the transition.
The deployment. Based on the results of the parallel run, you migrate. The human team's role evolves: fewer people doing routine analysis, more people focused on the strategic decisions and relationship management that the model itself identified as uniquely human.

What the Performance Delta Could Look Like

What performance improvements could Mythos enable in ALM?

The impact will be unprecedented because AI overcomes human design limitations.

Today, balance sheet functions operate in silos. The IRRBB team optimizes interest rate risk, liquidity manages LCR and NSFR, and the FTP team sets prices. Each optimizes its own domain, but the critical interactions between them are often missed. For example, a hedge might create LCR drag, or an FTP curve might misallocate capital against CSRBB constraints. No human can hold every variable simultaneously.

An AI-built operating model, designed by a system that understands this full complexity, optimizes all domains at once. This is not a marginal gain; it is a step change in integrated performance.

This is not a future prediction. The capabilities in the Mythos System Card, such as autonomous duration, reasoning depth, and software proficiency, are exactly what is needed to make this real. The only question is when institutions will be ready to use them.

Final Thoughts: What Does Mythos Change About AI in 2026?

Why Did the AI Timeline Compress in Just Three Weeks?

I want to close with a reflection on time. My whitepaper was well received. It was discussed at conferences and shared in industry forums. People told me it was one of the most forward-looking pieces they had read on AI in financial services. Yet, in three weeks, the model that Anthropic released made its projections look conservative. The framework and logic hold, but they are conservative. The future I described as arriving over the next 12 to 18 months may be arriving now. Three weeks is how fast this is moving

How Banks Are Misframing AI Risk

I want to make an observation regarding how our industry thinks about cybersecurity. The dominant concern I hear from banks is data leakage, which is the fear that sensitive information will leave their systems when interacting with large language models. This is a legitimate concern that deserves serious attention, but it has become the only concern for many institutions. The conversation often begins and ends with the refusal to send data to an external model. Everything else, including how AI could improve their security posture, gets deferred.

Meanwhile, Anthropic just demonstrated that a single AI model can find thousands of critical vulnerabilities that 27 years of human expertise and millions of automated tests missed. The cybersecurity threat landscape is about to change in ways most banks have not considered. The question is not only whether your data is safe when it leaves your systems. The question is whether your systems themselves are safe against adversaries using capabilities like those of Mythos.

The institutions participating in Project Glasswing, including JPMorgan Chase, understood this immediately. They are not debating whether to use AI. They are using the most powerful AI in existence to harden their defenses because they recognize that the defensive application of these capabilities is not optional. It is urgent. Banks that frame AI exclusively as a data leakage risk are looking in the wrong direction.

What Does Mythos Signal for the Future of AI this Year?

It’s only April!

The Mythos System Card was released in April 2026. We are not even halfway through the year. If the trend I documented was already accelerating three weeks ago, the capability that will exist by December is difficult to project. It will be significantly beyond where we are today.

I wrote my whitepaper to argue that institutions needed to start moving. Mythos has compressed the timeline in which that movement matters. If you are not having practical conversations about how and when to deploy AI in your balance sheet management, treasury, or ALM functions, you are running out of time. The technology is ready now, and proactive institutions are already building.

For the full framework, the exponential capability data, the three-phase adoption model, and detailed use cases for balance sheet management, read the complete whitepaper "AI in Balance Sheet Management: The Next Architecture of Agents and Automation."

To understand how Mirai AI leverages these capabilities through our Mirai AI MCP Server, Mirai AI Agent, and Mirai AI Modeling, I invite you to learn more here:

The architecture for what comes next is not theoretical. It exists. The question is whether you will be using it or competing against those who are.

Author's Note: Luis Estrada is the COO of Mirai RiskTech, a risktech company specializing in AI-powered Balance Sheet Management solutions.

View full post