The Architecture of Autonomous Intelligence

When Machines Learn to Think Twice, Measure Their Own Minds, and Open Their Own Wallets

March 19, 2026 — Tech Deep Digest

"The question is not whether machines can think. The question is whether they can think about thinking—and then pay someone else to think for them."

Chapter One

Three Layers and a Miracle

Somewhere in France, on an evening that probably smelled of coffee and quiet obsession, a developer named Alain sat in front of two AMD consumer graphics cards—an RX 7900 XT and an RX 6950 XT, neither of them particularly exotic by 2026 standards—and did something that would have sounded like a joke at a machine learning conference three years ago. He took a 24-billion-parameter language model, duplicated three of its forty layers, and watched its logical reasoning score jump from 0.22 to 0.76.

No retraining. No new data. No gradient descent. No fine-tuning of any kind. He literally copied and pasted a small section of the model's own architecture back into itself, like a musician discovering that if you play the bridge of a song twice, the whole piece suddenly makes more sense.

The improvement was not marginal. On the BBH Logical Deduction benchmark—a standard test of whether a model can follow chains of if-then reasoning—the jump was 245 percent. On GSM8K, a math benchmark, it went up 33 percent. On code generation, a more modest but still meaningful 8 percent. The cost? An extra 1.5 gigabytes of VRAM and roughly 7.5 percent slower inference. The kind of trade-off that makes engineers say "shut up and take my memory."

This was not a fluke. Alain was replicating and extending a method called RYS—Repeat Your Steps—originally developed by David Ng, based on a startling insight: certain contiguous blocks of layers inside a large language model act as indivisible cognitive units. They are not just sequential processing stages like floors in a building. They are more like subroutines in a program—discrete reasoning circuits that the model runs through to arrive at conclusions. Duplicate the circuit, and the model effectively reasons twice: same weights, same training, but an additional pass through its own logic.

If that sounds too simple, that is because it is too simple. And that is exactly why it is interesting.

In science, the most important discoveries are often the ones that make you feel slightly embarrassed for the entire field. Why did nobody try this before? The answer, in this case, is that people did try it—David Ng published the RYS method first—but on different hardware and with less systematic methodology. What Alain added was rigor: a structured sweep across layer ranges, stride sizes, and duplication counts, generating a map of how each region of the model contributes to specific cognitive abilities. The result is not just a trick for making one model better. It is a toolkit for understanding what different parts of any model actually do.

To put this in perspective: Anthropic has a team of dozens of researchers working on "mechanistic interpretability," the study of what individual neurons and circuits inside neural networks actually compute. Google DeepMind has a similar team. OpenAI has published foundational work in the same area. These are well-funded, multi-year research programs involving some of the best minds in AI. Alain, working alone on consumer hardware in one evening, used a different approach—functional rather than mechanistic, macro rather than micro—and arrived at a complementary set of insights. He did not figure out what individual neurons do. He figured out what groups of layers do, by the simple expedient of running them twice and seeing what improves.

This is the kind of result that makes the open-source AI community so valuable. Not because it competes with corporate research labs on their own terms, but because it asks different questions using different methods and sometimes gets answers that the labs missed.

Chapter Two

The Week That Machines Grew Up

The week of March 17, 2026, was one of those weeks that, in hindsight, people will mark as a turning point. Not because of any single breakthrough—technology rarely works that way—but because five separate stories, from five separate corners of the tech world, all pointed at the same thing: the infrastructure for autonomous machine intelligence is no longer speculative. It is being built, deployed, and used right now, by people who are not waiting for permission.

Alain's layer duplication experiment was one. Then there was Google DeepMind publishing a cognitive framework for measuring progress toward AGI—not the hype-laden kind of AGI that gets tweeted about, but a serious, psychology-grounded attempt to define what it would actually mean for a machine to think at human level. NVIDIA released NemoClaw, an open-source framework for running AI agents inside secure sandboxes—the digital equivalent of giving a powerful creature a safe room to operate in. A developer named Thijs Verreck launched Tmux-IDE, a tool that treats Claude and other AI agents as literal team members in a terminal workspace. And perhaps most consequentially, Stripe and a startup called Tempo unveiled the Machine Payments Protocol—an open standard that gives AI agents the ability to spend money autonomously.

Each of these stories is interesting on its own. Together, they describe something much larger: the emergence of a complete stack for machine autonomy. Understanding at the bottom (how models reason), measurement in the middle (how we know what they can do), infrastructure for action (how they operate safely), collaborative tools (how they work alongside us), and economic agency at the top (how they transact). This is not a roadmap; it is a construction site. The cranes are already up.

To understand why this matters, consider a historical analogy. In 1994, the World Wide Web existed as a set of hyperlinked documents, readable through Mosaic and Netscape. It was interesting, even exciting, but it was not clear why it would transform the economy. Then, over the course of about two years, a series of developments made the implications obvious: SSL encryption made secure transactions possible. Cookie-based sessions made persistent logins possible. Server-side scripting (CGI, then PHP, then Java servlets) made dynamic content possible. PayPal (and before it, SET and other protocols) made online payments possible. Suddenly, the web was not just a library. It was a platform for commerce, communication, and collaboration.

No single one of those developments was transformative by itself. SSL without dynamic content is just encrypted static pages. Payments without sessions are one-off transactions with no customer relationship. Sessions without payments are login portals with no business model. But together, they formed a stack that enabled entirely new categories of businesses: Amazon, eBay, Google, Facebook. The stack was more important than any individual component, because it was the stack—not any single technology—that made the new business models possible.

The same thing is happening now, but faster and with higher stakes. The five stories from this week are not isolated breakthroughs. They are layers in a new stack, and the businesses that will be built on top of this stack are not yet visible. What is visible is the stack itself, and that is enough to start building.

Chapter Three

The Neuroscience of Artificial Minds

To understand why duplicating three layers can triple a model's reasoning ability, you need to let go of a metaphor that has dominated AI discourse for years: the idea that neural networks are black boxes. They are not. Or at least, they are becoming less opaque every month, thanks to a growing field that might be called computational neuroscience for artificial minds.

The key insight behind the RYS method is structural. A transformer model—the architecture behind GPT, Claude, Gemini, and every other large language model you have heard of—processes text by passing it through a sequence of layers. Each layer applies attention (what should I focus on?) and transformation (what should I do with what I focused on?). In a 40-layer model like Devstral-Small-2-24B, the text passes through all 40 layers in order, emerging at the end as a prediction of what comes next.

For years, the assumption was that these layers work like a pipeline: each one does a little bit of processing, and the information gradually transforms from raw input to refined output. This is true in a broad sense, but it misses something crucial. Alain's experiments, building on Ng's original work, suggest that the layers are not uniformly important. Specific contiguous blocks of layers—as few as three in a row—form discrete reasoning circuits. These circuits perform something like complete cognitive operations: taking a partially processed thought and running it through a full cycle of logical reasoning.

The methodology Alain used was elegantly systematic. A three-phase sweep: first, test large blocks of eight layers with wide strides to identify "hot zones" where duplication helps. Then, narrow down to blocks of three to five layers with a stride of one to find exact boundaries. Finally, try multi-pass, interleaved, and compound configurations to squeeze out every drop of improvement. All of this running on consumer hardware, with modified GGUF files written to tmpfs for testing speed, using llama.cpp with Vulkan acceleration. The whole experiment completed in one evening.

The results were not just about magnitude but about specificity. Different duplication patterns yielded distinct cognitive profiles from identical model weights. Duplicate layers 13 through 16 twice and you get a math specialist. Triple the same layers and the model becomes an emotional intelligence specialist. Use an interleaved pattern and you get a pure math mode where mathematical ability skyrockets but emotional reasoning drops. Same weights, same training data, same model—but different cognitive configurations based solely on which circuits you repeat and how many times.

This is not how we expected artificial intelligence to work. It is, however, remarkably similar to how biological brains work.

Consider the numbers on Qwen2.5-Coder-32B, a 64-layer model designed for code generation. Alain found the optimal circuit at layers 7 through 9—much earlier in the model than Devstral's layers 12 through 14. Duplicating this early block once pushed the reasoning composite score from 76.5 percent to 94.1 percent, a 23 percent improvement. Even emotional intelligence, a dimension you might not expect to be affected by architectural surgery, went from 92.1 to 93.6. The model became not just smarter but, in some measurable sense, more emotionally aware. From copying three layers.

The "modes" discovery is perhaps the most provocative finding. When Alain tested different duplication configurations on the same base model, he found that each configuration produced a distinct cognitive personality. A double-pass through layers 13 through 16 of Devstral created what he calls a "math specialist"—a version of the model that excelled at mathematical reasoning above all else. A triple-pass through the same layers, paradoxically, produced an "emotional intelligence specialist" that scored higher on EQ benchmarks. An interleaved pattern—repeating the layers in a non-sequential order—created a "pure math mode" where mathematical ability spiked dramatically but emotional reasoning actually declined.

Think about what this means. The same neural network, with the exact same trained weights, can exhibit fundamentally different cognitive profiles depending purely on how its architecture is arranged at inference time. This is not like adjusting the temperature parameter or the system prompt. This is rearranging the circuitry of the mind itself, and getting different minds as a result. It is as if you could take a single human brain and, by rerouting some neural pathways, turn a poet into a mathematician or an empath into a logician—reversibly, at will, in minutes.

If this sounds like science fiction, remember that the entire experiment ran on hardware you can buy at your local computer store, using open-source software, in a single evening. The democratization is the story as much as the discovery.

A Detour Through Neuroscience

In the human brain, different cognitive abilities are not distributed uniformly across the cortex. They are localized in specific regions, and those regions are organized in circuits—loops of neural activity that run through multiple brain areas in sequence. Working memory, for instance, involves a circuit that loops between the prefrontal cortex and the basal ganglia. Emotional processing runs through the amygdala-prefrontal circuit. Spatial reasoning has its own hippocampal-parietal loop.

When neuroscientists study these circuits, they find something that sounds eerily familiar: the same circuit can produce different cognitive modes depending on how many times the signal loops through it and what modulatory signals are present. A single loop through the prefrontal-basal ganglia circuit might give you a quick, intuitive answer. Multiple loops give you deliberate, effortful reasoning. This is the biological basis of what Daniel Kahneman called System 1 (fast) and System 2 (slow) thinking.

Alain's layer duplication is, in a rough but meaningful sense, giving the model a second or third pass through its own System 2. The model's "reasoning circuit"—layers 12 through 14 in Devstral, layers 7 through 9 in Qwen2.5-Coder—runs once in the base model. Duplicate those layers and it runs twice: same operation, applied again to its own output, like checking your work on an exam. The improvement in logical deduction (from 0.22 to 0.76) is not the model becoming smarter. It is the model being allowed to think longer.

This has profound implications. It means that at least some of the "intelligence" we attribute to larger models is not about having more knowledge or better representations. It is about having more computational depth—more passes through reasoning circuits. A 24-billion-parameter model with three duplicated layers can, in some tasks, match or exceed models with far more parameters. The bottleneck was never knowledge; it was processing time.

There is a historical parallel worth drawing. In the early days of computing, the prevailing belief was that faster processors would make programs run better. Then, in the 1970s and 1980s, researchers discovered that cache architecture—the arrangement of small, fast memory close to the processor—mattered more than raw clock speed for most real-world workloads. A slower processor with a well-designed cache could dramatically outperform a faster processor with a poor one. The bottleneck was not computation; it was memory access patterns.

Something analogous may be happening with language models. We have spent years scaling models by adding parameters—more knowledge, more capacity, more "brain." Alain's experiments suggest that for reasoning tasks, the bottleneck might not be how much the model knows but how many times it can loop through its own reasoning process. If that is the case, then the current arms race to build ever-larger models may be partially misguided. A smaller model with the right architectural modifications could match or exceed a larger one, at a fraction of the compute cost.

This does not invalidate the scaling hypothesis. Larger models genuinely know more and can handle more complex inputs. But it does suggest that the relationship between model size and capability is not linear, and that architectural innovations at inference time—requiring zero additional training—could be a powerful and underexplored lever for improving AI performance.

Chapter Four

How Do You Measure a Mind?

The same week that Alain was duplicating layers, Google DeepMind published a paper that attacks a different but deeply related problem: if machines can reason, how do we know how well they reason? And how do we compare different machines' reasoning abilities in a way that is not self-serving corporate marketing?

The paper, "Measuring Progress Toward AGI: A Cognitive Framework," is DeepMind's most serious attempt yet to bring scientific rigor to a question that has been dominated by vibes and benchmarks that everyone games. The framework identifies ten core cognitive abilities that, together, constitute general intelligence: perception, generation, attention, learning, memory, reasoning, metacognition, executive functions, problem-solving, and social cognition.

This is not a random list. These ten abilities come directly from cognitive psychology and neuroscience—they are the same dimensions that psychologists use to assess human cognition. What DeepMind has done is take decades of research on human cognitive architecture and apply it as a measurement framework for artificial systems.

The evaluation protocol has three stages. First, evaluate AI systems across a broad suite of cognitive tasks covering each of the ten abilities. Second, collect human baselines for the same tasks from a demographically representative sample of adults. Third—and this is the critical innovation—map each AI system's performance relative to the distribution of human performance in each ability. Not against a single "average human" benchmark, but against the full range of human capability.

This is important because it moves the conversation from binary (is this AGI or not?) to continuous (how does this system's cognitive profile compare to the human population?). A model might be at the 95th percentile for reasoning but the 30th percentile for social cognition. That is not AGI; it is a very specific cognitive profile, just like humans have specific cognitive profiles.

DeepMind also builds on their earlier "Levels of AGI" framework from 2023, which defined five performance tiers: emerging, competent, expert, virtuoso, and superhuman. The cognitive framework adds dimensionality to these levels. A system might be "expert" at reasoning but only "emerging" at metacognition. This matters because it shows us where the gaps are—and right now, the gaps are illuminating.

The five abilities where current AI systems lag most behind humans are precisely the ones you might expect: learning (acquiring new knowledge efficiently from few examples), metacognition (knowing what you know and what you do not), attention (selectively focusing on relevant information in complex environments), executive functions (planning, inhibiting impulses, switching between tasks), and social cognition (understanding other minds). To address this, DeepMind launched a Kaggle hackathon with a $200,000 prize pool, inviting the community to design evaluations for exactly these five abilities.

Notice something? The abilities where AI lags are not the "hard" ones in the traditional computer science sense. They are not mathematical reasoning or code generation or language translation. Those, AI already does at expert to virtuoso levels. The hard problems are the ones that a three-year-old handles effortlessly: knowing when you do not understand something, paying attention to the right thing in a noisy room, understanding that other people have different beliefs than you do.

The Moravec Paradox, Redux

In 1988, roboticist Hans Moravec observed that the things humans find easy—walking, recognizing faces, understanding sarcasm—are computationally hard, while the things we find hard—chess, calculus, formal logic—are computationally easy. This became known as Moravec's Paradox, and for decades it applied mainly to robotics and computer vision.

DeepMind's cognitive framework reveals that Moravec's Paradox has a second chapter. Even in the domain of pure cognition—no bodies, no physical world, just thinking—the easy things are still hard. Metacognition, the ability to think about your own thinking, is something humans do constantly and unconsciously. Every time you say "I'm not sure about this" or "let me think about that differently," you are exercising metacognition. Current AI systems are strikingly bad at this. They will confidently generate wrong answers, hallucinate references, and fail to recognize the boundaries of their own knowledge.

This is where Alain's work and DeepMind's framework connect in an unexpected way. What layer duplication does, at a functional level, is give the model something like a metacognitive loop—a second pass through its own reasoning, which allows it to catch errors and refine conclusions. It is crude metacognition, achieved through architectural repetition rather than genuine self-awareness. But it works. And the fact that it works tells us something important: even rough approximations of higher cognitive abilities can produce dramatic improvements in performance.

The connection goes deeper than analogy. DeepMind's framework identifies ten cognitive abilities, but current benchmarks cover only some of them well. Reasoning and generation are well-benchmarked because they are easy to test: give the model a logic problem, see if it gets the right answer. Metacognition, attention, and social cognition are poorly benchmarked because they are hard to test: how do you measure whether a model knows that it does not know something?

This is why DeepMind launched the Kaggle hackathon with its $200,000 prize pool. They are effectively admitting that they do not know how to build good evaluations for the hardest cognitive abilities, and they are crowdsourcing the solution. The five target abilities—learning, metacognition, attention, executive functions, and social cognition—are not just the ones where AI lags behind humans. They are the ones where we do not even have good instruments to measure how far behind AI is.

This is an extraordinarily honest position for one of the world's leading AI labs to take. In an industry where every company claims their latest model is "the most capable ever" based on benchmark suites that everyone knows are gamed, DeepMind is saying: we do not have the right benchmarks, the ones we have are measuring the wrong things, and we need help building better ones. The fact that they are grounding their framework in cognitive psychology rather than in computer science benchmarks is a deliberate choice. Psychologists have been measuring human cognition for over a century. Computer scientists have been measuring machine cognition for about five years. One of these fields has more mature instruments.

The practical implication is that the cognitive profiles of AI systems are about to become much more detailed and much harder to fake. When evaluation covers ten dimensions rather than three, and when those dimensions are measured against the full distribution of human performance rather than against a single threshold, it becomes much harder to game your way to impressive-looking numbers. A model that excels at the benchmarks that happen to be in the current evaluation suite but fails at metacognition and social cognition will no longer be able to hide behind cherry-picked results. The cognitive framework, if adopted widely, would do for AI evaluation what unit testing did for software quality: not make bad products impossible, but make them much harder to ship without anyone noticing.

Chapter Five

Cages for Digital Creatures

Understanding how AI reasons and measuring what it can do are necessary but not sufficient. The next piece of the puzzle is infrastructure: if we are going to let these systems act in the world, we need safe environments for them to operate in. This is where NVIDIA's NemoClaw enters the picture.

NemoClaw is an open-source stack—currently in alpha—for running what NVIDIA calls "always-on assistants" safely. The name is a concatenation of Nemo (NVIDIA's AI platform) and Claw (the grab-and-act metaphor for agents that do things). At its core, NemoClaw provides secure sandboxed environments where AI agents can operate with real tools—browsing the web, reading files, executing code—while being constrained by declarative security policies that prevent them from doing anything you have not explicitly allowed.

The architecture has four components. A TypeScript CLI plugin that manages the lifecycle of agent sandboxes. A Python blueprint system that defines what the agent can and cannot do as versioned, reproducible artifacts. An OpenShell container that provides the actual isolated runtime. And an inference gateway that routes all of the agent's API calls through a controlled pipeline, preventing it from reaching any model endpoint you have not approved.

The security model is layered in a way that will be familiar to anyone who has worked with container security, but with additions specific to AI agents. The network layer blocks unauthorized outbound connections and is hot-reloadable, meaning you can change policies while the agent is running. The filesystem layer restricts the agent to /sandbox and /tmp. The process layer prevents privilege escalation. And the inference layer—this is the novel one—intercepts all model API calls and reroutes them through controlled providers, preventing the agent from secretly talking to a different model or exfiltrating data through inference requests.

When an agent tries to connect to a host that is not on its allowlist, NemoClaw does not just block the request. It surfaces it in a terminal UI for the human operator to review and approve or deny. This is the "human in the loop" pattern implemented at the infrastructure level, rather than at the application level where it is fragile and easily bypassed.

If this sounds like a lot of paranoia for software that is, after all, just predicting the next token—well, that is exactly the point. The teams building these containment systems are the ones who understand best what these systems can do. NVIDIA did not build NemoClaw because they are afraid of AI. They built it because they know their customers want to deploy agents that take real actions—sending emails, writing code, accessing databases—and doing that without a security framework is the equivalent of running a web server as root with no firewall.

The 8,300 stars on GitHub in the first weeks of release suggest that the demand is real.

The Container Security Lesson

NemoClaw's design philosophy mirrors a lesson that the software industry learned the hard way with containers. When Docker launched in 2013, the initial enthusiasm was all about convenience: package your application with its dependencies and run it anywhere. Security was an afterthought. The early Docker ecosystem was rife with containers running as root, images with embedded secrets, and exposed Docker sockets that gave any container full control over the host machine.

It took years of high-profile breaches and a gradual maturation of the ecosystem (seccomp profiles, user namespaces, rootless containers, image signing) before container security became a first-class concern. By then, enormous technical debt had accumulated. Organizations that had deployed thousands of insecure containers had to retrofit security onto systems that were never designed for it.

NemoClaw represents an attempt to avoid repeating this mistake with AI agents. By building security into the agent runtime from the start—rather than bolting it on after deployment—it establishes a baseline that the industry can adopt before the inevitable wave of "just get the agent working, we'll add security later" deployments creates the same kind of technical debt that plagued the container ecosystem.

The inference gateway is the component that makes this genuinely novel rather than just "containers for AI." Traditional container security does not need to worry about what model the application is talking to or whether it might be exfiltrating data through API calls disguised as inference requests. Agent security does. NemoClaw's approach—intercepting all model API calls and routing them through a controlled gateway—addresses a threat vector that is unique to AI systems and that no existing container security tool handles.

Whether the industry will adopt this level of rigor is another question. The history of software security suggests that most organizations will only take agent security seriously after the first major agent-related breach. NemoClaw's value may ultimately be less about preventing that breach than about providing a proven solution that everyone can adopt in its aftermath.

Chapter Six

Agents as Colleagues

While NVIDIA was building cages, Thijs Verreck was building something more intimate: a development environment that treats AI agents not as tools but as team members.

Tmux-IDE is, on the surface, a simple concept. It is a command-line tool that configures tmux terminal layouts optimized for multi-agent development. One pane for a "lead" Claude instance, multiple panes for "teammate" Claude instances, and additional panes for dev servers and other tools. The layout is defined in a YAML configuration file, and changes update the environment in real time.

But the simplicity is deceptive. What Tmux-IDE actually implements is a model of AI collaboration that is fundamentally different from the "assistant" paradigm that has dominated since ChatGPT launched in late 2022. In the assistant paradigm, you have one human and one AI. The human gives instructions; the AI executes them. It is a command-and-response model, and even the most sophisticated implementations (Claude Code, GitHub Copilot, Cursor) follow this basic pattern.

Tmux-IDE proposes something different: a team model. The lead Claude instance does not just respond to human instructions. It coordinates other Claude instances, delegating tasks, tracking progress through shared task lists and messages, and dynamically reorganizing the team based on what needs to happen. The human developer is a participant in the team, not the sole source of direction. You can tell the lead Claude "we need to refactor the authentication module and add tests for the new API endpoints" and it will recruit teammate instances, assign specific tasks, and manage the parallel workflow.

This is not science fiction; it is running on people's machines right now. The tool auto-detects project frameworks (Next.js, Vite, Python, Go) and configures itself accordingly. It registers as a Claude Code skill, so Claude can set up its own workspace. It is reproducible across machines through YAML configuration. And because everything runs in tmux, you can SSH into your development machine, start a multi-agent session, disconnect, and come back later to find the team still working.

The philosophical shift here is significant. When we moved from single-threaded to multi-threaded programming, we had to rethink how we wrote software. When we moved from single-server to distributed systems, we had to rethink how we built infrastructure. Tmux-IDE suggests we are at the beginning of a similar transition in how we think about software development itself: from a single developer with AI assistance to a development team where some members happen to be artificial.

The Coordination Problem

Multi-agent development introduces a class of problems that single-agent workflows never had to deal with: coordination. When two Claude instances are working on the same codebase, how do they avoid conflicting changes? When the lead agent delegates a task to a teammate, how does it verify that the task was completed correctly? When three agents are working in parallel, how do they share context without overwhelming each other with information?

Tmux-IDE's approach is deliberately low-tech: shared task lists and plain-text messages. The agents communicate through files, not through API calls or message queues. This is crude but robust. It means coordination does not depend on any external service, it works over SSH, and it degrades gracefully when one agent is slower or gets stuck. It is the tmux philosophy applied to AI collaboration: simple, composable, text-based, and resilient.

But the crudeness also reveals the frontier. As multi-agent development matures, we will need better primitives for agent coordination. How do you handle merge conflicts when two agents edit the same file simultaneously? How do you implement code review between agents? How do you handle the equivalent of a "standup meeting" where agents share status and surface blockers? These are not hypothetical questions. Anyone who has used Tmux-IDE for more than a few hours has encountered them.

The interesting thing is that these are not new problems. They are the same problems that human development teams have been solving for decades, with tools like Git, Jira, Slack, and code review systems. The question is whether the same solutions work when some team members are artificial, or whether the differences in how AI agents work—faster but less creative, tireless but less adaptable, precise but less intuitive—require fundamentally different coordination patterns.

The answer is almost certainly "both." Some existing patterns will transfer directly. Others will need to be reinvented. And some entirely new patterns will emerge that have no human equivalent, because human teams never had the option of spinning up three extra developers who share the same knowledge base and can work in perfect parallel.

Chapter Seven

When Machines Get Wallets

Every piece of infrastructure described so far—reasoning circuits, cognitive measurement, secure sandboxes, collaborative development environments—leads inevitably to a question that makes economists both excited and nervous: what happens when machines can spend money?

On March 18, 2026, the same week as everything else we have been discussing, Stripe and a startup called Tempo launched the Machine Payments Protocol, or MPP. Tempo's blockchain went live on mainnet the same day, after three and a half months of testing. The protocol was co-authored by Tempo's cofounder Matt Huang (who is also managing partner at Paradigm, the crypto investment firm) and Stripe, which is arguably the most important payments company in the world. Visa immediately signed on to support MPP by enabling card-based payments through the protocol.

Stripe. Visa. A $5 billion blockchain startup. All of them building infrastructure for a world where AI agents autonomously conduct financial transactions. This is not a crypto speculation play or a Silicon Valley thought experiment. This is the actual payments industry preparing for a future they believe is imminent.

The problem MPP solves is mundane but critical. Right now, when an AI agent needs to pay for something—a headless browser session, an API call, cloud computing time—it has to simulate being a human. Navigate to a pricing page. Select a plan. Enter credit card details. Handle authentication flows designed for people with eyeballs and fingers. This is absurd and fragile and does not scale.

MPP replaces this with a four-step protocol. The agent requests a resource from a service. The service responds with a payment request. The agent authorizes the payment. The resource is delivered. That is it. No account creation, no pricing page navigation, no subscription tier selection. Machine-native commerce.

The technical implementation is elegant. Payments can be made in stablecoins (via Tempo's blockchain) or in fiat currency through traditional cards and buy-now-pay-later methods. A key innovation is the "sessions" primitive: an agent can authorize a spending limit upfront and then stream micropayments continuously without requiring an on-chain transaction for each interaction. This enables use cases that would be economically impractical with traditional payment rails—paying a fraction of a cent per API call, per minute of compute time, per page of content accessed.

The businesses already using MPP give a vivid picture of what the agent economy looks like in practice. Browserbase lets agents pay per headless browser session—an agent that needs to scrape a website can spin up a browser, use it, and pay for exactly what it consumed. PostalForm lets agents pay to print and send physical mail—an AI assistant that needs to mail a legal document can do so without human intervention. And in perhaps the most entertaining use case, Prospect Butcher Co. lets agents order sandwiches for human pickup or delivery anywhere in New York City. Your AI assistant can now literally buy you lunch.

For Stripe, this is a natural extension of their existing infrastructure. Payments from agents show up in the same dashboard as payments from humans. Tax calculation, fraud protection, reporting, accounting integrations, and refunds all work identically. From Stripe's perspective, a dollar spent by a machine is the same as a dollar spent by a person. This is significant because it means businesses do not need separate infrastructure for agent commerce—they can accept agent payments with a few lines of code added to their existing Stripe integration.

The Numbers That Matter

The economic projections for the agent economy are staggering. The World Economic Forum estimates that AI agents could be worth $236 billion by 2034. Industry analysts predict that agentic AI will represent 10 to 15 percent of IT spending by the end of 2026, and that 33 percent of enterprise software applications will include agentic AI by 2028. Stripe's backing of MPP, combined with Visa's participation, represents a bet that a significant fraction of all digital commerce will eventually be agent-initiated.

Tempo's $500 million funding round at a $5 billion valuation (led by Paradigm and Thrive Capital) tells you what the smart money thinks about the size of this opportunity. For context, Stripe itself was valued at $50 billion at the time it invested in Tempo. When the payments infrastructure company invests in the payments protocol company, they are not making a speculative bet on a possible future. They are building the plumbing for a future they see clearly.

The Micropayment Revolution That Actually Works

For as long as the internet has existed, people have dreamed of micropayments. The idea is simple: instead of subscription models or advertising, content and services could be paid for in tiny increments—a tenth of a cent for a page view, a cent for an API call, a nickel for a song. Every few years, a startup tries to make this happen, and every few years it fails. The transaction costs of moving small amounts of money through the traditional financial system eat up the entire payment, making the economics unworkable.

MPP might actually solve this, not because the technology is fundamentally different but because the customer is. Micropayments for humans failed because humans do not want to make decisions about whether to spend $0.001 on every page they read. The cognitive overhead of constant micro-decisions is worse than the financial cost. Humans prefer flat-rate subscriptions precisely because they eliminate the need to think about cost on every interaction.

Machines do not have this problem. An AI agent has no cognitive overhead. It can evaluate whether a $0.003 API call is worth making as easily as it can evaluate a $3,000 purchase. The "sessions" primitive in MPP—where an agent pre-authorizes a spending limit and then streams micropayments continuously—is a perfect fit for how agents actually operate: making thousands of small requests rather than a few large ones.

This means that an entire category of business models that never worked for human customers might work perfectly for agent customers. Pay-per-query search. Pay-per-page content access. Pay-per-minute compute. Pay-per-call API access without subscription tiers. The agent economy could be the native habitat for micropayments, decades after humans proved to be a poor fit.

Consider the implications for software as a service. Today, most SaaS businesses charge monthly subscriptions because that is the only payment model that works reliably at scale for human customers. A CRM charges $25 per user per month, regardless of whether the user logs in three times or three hundred times. This is a compromise: the business sacrifices per-use pricing precision for the predictability of recurring revenue, and the customer sacrifices per-use efficiency for the simplicity of a flat rate.

In an agent economy, this compromise dissolves. An AI agent using a CRM does not need pricing simplicity. It can evaluate the cost-benefit of every API call and pay accordingly. This means SaaS businesses could offer per-call pricing alongside their subscription tiers, capturing revenue from agent usage that currently either does not happen or gets lumped into existing subscriptions where it is invisible and undermonetized.

The flip side is that agents are ruthlessly efficient shoppers. A human might stay on a more expensive plan out of inertia or because switching costs feel high. An agent will switch to the cheapest provider for every individual request, unless constrained by policy. This creates competitive pressure that could be deflationary for service pricing—great for consumers, challenging for businesses that rely on switching costs and customer inertia for their margins. Whether the net effect is positive or negative for the economy is an open question. What is clear is that MPP and protocols like it are going to change the pricing structure of digital services fundamentally.

The crypto angle is worth noting but not overemphasizing. Tempo's blockchain is optimized for high-speed stablecoin transactions, which makes sense for the throughput and cost characteristics of agent payments. But MPP also supports fiat through Visa and traditional card networks. Matt Huang, Tempo's cofounder, described the protocol design philosophy as "the most elegant, minimal, efficient protocol that anyone can extend without our permission." This is the approach that wins: not forcing a particular payment method but providing a protocol that works across all of them. The agent does not care whether it is paying in USDC or in dollars debited from a Visa card. It cares that the payment is fast, cheap, and programmatic.

Chapter Eight

The Cambrian Explosion Analogy (And Why It Is Both Right and Wrong)

It is tempting to reach for the Cambrian Explosion as a metaphor for what is happening in AI right now. About 540 million years ago, in the span of roughly 20 million years—an evolutionary eyeblink—most of the major animal body plans that exist today appeared. Before the Cambrian, life was simple: single-celled organisms, basic multicellular creatures, soft-bodied things that left faint impressions in mud. After the Cambrian, the world had eyes, legs, shells, claws, brains. The complexity of life increased by orders of magnitude in a geological instant.

The standard explanation for why this happened involves two factors. First, oxygen levels in the atmosphere rose enough to support metabolically expensive structures like brains and muscles. Second, once some organisms evolved eyes, predation became possible, which created intense selection pressure for defensive structures, mobility, and counter-intelligence. An evolutionary arms race, powered by newly available energy, compressed billions of years of prior gradual change into a explosive burst of innovation.

The analogy to AI is that we are in a similar moment. The "oxygen" is compute: the availability of powerful GPUs and the scaling laws that tell us bigger models trained on more data produce better results. The "eyes" are the transformer architecture: a general-purpose structure that can perceive and generate across modalities. And the "Cambrian explosion" is what we are seeing now—a burst of diverse forms: reasoning circuits, cognitive frameworks, agent sandboxes, payment protocols, team development environments, all appearing within the same week.

But here is where the metaphor breaks down, and the breakdown is instructive. In the Cambrian Explosion, the organisms had no idea what was happening. They were not aware that they were participating in a phase transition. They could not measure their own capabilities, build containment structures for more dangerous organisms, or create economic systems for inter-species trade.

We can. And we are.

DeepMind's cognitive framework is an attempt to measure the explosion as it happens. NemoClaw is an attempt to contain it. MPP is an attempt to channel it into economic structures. Tmux-IDE is an attempt to make it collaborative rather than adversarial. And Alain's layer duplication experiments, humble as they are, represent something that no organism in the Cambrian ever had: the ability to modify the architecture of intelligence itself, deliberately and reversibly, on consumer hardware, in an evening.

We are not just witnessing an explosion of AI capabilities. We are engineering the explosion while simultaneously building the measurement tools, the containment structures, and the economic rails. This is unprecedented. There is no biological analogy that captures it adequately, because biology never had a species that could modify its own cognitive architecture while watching itself do it.

If there is a useful analogy, it might be from a completely different domain: the history of nuclear energy. In the 1940s, physicists simultaneously discovered how to split the atom (understanding), developed instruments to measure radiation (measurement), built containment structures for reactors (safety), created organizational frameworks for collaborative research (Manhattan Project), and established the economic infrastructure for nuclear power (the Atomic Energy Commission). The entire stack, from fundamental physics to commercial deployment, was assembled in less than a decade.

The AI autonomy stack is assembling faster. And like nuclear energy, the same infrastructure that enables extraordinary productivity also enables extraordinary risk. The measurement frameworks, the containment systems, and the economic rails are not luxuries or afterthoughts. They are the difference between harnessing the explosion and being consumed by it. The fact that they are being built concurrently with the capabilities themselves—rather than decades after, as with nuclear weapons and nuclear safety regulations—is perhaps the most encouraging thing about this particular moment in technology.

Perhaps. Or perhaps the speed is the problem. Nuclear energy took a decade. The AI autonomy stack is taking months. Whether that leaves enough time to get the containment right is a question that this week's developments pose but cannot answer.

Chapter Nine

The Borges Problem: Intelligence Without Understanding

There is a Jorge Luis Borges short story called "Funes the Memorious," about a man who, after a head injury, acquires a perfect memory. He remembers everything: every leaf on every tree, every word ever spoken to him, every sensation he has ever experienced. At first, this seems like a superpower. But Borges slowly reveals that Funes is trapped. He cannot generalize. He cannot abstract. He cannot see the forest because he remembers every individual tree in excruciating detail. Perfect memory, zero understanding.

This is, in a surprisingly literal sense, the situation with current large language models. They have been trained on essentially all of the text humanity has produced. They can recall, recombine, and generate language with astonishing facility. But do they understand what they are saying? The DeepMind framework gives us a way to ask this question precisely, rather than philosophically. And the answer appears to be: it depends on what you mean by "understand," and the answer is different for each cognitive dimension.

On reasoning—following logical chains, doing mathematics, writing code—current models show expert-level performance. On perception and generation, they are virtuoso-level. But on metacognition, learning efficiency, and social cognition, they are emerging at best. They can reason but they cannot reflect on their reasoning. They can generate but they cannot evaluate whether their generation is appropriate for the audience. They can learn patterns from trillions of tokens but they cannot learn a new concept from three examples the way a child can.

Alain's circuit duplication experiment offers a tantalizing hint that some of these limitations might be architectural rather than fundamental. If giving a model a second pass through its reasoning circuit dramatically improves logical deduction, what would happen if you could give it a metacognitive circuit—a set of layers that evaluates the output of the reasoning layers before it becomes the final answer? We do not know yet. But the tool to find out—systematic layer duplication on consumer hardware—is now in the hands of anyone with a $500 GPU and an evening to spare.

This democratization of AI research is itself a major story. Five years ago, understanding the internal structure of language models was the exclusive domain of teams with access to clusters of A100 GPUs and millions of dollars in compute budgets. Today, a solo developer with consumer AMD cards can run systematic experiments on 24-billion-parameter models and discover cognitive circuits that improve performance by 245 percent. The barriers to entry for understanding (and modifying) AI are collapsing faster than the barriers to entry for building it.

Borges would have found this exquisitely ironic. His Funes could not forget anything but could not understand anything either. Our language models remember everything but struggle to know what they do not know. And now, a solo developer in France has shown that you can make them understand more—not by teaching them new things, but by giving them time to think about what they already know. The cure for perfect memory is not more memory; it is more processing. The cure for Funes would not have been to forget, but to loop—to pass his memories through a second circuit of reflection that could abstract, generalize, and compress.

There is something hopeful in this. The gap between what AI can recall and what AI can understand is not a fixed property of the technology. It is, at least partially, an architectural choice. And architectural choices can be changed—by anyone with a GPU, a text editor, and the willingness to try.

Chapter Ten

What This Means If You Build Things

If you are a developer, an architect, or anyone who builds software for a living, the stories from this week converge on a set of practical implications that are worth taking seriously.

First: the model is no longer a fixed object. Alain's work shows that the performance characteristics of a language model can be significantly altered at inference time, without retraining, by modifying its architecture. This means that the model you deploy does not have to be the model you downloaded. For specific tasks—logical reasoning, math, code generation—you can create specialized configurations from a single base model by duplicating the right layers. This is cheaper, faster, and more reversible than fine-tuning. If you are running local models for any purpose, the RYS method and tools like llm-circuit-finder should be on your radar.

Second: measurement is maturing. DeepMind's cognitive framework will not be the final word on how to evaluate AI systems, but it represents a significant step toward standardized, psychology-grounded evaluation. If you are choosing between models for a specific application, the framework gives you a vocabulary for what matters. Do you need strong reasoning? Most models have that. Do you need metacognition—the ability to know when the model is uncertain? That is much rarer, and you should evaluate specifically for it. The Kaggle hackathon (submissions open through April 16) is an opportunity to contribute to how these evaluations are built.

Third: agent security is now a first-class concern. NemoClaw establishes a baseline for what "safe agent deployment" looks like: sandboxed execution, network policy enforcement, filesystem restrictions, and inference routing. If you are building or deploying agents that take real-world actions, these are not optional features. They are the minimum viable security posture. The fact that NVIDIA has open-sourced this means you do not need to build it yourself, but you do need to understand and implement it.

Fourth: multi-agent development is here. Tmux-IDE is a first-generation tool, but the pattern it establishes—multiple AI agents coordinating on a shared project, with a human developer as one participant among several—is likely to become standard. If your development workflow is still "one developer, one AI assistant," you are leaving productivity on the table. Start experimenting with multi-agent setups, even if the tooling is still rough.

Fifth, and most importantly: agent commerce is real. Stripe and Visa backing MPP is not a signal to build experimental agent-payment prototypes. It is a signal that machine-to-machine payments are going to be a standard part of how software works. If you build APIs, think about what it means to have agent clients alongside human clients. If you build services, consider MPP integration alongside traditional payment flows. The businesses already on MPP—Browserbase, PostalForm, Parallel—are not futurists. They are pragmatists who saw the demand and built for it.

There is a sixth implication, less technical but equally important: the lone genius is over as a competitive model. Every one of these stories is about infrastructure—tools that make other tools possible, protocols that enable other protocols, frameworks that support other frameworks. The value is in the stack, not in the individual component. A developer who understands the full stack—from model internals to cognitive measurement to agent security to multi-agent coordination to machine payments—has a compounding advantage over one who is expert in only one layer. This was true of the web stack in 2005, and it is true of the autonomy stack now.

The practical question is not "should I learn about this?" It is "where in the stack do I start?" If you are running local models, start with layer duplication. If you are deploying agents, start with NemoClaw. If you are building APIs, start with MPP. If you are managing a development team, start with multi-agent workflows. The stack is modular by design; you can enter at any layer and expand from there.

Chapter Eleven

The Stack of Autonomy

Let us step back and look at the full picture. In a single week, we got five pieces that together form what we might call the Autonomy Stack:

Layer 5: Economic Agency — Stripe MPP. Machines can transact.
Layer 4: Collaboration — Tmux-IDE. Machines work in teams.
Layer 3: Safe Operation — NemoClaw. Machines act within constraints.
Layer 2: Measurement — DeepMind Cognitive Framework. We can assess what machines can do.
Layer 1: Understanding — LLM Circuit Finder. We can see how machines think.

This is not an accident. Each layer enables the ones above it. You cannot build safe agent operation without understanding how models reason (because you need to predict what they will do). You cannot build meaningful collaboration without measurement (because you need to know what to delegate). You cannot build economic agency without safe operation (because money requires trust). The stack is coherent, even though the people building each layer are mostly unaware of each other.

The history of computing is, in many ways, a history of stacks. The hardware/OS/application stack of the PC era. The client/server/database stack of the enterprise era. The browser/API/cloud stack of the web era. Each stack took decades to mature and standardize. The autonomy stack is assembling itself in months.

There is a pattern in how stacks mature that is worth understanding, because it tells us what happens next. In the early days of every major technology stack, the layers are tightly coupled and poorly defined. Early websites mixed HTML, CSS, and business logic in the same file. Early distributed systems had no clear separation between compute, storage, and networking. The initial burst of innovation is messy, pragmatic, and driven by people solving immediate problems without worrying about architectural purity.

Then, gradually, interfaces emerge. HTTP became the universal interface between clients and servers. SQL became the universal interface between applications and databases. REST (and later GraphQL) became the universal interface between frontends and backends. These interfaces are what transform a collection of ad-hoc solutions into a genuine stack—a set of interchangeable, composable layers that different teams can build on independently.

The autonomy stack is in the messy, pre-interface phase. Alain's circuit finder works on GGUF files through llama.cpp. NemoClaw works through OpenShell containers. Tmux-IDE works through tmux panes and shared files. MPP works through Stripe's API. These are all different mechanisms, different protocols, different assumptions. There is no universal interface between understanding and measurement, between measurement and safe operation, between safe operation and collaboration, between collaboration and economic agency.

When those interfaces emerge—and they will—the stack will become something much more powerful than the sum of its parts. Imagine a standard protocol that lets a cognitive measurement system automatically configure an agent's security sandbox based on its assessed capabilities. Or an interface that lets a multi-agent coordination tool query a model's cognitive profile to determine which tasks to delegate. Or a payment protocol that automatically adjusts pricing based on the assessed quality of the agent making the request. None of this exists yet. But the pieces are on the table, and the people who define the interfaces between them will shape the next decade of computing.

Whether this speed is exhilarating or terrifying depends on your temperament. But whether you are excited or worried, the practical response is the same: understand the stack, use the layers you need, and do not pretend it is not happening. The developers who understood the web stack early in the 2000s built the companies that dominate today. The developers who understand the autonomy stack now have a similar window.

Chapter Twelve

What to Make of All This

There is a tendency, when confronted with a week like this one, to either hype everything into inevitability or dismiss everything as premature. Both reactions are wrong, and for the same reason: they substitute a narrative for analysis.

The hype narrative says: autonomous AI agents are coming, they will transform everything, resistance is futile, invest now. This ignores the enormous engineering challenges that remain. NemoClaw is in alpha. Tmux-IDE is version 1.2. MPP has a handful of merchants. DeepMind's framework is a proposal, not a standard. Alain's experiments work on specific models for specific benchmarks. Generalizing from any of these to "the future is here" is premature.

The dismissal narrative says: this is all vaporware and hype, AI agents cannot really do anything useful, the benchmarks are gamed, the payment volumes are trivial. This ignores the trajectory. Every one of these projects represents real, working code solving real problems. The trajectory is clear, even if the timeline is not.

The useful response is neither hype nor dismissal. It is engagement. Read Alain's code on GitHub. Run the circuit-finding experiment on a model you care about. Look at DeepMind's cognitive framework and think about which dimensions matter for your applications. Deploy NemoClaw if you are running agents. Try Tmux-IDE if you want to see what multi-agent development feels like. Integrate MPP if you are building an API that agents might want to use.

The week of March 17, 2026, was not the week that everything changed. It was the week that the infrastructure for everything changing became visible. The difference matters. Infrastructure does not have launch dates. It has adoption curves. And the best time to understand an adoption curve is at the beginning, when the slope is still gentle and there is still time to build on top of it.

A developer in France duplicated three layers of a neural network and tripled its reasoning ability. Google DeepMind proposed a scientific framework for measuring machine cognition. NVIDIA open-sourced a security framework for autonomous agents. A developer in Belgium launched a tool that turns Claude into a team member. Stripe and Visa started letting machines spend money.

None of these stories, alone, changes the world. Together, they describe the world that is being built.

There is an old joke in computer science: "There are only two hard problems—cache invalidation, naming things, and off-by-one errors." The autonomy stack adds a few more. How do you cache a reasoning circuit? What do you call a machine that can think twice but not reflect? How do you count the cognitive abilities of a system that has no fixed cognitive profile?

These are not jokes. They are the research agenda for the next five years, compressed into the infrastructure announcements of a single week. The answers will come not from the corporate labs that dominate the current AI landscape but from the edges: solo developers like Alain running experiments on consumer hardware, open-source communities building coordination tools, startups like Tempo designing protocols that machines can use without human intermediation.

The center of gravity in AI is shifting. For the past three years, it has been in the training labs: OpenAI, Google DeepMind, Anthropic, Meta. The models they produce are powerful and transformative, and they will continue to improve. But the action is increasingly moving to the infrastructure layer—to the people building the tools that determine how those models are understood, measured, contained, coordinated, and commercialized. The model is the engine, but the stack is the car. And the car is what people actually drive.

The question is not whether you find it interesting. The question is what you are going to build on top of it.

"We shape our tools, and thereafter our tools shape us."
— John Culkin, describing Marshall McLuhan

Sources

LLM Circuit Finder — Duplicate layers in LLMs to boost reasoning without training (GitHub, March 2026)
Measuring Progress Toward AGI: A Cognitive Framework (Google DeepMind Blog, March 17, 2026)
Full Paper: Measuring Progress Toward AGI (PDF) (Google DeepMind, 2026)
NVIDIA NemoClaw — Open source secure agent sandbox (GitHub, March 2026)
Introducing the Machine Payments Protocol (Stripe Blog, March 18, 2026)
Stripe-backed Tempo releases AI payments protocol, launches blockchain (Fortune, March 18, 2026)
Visa Scales Agentic Commerce Through Stripe Protocol Collaboration (PYMNTS, March 2026)
Tmux-IDE — OSS agent-first terminal IDE (Project Site, 2026)
Tmux-IDE GitHub Repository (GitHub, 2026)
Stripe Agentic Commerce Protocol Documentation (Stripe Docs, 2026)
AI agents could be worth $236 billion by 2034 (World Economic Forum, January 2026)
Levels of AGI: Operationalizing Progress on the Path to AGI (DeepMind, original 2023 framework)
Google DeepMind hackathon to pit meatbags v machines (The Register, March 18, 2026)
HN Discussion: Duplicate 3 layers in a 24B LLM (Hacker News, March 2026)
HN Discussion: NVIDIA NemoClaw (Hacker News, March 2026)
HN Discussion: Stripe Machine Payments Protocol (Hacker News, March 2026)

- FINE -