The end of free hardware
Dennard scaling broke twenty years ago. The transistor density curve still has runway. The bill nobody warned us about is the complexity tax, and it is coming due.
The end of free hardware

My first machine was a Commodore VIC-20. One MHz, three and a half kilobytes of RAM after BASIC took its cut. That last clause is the first half of the lesson. There was no compiler. I wrote in a language that ran inside a runtime that lived in my memory budget, competing with my program for every byte. The first thing I learned was not how to write code. It was that the language and the runtime were not free, and if I did not account for what they were costing me, my program did not fit.
The second half of the lesson came when BASIC ran out of room and I dropped down to 6502 assembly. That is where I learned what a language actually is. A layer of cost over a CPU that has its own rules. Registers, zero page, memory-mapped I/O, the way the chip does not know what your variables are because it only knows addresses and opcodes. Once you have seen the metal, you do not stop seeing it through whatever you write on top of it.
That lens has not gone away. Forty-five years later it is the same lens I use to read a slow .NET service and find the answer is the GC, or the JIT tier, or boxing, or a closure capture the developer never noticed. The runtime is not free. It never was. The hardware just got fast enough for long enough that most of us got to pretend it was.
That run is over. It has been over since 2005. Most of the industry is still acting like it is not.
The question someone asked me last week was whether we are at the end of CPU hardware growth because of physics. The honest answer is more interesting than yes or no.
Dennard scaling broke in 2005, and we never stopped pretending
From 1974 through about 2005, transistors got smaller, ran faster, and used less power per cycle, all on the same curve. That was Dennard scaling. It is the reason a 1995 PC felt slow next to a 1999 PC and a 2003 PC felt like a different category of machine. The chip was not just denser, it was also faster, and because each transistor used less power as it shrank, you could run all of them flat out without melting the die. Free lunch, every two years.
Around 2005 the lunch ended. Voltage stopped scaling. Leakage current and heat density caught up with the geometry, and you could no longer just keep pushing the clock. The artifact is on every clock-speed graph since. A Pentium 4 in 2005 hit 3.8 GHz. A 2026 Ryzen sits around 5.5 GHz. After twenty years of process improvements, the clock has moved less than two times. Single-thread instructions per cycle now grow about ten percent per generation in a good year, when historically they doubled.
The industry's response was to stop selling clock speed and start selling cores. Two, four, eight, sixteen, thirty-two, sixty-four. That works as a marketing answer. It does not work as a substitute for single-threaded performance for code that was never written to be parallel, which is most code, written by most developers, running in most production systems on Earth.
Density still has runway, but the bill has changed hands
Transistor density itself has not stopped. It has slowed. The "3nm" and "2nm" labels are marketing, the actual gate pitch is larger than the number suggests, but real density gains continue. TSMC's N2 process shipped in late 2025 with gate-all-around nanosheet transistors. The roadmap has A14 and A10 with backside power delivery and CFETs mapped into the early 2030s. The hard physical wall, where atomic-scale features stop working because of quantum tunneling, is still out there but it is a soft wall, not a brick one. There is a decade of runway before we hit it on conventional CMOS.
The closer wall is economic. A leading-edge fab now costs more than 20 billion dollars to build, and the latest TSMC and Intel fabs are running closer to 40. A 2nm mask set runs into the hundreds of millions before tape-out. Verification eats more of the schedule than design does. Three or four companies in the world can afford to play at the leading edge, and the rest of the industry buys what those three or four ship. That is not a physics problem. It is a capital concentration problem with the same effect: fewer choices and slower change.
The growth has moved sideways, not up
Most of the actual performance growth in the last ten years has not come from faster general-purpose CPUs at all. It has come from chiplets, 3D stacking, HBM stacked next to logic dies, and specialized silicon built for specific workloads. GPUs started as graphics hardware and ended up the substrate for every parallel arithmetic problem worth solving. Tensor cores and NPUs and the on-die matrix units in Apple Silicon make on-device inference cheap enough to ship in a phone.
This is the part most working developers have missed. An LLM inference workload sees order-of-magnitude gains because it gets to ride the accelerator curve. A typical .NET LOB application sits on conventional cores and sees almost nothing. The two workloads are not on the same hardware curve anymore, and neither are the careers of the people who write them.
The vertically integrated companies figured this out first and acted on it. Apple owns the chip, the OS, the compiler, the frameworks, and the apps that ship with the device. Google built TPUs to run Google's workloads on Google's racks. Amazon built Graviton because they got tired of paying Intel margins and they could co-design the chip with the services it would run. Tesla built FSD silicon for the car they were already shipping. The pattern is the same in every case. If you control the layers, you can co-design across them and get gains nobody assembling parts can match.
If you are working in a Dell plus Intel plus Microsoft plus ISV stack, you are an ISV. You are building on top of someone else's choices, and the headroom you have is whatever those choices left for you.
The memory wall is the actual ceiling
The number that matters most in modern CPU performance is not clock speed and not core count. It is the time between "I need this byte" and "I have it" when the byte is sitting in main memory.
That number has barely moved in twenty-five years. A DDR1 access in 2000 took roughly 80 to 100 nanoseconds. A DDR5 access in 2026 takes 60 to 90. Bandwidth has scaled enormously over that same period, DDR4 to DDR5 roughly doubled it, HBM3 on a current GPU pushes more than 800 GB per second. But latency, the actual time you wait for a specific value, has stayed flat. It is a physics-and-protocol problem, not a parallelism problem. You cannot widen your way around it.
What 80 nanoseconds means in CPU cycles, on a 5 GHz core, is roughly 400 cycles of doing nothing useful. In that window the core could have run 1500 or more instructions if the data had been in L1 cache. Instead it sits and waits.
The cache hierarchy in numbers, roughly:
- L1 cache: about 1 ns, 32 to 128 KB per core
- L2 cache: 3 to 10 ns, half a meg to a couple of megs per core
- L3 cache: 10 to 40 ns, tens of megs shared
- Main memory (DRAM): 60 to 100 ns
- NVMe SSD: 10,000 to 100,000 ns
- Spinning disk: about 10,000,000 ns
Each tier is roughly an order of magnitude slower than the one above it. Almost everything we call modern CPU performance engineering since the mid-2000s is really cache hierarchy and prefetcher engineering. Wider out-of-order windows, deeper speculation, smarter branch predictors, larger reorder buffers, better prefetch heuristics: all of it is fundamentally about hiding the latency of going to main memory. When the trick fails, when you take a cache miss the prefetcher could not predict, you see what the bare hardware can actually do, and the answer is not impressive.
The memory wall got its name in 1995, in a paper by Wulf and McKee called "Hitting the Memory Wall: Implications of the Obvious." They were exactly right and a lot earlier than most of the industry was willing to admit.
Complexity is the real tax
Physics is the headline. Complexity is the bigger story.
A leading-edge SoC is now somewhere between 50 and 100 billion transistors, and on the high end well past that. It is not really a chip anymore, it is a system in a package. Multiple dies, multiple process nodes, custom interconnect fabrics, power delivery networks that are themselves serious engineering work. Verification of that thing eats more of the schedule than the original design does. The mask sets run into hundreds of millions before tape-out. The number of people on Earth who can do this work at the leading edge is, optimistically, in the low thousands.
Software has not kept up with any of it. A modern x86 core does wide out-of-order execution, deep pipelining, multi-level cache coherency across many cores, NUMA, P-cores and E-cores with different performance profiles, simultaneous multithreading, vector units of varying widths, and increasingly matrix and AI extensions. A C# developer writing straightforward business logic captures maybe five to ten percent of what the silicon can theoretically do. The JIT and the CLR do real work to bridge that gap. Mainstream code is fundamentally not written to feed these machines, and the gap between peak and achieved has been widening for twenty years.
Then there is the security tax that nobody planned for. Spectre and Meltdown were disclosed in January 2018, and in the years since we have had an entire parade of side-channel attacks against the same speculative execution machinery that drove most of the single-thread gains from the late 90s onward. The mitigations cost real performance, anywhere from five to thirty percent on specific workloads. That performance is gone. New designs have to assume side-channel resistance from the ground up, which constrains the architectural moves that used to deliver speed.
Underneath all of this is a stack between your C# and the transistors that nobody fully understands end to end. Compilers, drivers, firmware, microcode, schedulers, hypervisors, container runtimes, sidecars. Bugs at any layer can erase silicon gains. The companies winning right now are the ones who can co-design across that stack. The rest of us are downstream of decisions we did not make.
What this means for the rest of us
The era of "wait 18 months, your code gets faster for free" is over and has been for a while. Two paths forward.
The first path is co-design. Pick a workload, pick the hardware that suits it, control as much of the stack as you can. Apple does this. The hyperscalers do this. At hobbyist scale it is what I am doing with GlyphDeck. I picked the SBC, I am writing the OS image, I designed the comms protocol, I own the rendering layer and the API on the host. That is not because I want to build everything. It is because the only way to get a smooth UI on a Radxa Zero 3W with A55 in-order cores and a modest memory subsystem is to know exactly what every layer is costing me.
The second path is to learn to use the hardware you already have. To stop assuming that next year's machine will paper over this year's lazy code, because it will not. To remember that allocations cost something. That cache lines exist. That a List<MyClass> and a MyStruct[] are not interchangeable just because they hold the same data. Most developers have never had to think about any of this. The hardware curve protected them from having to. That protection has run out.
What worries me more than the physics or the economics is what abstraction has done to developer instincts. We have become complacent, and the stack has gotten deep enough that complacency is rational. Hardware underneath, then an operating system, then a runtime, then a framework, then a browser engine, then Electron on top of the browser to ship the same app as a desktop application. Five or six layers before you get to the code somebody actually wrote. A modern chat client renders a text box by booting a full Chromium engine and spends two hundred megabytes of memory doing it. The same job ran instantly on a 286 with a couple of kilobytes. Most developers shipping that chat client have never had to think about any of it, because the layers all worked well enough, and the hardware curve absorbed the cost. That is the part that has changed.
For years we papered over the idea of speed because it was easy. We could get more RAM, or more CPU, or faster storage, and the hardware would catch our slop. The runway is ending. We need to start being frugal again.
I will write the second post about the second path. The practical rules. What costs what in .NET. Why a small trimmed self-contained binary is a feature and not a curiosity. How to write code in 2026 that respects the machine it actually runs on.
I learned the lesson on a 1 MHz CPU with three and a half kilobytes of RAM. The hardware has come a very long way since then. The lesson has not changed.
// comments