Frugal again - rules for writing code that respects the machine
The hardware curve has flattened. The papering-over is ending. Here are the rules for writing code that respects the machine it actually runs on, with .NET specifics.
Frugal again - rules for writing code that respects the machine

Picture a Radxa Zero 3W: a single-board Advanced RISC Machines (ARM) Linux machine the size of a stick of gum, with a modest Central Processing Unit (CPU) and just enough Random Access Memory (RAM) to think with. On it runs a small rendering engine I wrote, parsing GlyphDeck Screen Definition Language (SDL) files and drawing them frame by frame to the macro pad's display. (SDL here is GlyphDeck's own format: a small JavaScript Object Notation (JSON) dialect describing what to draw, not the Simple DirectMedia Layer 2 (SDL2) C graphics library that shares the initials.) On hardware this thin there is no headroom to absorb mistakes. Every frame I draw, every byte I allocate, the machine answers honestly. If I write a hot loop that allocates, I see it. If I reach for Language Integrated Query (LINQ) where a for loop would do, I see it.
That kind of feedback used to be normal. The C64 I grew up on had 64 kilobytes and one job: do exactly what you told it to. Forty-five years later the machines are unrecognizable but the discipline is identical. The runtime is not free. The hardware is not unlimited. The developer who keeps both of those facts in view ships smaller, faster, more predictable code.
This is the practical companion to cpu-physics-end-of-runway. That post made the case that the free hardware ride is over and we have grown complacent because the abstraction stack got too deep. This one is the answer: here is what to actually do. Either post stands on its own. You may have arrived at this one first; that is fine.
Audience: working .NET developers writing Line of Business (LOB) systems, small services, Command-Line Interface (CLI) tools, desktop apps. Not chip architects. Not game-engine teams. Not High-Frequency Trading (HFT) shops. The people who have never needed to think about cache lines and now might want to start.
What the compiler does for you, and what it does not
The single biggest mistake developers make about .NET performance is overestimating the runtime. The Just-In-Time (JIT) compiler and the Garbage Collector (GC) are good. They are not as good as you think. They are doing a hard job under tight constraints, and most of what makes a program fast is decisions you have already made before the JIT ever sees the code.
Things RyuJIT and the modern Common Language Runtime (CLR) do well:
- Register allocation and instruction scheduling
- Inlining of small methods, dead-code elimination, constant folding
- Loop unrolling and basic loop optimizations
- Some auto-vectorization, getting better release-on-release but still conservative compared to a hand-written
Vector<T>path - Tiered compilation: hot methods get re-JITted with more aggressive optimizations once they prove themselves
- Limited escape analysis, with stack allocation expanding meaningfully in .NET 10
Things it will not do for you:
- Choose your data structures
- Decide between class and struct
- Reorder fields for cache-line alignment
- Pool allocations on your behalf
- Avoid boxing
- Choose access patterns (row-major vs column-major)
- Write Single Instruction Multiple Data (SIMD) code for you
Everything below is the "will not do for you" list, organized into rules you can actually apply.
The rules, in order of impact
1. Know the cost of an allocation, and work with the GC, not against it
Heap allocations are not free, but they are also not uniformly expensive. The generational GC is built around an assumption: most objects either die young or live forever. If your object dies in Gen0, it costs almost nothing. If it lives long enough to be promoted to Gen1 or Gen2 and then becomes garbage, you pay for a full collection of that generation.
The expensive case is the middle. Objects that linger just long enough to get promoted are the silent killer. A request handler that captures a lambda, holds onto it for the duration of a HyperText Transfer Protocol (HTTP) call, then discards it can survive Gen0 if the request is slow, get promoted, and turn a routine endpoint into a Gen1 collection generator under load.
A class is a heap allocation. A struct on the stack is not. A struct in an array is not. A boxed struct is.
Anything 85 KB or larger lands on the Large Object Heap. The LOH is collected with Gen2 and is not compacted by default, which means LOH allocations fragment your address space and force expensive Gen2 work. Long-lived large buffers belong in ArrayPool<T>.Shared, not in a fresh new byte[size] on every call.
(Sources: Ben Watson's Writing High-Performance .NET Code via the breder.org reading notes, and Adam Sitnik on pooling large arrays.)
2. Choose struct vs class with intent, and don't box value types into object slots
The default is class. Reach for struct when you have a small, value-semantic, allocation-hot shape and you have measured the win. readonly struct to prevent defensive copies. ref struct for stack-only types like Span<T>. Pass and return by ref when the struct is large enough that the copy starts to cost.
The boxing trap is older than .NET itself but it has not gone away. Generics killed the worst offender; nobody puts an int into an ArrayList anymore. But boxing happens any time a value type lands in an object slot:
- An interface reference holding a struct (unless the interface methods are devirtualized)
params object[]array overloads- Old
string.Formatoverloads (the modern interpolated-string handler is fine) - A
Dictionary<,>lookup with a value-type key hitting the wrong comparer IEnumerable<T>returning structs through the boxing iterator path
The cost of a single box is around twenty times a reference assignment, and the unbox cast is roughly four times an assignment. Per-call that is small. In a hot loop it is the entire performance budget.
(Sources: Sitnik on value vs reference types and ref returns and ref locals; Microsoft Learn's .NET Performance Tips, still right on the boxing rule even if the example is dated.)
3. Use Span<T> and Memory<T> like you mean it
The single most important addition to the .NET performance toolkit since 2018. Span<T> is a stack-only view over a contiguous region of memory. It can point at a stack buffer, an unmanaged buffer, an array, or a slice of a larger array. Slicing is free. Iteration is JIT-friendly. Parsing patterns that used to allocate intermediate strings can now run with zero allocation.
The ReadOnlySpan<char> parsing pattern, in particular, has rewritten how the Base Class Library (BCL) handles strings internally. Most of the allocation reductions in the annual Toub posts trace back to spans being threaded through more APIs. If you are still calling string.Split and discarding most of the result, you are doing it the slow way.
(Source: Sitnik's Span explainer.)
4. Stop allocating in hot paths
ArrayPool<T>.Shared for arrays you will return when done. MemoryPool<T> for the same with pooled IMemoryOwner<T> semantics. Pooled StringBuilder instances. Reuse buffers across iterations. The async state machine is itself an allocation; ValueTask exists so that the common synchronous-completion case does not pay for one.
String concatenation deserves its own beat because the old advice has aged. The 2017-era rule was "always use StringBuilder in loops." That is still right for unbounded or large concatenation. For everything else, modern .NET has changed the picture:
- A single-expression concatenation
a + b + c + dis folded by Roslyn intoString.Concat(a, b, c, d), which beatsStringBuilderfor fixed-N - Interpolated strings (
$"...") since .NET 6 useDefaultInterpolatedStringHandler, which is heavily optimized and frequently beatsStringBuilderfor small fixed concatenations - For genuinely hot paths,
string.Create(length, state, callback)or writing into aSpan<char>beats both StringBuilderis still right when the final length is unknown and you are appending in a loop
Pick the tool for the shape of the work, not the rule of thumb from 2010.
5. Watch what LINQ costs
Every chained operator in a LINQ pipeline is at minimum a closure capture, often an allocation, and a delegate invocation per element. Where().Select().ToList() over a thousand-item list will allocate the closure for Where, the closure for Select, the iterator state machine, the result List<T>, and pay a virtual call per element. In a request handler that runs once per request, that is invisible. In a hot inner loop that runs ten thousand times a second, that is the bug.
LINQ in a hot loop is the single most common quiet performance bug in .NET code. The fix is almost always a for loop or a foreach over the underlying collection type. Save LINQ for the places where the readability win is real and the call site is cold.
6. Async is not free
Each async method generates a state machine. The state machine is a struct, but if the method ever actually goes async (awaits a non-completed task), the state machine boxes onto the heap so it can survive the suspension. ValueTask<T> exists so that a method that usually completes synchronously can return without allocating.
The corollary: do not make every method async by reflex. A wrapper that awaits a single task and returns the result has paid for a state machine to do nothing. If the inner call is genuinely async, propagate the Task directly. If it is sometimes synchronous, return ValueTask.
7. Bound your concurrency. Do not fan out unbounded.
await Task.WhenAll(items.Select(i => DoWorkAsync(i))) over ten thousand items is a perennial production incident. It is not "parallel," it is "all at once, and now everything is starving." The thread pool, the Input/Output (I/O) completion ports, the database connection pool, and whatever you are calling are all under simultaneous attack from your one process.
Use Parallel.ForEachAsync with a sensible MaxDegreeOfParallelism, or gate the work with a SemaphoreSlim. The right number of in-flight requests is almost never "all of them." It is usually small: 8, 16, 32 depending on the workload. Measure with the actual downstream and pick a number that keeps the pipe full without saturating it.
8. Reach for concurrent collections only when there is real contention
ConcurrentDictionary<,> and ConcurrentBag<T> exist for write-heavy shared state. They are not free; they pay for synchronization on every access. Read-mostly or read-only data accessed from multiple threads does not need them. If the data is built once at startup and then read, a regular Dictionary<,> (or FrozenDictionary<,> since .NET 8) is faster on every lookup.
The lazy default of "I have multiple threads, so I need a ConcurrentDictionary" is a quiet tax on every read. Reach for it when you actually have writers competing with readers, not as a precaution.
A few more, compressed into prose
Three rules deserve mention but do not warrant their own slot in the numbered list.
Struct layout matters when you are operating on millions of small structs. Field ordering, [StructLayout(LayoutKind.Sequential, Pack = N)], and the choice between Array of Structs (AoS) and Struct of Arrays (SoA) determine the cache hit rate. This rarely shows up outside tight numerical work, but where it shows up it dominates.
Reflection is a last resort. Cache delegates if you must reflect. Prefer source generators (System.Text.Json, LoggerMessage, regex source generators), which produce the dispatch at compile time. Avoid dynamic outside genuine interop scenarios.
And measure, do not guess. BenchmarkDotNet is the only honest answer. Run the workload, read the numbers, pick the rule the numbers actually support. Sitnik's sample performance investigation is a worked end-to-end example.
Trimmed self-contained binaries
I ship single-file self-contained binaries as the default for personal projects. The size, startup, and distribution cases are clear: one file, no SDK install on the target machine, deterministic deployment. With PublishTrimmed=true the binary shrinks dramatically because the trimmer drops every type and member it can prove unreachable from your entry point.
That word "prove" is doing all the work. The trimmer is a static analyzer. Reflection, dynamic assembly loading, and serialization-by-name all defeat the proof. If the trimmer cannot tell that MyType is reachable, MyType does not ship.
Writing code the trimmer will not reject
A working list of habits that keep trimming clean:
- Annotate dynamic access.
[DynamicallyAccessedMembers]on parameters, fields, and return types tells the trimmer which members to keep. A method that takes aTypeparameter and pokes its properties needs the attribute, otherwise the trimmer assumes nothing about that type is needed. - Mark the trim-unsafe boundary.
[RequiresUnreferencedCode]on the methods that genuinely need it. The warning bubbles up through every caller, which makes the unsafe surface visible at compile time instead of blowing up in production on the path the trimmer pruned. - Prefer source generators over reflection.
System.Text.Jsonsource generation,LoggerMessage, regex source generators. Each one removes a reflection path the trimmer would otherwise have to preserve or fail on. - Avoid
Activator.CreateInstance,Type.GetType(string),Assembly.Load, anddynamicin trimmed code. If you cannot avoid them, isolate them behind a trim-attributed boundary. - Turn warnings on, treat them as errors. Set
<TrimmerSingleWarn>false</TrimmerSingleWarn>and<TreatWarningsAsErrors>true</TreatWarningsAsErrors>. The trimmer emits Intermediate Language (IL) warning codes IL2026 / IL2070 / IL2075 for every unsafe pattern it sees. Silencing them ships a broken binary that fails only on the pruned path. - Test the trimmed output, not the JIT debug build. A debug build passing tells you nothing about the trimmed publish. Run the published artifact in Continuous Integration (CI).
Where it works: GDC, WebGet, Notify, WhisperTranscribe
The CLI tools and the headless service all trim cleanly. GlyphDeckComs (GDC) is the daemon that bridges the macro pad to the host machine. Trimmed publish, ARM Linux target, runs on a Radxa Zero 3W with bytes to spare. WebGet, Notify, WhisperTranscribe: same story. Small surface area, controlled dependencies, libraries that ship trimmer annotations or are simple enough not to need them.
Where it does not: Avalonia
PublishTrimmed=true against a real Avalonia application is a fight. Avalonia's XAML pipeline, styling system, and data templates all rely on reflection paths the trimmer cannot follow without extensive annotation. The Avalonia team has been improving this release on release, but in practice on GlyphDeck.UI I gave up on full trimming. Symptoms: missing styles at runtime, controls that render blank, data bindings that silently fail.
Partial trimming with a hand-written TrimmerRootDescriptor keeps the app alive but the size win shrinks fast once you start re-rooting whole assemblies. The honest answer for Avalonia today: ship un-trimmed self-contained, take the size hit, revisit when the framework's trim story matures. Ahead-of-Time (AOT) compilation is the same conversation, harder. Avoid Native AOT entirely if your project depends on Microsoft.Graph or Azure.Identity, both of which use reflection patterns that AOT cannot follow.
What is genuinely expensive in .NET
A reference list of operations that quietly cost more than they look like they cost. None of these are absolute prohibitions; they are calls you should make on purpose, not by reflex.
- Boxing. A struct stuffed into an
objectslot. ~20× reference assignment. - LINQ chains in tight loops. Closure allocations, delegate invocation per element, hidden iterator state machines.
string.Splitandstring.Replaceon large strings. Each allocates. UseSpan<char>-based parsing where possible.IEnumerable<T>traversal where the underlying collection isList<T>. Interface dispatch, missed JIT optimizations. Iterate the concrete type when you have it.Dictionary<,>access with astringkey when alternatives exist.FrozenDictionaryfor read-only sets. Alternate-lookup APIs forReadOnlySpan<char>keys.- Async state machines for trivial methods. A method that just awaits and returns has paid for a state machine to do nothing.
Activator.CreateInstanceand reflection-driven object construction. Slow, allocates, and breaks trimming.- Closures that capture local variables. Heap allocation per invocation.
paramsarrays. Allocation per call site. The newparams Span<T>overloads in .NET 9+ remove this for many BCL methods.- Exceptions on hot paths. Two to three orders of magnitude slower than returning a result. Throw on truly exceptional conditions, not on parse failure of user input.
- Empty finalizers. A finalizer (even an empty one) opts your object into the finalization queue, costs an extra GC cycle, and triggers F-reachable queue work for nothing. In 2026 you almost never need a finalizer at all;
SafeHandleand the surrounding ecosystem have eaten almost every legitimate use-case.
When not to optimize
The other half of the discipline. Most LOB code is I/O bound and the cost sits in Structured Query Language (SQL) queries, HTTP calls, and disk reads. Optimizing in-memory operations on the application server when the database is taking 200 ms per query is wasted effort. Knowing when to leave code alone is part of the skill.
For my government LOB work, almost none of these rules matter. The bottleneck is the network round-trip to the database. The right move is a better query, a covering index, a cache, not a Span<char> rewrite of the screen-rendering path.
For GlyphDeck on a Radxa Zero 3W, every rule matters every frame. The screen is being redrawn at 60 Hz from a CPU with no headroom.
For Whisper.NET pulling a twenty-hour audio file through Compute Unified Device Architecture (CUDA), performance matters in the Graphics Processing Unit (GPU) pipeline. The CLI argument parser can stay sloppy; nobody cares.
The rule of rules: profile first, optimize second. Knowing which path is hot is more valuable than knowing every optimization in this post.
Where to learn this properly
Three sources are worth your time.
Stephen Toub's annual "Performance Improvements in .NET X" series on the official .NET Blog. The .NET 10 edition is around 232 pages and reads like a short book on what the runtime team has done in the last release. It assumes you already know the rules above; it tells you which ones the runtime is now doing for you and which ones it still expects from your code.
Ben Watson's Writing High-Performance .NET Code (2nd ed, 2018) is the closest thing to a practitioner rule book. Chapter shape is "here is the cost, here is the rule, here is the measurement." It predates the modern Span and ValueTask vocabulary in places, but the GC, allocation, and concurrency rules are still the canonical statement of them. The free breder.org reading notes condense it for citation purposes.
Adam Sitnik's blog at adamsitnik.com is the practitioner middle ground between Toub's release notes and Watson's deep book. Scenario-driven, current, with worked examples and benchmark numbers. The articles on Span, value vs reference types, ref returns, and ArrayPool are direct support for the rules in this post.
The official Microsoft Learn ".NET Performance Tips" page exists, but it has not been meaningfully updated since 2017. It captures three rules (boxing, StringBuilder, finalizers) and zero of the modern Span / ArrayPool / source-generator era. That gap, between what the official docs cover and what a working developer actually needs in 2026, is part of why this post exists.
Closing
The discipline is not new. The VIC-20 taught it forty-five years ago. The machine is bigger, the abstractions are taller, the names of the tools have changed, but the lesson is identical: the runtime is not free, the hardware is not unlimited, and the developer who keeps both of those facts in view ships smaller, faster, more predictable code than the one who does not.
The hardware curve is no longer going to make up the difference. We get to be frugal again. Not as a constraint. As a craft.
// comments