Shipping a 25K-Line LLVM Obfuscation Suite in 30 Days

February 2026

I was working on a game server project and didn't want people cheating. The obvious move is to throw VMProtect on your binaries and call it a day, but I was bored and got curious about how obfuscation actually works at the compiler level. So I looked at what existed in the open-source LLVM space. The original Obfuscator-LLVM targets LLVM 4. Hikari stopped at 8. Pluto gets you to 14, Polaris to 16. None of them have a VM pass. And LLVM's IR changes enough between major versions that porting isn't trivial: opaque pointers, funclet-based exception handling, comdat changes. Every release breaks something. I wanted something that worked with current LLVM, had a real VM virtualizer, and ran on Windows. Nothing like that existed, so I wrote Kilij: 12 composable transforms and a bytecode virtual machine, about 25,000 lines of C++, built in about a month.

Opaque predicates solvers can't solve ^{Within a reasonable analysis time budget.}

Opaque predicates are branch conditions that always evaluate one way but look like real branches to a reverse engineer. The classic example is x*(x+1) % 2 == 0. Always true. The problem is that tools already know this. IDA plugins like D-810 maintain databases of known predicate patterns. Z3 (a solver that can prove whether an expression is always true or always false) can simplify most arithmetic identities in under a second. If your predicates are in the database or solvable by the solver, they buy you nothing.

Kilij has seven predicate families. Four are standard: Collatz sequences (which resist simplification because solvers must unroll loop iterations rather than apply algebraic identities), Fermat's little theorem via modular exponentiation, difference-of-squares, and XOR equivalence. These resist casual inspection but a dedicated analyst with Z3 will clear them. The question I kept coming back to was: what's always true but still doesn't simplify under an SMT solver within a reasonable analysis budget?

The q-binomial theorem. For a fixed n of 6, the product Π(1 + t·q^r) equals a sum computed through Gaussian binomial coefficients. Two different expansions of the same polynomial, subtracted, always returning zero. In the implementation, t and q are derived from a runtime seed through per-module constants (FNV-1a of the module seed, variant tag, and splitmix64 expansion). The output is a long chain of multiplications and additions that always nets to zero.

The product form and the sum form compute the same value through completely different algebraic paths. There's no term-by-term cancellation for Z3's simplifier to exploit. In practice, proving the difference is zero seems to force a much harder proof search over 64-bit modular arithmetic. I ran this against Z3 4.15 with default tactics and a 300-second timeout per query; at n=6, 8-bit is UNSAT (proved), while 16/32/64-bit all time out at 300s. The identity doesn't appear in any existing pattern database either.

A second family uses Grassmann-Plucker relations on the 2x2 minors of a 2x4 matrix. The relation d12*d34 - d13*d24 + d14*d23 == 0 holds for all values over any commutative ring. This one looks intimidating in a disassembler and doesn't appear in pattern databases like D-810. But I ran it against Z3 and it solves in under a millisecond at any bitvector width. The terms cancel algebraically, so the solver doesn't even need to reason about the identity. Grassmann-Plucker buys you resistance to pattern matching but not to SMT.

A third family calls both and mixes their residuals as a*r1 + b*r2 + c*(r1^r2) with per-module constants. An analyst who clears the Grassmann-Plucker component still faces the q-series, which Z3 doesn't simplify within a reasonable analysis time budget.

Every instance gets unique constants. Per-site salt is FNV-1a of the module seed, function name, instruction index, and source debug location. Runtime entropy comes from hardware cycle counters (rdtsc on x86, CNTVCT_EL0 on ARM), with stack-address ASLR as a fallback. Two instances of the same predicate family in the same binary share the same algebraic structure but have completely different constants, so pattern matching on literal values won't link them.

The honest weakness is dynamic analysis. Run any of these predicates on a thousand random inputs and observe that they always return the same value. A concolic executor or dynamic taint tracker will flag them. Opaque predicates as a technique are fundamentally vulnerable to this. The algebraic novelty only buys you resistance to static symbolic analysis.

The virtual machine

About 15,000 of the 25,000 lines are the VM. It translates a function's LLVM IR into custom bytecode and replaces the original function with an interpreter loop. The interpreter reads instructions, dispatches to handlers, and operates on a uniform register file where every value is an i64. Pointers get packed via ptrtoint, floats via bitcast to i32 then zero-extend. The analyst sees a flat array of 64-bit integers with no type information.

There are three execution modes. Opcode mode virtualizes everything: every IR instruction becomes bytecode. Basic-block mode dispatches at the VM level but keeps block internals as native code, so hot loops still run at full speed. Region mode only virtualizes cold paths (selected by profile data or annotation), leaving performance-sensitive code untouched. Opcode maximizes obfuscation depth. Region minimizes runtime cost. BB mode sits in between.

On top of the base interpreter, there are encoding layers. Affine encoding transforms register values with per-register a*x + b pairs before storing and inverts on load, so encoded registers are opaque at rest. MBA encoding replaces the affine arithmetic with mixed boolean-arithmetic expressions that resist algebraic simplification. Feistel encoding layers a multi-round network on top of affine, so the register keys can't be recovered with a single linear solve. The bytecode stream gets separate hardening: per-instruction keys derived through a MurmurHash3 mixer, randomized opcode tables, and shuffled field layouts. Stripping one doesn't get you to plaintext.

Hard mode turns on the strongest option for each axis with one flag: indirect dispatch (function-pointer table instead of switch), randomized opcode assignment per build, bogus trap handlers mixed into the dispatch table, and an affine-encoded program counter. The encoded PC means a reverse engineer can't just read the counter to know which instruction is executing.

The growth budget system prevents compilation from exploding. MBA encoding can triple the VM-IR instruction count on complex functions. vm-mba-max-growth defaults to 200%, so if MBA would more than triple the instructions, the pass falls back to affine-only encoding. vm-encode-max-growth caps total encoding growth at 300%. These exist because I was compiling real binaries and watching some functions take minutes without the caps.

The other 9,000 lines

Control flow flattening replaces a function's branch structure with a switch-based dispatch loop. Standard approach, originally from OLLVM. The difference is that Kilij's dispatch variable goes through a 4-round Feistel network with random keys, and the case IDs are Fisher-Yates shuffled, so the encoded values don't correlate with basic block order. The Feistel decoder is emitted as inline IR at the dispatch point. An analyst trying to map switch cases back to original blocks has to recover the keys first.

String obfuscation uses XOR with an xorshift64 keystream. It's not cryptographic -- xorshift64 is trivially invertible from known plaintext -- but lazy decode (first access triggers it) keeps the plaintext window small, which is the actual protection. Decoding is thread-safe via compare-and-swap on a flag word: 0 means undecoded, 1 means another thread is decoding, 2 means done. There's an FNV-1a integrity hash per string that re-decodes from the ciphertext on mismatch, which catches both corruption and partial decode races.

MBA obfuscation replaces arithmetic operations (add, sub, xor, and, or) with algebraic equivalents. Multiple rewrite variants per operation, chosen randomly. The pass skips instructions with nsw/nuw/exact flags to avoid miscompiles, which matters because LLVM's optimizer uses those flags for assumptions about undefined behavior.

How it got built

AI did a lot of the work on this project. I used ChatGPT Pro and Claude for design, Codex for implementation, and Claude Opus through Revenant for automated attack testing. I had friends in industry who saved me from expensive mistakes early, one who understood OLLVM internals, another who helped me think through VM design. I think most people's "I used AI" stories either undersell it or oversell themselves, so I'll be specific about what I did.

The design came from everywhere, but the calls were mine. Running the VM pass first in the pipeline so later passes obfuscate the interpreter itself? That idea came out of a conversation, but I decided to commit to it and live with the complexity. The growth budget system that caps instruction explosion so compilation doesn't take ten minutes? That was me, because I was the one actually watching builds. The three execution modes came from a real question: which functions actually need heavy protection, and which ones just need to resist casual reading?

Here's how a typical day worked. I'd set a direction and hand implementation to Codex. Codex would build the plugin and run the test suite. The suite has two parts: unit tests that check things I've broken before (annotation handling, musttail calls, callbr, MSVC exception handling, a smoke test for each pass, i686 stdcall, UEFI), and a fuzzer that throws random LLVM IR from llvm-stress at every pass config and checks if anything crashes or trips the verifier. Kilij runs the LLVM verifier after every transform by default, so most bugs surface as hard failures, not silent miscompiles. If the tests passed, the pipeline compiled a known binary with full obfuscation, loaded it into IDA Pro, and handed it to Revenant, a reverse engineering agent I built that controls IDA over RPC using Claude Opus. Revenant would decompile functions, try to identify the VM dispatcher, look for patterns in the handler table, trace the bytecode decoding. It was the automated red team. Every few days I'd sit down and try to crack the binary myself in IDA. Sometimes I found things Revenant missed. Sometimes it found things I missed because it was more systematic.

When things broke, they broke in ways AI couldn't fix. Codex would get stuck on a bug for 20+ hours. I'm not exaggerating. I would go to sleep, wake up, and it would still be spinning, trying increasingly absurd approaches to the same problem. What worked was me stepping in: stop, make a minimal repro, add it to the test suite, trace step by step. That usually solved it within an hour. Most of the unit tests are there because something broke and I turned the fix into a regression test. BCF was breaking musttail adjacency, flattening was corrupting asm goto, funclets and obfuscation were fighting each other. The fuzzer came later, when I realized I was only catching failures I'd already seen. Throwing random IR at every pass config catches the ones I haven't. The worst bugs were miscompiles. A function returns the wrong value after obfuscation, and Claude suggests plausible fixes that are wrong because it doesn't understand nsw versus nuw flags on an add, or how opaque pointers changed GEP semantics in LLVM 17+. Those were always me reading LLVM source and tracing IR by hand.

Tradeoffs I'd revisit

The VM interpreter is generated per-module. Ten source files means ten interpreters, which bloats binary size. A shared interpreter with per-module bytecode tables would work better for real projects.

The predicate family weights are tuned by hand. The minor quadric and q-series predicates are heavier at runtime than the arithmetic identities, but I don't have data on how much analysis time each buys. Ideally the weight distribution would be learned by profiling protected binaries against real deobfuscation tools to maximize time-to-break per cycle spent.

The fuzzer helps, but it only catches crashes and verifier failures. It doesn't catch silent miscompiles where the output runs but returns the wrong value. Fuzzing with execution comparison (compile with and without obfuscation, run both, diff the output) would catch more.

And the Codex workflow needs guardrails. Letting it spin for 20 hours is a waste. Something that detects "this agent has been trying the same fix for two hours" and forces a pause would have saved me a lot of babysitting.

What I learned

The gap between "this compiles" and "this compiles correctly" is massive in compiler work. A pass that silently miscompiles one edge case in a thousand is worse than one that crashes. I spent more time on correctness testing than on any individual feature.

Feistel networks are underrated. Simple and invertible, with per-use parameterization that makes each instance unique. I used them in two places (flattening dispatch and VM register encoding) and they worked in both.

I didn't write most of the code. I didn't even come up with all of the design. What I did was set up a process where AI built things, AI attacked those things, and I was the person who watched it all and knew when to change direction. Most of the time that meant letting it run. Sometimes it meant telling Codex to stop and start over.

Obfuscation is an arms race against specific tools. Opaque predicates that look complex in a hex editor but fold in Z3 aren't doing anything. The question for each transform is: which analysis tool does this break, and how does an analyst work around it? If you can't answer both halves, you're guessing.

If you want to talk about this, email me at [email protected]. Source is on GitHub. If you'd rather text, Phone.