Shipping a 23K-Line LLVM Obfuscation Suite in 30 Days

February 2026

I was working on a game server project and didn't want people cheating. The obvious move is to throw VMProtect on your binaries and call it a day, but I was bored and got curious about how obfuscation actually works at the compiler level. So I looked at what existed in the open-source LLVM space. The original Obfuscator-LLVM targets LLVM 4. Hikari stopped at 8. Pluto gets you to 14, Polaris to 16. None of them have a VM pass. And LLVM's IR changes enough between major versions that porting isn't trivial: opaque pointers, funclet-based exception handling, comdat changes. Every release breaks something. I wanted something that worked with current LLVM, had a real VM virtualizer, and ran on Windows. Nothing like that existed, so I wrote Kilij—15 composable transforms and a bytecode virtual machine, about 23,000 lines of C++, built in about a month.

What it does

Kilij has 15 passes: control flow flattening, MBA, opaque predicates, string encryption, constant substitution, indirect calls, IAT obfuscation, and more. Each one tries to be non-trivial to undo. Flattening dispatch goes through a Feistel network. Opaque predicates use four math families salted with rdtsc so every instance is unique. String decryption is lazy and thread-safe.

14,000 of those 23,000 lines are the VM. Take a function's LLVM IR, translate it into custom bytecode, and replace the original function with an interpreter. There are three execution modes: full opcode virtualization, basic block mode that dispatches at the VM level but keeps block internals native, and region mode that only virtualizes cold paths. The register file is uniform i64, so pointers, floats, everything gets packed to 64-bit integers. On top of that there's affine encoding on register values, MBA encoding on top of the affine logic, and a Feistel network over the bytecode itself. Hardening mode adds an affine-encoded program counter, encoded handler table lookups, bogus trap handlers, and randomized opcode assignment per build.

This isn't VMProtect. Some of the protection adds real complexity for a reverse engineer. Some of it is closer to security theater—it looks intimidating in a decompiler but a patient analyst with the right tools would get through it. The layering helps, but the individual passes aren't all battle-tested against serious attackers. It's a month-old project, not a decade-old commercial product.

How it got built

AI did a lot of the work on this project. I used ChatGPT Pro and Claude for design, Codex for implementation, and Claude Opus through Revenant for automated attack testing. I had friends in industry who saved me from expensive mistakes early, one who understood OLLVM internals, another who helped me think through VM design. I think most people's "I used AI" stories either undersell it or oversell themselves, so I'll be specific about what I did.

The design came from everywhere, but the calls were mine. Running the VM pass first in the pipeline so later passes obfuscate the interpreter itself? That idea came out of a conversation, but I was the one who decided to commit to it and live with the complexity. The growth budget system that caps instruction explosion so compilation doesn't take ten minutes? That was me, because I was the one actually watching builds. The three execution modes came from a real question: which functions actually need heavy protection, and which ones just need to not be trivially readable?

Here's how a typical day worked. I'd set a direction and hand implementation to Codex. Codex would build, compile, and run the stress test suite: 38 tests covering Fibonacci, 100-case switch dispatch, function pointers, nested loops, mixed arithmetic, and a growing pile of edge cases. If the tests passed, the pipeline compiled a known binary with full obfuscation, loaded it into IDA Pro, and handed it to Revenant, a reverse engineering agent I built that controls IDA over RPC using Claude Opus. Revenant would decompile functions, try to identify the VM dispatcher, look for patterns in the handler table, trace the bytecode decoding. It was the automated red team. Every few days I'd sit down and try to crack the binary myself in IDA. Sometimes I found things Revenant missed. Sometimes it found things I missed because it was more systematic.

When things broke, they broke in ways AI couldn't fix. Codex would get stuck on a bug for 20+ hours. I'm not exaggerating. I would go to sleep, wake up, and it would still be spinning, trying increasingly absurd approaches to the same problem. What worked was me stepping in: stop, make a minimal repro, add it to the test suite, trace step by step. That usually solved it within an hour. The suite grew from 10 to 38 tests over the month, each one there because something broke and I made sure it couldn't break the same way again. The worst bugs were miscompiles. A function returns the wrong value after obfuscation, and Claude suggests plausible fixes that are wrong because it doesn't understand nsw versus nuw flags on an add, or how opaque pointers changed GEP semantics in LLVM 17+. Those were always me reading LLVM source and tracing IR by hand.

What I'd change

The VM interpreter is generated per-module. Ten source files means ten interpreters, which bloats binary size. A shared interpreter with per-module bytecode tables would work better for real projects.

The test suite grew from bugs I hit, which means it only covers failure modes I've already seen. Fuzzing the obfuscator with random LLVM IR would catch edge cases faster than waiting to stumble into them.

Some of the passes need evaluation against real reverse engineers, not just Revenant. I'd want to put obfuscated binaries in front of people who do this for a living and see what they actually struggle with versus what they cut through in minutes.

And the Codex workflow needs guardrails. Letting it spin for 20 hours is a waste. Something that detects "this agent has been trying the same fix for two hours" and forces a pause would have saved me a lot of babysitting.

What I learned

The gap between "this compiles" and "this compiles correctly" is massive in compiler work. A pass that silently miscompiles one edge case in a thousand is worse than one that crashes. I spent more time on correctness testing than on any individual feature.

Feistel networks are underrated. Simple, invertible, parameterizable. I used them in three places (flattening dispatch, VM bytecode encoding, VM register encoding) and they worked in all of them.

The bigger takeaway is about how to use AI on a project like this. I didn't write most of the code. I didn't even come up with all of the design. What I did was set up a process where AI built things, AI attacked those things, and I was the person who watched it all and knew when to change direction. Most of the time that meant letting it run. Sometimes it meant telling Codex to stop and start over. That's a different kind of work than what I expected going in, and I think it's closer to what a lot of engineering is going to look like soon.

Source is at github.com/dannyisbad/kilij.