Designing Protocols for Servers You Don't Trust
I built RankedServer as the backend for a small community game that needed competitive matchmaking. One process runs an HTTP server for auth and profiles, plus two TCP channels -- one for server-to-server communication (S2S) and one for player matchmaking (RDV, for rendezvous). The TCP channels handle everything that needs persistent connections and server-initiated messages.
The design decision that shaped everything else was the threat model. Game servers run on machines you don't fully control. Players poke at them constantly. It only takes one exploit for an attacker to own a game server with a valid, authenticated connection to your master. When that happens, what's the blast radius?
The wire format
S2S and RDV use the same binary framing. Every message has a 57-byte header:
pub struct MessageHeader {
pub payload_len: u32, // 4 bytes
pub hmac: [u8; 32], // 32 bytes - HMAC-SHA256
pub nonce: [u8; 12], // 12 bytes - AES-GCM nonce
pub msg_type: u8, // 1 byte
pub timestamp_ms: u64, // 8 bytes
}
Payloads are encrypted with AES-256-GCM. The non-HMAC header fields are passed as additional authenticated data, so the GCM tag binds the header to the ciphertext. A separate HMAC-SHA256 covers those same header fields plus the ciphertext. That's redundant with the GCM tag, but it means rejection doesn't depend on a single primitive. Both use the same session key, which I'd fix with HKDF-derived subkeys if I were doing this again.
Each connection tracks the last 512 nonces in a rolling window and rejects duplicates. Timestamps can't drift more than 120 seconds from the server's clock. 120 seconds is generous because clock sync across machines you don't control is unreliable, but tight enough that a replayed message from five minutes ago gets rejected.
What a compromised server can do
A compromised game server has a valid certificate and a live encrypted connection to the master. It can send any message the protocol allows -- heartbeats, match results, log entries -- including fabricated outcomes, fake player stats, or claims about matches it isn't hosting. That's the realistic threat, and the protocol is designed to contain it.
Every game server gets a certificate signed by the master's Ed25519 signing key, with a unique UUID. The cert embeds a 32-byte secret, and the master stores an encrypted copy. During auth, the server proves it has the secret by HMAC-signing a challenge. The secret itself never goes over the wire. An attacker sniffing the connection sees the HMAC proof but can't extract the secret from it, so they can't forge an auth handshake for a different server.
The cert secret isn't used directly as an HMAC key either. There's an embedded pepper compiled into both binaries. The auth key is HMAC(pepper, cert_secret), so even if someone extracts the raw secret from a .cert file, they need the pepper to complete the handshake. The pepper lives in the binary, so it's not a secret against anyone who can reverse the executable. But it means stealing a .cert file from disk isn't enough to forge a handshake.
Once authenticated, the master doesn't trust anything the game server claims about its own identity. If a heartbeat arrives with a username that doesn't match the authenticated certificate owner, the master overrides it and logs the mismatch. A compromised server can't impersonate another server's identity through the heartbeat path. All match stats from game servers are clamped to reasonable ranges (kills 0-100, deaths 0-100, rounds 0-30, and similar bounds on other fields). A compromised server can still lie within those ranges, but it can't inject absurd values that blow up the rating math.
When a server gets compromised, there are three ways to kill it. Revoke its certificate UUID in the database and its next heartbeat gets rejected. Let the cert expire and it dies on schedule. Or rotate the master's Ed25519 signing key and every certificate in the system dies at once. Single server, time-based, or nuclear. There's a CLI tool for issuing, revoking, and inspecting certificates so you're not doing revocations by hand during an incident.
Rate limiting fires before any crypto work. S2S allows 30 auth attempts per IP per minute, RDV allows 60, the admin interface allows 30. If someone is brute-forcing auth, they burn through their rate limit before the server spends a single cycle on HMAC verification or database lookups.
Session key isolation
Even with valid certs, one compromised session shouldn't give you the keys to another. The session key for each connection is derived fresh:
pub fn derive_session_key(
auth_key: &[u8; 32],
client_challenge: &[u8; 32],
server_challenge_resp: &[u8; 32],
) -> [u8; 32] {
let mut mac = HmacSha256::new_from_slice(auth_key)
.expect("HMAC key");
mac.update(client_challenge);
mac.update(server_challenge_resp);
mac.finalize().into_bytes().into()
}
The client sends a random 32-byte challenge. The server computes a challenge response (HMAC(auth_key, client_challenge)) and sends it back. The session key is then HMAC(auth_key, client_challenge || challenge_response). The client's random challenge means each connection gets a different session key, even if the same certificate reconnects immediately. Compromising the underlying auth key is a different story (see What I'd change).
The challenge-response also provides mutual authentication. The client can verify the server knows the auth key by checking the challenge response before proceeding. This is separate from the Ed25519 signature on the cert file itself, which proves the cert was issued by the real master.
S2S and RDV both use this derivation for the final session key, but they derive their auth keys from different roots. S2S derives the auth key from the cert secret through the embedded pepper. RDV derives it from a player's login ticket: the client holds a random ticket, the server stores only its SHA-256 hash (the raw ticket is never persisted), and the hash is used directly as the HMAC key. No pepper. Same session key derivation, different trust roots.
Matchmaking
The RDV channel handles player matchmaking. Players connect after logging in through the HTTP API, authenticate using their ticket, and send queue/dequeue messages. The server pushes match assignments back. The matchmaker runs on a 500ms background tick inside a SQLite BEGIN IMMEDIATE transaction to prevent race conditions across queue operations.
One ordering detail that took me a while to get right: the matchmaker sends CmdAllowPlayers to the game server before telling clients where to connect. Without this, a fast client can join and get rejected because the game server hasn't received the allowlist yet. Players who disconnect and reconnect get routed back to their existing match if one is active, skipping the queue entirely.
There's also a TrueSkill-inspired rating engine built into the matchmaker. I wrote a separate post about that.
What I'd change
I rolled my own crypto framing because I wanted to understand how it works. In production I'd use TLS 1.3 and skip maintaining nonce windows and HMAC ordering myself. TLS 1.3 would also give me forward secrecy via ephemeral Diffie-Hellman, which the current scheme doesn't have. If someone compromises a cert's auth key, they can derive session keys for any connection they've recorded. The custom framing gives me more control over the wire format, but it's also more surface area to screw up.
SQLite is fine for matchmaking state at this scale, but it wouldn't survive hundreds of concurrent game servers. I'd move to Postgres with advisory locks if this needed to grow. The matchmaker tick is single-threaded by design, which keeps it simple but caps throughput.
The whole thing runs as a single process. Easy to deploy, but a single point of failure. No clustering, no failover. For a small community that's fine. For anything bigger you'd want at least a hot standby.
What I learned
Trust modeling is the hard part of protocol design. The crypto primitives are well-understood. AES-GCM, HMAC-SHA256, Ed25519 all have good libraries and clear usage guidelines. The hard part is deciding what happens when a node you trusted turns hostile: what can it see, what can it forge, what's the blast radius, and how fast can you cut it off. Those questions shape the protocol more than the choice of cipher.
The other thing I'd underestimated was how much operational tooling matters. The cert CLI, the rate limiters, the stat sanitization, the admin dashboard with IP allowlisting, mTLS, and audit logging. None of that is in the protocol spec. All of it is what you actually reach for at 2am when something goes wrong.
If you want to talk about this, email me at [email protected]. Source is on GitHub. If you'd rather text, Phone.