Building a Custom MMR System from Scratch

February 2026

I built RankedServer this year as the backend for a small community game that needed competitive matchmaking. The project needed a rating system, and the constraint that shaped everything was playerbase size: with fewer than 500 concurrent players, keeping people in queue matters more than decimal precision in your skill estimates. I spent a week reading papers, staring at spreadsheets, and wrote my own.

Why not Elo, Glicko, or TrueSkill?

Elo was designed for chess. 1v1, no teams, and the K-factor is static. In a 5v5 shooter, everyone on the winning team gets the same bump regardless of whether they carried or got carried. And Elo has no concept of confidence. A brand-new player and a 500-game veteran move by the same amount, which means new players take forever to reach their real rank.

Glicko-2 adds a rating deviation that shrinks as you play more, which partially solves the new player problem. But it was still designed for 1v1 and doesn't natively handle teams. You can hack team support onto it, but then you're maintaining someone else's math with your own duct tape.

TrueSkill is the closest to what I wanted. It handles teams and has Bayesian uncertainty tracking. I ended up borrowing a lot from it. But the full implementation uses factor graphs and message passing, which is overkill for a small community, and the rating it produces (mu minus three sigma) is invisible to players and hard to reason about. Try explaining to a tilted player why their number went down even though their "true skill" went up because their sigma shrank.

So I took TrueSkill's Bayesian uncertainty as the foundation and put a visible MMR number on top, one that feels responsive, rewards upsets, accounts for individual performance, and doesn't punish you for a teammate disconnecting in round 3.

The two-layer architecture

The system maintains two separate ratings per player.

The hidden layer is a TrueSkill-style Gaussian. Every player starts at mu = 25.0 with sigma = 8.33 (mu/3). After each match, both update using standard Bayesian formulas: comparing team skill totals, computing a surprise factor, adjusting. Sigma shrinks as confidence grows and has a floor at 2.0 so it never fully locks. A small dynamics factor (tau = sigma/100) adds a tiny amount of uncertainty each game so the system can adapt if someone improves.

The visible layer is the number players see. It starts at 2500 and moves using a modified K-factor formula that draws on the hidden layer but follows its own rules. This is where the product design lives.

The hidden layer answers "what is this player's true skill?" The visible layer answers "what number makes this player feel correctly ranked?" Those are different questions.

New player calibration

The K-factor (how many points you gain or lose per match) scales with sigma, the uncertainty value from the hidden layer:

u = (sigma - SIGMA_MIN) / (SIGMA0 - SIGMA_MIN)
k = K_MIN + (K_MAX - K_MIN) * u

A brand new player has sigma near 8.33, so u is close to 1.0 and their K-factor is near 160. They move fast. A veteran with sigma near 2.0 has K near 50. They move slow. This is calibration for free: no "placement matches" needed, no special mode. The system naturally moves new players quickly and settles experienced ones.

K_MIN is set to 50 rather than something lower like 16 or 24. In a small community, if gains feel too small, people stop playing. A 50-point floor means even a stable player gets a noticeable bump from a win. I'm choosing engagement over convergence speed at the top end.

Upset amplification

One of the worst feelings in competitive gaming is beating a team you were "supposed to lose to" and getting 12 points for it. The system handles this with an upset multiplier.

The expected win probability is computed two different ways. The hidden layer uses the real Gaussian CDF, which gives an honest probability. But the visible MMR layer uses a flattened version of that probability (dividing by a scale factor of 2.0) and clamps it so that even extremely lopsided matches award at least 10% of the base K.

When the actual result diverges sharply from the expected outcome, the delta gets amplified by up to 3x. The amplification follows a power curve (exponent 1.5) so minor upsets barely budge the multiplier, but big upsets get a real bonus. Wins are capped at 150 points for upsets versus 100 for normal wins. Losses are always capped at 80. The asymmetry is intentional. Big losses feel worse than big wins feel good, so I tax losses less.

Individual performance (without making it about KD)

This was the hardest part to get right. The game I was optimizing for is a tactical shooter, and I didn't want a system that rewards kill-farming while ignoring the player who planted the bomb in a 1v3.

Each player gets an "impact score" computed per round:

1.00 * kills
+ 0.70 * assists
+ 3.00 * plants
+ 3.50 * defuses
+ 0.40 * ln(1 + gadget_destroys)
+ 1.20 * revives
- 0.75 * deaths

Deaths are penalized at 0.75 instead of 1.0 because entry fraggers die a lot but create value. Punishing deaths equally to kills would systematically underrate aggressive players. Plants and defuses are weighted 3x and 3.5x because they directly win rounds. Gadget destroys use ln(1+x) to saturate: destroying 3 cameras is useful, but farming 20 drones shouldn't make you the MVP.

Within each team, the impact scores are converted to robust z-scores using median and MAD (Median Absolute Deviation) instead of mean and standard deviation. With only 5 players per team, one outlier can badly skew a mean. MAD handles that. The z-scores become small multipliers (capped at plus or minus 12%) that redistribute MMR within the team. The total team delta stays the same, performance adjustments are zero-sum. Nobody gets extra points conjured from nowhere. The best performer takes a bit from the worst performer, and vice versa.

After redistribution, the per-player deltas get rounded to integers while preserving the exact team sum. There's a round_preserve_sum function that nudges individual deltas by at most 1 point each, sorting by rounding remainder, so no team total drifts due to floating point. Small thing, but it means players can verify: the five numbers on the scoreboard always add up exactly.

Elite dampening

Above 4200 MMR, gains start shrinking and losses start growing slightly. At 5000, wins are multiplied by 0.8 and losses by 1.05. This creates a soft ceiling that compresses the leaderboard top without a hard cap.

In a small community, the top 5 players would otherwise inflate indefinitely because they keep beating the same people. Dampening turns the leaderboard into a "king of the hill" where staying at the top requires consistent play, not just having queued more games.

Leaver handling

Disconnected players are always treated as losses for rating purposes, even if their team wins. Their hidden skill (mu) also takes a hit, scaled by their participation. If someone played 4 out of 9 rounds and left, their sigma gets a small boost back up (less certainty about their skill for that match), and their weight in the team skill calculation is reduced so they don't drag the team's aggregate rating down as much.

On the punishment side, each abandon adds 3 leaver points, which decay at 10% per day. Accumulating points triggers escalating lockouts: 10 minutes, then 30, then 2 hours, 12, and 48. The decay means a single bad disconnect is forgiven within a week, but serial leavers hit multi-day bans.

Queue matching

The matchmaker doesn't sort players by MMR and try to build balanced teams. It fills servers FIFO, first in, first out. When a server is available and idle, the next batch of queued players gets assigned.

Why? Because queue times kill small communities faster than unbalanced matches do. With fewer than 500 concurrent players spread across queue types and regions, trying to enforce skill brackets would mean 15-minute waits. And 15-minute waits mean people close the game. I'd rather have a match start in 30 seconds with a 300 MMR spread than have players alt-tab and never come back.

The rating system compensates for this. If a 4000 MMR player gets matched against a 2500, the expected-win flattening and upset amplification ensure that the 2500 player gets a big reward for winning and the 4000 barely loses anything. The math absorbs the matchmaking imbalance instead of the matchmaking absorbing the math.

If the community ever grows, adding skill-based bucketing to the queue would be straightforward since the matchmaker tick loop already iterates per queue type. But premature optimization of fairness at the cost of queue time is how you end up with a perfectly balanced game that nobody plays.

Tradeoffs I'd revisit

The performance weights are tuned by hand. The impact score weights (1.0 for kills, 3.0 for plants, etc.) are educated guesses based on watching games. Ideally these would be learned from data, run a logistic regression on which per-round stats best predict round wins and use those coefficients. I didn't have enough match data at launch to do this, but the weights are all constants at the top of the file, so swapping them out is trivial.

There's no party/stack detection. A 5-stack of coordinated players will perform differently than five solo players at the same MMR. Right now the system treats them the same. Adjusting the expected outcome when one team is a full party would improve accuracy.

The elite dampening curve is a magic number. The 4200 threshold and the 0.8/1.05 multipliers were chosen by simulating a few hundred games in a spreadsheet. With a larger playerbase I'd want to set these dynamically based on percentile distribution.

What I learned

Rating systems are half math and half product design. The Bayesian update is the easy part. The hard part is deciding how a +47 should feel versus a -31, how fast new players should be allowed to climb, what happens when someone's internet dies, and whether a support player who went 2-5-8 with two clutch defuses should gain more than the fragger who went 11-4-0. Those are not math questions. Those are game design questions with math as the implementation detail.

The full implementation is about 500 lines of Rust: an error function approximation, some clamping, and a lot of constants I tuned by playing games and watching what felt wrong.

Source is at github.com/dannyisbad/ranked-server.