Blog

Building a 2KB Histogram with Sub -0.2% Percentile Error

drmingdrmerJul 1, 2026

The Problem: Tracking Request Latency Without Slowing Things Down

For a cloud data warehouse, performance is not just about average query time. What often matters more is tail latency, predictability, and the ability to pinpoint where things go wrong.In a cloud-native data warehouse like Databend, a single request may pass through multiple stages: SQL planning, distributed execution, remote storage, Raft logging, and state machine apply. Tail latency in any one of these stages can affect the query stability users actually experience.

That means we need a way to continuously track latency distributions inside the system — lightweight enough to stay off the hot path, accurate enough to be useful, and cheap enough to run everywhere.

This article walks through the design of base2histogram, the lightweight histogram library we built for that purpose.

Consider the lifecycle of a single Raft log entry. It passes through several stages, each with its own latency profile:

Received → written to storage
Persisted to local disk
Replicated to remote nodes
Acknowledged by a majority quorum
Committed → applied to the state machine

A histogram is a natural fit here: put latency on the x-axis and request count on the y-axis, and you get an immediate view of where time is being spent.

This kind of visibility helps you identify bottlenecks and fix the right part of the system.

But there is a catch: collecting metrics must not get in the way of doing actual work. The histogram needs to be:

O(1) to record — no sorting, no rebalancing, and nothing that can stall a hot path
Tiny in memory — a system may run hundreds or thousands of histograms at once
Queryable for percentiles — P50, P95, P99

Let's walk through how we designed a histogram that meets all three requirements.

Recording: Getting Samples Into Buckets

Why Log-Scale Buckets

Most requests cluster around a typical latency, with a few outliers on both ends. This often produces a log-normal distribution: take the log of the latency values, and the shape becomes a classic bell curve.

The signature shape is a peak at lower values, followed by a gradual long tail to the right.

To build a histogram, we divide the x-axis into buckets and count how many samples fall into each one.

The key question is how to size those buckets.

Equal-width buckets work well for a normal distribution, but latency is often log-normal. The data only looks roughly uniform on a logarithmic scale, so the buckets should grow on a log scale, not a linear one.

The simplest version is to make each bucket twice as wide as the previous one:

[0,1), [1,2), [2,4), [4,8), [8,16), ...

Why powers of 2? Because multiplying by 2 is cheap on a CPU, and mapping a value to its bucket takes a single leading zero count instruction.

If we simulate a log-normal workload and plot bucket counts with the bucket index on the x-axis — effectively applying a log transform — the result is a clean bell curve:

This is great for storage: 65 buckets cover the entire

u64

range.

But the resolution is poor. The last bucket spans half of all possible values, so everything that lands there becomes a blur.

A Tempting Fix We Passed On

An obvious improvement is to use a smaller growth factor, such as 1.1× instead of 2×. That gives us more buckets and finer resolution:

The problem is cost. Finding the right bucket for a value

means solving for the smallest

where

1 + 1.1 + 1.1^2 + ... + 1.1^x >= l

, which requires floating-point logarithms. That is real overhead on a hot path.

We wanted to stay in the world of integers and bit operations.

The Trick: Float-Like Encoding

Here is the idea that makes the design work: keep bucket sizes roughly exponential, but encode each bucket using a fixed number of bits — a parameter we call WIDTH.

Think of a bucket's lower bound as a tiny floating-point number.

The MSB position gives the exponent, which tells us which bucket group the value belongs to. The next few bits give the offset within that group.

With WIDTH=3, the default configuration, a bucket boundary looks like this in binary:

00..00 1 xx 00..00
       |
       MSB
<- significant

The leading

selects the group. The two bits that follow select the bucket within the group.

Here is what the first few groups look like. Each bucket is fully described by just 3 bits:

WIDTH = 3:

range     bucket index        bucket size
[0, 1)     0  0b0 ..... 000    1
[1, 2)     1  0b0 ..... 001    1
[2, 3)     2  0b0 ..... 010    1
[3, 4)     3  0b0 ..... 011    1

[4, 5)     4  0b0 ..... 100    1
[5, 6)     5  0b0 ..... 101    1
[6, 7)     6  0b0 ..... 110    1
[7, 8)     7  0b0 ..... 111    1

[8, 10)    8  0b0 .... 1000    2
[10, 12)   9  0b0 .... 1010    2
[12, 14)  10  0b0 .... 1100    2
[14, 16)  11  0b0 .... 1110    2

[16, 20)  12  0b0 ... 10000    4
[20, 24)  13  0b0 ... 10100    4
[24, 28)  14  0b0 ... 11000    4
[28, 32)  15  0b0 ... 11100    4

[32, 40)  16  0b0 .. 100000    8
[40, 48)  17  0b0 .. 101000    8
[48, 56)  18  0b0 .. 110000    8
[56, 64)  19  0b0 .. 111000    8

The pattern is simple:

Each group contains
```
2^(WIDTH-1) = 4
```
buckets
The two bits after the MSB select the bucket within the group
It behaves like a 3-bit float: 1 implicit leading bit + 2 fractional bits

Bucket sizes still grow roughly logarithmically, but computing the bucket index is now just a matter of extracting the top WIDTH bits — a handful of integer and bit operations. Recording a sample is O(1).

Walk-through with latency = 42:

value = 42 (binary: 0b101010)
  MSB position: 5
  group: 5 - 2 = 3
  2 bits after MSB: 01 (from 1[01]010)
  offset in group: 1
  Bucket index: 4 + (3 × 4) + 1 = 17

Tuning WIDTH: The Precision–Memory Knob

WIDTH controls how many buckets each group contains:

2^(WIDTH-1)

The number of groups is capped at 64, so the histogram still covers the full

u64

range. Increasing WIDTH gives each group more buckets, improving resolution at the cost of memory.

Here is the trade-off:

WIDTH	Buckets	Mem/slot	Buckets per group
1	65	520 B	1
2	128	1.0 KB	2
3	252	2.0 KB	4 (default)
4	496	3.9 KB	8
5	976	7.6 KB	16
6	1920	15.0 KB	32

At the default WIDTH=3, one histogram uses 2 KB and records every sample in O(1).

That covers the write path. Now let's look at the read path.

Percentile Estimation: Getting Answers Out

Once we have collected the counts, we want to query percentiles: at what latency have 50% of requests completed (P50)? What about 90% (P90) or 99% (P99)?

Locating the Right Bucket

The basic idea is simple.

For P50, count the total number of samples, take 50% to get a target rank

, then scan the buckets from the beginning and accumulate counts until you pass

. That gives you the target bucket.

But a bucket spans a range, not a single point. We still need to estimate where inside the bucket the percentile falls.

Here are a few options, from rough to more accurate.

All error numbers below come from a log-normal distribution that models API latency, using WIDTH=3 and 1,000,000 samples.

Midpoint: return

(min + max) / 2

Many histogram libraries do this, including iopsystems/histogram. It is a blind guess: it ignores how samples are distributed within the bucket.

	P50	P95	P99
midpoint	5.018%	7.732%	4.861%

Uniform interpolation: assume samples are evenly spread across the bucket, then interpolate linearly:

estimate = min + (max - min) × rank / count

This is better than midpoint because it uses the target rank within the bucket. But the assumption is still rough: log-normal data is skewed, even inside a single bucket.

Trapezoid Interpolation (Our Approach)

Uniform interpolation treats density inside a bucket as flat. In reality, density is often sloped: higher on the side closer to the peak of the distribution.

If we can infer the direction and steepness of that slope, we can replace the rectangle with a trapezoid and get much closer to the true value.

Each bucket stores only a count, and we do not want to add any extra fields. So where does the slope information come from?

From the neighboring buckets.

The densities of the left and right buckets tell us how the density is likely to slope through the current bucket.

Here is the recipe.

Compute the average density of the left bucket,

d0 = c0/(x1-x0)

, and treat it as the density at that bucket's midpoint,

m0

Do the same for the right bucket:

d2 = c2/(x3-x2)

at midpoint

m2

Then assume density changes linearly from

m0

m2

. Over this short range, this is a reasonable approximation. It gives us the slope

Inside the target bucket, the density now forms a trapezoid: a sloped line with slope

, anchored so that the density at the target bucket's midpoint

(x1+x2)/2

equals the bucket's own average density

d1 = c1/(x2-x1)

. For a linear function, the midpoint value is equal to the average over the interval.

To estimate the percentile, we solve for the x-position where the trapezoid area from

x1

equals the target rank.

Same distribution, same buckets — here is how the results compare:

	P50	P95	P99
midpoint	5.018%	7.732%	4.861%
trapezoid	0.000%	0.080%	0.086%

That is two orders of magnitude better, with zero additional storage.

The three-bucket layout:

Variable	Meaning
`x0, x1, x2, x3`	Boundaries of the three adjacent buckets
`w0, w1, w2`	Bucket widths: `w0 = x1-x0` `w1 = x2-x1` `w2 = x3-x2`
`c0, c1, c2`	Sample counts in each bucket
`rank`	How many samples into the target bucket the percentile falls

d0 = c0 / w0       -- left bucket density
d1 = c1 / w1       -- target bucket density
d2 = c2 / w2       -- right bucket density

Midpoints of the left and right buckets:

m0 = (x0+x1)/2

m2 = (x2+x3)/2

Slope:

k = (d2 - d0) / (m2 - m0)

Then solve for the x-position where the trapezoid's cumulative area from

x1

equals the target rank.

The whole calculation uses only three counts and their bucket boundaries. Nothing else is stored, and nothing else is needed.

Benchmarks: Seven Distributions, Six WIDTH Settings

We tested the algorithm across seven representative distributions, each with 1,000,000 samples, using trapezoid interpolation.

The rows to focus on are LN-API and LN-DB at W=3. These are the real-world latency cases under the default 2 KB configuration:

|                   W=1      W=2      W=3      W=4      W=5      W=6
| ------------------------------------------------------------------
| Uniform P50    0.108%   0.028%   0.012%   0.018%   0.019%   0.002%
|         P95    2.317%   1.988%   1.035%   0.475%   0.005%   0.005%
|         P99    4.290%   4.129%   3.706%   1.486%   0.298%   0.162%
|
| LN-API  P50    2.281%   0.182%   0.000%   0.000%   0.000%   0.000%
|         P95   20.256%   3.963%   0.080%   0.040%   0.040%   0.000%
|         P99   11.951%   3.594%   0.086%   0.000%   0.029%   0.000%
|
| Bimodal P50    1.381%   0.394%   0.394%   0.197%   0.197%   0.197%
|         P95    3.918%   0.172%   0.012%   0.028%   0.038%   0.008%
|         P99    1.521%   1.344%   0.543%   0.078%   0.016%   0.014%
|
| Expon   P50    1.012%   0.000%   0.145%   0.145%   0.145%   0.000%
|         P95   10.989%   0.200%   0.000%   0.000%   0.033%   0.033%
|         P99   18.665%   4.574%   0.824%   0.022%   0.022%   0.022%
|
| LN-DB   P50    2.018%   0.034%   0.000%   0.000%   0.000%   0.034%
|         P95    2.027%   0.368%   0.039%   0.006%   0.019%   0.026%
|         P99    3.764%   1.066%   0.187%   0.007%   0.003%   0.062%
|
| Sequent P50    0.095%   0.000%   0.000%   0.000%   0.000%   0.000%
|         P95    2.271%   1.967%   1.011%   0.496%   0.000%   0.000%
|         P99    4.272%   4.118%   3.696%   1.521%   0.305%   0.169%
|
| Pareto  P50   10.127%   1.899%   0.633%   0.633%   0.633%   0.000%
|         P95    9.239%   0.272%   0.000%   0.136%   0.000%   0.000%
|         P99    3.517%   0.879%   0.231%   0.093%   0.046%   0.046%
|
| ------------------------------------------------------------------
| Buckets            65      128      252      496      976     1920
| Mem/slot        520 B   1.0 KB   2.0 KB   3.9 KB   7.6 KB  15.0 KB
| Mem total      1.0 KB   2.0 KB   3.9 KB   7.8 KB  15.2 KB  30.0 KB

What each distribution models:

Uniform (uniform distribution): synthetic benchmarks
LN-API (log-normal** σ=0.5)**: API and microservice latency
Bimodal (bimodal distribution): cache hit/miss — 90% fast path around 500 μs, 10% slow path around 50 ms
Expon (exponential distribution): network and I/O waits
LN-DB (log-normal σ=1.0): database query latency with a wider tail
Sequent (sequential): adversarial worst case
Pareto (Pareto distribution** α=1.5)**: heavy-tailed workloads, such as request sizes

For the latency distributions we care about most — LN-API and LN-DB — WIDTH=3 delivers sub -0.2% error with only 2 KB of memory.

Summary

2 KB memory: WIDTH=3, 252 buckets of
```
u64
```
, with P50/P95/P99 error under 0.2% for log-normal latency workloads
O(1) recording, O(buckets) querying
Trapezoid interpolation delivers over 10× better accuracy than midpoint, with zero extra storage
WIDTH is tunable: from 520 B for minimal tracking to 15 KB for maximum precision

A histogram may be a small piece of infrastructure, but it supports a much larger goal for Databend: making cloud data warehouse performance more observable, easier to reason about, and easier to optimize.

When a system can continuously record distributions such as P50, P95, and P99 at very low cost, engineering teams can trace tail latency much faster — whether it comes from storage, the network, Raft, the execution pipeline, or the query itself.

For users, that ultimately means more stable queries, more predictable performance, and a clearer path to cost optimization.

Check out the project: https://github.com/drmingdrmer/base2histogram

Share this post

Subscribe to our newsletter

Stay informed on feature releases, product roadmap, support, and cloud offerings!

Trace Is Evals: Data Engineering for Agent Trace Analysis and Attribution

Building a 2KB Histogram with Sub -0.2% Percentile Error

drmingdrmerJul 1, 2026

The Problem: Tracking Request Latency Without Slowing Things Down

Recording: Getting Samples Into Buckets

Why Log-Scale Buckets

A Tempting Fix We Passed On

The Trick: Float-Like Encoding

Tuning WIDTH: The Precision–Memory Knob

Percentile Estimation: Getting Answers Out

Locating the Right Bucket

Trapezoid Interpolation (Our Approach)

Benchmarks: Seven Distributions, Six WIDTH Settings

Summary

Subscribe to our newsletter

Get Started

Talk to us

Products

Resources

Community

Company

Solutions

Partners

Building a 2KB Histogram with Sub -0.2% Percentile Error

drmingdrmerJul 1, 2026Copy Page

The Problem: Tracking Request Latency Without Slowing Things Down

Recording: Getting Samples Into Buckets

Why Log-Scale Buckets

A Tempting Fix We Passed On

The Trick: Float-Like Encoding

Tuning WIDTH: The Precision–Memory Knob

Percentile Estimation: Getting Answers Out

Locating the Right Bucket

Trapezoid Interpolation (Our Approach)

Benchmarks: Seven Distributions, Six WIDTH Settings

Summary

Subscribe to our newsletter

Get Started

Talk to us

Products

Resources

Community

Company

Solutions

Partners

drmingdrmerJul 1, 2026