Frequently asked questions about histograms and OpenHistogram technology
What’s a histogram?
A histogram is a data structure (or a model) that allows one to model the distribution of a set of samples. Imagine you wanted to know how much daily sleep humans on earth enjoyed. That’s a lot of data points. One way to model that tersely is to catalog how many hours each person slept and simply keep a running tally of how many people slept between 0 and 1 hours, 1 and 2 hours, etc. There are many billions of people in the world, but the resulting model would be limited to 24 counts (because one cannot sleep less than zero or more than 24 hours in a day).
This relatively tiny model still allows us to extract valuable information such as estimations of the minimum, maximum, median, average, standard deviation, modality, arbitrary quantiles (and their inverses), as well as an understanding of the “shape” of the distribution. Basically, a lot of value for a little cost.
What is a time series histogram?
A time series histogram is a histogram over different discrete periods of observation. Imagine tracking daily rainfall in millimeters and storing a histogram for a year. You would have 365 (or 366) samples in the bins of your histogram describing how often each day produced that particular amount of rainfall. If you were to model the last one hundred years separately, you would end up with 100 histograms – one for each year. Looking at the histograms historically, you can perform an analysis of how rainfall has changed over time. From this series of space-efficient models, you could look at the yearly averages, or yearly minimums or maximums. The statistical power of having a distribution model combined with the ability to analyze change over time provides the opportunity to gain valuable insights.
What do time series histograms look like in IT?
First, they allow us to collapse a vast amount of sample data into a small, space-efficient model. This changes what is possible in some profound ways. Here are a few examples of things that become both valuable and relatively inexpensive to model over time:
- The nanoseconds of latency of every system call performed on each system.
- The microseconds of latency of every read and write I/O operation against each piece of media (spinning disks, SSD, NVMe) on each system.
- The microseconds of latency of each API call or web page service on a platform.
- The arrival rate of new connections (microseconds of delay between each new connection).
Each of these is recorded second-by-second allowing one to robustly understand how the behavior of their systems change over time. These are just a few examples. If you have high-frequency measurements coming in, chances are you can achieve greater value at lower costs by leveraging time series histograms.
Why is binning important?
Binning is important because it allows multiple samples “near” each other to be represented as a single range which allows for significant compression within the model. The real question here is what is important when choosing bin boundaries. There are many academic papers on how to select good bin boundaries for specific domains of data (and even some patents), but the fundamental problem with time-series histograms (as opposed to singular histograms) is that one must be able to merge (add) these histograms together. If we store histograms for each second within an hour, but would like to analyze the whole hour as a single distribution, we must be able to aggregate all the data. This might sound simple, but if you have distinct, overlapping bins in your separate histograms, there is no way to merge them together without introducing more error and it can get bad, fast.
Not only do we need to make sure the bins in our separate histograms are mergeable (without introduced error) over time, we also need to ensure they are cleanly mergeable across signals. Imagine having a time series histogram for I/O latency on spinning disks, SSDs, and NVMe devices. They have quite different latency potentials and, in isolation, it might make sense to use a different set of bin boundaries for each so that your relative error is smaller. If they are not cleanly mergeable you cannot look at “storage” as a single distribution without merge error and some serious statistics knowledge. What’s the answer? One universal set of bins.
What is universal binning?
Universal binning is accepting a set of bin boundaries for all data a priori of data collection. No matter if you are looking at instruction latency (sub nanosecond) or time to finish online purchase (minutes), or time of an open ticket (perhaps months) they all need to be placed into the same bins. What’s the catch? We want a limited amount of error. If we say we have one bin from zero to one and another from one to infinity we could represent everything easily, but it’s hopefully obvious why this is not very descriptive of the underlying data and the error introduced is profound. What is the solution to one universal binning scheme? Circonus log-linear histograms. Are there other solutions? Tautologically no: if you have more than one solution, it isn’t universal.
What is Circonus log-linear binning?
OpenHistogram uses the Circonus reference implementation of log linear histograms called circllhist (pronounced “circle hist”) for short. Circllhist describes a set of bin boundaries that have some amazing properties:
- Both positive and negative values
- Both small and large values 10^-128 to 10^127
- It controls relative error well across all its domain maximally 4.7%, empirically usually less than 0.1%
- It operates in the base-10 realm making it very simple for human beings to reason about
- Recording samples into a histogram is CPU-efficient (< 10ns)
- The implementation is usable without floating point operations making it useful in small embedded systems
- It’s free
The basic binning structure is the bins between the consecutive numbers 0 and any numbers that can be represented as [+/-]X.Y x 10^Z where X can be any integral number between 1 and 9 inclusively, Y can be any integral number between 0 and 9 inclusively and Z can be any integral number between -128 and 127. Examples?
0 -> [0,0], 42 -> [4.2x10^1,4.3x10^1), 0.148 -> [1.4x10^-1,1.5x10^-1), 1923475 -> [1.9x10^6,2.0x10^6), -3.2 -> [-3.2x10^0,-3.3x10^0)
Why circllhist and not some other?
Is universal binning more valuable than the slight contextual advantages one implementation has over another? Is circllhist the best in every situation? Of course not, but it is in many and it’s quite good in all. Ensuring all data everywhere can be merged (and exported from one tool/vendor to another) without introduced error is incredibly valuable to the owner of that data and thus invaluable to our industry.
We have a growing portfolio of patents around time-series distribution data (histograms) and it is important to us that the end-users of all this magic are empowered. We make our histogram patents freely available to anyone who adheres precisely to our binning format so that end-user data is portable from platform to platform and that more end-users have access to the power of this technology.
Do other approaches infringe on Circonus patents?
Maybe. That’s for you and your counsel to decide. Regardless if another approach infringes or not, the undeniable cost is a lack of interoperability.