转自:http://blog.pkhamre.com/2012/07/24/understanding-statsd-and-graphite/
After a short conversation with on the #logstash channel at Freenode, I realized that I did not know how my data was sent and how it was stored in Graphite. I knew that StatsD collects and aggregates my metrics. And I knew that StatsD ships them off to Graphite. Which I knew stores the time-series data and enables us to render graphs based on these data.
What I did not know was if my http-access graphs displayed requests per second, average requests per retention or anything else.
It was time to research how these things worked in order to get a complete understanding.
StatsD
To get a full understanding of how works, I started to read the source code. I knew StatsD was a simple application, but I did not knew it was this simple. Just over 300 lines of code in and around 150 lines in .
Concepts in StatsD
StatsD has a few concepts listed in the documentation that should be understood.
Buckets
Each stat is in its own “bucket”. They are not predefined anywhere. Buckets can be named anything that will translate to Graphite (periods make folders, etc)
Values
Each stat will have a value. How it is interpreted depends on modifiers. In general values should be integer.
Flush interval
After the flush interval timeout (default 10 seconds), stats are aggregated and sent to an upstream backend service.
Metric types
Counters
Counters are simple. It adds a value to a bucket and stays in memory until the flush interval.
Lets take a look at the source code that generates the counter stats that gets
flushed to the backend.
-
for (key in counters) {
-
var value = counters[key];
-
var valuePerSecond = value / (flushInterval / 1000); // calculate "per second" rate
-
-
statString += 'stats.' + key + ' ' + valuePerSecond + ' ' + ts + "\n";
-
statString += 'stats_counts.' + key + ' ' + value + ' ' + ts + "\n";
-
-
numStats += 1;
- }
First, StatsD iterates over any counters received, where it starts by assigning two variables. One variable holds the counter value, and one variable holds the per-second value. It then adds the values to the statString and increases the numStats variable.
If you have the default flush interval, 10 seconds, and send StatsD 7 increments on a counter with the flush interval, the counter would be 7 and the per-second value would be 0.7. No magic.
Timers
Timers collects numbers. They does not necessarily need to contain a value of time. You can collect bytes read, number of objects in some storage, or anything that is a number. A good thing about timer, is that you get the mean, the sum, the count, the upper and the lower values for free. Feed StatsD a timer and this gets automatically calculated for you before it is flushed to Graphite. Oh, I almost forgot to mention that you also get the 90 percentile calculated for the mean, sum and upper values as well. You can also configure StatsD to use an array of numbers as percentiles, which means you can get 50 percentile, 90 percentile and 95 percentile calculated for you if you want.
The source code for timer stats is a bit more advanced than the code for the
counters.
-
for (key in timers) {
-
if (timers[key].length > 0) {
-
var values = timers[key].sort(function (a,b) { return a-b; });
-
var count = values.length;
-
var min = values[0];
-
var max = values[count - 1];
-
-
var cumulativeValues = [min];
-
for (var i = 1; i < count; i++) {
-
cumulativeValues.push(values[i] + cumulativeValues[i-1]);
-
}
-
-
var sum = min;
-
var mean = min;
-
var maxAtThreshold = max;
-
-
var message = "";
-
-
var key2;
-
-
for (key2 in pctThreshold) {
-
var pct = pctThreshold[key2];
-
if (count > 1) {
-
var thresholdIndex = Math.round(((100 - pct) / 100) * count);
-
var numInThreshold = count - thresholdIndex;
-
-
maxAtThreshold = values[numInThreshold - 1];
-
sum = cumulativeValues[numInThreshold - 1];
-
mean = sum / numInThreshold;
-
}
-
-
var clean_pct = '' + pct;
-
clean_pct.replace('.', '_');
-
message += 'stats.timers.' + key + '.mean_' + clean_pct + ' ' + mean + ' ' + ts + "\n";
-
message += 'stats.timers.' + key + '.upper_' + clean_pct + ' ' + maxAtThreshold + ' ' + ts + "\n";
-
message += 'stats.timers.' + key + '.sum_' + clean_pct + ' ' + sum + ' ' + ts + "\n";
-
}
-
-
sum = cumulativeValues[count-1];
-
mean = sum / count;
-
-
message += 'stats.timers.' + key + '.upper ' + max + ' ' + ts + "\n";
-
message += 'stats.timers.' + key + '.lower ' + min + ' ' + ts + "\n";
-
message += 'stats.timers.' + key + '.count ' + count + ' ' + ts + "\n";
-
message += 'stats.timers.' + key + '.sum ' + sum + ' ' + ts + "\n";
-
message += 'stats.timers.' + key + '.mean ' + mean + ' ' + ts + "\n";
-
statString += message;
-
-
numStats += 1;
-
}
- }
StatsD iterates over each timer and processes the timer if the value is above 0. It then sorts the array of values and simply counts it and locates the minimum and maximum values. An array of the cumulative values is created and a few variables are assigned before it starts to iterate over the percentile thresholds array to calculate percentiles and creates the messages to assign to the statString variable. When percentile calculation is done, the final sum gets assigned and the final statString is created.
If you send the following timer values to StatsD during the default flush interval
- 450
- 120
- 553
- 994
- 334
- 844
- 675
- 496
StatsD will calculate the following values
- mean_90 496
- upper_90 844
- sum_90 3472
- upper 994
- lower 120
- count 8
- sum 4466
- mean 558.25
Gauges
A gauge simply indicates an arbitrary value at a point in time and is the most simple type in StatsD. It just takes any number and ships it to the backend.
The source code for gauge stats is just four lines.
-
for (key in gauges) {
-
statString += 'stats.gauges.' + key + ' ' + gauges[key] + ' ' + ts + "\n";
-
numStats += 1;
- }
Feed StatsD a number and it sends it unprocessed to the backend. A thing to note is that only the last value of a gauge during a flush interval is flushed to the backend. That means that if you send the following gauge values to StatsD during a flush interval
- 643
- 754
- 583
The only value that gets flushed to the backend is 583. The value of this gauge will be kept in memory in StatsD and be sent to the backend at the end of every flush interval.
Graphite
Now that we know how our data is sent from StatsD, lets take a look at how it is stored and processed in Graphite.
Overview
In the Graphite documentation we can find the . It sums up Graphite with these two simple points.
- Graphite stores numeric time-series data.
- Graphite renders graphs of this data on demand.
Graphite consists of three parts.
- carbon - a daemon that listens for time-series data.
- whisper - a simple database library for storing time-series data.
- webapp - a (Django) webapp that renders graphs on demand.
The format for time-series data in graphite looks like this
1
|
|
Storage schemas
Graphite uses configurable storage schemas too define retention rates for storing data. It matches data paths with a pattern and tells what frequency and history for our data to store.
The following configuration example is taken from the StatsD documentation.
-
[stats]
-
pattern = ^stats\..*
- retentions = 10:2160,60:10080,600:262974
Which means these retentions will be used for every entry with a key matching the pattern defined. The retention format is frequency:history. So this configuration lets us store 10 second data for 6 hours, 1 minute data for 1 week, and 10 minute data for 5 years.
Visualizing a timer in Graphite
Knowing all this, we can now take a look at my simple ruby-script that collects
timings for a HTTP requests.
-
#!/usr/bin/env ruby
-
-
require 'rubygems' if RUBY_VERSION < '1.9.0'
-
require './statsdclient.rb'
-
require 'typhoeus'
-
-
Statsd.host = 'localhost'
-
Statsd.port = 8125
-
-
def to_ms time
-
(1000 * time).to_i
-
end
-
-
while true
-
start_time = Time.now.to_f
-
-
resp = Typhoeus::Request.get ''
-
-
end_time = Time.now.to_f
-
-
elapsed_time = (1000 * end_time) - (to_ms start_time)
-
response_time = to_ms resp.time
-
start_transfer_time = to_ms resp.start_transfer_time
-
app_connect_time = to_ms resp.app_connect_time
-
pretransfer_time = to_ms resp.pretransfer_time
-
connect_time = to_ms resp.connect_time
-
name_lookup_time = to_ms resp.name_lookup_time
-
-
Statsd.timing('http_request.elapsed_time', elapsed_time)
-
Statsd.timing('http_request.response_time', response_time)
-
Statsd.timing('http_request.start_transfer_time', start_transfer_time)
-
Statsd.timing('http_request.app_connect_time', app_connect_time)
-
Statsd.timing('http_request.pretransfer_time', pretransfer_time)
-
Statsd.timing('http_request.connect_time', connect_time)
-
Statsd.timing('http_request.name_lookup_time', name_lookup_time)
-
-
sleep 10
- end
Lets take a look at the visualized Graphite render from this data. The data is from the last 2 minutes, and the elapsed_time target from our script above.
Image visualization
Render URLRender URL used for the image below.
1
|
/render/?width=586&height=308&from=-2minutes&target=stats.timers.http_request.elapsed_time.sum |
Rendered image from Graphite, a simple graph visualizing elapsed_time for http requests over time.
JSON-data
Render URLRender URL used for the JSON-data below.
1
|
/render/?width=586&height=308&from=-2minutes&target=stats.timers.http_request.elapsed_time.sum&format=json |
In the results below, we can see the raw data from Graphite. It is data from 12 different data points which means 2 minutes with the StatsD 10-second flush interval. It is really this simple, Graphite just visualizes its data.
The JSON-data is beautified with for viewing purposes.
- [
- {
- "target": "stats.timers.http_request.elapsed_time.sum",
- "datapoints": [
- [
- 53.449951171875,
- 1343038130
- ],
- [
- 50.3916015625,
- 1343038140
- ],
- [
- 50.1357421875,
- 1343038150
- ],
- [
- 39.601806640625,
- 1343038160
- ],
- [
- 41.5263671875,
- 1343038170
- ],
- [
- 34.3974609375,
- 1343038180
- ],
- [
- 36.3818359375,
- 1343038190
- ],
- [
- 35.009033203125,
- 1343038200
- ],
- [
- 37.0087890625,
- 1343038210
- ],
- [
- 38.486572265625,
- 1343038220
- ],
- [
- 45.66064453125,
- 1343038230
- ],
- [
- null,
- 1343038240
- ]
- ]
- }
- ]
Visualizing a gauge in Graphite
The following simple script ships a gauge to StatsD, simulating a number of
user registrations.
- #!/usr/bin/env ruby
- require './statsdclient.rb'
- Statsd.host = 'localhost'
- Statsd.port = 8125
- user_registrations = 1
- while true
- user_registrations += Random.rand 128
- Statsd.gauge('user_registrations', user_registrations)
- sleep 10
- end
Image visualization - Number of user registrations
Render URLRender URL used for the image below.
1
|
/render/?width=586&height=308&from=-20minutes&target=stats.gauges.user_registrations |
Another simple graph, just showing the total number of registrations.
Image visualization - Number of user registrations per minute
By using the derivative-function in Graphite, we can get the number of user registrations per minute.
Render URLRender URL used for the image below.
1
|
/render/?width=586&height=308&from=-20minutes&target=derivative(stats.gauges.user_registrations) |
A graph based on the same data as above, but with the derivative function applied to visualize a per-minute rate.
Conclusion
Knowing more about how StatsD and Graphite works, it will be alot easier to know what kind of data to ship StatsD, to know how to ship the data to StatsD, and to know how to read the data from Graphite.
Got any comments or questions? Let me know in the comment section below.