Understanding StatsD and Graphite-xjc2694-ChinaUnix博客

转自：http://blog.pkhamre.com/2012/07/24/understanding-statsd-and-graphite/

After a short conversation with on the #logstash channel at Freenode, I realized that I did not know how my data was sent and how it was stored in Graphite. I knew that StatsD collects and aggregates my metrics. And I knew that StatsD ships them off to Graphite. Which I knew stores the time-series data and enables us to render graphs based on these data.

What I did not know was if my http-access graphs displayed requests per second, average requests per retention or anything else.

It was time to research how these things worked in order to get a complete understanding.

StatsD

To get a full understanding of how works, I started to read the source code. I knew StatsD was a simple application, but I did not knew it was this simple. Just over 300 lines of code in and around 150 lines in .

Concepts in StatsD

StatsD has a few concepts listed in the documentation that should be understood.

Buckets

Each stat is in its own “bucket”. They are not predefined anywhere. Buckets can be named anything that will translate to Graphite (periods make folders, etc)

Values

Each stat will have a value. How it is interpreted depends on modifiers. In general values should be integer.

Flush interval

After the flush interval timeout (default 10 seconds), stats are aggregated and sent to an upstream backend service.

Metric types

Counters

Counters are simple. It adds a value to a bucket and stays in memory until the flush interval.

Lets take a look at the source code that generates the counter stats that gets flushed to the backend.

for (key in counters) {
var value = counters[key];
var valuePerSecond = value / (flushInterval / 1000); // calculate "per second" rate
statString += 'stats.' + key + ' ' + valuePerSecond + ' ' + ts + "\n";
statString += 'stats_counts.' + key + ' ' + value + ' ' + ts + "\n";
numStats += 1;
}

First, StatsD iterates over any counters received, where it starts by assigning two variables. One variable holds the counter value, and one variable holds the per-second value. It then adds the values to the statString and increases the numStats variable.

If you have the default flush interval, 10 seconds, and send StatsD 7 increments on a counter with the flush interval, the counter would be 7 and the per-second value would be 0.7. No magic.

Timers

Timers collects numbers. They does not necessarily need to contain a value of time. You can collect bytes read, number of objects in some storage, or anything that is a number. A good thing about timer, is that you get the mean, the sum, the count, the upper and the lower values for free. Feed StatsD a timer and this gets automatically calculated for you before it is flushed to Graphite. Oh, I almost forgot to mention that you also get the 90 percentile calculated for the mean, sum and upper values as well. You can also configure StatsD to use an array of numbers as percentiles, which means you can get 50 percentile, 90 percentile and 95 percentile calculated for you if you want.

The source code for timer stats is a bit more advanced than the code for the counters.

for (key in timers) {
if (timers[key].length > 0) {
var values = timers[key].sort(function (a,b) { return a-b; });
var count = values.length;
var min = values[0];
var max = values[count - 1];
var cumulativeValues = [min];
for (var i = 1; i < count; i++) {
cumulativeValues.push(values[i] + cumulativeValues[i-1]);
}
var sum = min;
var mean = min;
var maxAtThreshold = max;
var message = "";
var key2;
for (key2 in pctThreshold) {
var pct = pctThreshold[key2];
if (count > 1) {
var thresholdIndex = Math.round(((100 - pct) / 100) * count);
var numInThreshold = count - thresholdIndex;
maxAtThreshold = values[numInThreshold - 1];
sum = cumulativeValues[numInThreshold - 1];
mean = sum / numInThreshold;
}
var clean_pct = '' + pct;
clean_pct.replace('.', '_');
message += 'stats.timers.' + key + '.mean_' + clean_pct + ' ' + mean + ' ' + ts + "\n";
message += 'stats.timers.' + key + '.upper_' + clean_pct + ' ' + maxAtThreshold + ' ' + ts + "\n";
message += 'stats.timers.' + key + '.sum_' + clean_pct + ' ' + sum + ' ' + ts + "\n";
}
sum = cumulativeValues[count-1];
mean = sum / count;
message += 'stats.timers.' + key + '.upper ' + max + ' ' + ts + "\n";
message += 'stats.timers.' + key + '.lower ' + min + ' ' + ts + "\n";
message += 'stats.timers.' + key + '.count ' + count + ' ' + ts + "\n";
message += 'stats.timers.' + key + '.sum ' + sum + ' ' + ts + "\n";
message += 'stats.timers.' + key + '.mean ' + mean + ' ' + ts + "\n";
statString += message;
numStats += 1;
}
}

StatsD iterates over each timer and processes the timer if the value is above 0. It then sorts the array of values and simply counts it and locates the minimum and maximum values. An array of the cumulative values is created and a few variables are assigned before it starts to iterate over the percentile thresholds array to calculate percentiles and creates the messages to assign to the statString variable. When percentile calculation is done, the final sum gets assigned and the final statString is created.

If you send the following timer values to StatsD during the default flush interval

StatsD will calculate the following values

mean_90 496
upper_90 844
sum_90 3472
upper 994
lower 120
count 8
sum 4466
mean 558.25

Gauges

A gauge simply indicates an arbitrary value at a point in time and is the most simple type in StatsD. It just takes any number and ships it to the backend.

The source code for gauge stats is just four lines.

for (key in gauges) {
statString += 'stats.gauges.' + key + ' ' + gauges[key] + ' ' + ts + "\n";
numStats += 1;
}

Feed StatsD a number and it sends it unprocessed to the backend. A thing to note is that only the last value of a gauge during a flush interval is flushed to the backend. That means that if you send the following gauge values to StatsD during a flush interval

The only value that gets flushed to the backend is 583. The value of this gauge will be kept in memory in StatsD and be sent to the backend at the end of every flush interval.

Graphite

Now that we know how our data is sent from StatsD, lets take a look at how it is stored and processed in Graphite.

Overview

In the Graphite documentation we can find the . It sums up Graphite with these two simple points.

Graphite stores numeric time-series data.
Graphite renders graphs of this data on demand.

Graphite consists of three parts.

carbon - a daemon that listens for time-series data.
whisper - a simple database library for storing time-series data.
webapp - a (Django) webapp that renders graphs on demand.

The format for time-series data in graphite looks like this

Storage schemas

Graphite uses configurable storage schemas too define retention rates for storing data. It matches data paths with a pattern and tells what frequency and history for our data to store.

The following configuration example is taken from the StatsD documentation.

[stats]
pattern = ^stats\..*
retentions = 10:2160,60:10080,600:262974

Which means these retentions will be used for every entry with a key matching the pattern defined. The retention format is frequency:history. So this configuration lets us store 10 second data for 6 hours, 1 minute data for 1 week, and 10 minute data for 5 years.

Visualizing a timer in Graphite

Knowing all this, we can now take a look at my simple ruby-script that collects timings for a HTTP requests.

#!/usr/bin/env ruby
require 'rubygems' if RUBY_VERSION < '1.9.0'
require './statsdclient.rb'
require 'typhoeus'
Statsd.host = 'localhost'
Statsd.port = 8125
def to_ms time
(1000 * time).to_i
end
while true
start_time = Time.now.to_f
resp = Typhoeus::Request.get ''
end_time = Time.now.to_f
elapsed_time = (1000 * end_time) - (to_ms start_time)
response_time = to_ms resp.time
start_transfer_time = to_ms resp.start_transfer_time
app_connect_time = to_ms resp.app_connect_time
pretransfer_time = to_ms resp.pretransfer_time
connect_time = to_ms resp.connect_time
name_lookup_time = to_ms resp.name_lookup_time
Statsd.timing('http_request.elapsed_time', elapsed_time)
Statsd.timing('http_request.response_time', response_time)
Statsd.timing('http_request.start_transfer_time', start_transfer_time)
Statsd.timing('http_request.app_connect_time', app_connect_time)
Statsd.timing('http_request.pretransfer_time', pretransfer_time)
Statsd.timing('http_request.connect_time', connect_time)
Statsd.timing('http_request.name_lookup_time', name_lookup_time)
sleep 10
end

Lets take a look at the visualized Graphite render from this data. The data is from the last 2 minutes, and the elapsed_time target from our script above.

Image visualization

Render URL

Render URL used for the image below.

		
/render/?width=586&height=308&from=-2minutes&target=stats.timers.http_request.elapsed_time.sum

Rendered image from Graphite

Rendered image from Graphite, a simple graph visualizing elapsed_time for http requests over time.

Rendered graph from Graphite

JSON-data

Render URL

Render URL used for the JSON-data below.

		
/render/?width=586&height=308&from=-2minutes&target=stats.timers.http_request.elapsed_time.sum&format=json

JSON-output from Graphite

In the results below, we can see the raw data from Graphite. It is data from 12 different data points which means 2 minutes with the StatsD 10-second flush interval. It is really this simple, Graphite just visualizes its data.

The JSON-data is beautified with for viewing purposes.

[
{
"target": "stats.timers.http_request.elapsed_time.sum",
"datapoints": [
[
53.449951171875,
1343038130
],
[
50.3916015625,
1343038140
],
[
50.1357421875,
1343038150
],
[
39.601806640625,
1343038160
],
[
41.5263671875,
1343038170
],
[
34.3974609375,
1343038180
],
[
36.3818359375,
1343038190
],
[
35.009033203125,
1343038200
],
[
37.0087890625,
1343038210
],
[
38.486572265625,
1343038220
],
[
45.66064453125,
1343038230
],
[
null,
1343038240
]
]
}
]

Visualizing a gauge in Graphite

The following simple script ships a gauge to StatsD, simulating a number of user registrations.

#!/usr/bin/env ruby
require './statsdclient.rb'
Statsd.host = 'localhost'
Statsd.port = 8125
user_registrations = 1
while true
user_registrations += Random.rand 128
Statsd.gauge('user_registrations', user_registrations)
sleep 10
end

Image visualization - Number of user registrations

Render URL

Render URL used for the image below.

		
/render/?width=586&height=308&from=-20minutes&target=stats.gauges.user_registrations

Rendered image from Graphite

Another simple graph, just showing the total number of registrations.

Number of user registrations

Image visualization - Number of user registrations per minute

By using the derivative-function in Graphite, we can get the number of user registrations per minute.

Render URL

Render URL used for the image below.

		
/render/?width=586&height=308&from=-20minutes&target=derivative(stats.gauges.user_registrations)

Rendered image from Graphite

A graph based on the same data as above, but with the derivative function applied to visualize a per-minute rate.

Number of user registrations per minute

Conclusion

Knowing more about how StatsD and Graphite works, it will be alot easier to know what kind of data to ship StatsD, to know how to ship the data to StatsD, and to know how to read the data from Graphite.

Got any comments or questions? Let me know in the comment section below.