前几天提交了一篇ganglia监控storm集群的博文,本文将介绍使用ganglia监控mongdb集群。因为我们需要使用ganglia一统天下。
1. ganglia扩展机制
首先要使用ganglia监控mongodb集群必须先明白ganglia的扩展机制。通过ganglia插件可以给我们提供两种扩展ganglia监控功能的方法:
1)、通过添加内嵌(in-band)插件,主要是通过gmetric命令来实现。
这是通常使用的一种方法,主要是通过cronjob方法并调用ganglia的gmetric命令来向gmond输入数据,进而实现统一监控,这种方法简单,对于少量的监控可以采用,但是对于大规模自定义监控时,监控数据难以统一管理。
2)、通过添加一些额外的脚本来实现对系统的监控,主要是通过C或者python接口来实现。
在ganglia3.1.x版本以后,增加了C或者Python接口,通过这个接口可以自定义数据收集模块,并且这些模块可以被直接插入到gmond中以监控用户自定义的应用。
2. python脚本监控mongdb
我们使用python脚本来监控mongodb集群,毕竟通过python脚本扩展比较方便,需要增加监控信息时在相应的py脚本中添加监控数据就可以了,十分方便,扩展性强,移植也比较简单。
2.1 环境配置
要使用python脚本来实现ganglia监控扩展,首先需要明确modpython.so文件是否存在,这个文件是ganglia调用python的动态链接库,要通过python接口开发ganglia插件,必须要编译安装此模块。modpython.so文件存放在ganglia安装目录下的lib(or lib64)/ganglia/目录中。如果存在则可以进行下面的脚本编写;如果不存在,那么需要你重新编译安装gmond哦,编译安装时带上参数“--with-python”。
2.2 编写监控脚本
我们打开ganglia安装目录下的/etc/gmond.conf文件,可以发现在客户端监控中可以看到include ("/usr/local/ganglia/etc/conf.d/*.conf"),说明gmond服务直接扫描目录下的监控配置文件,所以我们需要将监控配置脚本放在/etc/conf.d/目录下并命名为XX.conf,所以我们将要监控mongdb的配置脚本命名为mongdb.conf
1)、查看modpython.conf文件
modpython.conf位于/etc/conf.d/目录下。文件内容如下:
-
modules {
-
module {
-
name = "python_module" #主模块文成
-
path = "modpython.so" #ganglia扩展python脚本需要的动态链接库
-
params = "/usr/local/ganglia/lib64/ganglia/python_modules" #python脚本存放的位置
-
}
-
}
-
-
include ("/usr/local/ganglia/etc/conf.d/*.pyconf") #ganglia扩展存放配置脚本的路径
所以我们使用python来扩展ganglia监控mongodb需要将配置脚本和py脚本放在相应的目录下,再重启ganglia服务就可以完成mongdb监控,下面将介绍如何编写脚本。
2)、创建mongodb.pyconf脚本
注意这里需要使用root权限来创建编辑脚本,将此脚本存放在conf.d目录下。具体要收集mongdb那些参数可以参考,根据自己的需求酌量增删。
从上面你可以发现这个配置文件的写法跟gmond.conf的语法一致,所以有什么不明白的可以参考gmond.conf的写法。
3)、创建mongodb.py脚本
将mongodb.py文件存放在lib64/ganglia/python_modules目录下,在这个目录中可以看到已经有很多python脚本存在,比如:监控磁盘、内存、网络、mysql、redis等的脚本。我们可以参考这些python脚本完成mongodb.py的编写。我们打开其中部分脚本可以看到在每个脚本中都有一个函数metric_init(params),前面也说过mongodb.pyconf传来的参数传递给metric_init函数。
-
#!/usr/bin/env python
-
import json
-
import os
-
import re
-
import socket
-
import string
-
import time
-
import copy
-
-
NAME_PREFIX = 'mongodb_'
-
PARAMS = {
-
'server_status' : '/bin/mongo路径 --host host --port 27017 --quiet --eval "printjson(db.serverStatus())"',
-
'rs_status' : '/bin/mongo路径 --host host --port 27017 --quiet --eval "printjson(rs.status())"'
-
}
-
METRICS = {
-
'time' : 0,
-
'data' : {}
-
}
-
LAST_METRICS = copy.deepcopy(METRICS)
-
METRICS_CACHE_TTL = 3
-
def flatten(d, pre = '', sep = '_'):
-
"""Flatten a dict (i.e. dict['a']['b']['c'] => dict['a_b_c'])"""
-
new_d = {}
-
for k,v in d.items():
-
if type(v) == dict:
-
new_d.update(flatten(d[k], '%s%s%s' % (pre, k, sep)))
-
else:
-
new_d['%s%s' % (pre, k)] = v
-
return new_d
-
-
def get_metrics():
-
"""Return all metrics"""
-
global METRICS, LAST_METRICS
-
if (time.time() - METRICS['time']) > METRICS_CACHE_TTL:
-
metrics = {}
-
for status_type in PARAMS.keys():
-
# get raw metric data
-
o = os.popen(PARAMS[status_type])
-
# clean up
-
metrics_str = ''.join(o.readlines()).strip() # convert to string
-
metrics_str = re.sub('\w+\((.*)\)', r"\1", metrics_str) # remove functions
-
# convert to flattened dict
-
try:
-
if status_type == 'server_status':
-
metrics.update(flatten(json.loads(metrics_str)))
-
else:
-
metrics.update(flatten(json.loads(metrics_str), pre='%s_' % status_type))
-
except ValueError:
-
metrics = {}
-
-
# update cache
-
LAST_METRICS = copy.deepcopy(METRICS)
-
METRICS = {
-
'time': time.time(),
-
'data': metrics
-
}
-
return [METRICS, LAST_METRICS]
-
-
def get_value(name):
-
"""Return a value for the requested metric"""
-
# get metrics
-
metrics = get_metrics()[0]
-
# get value
-
name = name[len(NAME_PREFIX):] # remove prefix from name
-
try:
-
result = metrics['data'][name]
-
except StandardError:
-
result = 0
-
return result
-
-
def get_rate(name):
-
"""Return change over time for the requested metric"""
-
# get metrics
-
[curr_metrics, last_metrics] = get_metrics()
-
# get rate
-
name = name[len(NAME_PREFIX):] # remove prefix from name
-
try:
-
rate = float(curr_metrics['data'][name] - last_metrics['data'][name]) / \
-
float(curr_metrics['time'] - last_metrics['time'])
-
if rate < 0:
-
rate = float(0)
-
except StandardError:
-
rate = float(0)
-
return rate
-
-
def get_opcounter_rate(name):
-
"""Return change over time for an opcounter metric"""
-
master_rate = get_rate(name)
-
repl_rate = get_rate(name.replace('opcounters_', 'opcountersRepl_'))
-
return master_rate + repl_rate
-
-
def get_globalLock_ratio(name):
-
"""Return the global lock ratio"""
-
try:
-
result = get_rate(NAME_PREFIX + 'globalLock_lockTime') / \
-
get_rate(NAME_PREFIX + 'globalLock_totalTime') * 100
-
except ZeroDivisionError:
-
result = 0
-
return result
-
-
def get_indexCounters_btree_miss_ratio(name):
-
"""Return the btree miss ratio"""
-
try:
-
result = get_rate(NAME_PREFIX + 'indexCounters_btree_misses') / \
-
get_rate(NAME_PREFIX + 'indexCounters_btree_accesses') * 100
-
except ZeroDivisionError:
-
result = 0
-
return result
-
-
def get_connections_current_ratio(name):
-
"""Return the percentage of connections used"""
-
try:
-
result = float(get_value(NAME_PREFIX + 'connections_current')) / \
-
float(get_value(NAME_PREFIX + 'connections_available')) * 100
-
except ZeroDivisionError:
-
result = 0
-
return result
-
-
def get_slave_delay(name):
-
"""Return the replica set slave delay"""
-
# get metrics
-
metrics = get_metrics()[0]
-
# no point checking my optime if i'm not replicating
-
if 'rs_status_myState' not in metrics['data'] or metrics['data']['rs_status_myState'] != 2:
-
result = 0
-
# compare my optime with the master's
-
else:
-
master = {}
-
slave = {}
-
try:
-
for member in metrics['data']['rs_status_members']:
-
if member['state'] == 1:
-
master = member
-
if member['name'].split(':')[0] == socket.getfqdn():
-
slave = member
-
result = max(0, master['optime']['t'] - slave['optime']['t']) / 1000
-
except KeyError:
-
result = 0
-
return result
-
-
def get_asserts_total_rate(name):
-
"""Return the total number of asserts per second"""
-
return float(reduce(lambda memo,obj: memo + get_rate('%sasserts_%s' % (NAME_PREFIX, obj)),['regular', 'warning', 'msg', 'user', 'rollovers'], 0))
-
-
def metric_init(lparams):
-
"""Initialize metric descriptors"""
-
global PARAMS
-
# set parameters
-
for key in lparams:
-
PARAMS[key] = lparams[key]
-
# define descriptors
-
time_max = 60
-
groups = 'mongodb'
-
descriptors = [
-
{
-
'name': NAME_PREFIX + 'opcounters_insert',
-
'call_back': get_opcounter_rate,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': 'Inserts/Sec',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Inserts',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'opcounters_query',
-
'call_back': get_opcounter_rate,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': 'Queries/Sec',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Queries',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'opcounters_update',
-
'call_back': get_opcounter_rate,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': 'Updates/Sec',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Updates',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'opcounters_delete',
-
'call_back': get_opcounter_rate,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': 'Deletes/Sec',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Deletes',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'opcounters_getmore',
-
'call_back': get_opcounter_rate,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': 'Getmores/Sec',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Getmores',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'opcounters_command',
-
'call_back': get_opcounter_rate,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': 'Commands/Sec',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Commands',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'backgroundFlushing_flushes',
-
'call_back': get_rate,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': 'Flushes/Sec',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Flushes',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'mem_mapped',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'MB',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Memory-mapped Data',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'mem_virtual',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'MB',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Process Virtual Size',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'mem_resident',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'MB',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Process Resident Size',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'extra_info_page_faults',
-
'call_back': get_rate,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': 'Faults/Sec',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Page Faults',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'globalLock_ratio',
-
'call_back': get_globalLock_ratio,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': '%',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Global Write Lock Ratio',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'indexCounters_btree_miss_ratio',
-
'call_back': get_indexCounters_btree_miss_ratio,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': '%',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'BTree Page Miss Ratio',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'globalLock_currentQueue_total',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'Operations',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Total Operations Waiting for Lock',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'globalLock_currentQueue_readers',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'Operations',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Readers Waiting for Lock',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'globalLock_currentQueue_writers',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'Operations',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Writers Waiting for Lock',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'globalLock_activeClients_total',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'Clients',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Total Active Clients',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'globalLock_activeClients_readers',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'Clients',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Active Readers',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'globalLock_activeClients_writers',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'Clients',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Active Writers',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'connections_current',
-
'call_back': get_value,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'Connections',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Open Connections',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'connections_current_ratio',
-
'call_back': get_connections_current_ratio,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': '%',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Percentage of Connections Used',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'slave_delay',
-
'call_back': get_slave_delay,
-
'time_max': time_max,
-
'value_type': 'uint',
-
'units': 'Seconds',
-
'slope': 'both',
-
'format': '%u',
-
'description': 'Replica Set Slave Delay',
-
'groups': groups
-
},
-
{
-
'name': NAME_PREFIX + 'asserts_total',
-
'call_back': get_asserts_total_rate,
-
'time_max': time_max,
-
'value_type': 'float',
-
'units': 'Asserts/Sec',
-
'slope': 'both',
-
'format': '%f',
-
'description': 'Asserts',
-
'groups': groups
-
}
-
]
-
return descriptors
-
-
def metric_cleanup():
-
"""Cleanup"""
-
pass
-
-
# the following code is for debugging and testing
-
if __name__ == '__main__':
-
descriptors = metric_init(PARAMS)
-
while True:
-
for d in descriptors:
-
print (('%s = %s') % (d['name'], d['format'])) % (d['call_back'](d['name']))
-
print ''
-
time.sleep(METRICS_CACHE_TTL)
python扩展脚本中必须要重写的函数有:metric_init(params),metric_cleanup()
metric_init()函数在模块初始化的时候调用,必须要返回一个metric描述字典或者字典列表,mongdb.py就返回了字典列表。
Metric字典定义如下:
d = {‘name’ : ‘’, #这个name必须跟pyconf文件中的名字保持一致
'call_back’ : ,
'time_max’ : int(),
'value_type’ : ‘’,
'units’ : ’’,
'slope’ : ‘’,
'format’ : ‘’,
'description’ : ‘’
}
metric_cleanup()函数在模块结束时调用,无数据返回
4)、在web端查看监控统计
完成脚本编写后,重启gmond服务。
阅读(3443) | 评论(0) | 转发(0) |