使用ganglia监控mongodb集群-chinaboywg-ChinaUnix博客

chinaboy小宝chinaboy007.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

chinaboywg

博客访问： 2921785
博文数量： 348
博客积分： 2907
博客等级：中校
技术积分： 2272
用户组：普通用户
注册时间： 2010-03-12 09:16

个人简介

专注 K8S研究

文章分类

全部博文（348）

elk（2）
docker（5）
error（0）
zabbix（21）
haproxy（2）
linux（11）
redis（2）
lvs（9）
squid（8）
nagios（4）
puppet（6）
html（1）
nginx（45）
apache（3）
mysql（65）
php（0）
python（114）

pycharm（1）

pip（1）

requests（1）

requests（0）

urllib（0）

logging（1）

flask（0）

lib（0）

pyqt4（14）

django（7）

beautifulsoup（11）

scrapy（3）

string（6）

pexpect（4）
shell（19）
linux（25）
other（4）
未分配的博文（2）

文章存档

2019年（22）

2018年（57）

2016年（2）

2015年（27）

2014年（33）

2013年（190）

2011年（3）

2010年（14）

我的朋友

相关博文

使用ganglia监控mongodb集群

分类：系统运维

2015-04-21 15:39:18

原文地址：使用ganglia监控mongodb集群作者：yueys_canedy

前几天提交了一篇ganglia监控storm集群的博文，本文将介绍使用ganglia监控mongdb集群。因为我们需要使用ganglia一统天下。
1. ganglia扩展机制
首先要使用ganglia监控mongodb集群必须先明白ganglia的扩展机制。通过ganglia插件可以给我们提供两种扩展ganglia监控功能的方法：

1）、通过添加内嵌（in-band）插件，主要是通过gmetric命令来实现。

这是通常使用的一种方法，主要是通过cronjob方法并调用ganglia的gmetric命令来向gmond输入数据，进而实现统一监控，这种方法简单，对于少量的监控可以采用，但是对于大规模自定义监控时，监控数据难以统一管理。

2）、通过添加一些额外的脚本来实现对系统的监控，主要是通过C或者python接口来实现。

    在ganglia3.1.x版本以后，增加了C或者Python接口，通过这个接口可以自定义数据收集模块，并且这些模块可以被直接插入到gmond中以监控用户自定义的应用。
2. python脚本监控mongdb
    我们使用python脚本来监控mongodb集群，毕竟通过python脚本扩展比较方便，需要增加监控信息时在相应的py脚本中添加监控数据就可以了，十分方便，扩展性强，移植也比较简单。
2.1 环境配置
    要使用python脚本来实现ganglia监控扩展，首先需要明确modpython.so文件是否存在，这个文件是ganglia调用python的动态链接库，要通过python接口开发ganglia插件，必须要编译安装此模块。modpython.so文件存放在ganglia安装目录下的lib(or lib64)/ganglia/目录中。如果存在则可以进行下面的脚本编写；如果不存在，那么需要你重新编译安装gmond哦，编译安装时带上参数“--with-python”。
2.2 编写监控脚本
   我们打开ganglia安装目录下的/etc/gmond.conf文件，可以发现在客户端监控中可以看到include ("/usr/local/ganglia/etc/conf.d/*.conf")，说明gmond服务直接扫描目录下的监控配置文件，所以我们需要将监控配置脚本放在/etc/conf.d/目录下并命名为XX.conf，所以我们将要监控mongdb的配置脚本命名为mongdb.conf
    1)、查看modpython.conf文件
    modpython.conf位于/etc/conf.d/目录下。文件内容如下：

点击(此处)折叠或打开

modules {
module {
name = "python_module" #主模块文成
path = "modpython.so" #ganglia扩展python脚本需要的动态链接库
params = "/usr/local/ganglia/lib64/ganglia/python_modules" #python脚本存放的位置
}
}
include ("/usr/local/ganglia/etc/conf.d/*.pyconf") #ganglia扩展存放配置脚本的路径

    所以我们使用python来扩展ganglia监控mongodb需要将配置脚本和py脚本放在相应的目录下，再重启ganglia服务就可以完成mongdb监控，下面将介绍如何编写脚本。
    2)、创建mongodb.pyconf脚本
    注意这里需要使用root权限来创建编辑脚本，将此脚本存放在conf.d目录下。具体要收集mongdb那些参数可以参考，根据自己的需求酌量增删。

点击(此处)折叠或打开

modules {
module {
name = "mongodb" #模块名，该模块名必须与开发的存放于"/usr/lib64/ganglia/python_modules"指定的路径下的python脚本名称一致
language = "python" #声明使用python语言
#参数列表，所有的参数作为一个dict(即map)传给python脚本的metric_init(params)函数。
param server_status{
value = "mongo路径 --host host --port 27017 --quiet --eval 'printjson(db.serverStatus())'"
}
param rs_status{
value = "mongo路径 --host host --port 2701 --quiet --eval 'printjson(rs.status())'"
}
}
}
#需要收集的metric列表，一个模块中可以扩展任意个metric
collection_group {
collect_every = 30
time_threshold = 90 #最大发送间隔
metric {
name = "mongodb_opcounters_insert" #metric在模块中的名字
title = "Inserts" #图形界面上显示的标题
}
metric {
name = "mongodb_opcounters_query"
title = "Queries"
}
metric {
name = "mongodb_opcounters_update"
title = "Updates"
}
metric {
name = "mongodb_opcounters_delete"
title = "Deletes"
}
metric {
name = "mongodb_opcounters_getmore"
title = "Getmores"
}
metric {
name = "mongodb_opcounters_command"
title = "Commands"
}
metric {
name = "mongodb_backgroundFlushing_flushes"
title = "Flushes"
}
metric {
name = "mongodb_mem_mapped"
title = "Memory-mapped Data"
}
metric {
name = "mongodb_mem_virtual"
title = "Process Virtual Size"
}
metric {
name = "mongodb_mem_resident"
title = "Process Resident Size"
}
metric {
name = "mongodb_extra_info_page_faults"
title = "Page Faults"
}
metric {
name = "mongodb_globalLock_ratio"
title = "Global Write Lock Ratio"
}
metric {
name = "mongodb_indexCounters_btree_miss_ratio"
title = "BTree Page Miss Ratio"
}
metric {
name = "mongodb_globalLock_currentQueue_total"
title = "Total Operations Waiting for Lock"
}
metric {
name = "mongodb_globalLock_currentQueue_readers"
title = "Readers Waiting for Lock"
}
metric {
name = "mongodb_globalLock_currentQueue_writers"
title = "Writers Waiting for Lock"
}
metric {
name = "mongodb_globalLock_activeClients_total"
title = "Total Active Clients"
}
metric {
name = "mongodb_globalLock_activeClients_readers"
title = "Active Readers"
}
metric {
name = "mongodb_globalLock_activeClients_writers"
title = "Active Writers"
}
metric {
name = "mongodb_connections_current"
title = "Open Connections"
}
metric {
name = "mongodb_connections_current_ratio"
title = "Open Connections"
}
metric {
name = "mongodb_slave_delay"
title = "Replica Set Slave Delay"
}
metric {
name = "mongodb_asserts_total"
title = "Asserts per Second"
}
}

    从上面你可以发现这个配置文件的写法跟gmond.conf的语法一致，所以有什么不明白的可以参考gmond.conf的写法。
3)、创建mongodb.py脚本
    将mongodb.py文件存放在lib64/ganglia/python_modules目录下，在这个目录中可以看到已经有很多python脚本存在，比如：监控磁盘、内存、网络、mysql、redis等的脚本。我们可以参考这些python脚本完成mongodb.py的编写。我们打开其中部分脚本可以看到在每个脚本中都有一个函数metric_init(params)，前面也说过mongodb.pyconf传来的参数传递给metric_init函数。

点击(此处)折叠或打开

#!/usr/bin/env python
import json
import os
import re
import socket
import string
import time
import copy
NAME_PREFIX = 'mongodb_'
PARAMS = {
'server_status' : '/bin/mongo路径 --host host --port 27017 --quiet --eval "printjson(db.serverStatus())"',
'rs_status' : '/bin/mongo路径 --host host --port 27017 --quiet --eval "printjson(rs.status())"'
}
METRICS = {
'time' : 0,
'data' : {}
}
LAST_METRICS = copy.deepcopy(METRICS)
METRICS_CACHE_TTL = 3
def flatten(d, pre = '', sep = '_'):
"""Flatten a dict (i.e. dict['a']['b']['c'] => dict['a_b_c'])"""
new_d = {}
for k,v in d.items():
if type(v) == dict:
new_d.update(flatten(d[k], '%s%s%s' % (pre, k, sep)))
else:
new_d['%s%s' % (pre, k)] = v
return new_d
def get_metrics():
"""Return all metrics"""
global METRICS, LAST_METRICS
if (time.time() - METRICS['time']) > METRICS_CACHE_TTL:
metrics = {}
for status_type in PARAMS.keys():
# get raw metric data
o = os.popen(PARAMS[status_type])
# clean up
metrics_str = ''.join(o.readlines()).strip() # convert to string
metrics_str = re.sub('\w+\((.*)\)', r"\1", metrics_str) # remove functions
# convert to flattened dict
try:
if status_type == 'server_status':
metrics.update(flatten(json.loads(metrics_str)))
else:
metrics.update(flatten(json.loads(metrics_str), pre='%s_' % status_type))
except ValueError:
metrics = {}
# update cache
LAST_METRICS = copy.deepcopy(METRICS)
METRICS = {
'time': time.time(),
'data': metrics
}
return [METRICS, LAST_METRICS]
def get_value(name):
"""Return a value for the requested metric"""
# get metrics
metrics = get_metrics()[0]
# get value
name = name[len(NAME_PREFIX):] # remove prefix from name
try:
result = metrics['data'][name]
except StandardError:
result = 0
return result
def get_rate(name):
"""Return change over time for the requested metric"""
# get metrics
[curr_metrics, last_metrics] = get_metrics()
# get rate
name = name[len(NAME_PREFIX):] # remove prefix from name
try:
rate = float(curr_metrics['data'][name] - last_metrics['data'][name]) / \
float(curr_metrics['time'] - last_metrics['time'])
if rate < 0:
rate = float(0)
except StandardError:
rate = float(0)
return rate
def get_opcounter_rate(name):
"""Return change over time for an opcounter metric"""
master_rate = get_rate(name)
repl_rate = get_rate(name.replace('opcounters_', 'opcountersRepl_'))
return master_rate + repl_rate
def get_globalLock_ratio(name):
"""Return the global lock ratio"""
try:
result = get_rate(NAME_PREFIX + 'globalLock_lockTime') / \
get_rate(NAME_PREFIX + 'globalLock_totalTime') * 100
except ZeroDivisionError:
result = 0
return result
def get_indexCounters_btree_miss_ratio(name):
"""Return the btree miss ratio"""
try:
result = get_rate(NAME_PREFIX + 'indexCounters_btree_misses') / \
get_rate(NAME_PREFIX + 'indexCounters_btree_accesses') * 100
except ZeroDivisionError:
result = 0
return result
def get_connections_current_ratio(name):
"""Return the percentage of connections used"""
try:
result = float(get_value(NAME_PREFIX + 'connections_current')) / \
float(get_value(NAME_PREFIX + 'connections_available')) * 100
except ZeroDivisionError:
result = 0
return result
def get_slave_delay(name):
"""Return the replica set slave delay"""
# get metrics
metrics = get_metrics()[0]
# no point checking my optime if i'm not replicating
if 'rs_status_myState' not in metrics['data'] or metrics['data']['rs_status_myState'] != 2:
result = 0
# compare my optime with the master's
else:
master = {}
slave = {}
try:
for member in metrics['data']['rs_status_members']:
if member['state'] == 1:
master = member
if member['name'].split(':')[0] == socket.getfqdn():
slave = member
result = max(0, master['optime']['t'] - slave['optime']['t']) / 1000
except KeyError:
result = 0
return result
def get_asserts_total_rate(name):
"""Return the total number of asserts per second"""
return float(reduce(lambda memo,obj: memo + get_rate('%sasserts_%s' % (NAME_PREFIX, obj)),['regular', 'warning', 'msg', 'user', 'rollovers'], 0))
def metric_init(lparams):
"""Initialize metric descriptors"""
global PARAMS
# set parameters
for key in lparams:
PARAMS[key] = lparams[key]
# define descriptors
time_max = 60
groups = 'mongodb'
descriptors = [
{
'name': NAME_PREFIX + 'opcounters_insert',
'call_back': get_opcounter_rate,
'time_max': time_max,
'value_type': 'float',
'units': 'Inserts/Sec',
'slope': 'both',
'format': '%f',
'description': 'Inserts',
'groups': groups
},
{
'name': NAME_PREFIX + 'opcounters_query',
'call_back': get_opcounter_rate,
'time_max': time_max,
'value_type': 'float',
'units': 'Queries/Sec',
'slope': 'both',
'format': '%f',
'description': 'Queries',
'groups': groups
},
{
'name': NAME_PREFIX + 'opcounters_update',
'call_back': get_opcounter_rate,
'time_max': time_max,
'value_type': 'float',
'units': 'Updates/Sec',
'slope': 'both',
'format': '%f',
'description': 'Updates',
'groups': groups
},
{
'name': NAME_PREFIX + 'opcounters_delete',
'call_back': get_opcounter_rate,
'time_max': time_max,
'value_type': 'float',
'units': 'Deletes/Sec',
'slope': 'both',
'format': '%f',
'description': 'Deletes',
'groups': groups
},
{
'name': NAME_PREFIX + 'opcounters_getmore',
'call_back': get_opcounter_rate,
'time_max': time_max,
'value_type': 'float',
'units': 'Getmores/Sec',
'slope': 'both',
'format': '%f',
'description': 'Getmores',
'groups': groups
},
{
'name': NAME_PREFIX + 'opcounters_command',
'call_back': get_opcounter_rate,
'time_max': time_max,
'value_type': 'float',
'units': 'Commands/Sec',
'slope': 'both',
'format': '%f',
'description': 'Commands',
'groups': groups
},
{
'name': NAME_PREFIX + 'backgroundFlushing_flushes',
'call_back': get_rate,
'time_max': time_max,
'value_type': 'float',
'units': 'Flushes/Sec',
'slope': 'both',
'format': '%f',
'description': 'Flushes',
'groups': groups
},
{
'name': NAME_PREFIX + 'mem_mapped',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'MB',
'slope': 'both',
'format': '%u',
'description': 'Memory-mapped Data',
'groups': groups
},
{
'name': NAME_PREFIX + 'mem_virtual',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'MB',
'slope': 'both',
'format': '%u',
'description': 'Process Virtual Size',
'groups': groups
},
{
'name': NAME_PREFIX + 'mem_resident',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'MB',
'slope': 'both',
'format': '%u',
'description': 'Process Resident Size',
'groups': groups
},
{
'name': NAME_PREFIX + 'extra_info_page_faults',
'call_back': get_rate,
'time_max': time_max,
'value_type': 'float',
'units': 'Faults/Sec',
'slope': 'both',
'format': '%f',
'description': 'Page Faults',
'groups': groups
},
{
'name': NAME_PREFIX + 'globalLock_ratio',
'call_back': get_globalLock_ratio,
'time_max': time_max,
'value_type': 'float',
'units': '%',
'slope': 'both',
'format': '%f',
'description': 'Global Write Lock Ratio',
'groups': groups
},
{
'name': NAME_PREFIX + 'indexCounters_btree_miss_ratio',
'call_back': get_indexCounters_btree_miss_ratio,
'time_max': time_max,
'value_type': 'float',
'units': '%',
'slope': 'both',
'format': '%f',
'description': 'BTree Page Miss Ratio',
'groups': groups
},
{
'name': NAME_PREFIX + 'globalLock_currentQueue_total',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'Operations',
'slope': 'both',
'format': '%u',
'description': 'Total Operations Waiting for Lock',
'groups': groups
},
{
'name': NAME_PREFIX + 'globalLock_currentQueue_readers',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'Operations',
'slope': 'both',
'format': '%u',
'description': 'Readers Waiting for Lock',
'groups': groups
},
{
'name': NAME_PREFIX + 'globalLock_currentQueue_writers',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'Operations',
'slope': 'both',
'format': '%u',
'description': 'Writers Waiting for Lock',
'groups': groups
},
{
'name': NAME_PREFIX + 'globalLock_activeClients_total',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'Clients',
'slope': 'both',
'format': '%u',
'description': 'Total Active Clients',
'groups': groups
},
{
'name': NAME_PREFIX + 'globalLock_activeClients_readers',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'Clients',
'slope': 'both',
'format': '%u',
'description': 'Active Readers',
'groups': groups
},
{
'name': NAME_PREFIX + 'globalLock_activeClients_writers',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'Clients',
'slope': 'both',
'format': '%u',
'description': 'Active Writers',
'groups': groups
},
{
'name': NAME_PREFIX + 'connections_current',
'call_back': get_value,
'time_max': time_max,
'value_type': 'uint',
'units': 'Connections',
'slope': 'both',
'format': '%u',
'description': 'Open Connections',
'groups': groups
},
{
'name': NAME_PREFIX + 'connections_current_ratio',
'call_back': get_connections_current_ratio,
'time_max': time_max,
'value_type': 'float',
'units': '%',
'slope': 'both',
'format': '%f',
'description': 'Percentage of Connections Used',
'groups': groups
},
{
'name': NAME_PREFIX + 'slave_delay',
'call_back': get_slave_delay,
'time_max': time_max,
'value_type': 'uint',
'units': 'Seconds',
'slope': 'both',
'format': '%u',
'description': 'Replica Set Slave Delay',
'groups': groups
},
{
'name': NAME_PREFIX + 'asserts_total',
'call_back': get_asserts_total_rate,
'time_max': time_max,
'value_type': 'float',
'units': 'Asserts/Sec',
'slope': 'both',
'format': '%f',
'description': 'Asserts',
'groups': groups
}
]
return descriptors
def metric_cleanup():
"""Cleanup"""
pass
# the following code is for debugging and testing
if __name__ == '__main__':
descriptors = metric_init(PARAMS)
while True:
for d in descriptors:
print (('%s = %s') % (d['name'], d['format'])) % (d['call_back'](d['name']))
print ''
time.sleep(METRICS_CACHE_TTL)

python扩展脚本中必须要重写的函数有：metric_init(params)，metric_cleanup()
metric_init()函数在模块初始化的时候调用，必须要返回一个metric描述字典或者字典列表，mongdb.py就返回了字典列表。

Metric字典定义如下：

d = {‘name’ : ‘’, #这个name必须跟pyconf文件中的名字保持一致

'call_back’ : ,

'time_max’ : int(),

'value_type’ : ‘’,

'units’ : ’’,

'slope’ : ‘’,

'format’ : ‘’,

            'description’ : ‘’
        }
    metric_cleanup()函数在模块结束时调用，无数据返回
4)、在web端查看监控统计
    完成脚本编写后，重启gmond服务。

阅读(3566) | 评论(0) | 转发(0) |

上一篇：LVS Nginx HAProxy 优缺点

下一篇：Zabbix上IO监控

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6