Mongodb日志:
2018-05-22T17:49:52.493+0800 I COMMAND [conn21061] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after oplog: 5178, after repl: 5178, after security: 5178, after sharding: 5178, after storageEngine: 5178, after tcmalloc: 5178, after wiredTiger: 5178, at end: 5178 }
2018-05-22T17:49:52.493+0800 I COMMAND [conn21061] command local.oplog.rs command: serverStatus { serverStatus: 1, oplog: 1, tcmalloc: 1 } numYields:0 reslen:23716 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } protocol:op_query 5177ms
2018-05-22T17:50:09.373+0800 I COMMAND [conn21061] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after oplog: 12052, after repl: 12052, after security: 12052, after sharding: 12052, after storageEngine: 12052, after tcmalloc: 12052, after wiredTiger: 12052, at end: 12052 }
2018-05-22T17:50:09.373+0800 I COMMAND [conn21061] command local.oplog.rs command: serverStatus { serverStatus: 1, oplog: 1, tcmalloc: 1 } numYields:0 reslen:23712 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } protocol:op_query 12058ms
serverStatus was very slow
查内存:还有很大的剩余
查IO:一切正常
mongodbstat
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
16 241 95 *0 443 50|0 0.4% 80.0% 0 232G 230G 0|0 3|0 203k 445k 563 RelName PRI May 22 17:52:13.054
40 320 92 *0 346 75|0 0.4% 80.0% 1 232G 230G 0|0 1|0 237k 6.08m 563
RelName PRI May 22 17:52:14.054
17 303 114 *0 470 69|0 0.4% 80.0% 0 232G 230G 0|0 1|0 236k 445k 563
RelName PRI May 22 17:52:15.055
13 204 66 *0 309 41|0 0.3% 80.0% 0 232G 230G 0|0 1|0 151k 422k 563
RelName PRI May 22 17:52:16.053
数据读写量也不大, 唯一可疑点:getmore比较多
再使用strace命令看Mongodb进程在做什么?
strace -c -T -f -p Mongodbpid
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
78.06 7.630402 6348 1202 recvfrom
16.35 1.598607 775 2064 332 futex
2.73 0.266954 14050 19 15 restart_syscall
1.87 0.182972 45743 4 select
0.44 0.042993 2687 16 epoll_wait
0.31 0.030707 162 189 fdatasync
0.10 0.010210 17 601 sendto
0.09 0.008929 8 1074 clock_gettime
0.02 0.002010 5 428 pwrite
0.01 0.001016 102 10 read
0.00 0.000321 19 17 madvise
0.00 0.000000 0 11 open
80%左右时间都在
recvfrom
没事Mongodb你搞什么JY呀,网络请求有这么多么?
Mongodb是一主两从,对比一下其它同样的构架的Mongodb,发现没有这么多recvfrom
谁在搞Mongodb的oplog呀
再用iftop来看
如下:
iftop
有点意思,有个IP的网络流量有点不正常,正常的从的流量只有3Kb,有一个IP有7.95Mb,这个得关注一下
连上此服务器:
netstat下就定位到有几个:mongo-connector在连Mongodb
好吧,就它了!
晚上停了mongodb-connector业务,发生了什么?
你妹,情况没有任务变化,还是一大堆的very slow的日志。这是要上天呀
再来,连到mongodb 主,发现一个问题:在执行:db.printReplicationInfo()时要等几秒钟,这是什么情况?内存,CPU,硬盘都没有问题,为什么会要几秒,连到几个大实例执行相同的命令,都是秒回。问题出在这里么?
手动切主,rs.stepdown,MD无法切换,加上force也不行。好吧反正业务已经基本停了,只好用终极命令:kill
霸王关了后,切主成功,等了几分钟看新主的日志,没有再出现very slow日志。就这样解决了!可是还是一头的雾水呀。
也许遇到这样问题的兄弟也可以试一下!