2013
全部博文(65)
分类: Mysql/postgreSQL
2012-03-30 13:44:13
数据文件大约200GB,平时备份(mysqldump)压缩后的大小大约20GB,正常备份时间在2个小时内。一台服务器上,安装了多个mysql实例,这个数据库是其中一个。
从库用mysqldump 备份时,error log 中出现很多InnoDB: Warning: a long semaphore wait 信息(详见:附1)
备份是每天定时执行的,自从error log 出现InnoDB: Warning: a long semaphore wait后,备份时间一天比一天慢,本来正常2个小时内备份完成的,到后面需要5、6小时,而且有时会导致mysql重启。
环境:IBM X3650M3、redhat5.4、mysql 5.1.50 多实例部署
分析
这台服务器上,安装了多个mysql实例。其他实例也是从库,mysqldump备份的时候也正常,未出现任何warning 和 重启mysql的情况,cpu、io 基本上处于比较空闲的状态,swap 使用较少。从系统资源上看好像一切正常,只跟这个数据库有关?参数不对?
再看看mysql日志,有很多 InnoDB: Warning: a long semaphore wait ,从日志上看好像是加锁导致innodb存储引擎自杀。但mysqldump的时候,已停止同步的线程,备份数据的时候,只有mysqldump进程,应该不至于加锁等待这么久。
在备份的时候,show processlist 中 state 处于 Sending data ,此时的磁盘IO并不高。
假设一:服务器硬件有问题?
服务器有问题,不可能只有一个实例受影响。而且一直监控服务器的硬件相关信息,并无发现异常,dmesg、message 等未发现异常。IBM的raid卡无电,使硬盘读写变慢?
通过查看raid卡信息,发现Current Cache Policy: WriteBack,并非这个原因。
# /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aALL |grep -i "Current Cache Policy"
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
假设二: 受mysqldump备份参数影响?
尝试调整参数: mysqldump 时加上 single-transaction、quick 。 现象依旧
假设三: 受其他实例影响?
尝试将mysql 锁到内存,调整参数 memlock=1 这个参数会强迫mysqld进程的地址空间一直被锁定在物理内存上。现象依旧
尝试在备份的时候,重启其他实例,备份的速度恢复正常,但等下次备份时,又变得很慢,同时一直再报 InnoDB: Warning: a long semaphore wait ,隔了好几天,又出现一次mysql重启。
正当束手无策之时,在查找mysql多实例是否会互相影响,看到网上几篇博文是关于NUMA(Non-Uniform Memory Access)非一致内存访问。难道mysql备份变慢以及导致的重启跟numa有关?
先来看看NUMA的内存分配策略
有四种:
1.缺省(default):总是在本地节点分配(分配在当前进程运行的节点上);
2.绑定(bind):强制分配到指定节点上;
3.交叉(interleave):在所有节点或者指定的节点上交织分配;
4.优先(preferred):在指定节点上分配,失败则在其他节点上分配。
因为NUMA默认的内存分配策略是优先在进程所在CPU的本地内存中分配,会导致CPU节点之间内存分配不均衡,当某个CPU节点的内存不足时,会导致swap产生,而不是从远程节点分配内存。这就是所谓的swap insanity 现象。
MySQL采用了线程模式,对于NUMA特性的支持并不好,如果单机只运行一个MySQL实例,我们可以选择关闭NUMA,关闭的方法有三种:
1.硬件层,在BIOS中设置关闭
2.OS内核,启动时设置numa=off;
3.可以用numactl命令将内存分配策略修改为interleave(交叉)。
如果单机运行多个MySQL实例,我们可以将MySQL绑定在不同的CPU节点上,并且采用绑定的内存分配策略,强制在本节点内分配内存,这样既可以充分利用硬件的NUMA特性,又避免了单实例MySQL对多核CPU利用率不高的问题。
再看看服务器上的numa信息
做一个简单收集numastat信息,没隔10分钟收集一次:
$ more /home/mysql/scripts/numastat_out.sh
#!/bin/sh
[ -f ~/.bash_profile ] && . ~/.bash_profile || export PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/mysql/bin
umask 027
logfile=$(cd $(dirname $0);pwd)/log/numastat_out.log
NS=`which numastat`
NC=`which numactl`
echo "############ `date` ############" >> $logfile
$NS >> $logfile
$NC --hardware >> $logfile
echo "######################################################" >> $logfile
结果如下:
############ Mon Mar 26 11:50:01 CST 2012 ############
node0 node1
numa_hit 8363816504 5592890156
numa_miss 1980772 878592
numa_foreign 878592 1980772
interleave_hit 2294388 2354739
local_node 8361757791 5573044252
other_node 4039485 20724496
available: 2 nodes (0-1)
node 0 size: 32275 MB
node 0 free: 151 MB
node 1 size: 32320 MB
node 1 free: 20 MB
node distances:
node 0 1
0: 10 21
1: 21 10
######################################################
注:参数说明
Numa policy hit/miss statistics
/sys/devices/system/node/node*/numastat
All units are pages. Hugepages have separate counters.
numa_hit A process wanted to allocate memory from this node, and succeeded.
numa_miss A process wanted to allocate memory from another node, but ended up with memory from this node.
numa_foreign A process wanted to allocate on this node, but ended up with memory from another one.
local_node A process ran on this node and got memory from it.
other_node A process ran on this node and got memory from another node.
interleave_hit Interleaving wanted to allocate from this node and succeeded.
通过收集2天的数据做对比,interleave_hit 所对应的次数变化很小。说明numa的分配策略是缺省(default):总是在本地节点分配。在服务器上查询numa的 policy 是 default:
$ numactl --show
policy: default
preferred node: current
尝试更改numa策略
尝试将3306 备份慢的实例,启动时前面加上numactl --interleave=all 如下:
numactl --interleave=all /etc/init.d/mysqld_multi start 3306
效果
通过上面的脚本,收集到的 interleave_hit 的次数在备份时段增加得特别快:
############ Wed Mar 28 07:20:01 CST 2012 ############
interleave_hit 5558259 3760066
############ Wed Mar 28 07:30:01 CST 2012 ############
interleave_hit 7661687 5322428
############ Wed Mar 28 07:40:01 CST 2012 ############
interleave_hit 9686299 6776337
############ Wed Mar 28 07:50:01 CST 2012 ############
interleave_hit 11697365 8311399
############ Wed Mar 28 08:00:01 CST 2012 ############
interleave_hit 12499526 8949833
############ Wed Mar 28 08:10:01 CST 2012 ############
interleave_hit 14707608 10587130
############ Wed Mar 28 08:20:01 CST 2012 ############
interleave_hit 16442433 11951893
############ Wed Mar 28 08:30:01 CST 2012 ############
interleave_hit 17824729 12954106
并且,此时error log 无任何Warning。此后几天,再未出现warning信息。
(可以考虑,将实例分别绑定到固定的节点,比如:)
附1
InnoDB: Warning: a long semaphore wait:
--Thread 1183631680 has waited at ../../storage/innobase/include/btr0btr.ic line 28 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x2aacda28d898 created in file buf/buf0buf.c line 550
a writer (thread id 1183631680) has reserved it in mode exclusive
number of readers 0, waiters flag 1
Last time read locked in file btr/btr0pcur.c line 246
Last time write locked in file buf/buf0buf.c line 1823
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 1, pwrites 0
=====================================
120217 10:40:47 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 27 seconds
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 62196, signal count 62194
--Thread 1183631680 has waited at ../../storage/innobase/include/btr0btr.ic line 28 for 256.00 seconds the semaphore:
S-lock on RW-latch at 0x2aacda28d898 created in file buf/buf0buf.c line 550
a writer (thread id 1183631680) has reserved it in mode exclusive
number of readers 0, waiters flag 1
Last time read locked in file btr/btr0pcur.c line 246
Last time write locked in file buf/buf0buf.c line 1823
Mutex spin waits 0, rounds 98353, OS waits 1500
RW-shared spins 118450, OS waits 59162; RW-excl spins 1996, OS waits 1404
------------
TRANSACTIONS
------------
Trx id counter 0 91046918
Purge done for trx's n < 0 91046756 undo n < 0 0
History list length 21
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 0 91046818, not started, process no 4553, OS thread id 1183832384
MySQL thread id 154, query id 294386 localhost root
---TRANSACTION 0 91046875, ACTIVE 1300 sec, process no 4553, OS thread id 1183631680 fetching rows, thread declared inside InnoDB 500
mysql tables in use 1, locked 0
MySQL thread id 134, query id 294200 localhost backup Sending data
SELECT /*!40001 SQL_NO_CACHE */ * FROM `table1`
Trx read view will not see trx with id >= 0 91046876, sees < 0 91046876
--------
FILE I/O
--------
I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: doing file i/o (read thread) ev set
I/O thread 3 state: waiting for i/o request (write thread)
Pending normal aio reads: 64, aio writes: 0,
ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
192399 OS file reads, 256617 OS file writes, 7538 OS fsyncs
1 pending preads, 0 pending pwrites
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Ibuf: size 1, free list len 20875, seg size 20877,
53381 inserts, 53381 merged recs, 31096 merges
Hash table size 18563003, node heap has 1 buffer(s)
0.00 hash searches/s, 0.00 non-hash searches/s
---
LOG
---
Log sequence number 43 3634229620
Log flushed up to 43 3634229620
Last checkpoint at 43 3634229620
0 pending log writes, 0 pending chkp writes
125902 log i/o's done, 0.00 log i/o's/second
----------------------
BUFFER POOL AND MEMORY
----------------------
Total memory allocated 9945977032; in additional pool allocated 13424640
Dictionary memory allocated 648760
Buffer pool size 549824
Free buffers 0
Database pages 549823
Modified db pages 0
Pending reads 64
Pending writes: LRU 0, flush list 0, single page 0
Pages read 3798473, created 3084, written 142124
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
--------------
ROW OPERATIONS
--------------
1 queries inside InnoDB, 0 queries in queue
2 read views open inside InnoDB
Main thread process no. 4553, id 1182828864, state: waiting for server activity
Number of rows inserted 90188, updated 33099, deleted 0, read 249911055
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
----------------------------
END OF INNODB MONITOR OUTPUT
============================
InnoDB: ###### Diagnostic info printed to the standard error stream
其中还出现过重启:
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
120217 8:45:16 InnoDB: Assertion failure in thread 1162561856 in file srv/srv0srv.c line 2236
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
InnoDB: about forcing recovery.
120217 8:45:16 - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.
key_buffer_size=267108864
read_buffer_size=3145728
max_used_connections=3
max_threads=300
threads_connected=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2110308 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
thd: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = (nil) thread_stack 0x30000
/usr/local/mysql/bin/mysqld(my_print_stacktrace+0x20) [0xa16470]
/usr/local/mysql/bin/mysqld(handle_segfault+0x368) [0x609c34]
/lib64/libpthread.so.0 [0x3d5d60e7c0]
/usr/local/mysql/bin/mysqld(srv_error_monitor_thread+0x2f3) [0x94ca0d]
/lib64/libpthread.so.0 [0x3d5d6064a7]
/lib64/libc.so.6(clone+0x6d) [0x3d5cad3c2d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
120217 08:45:17 mysqld_safe Number of processes running now: 0
120217 08:45:17 mysqld_safe mysqld restarted
120217 8:45:17 [Note] Plugin 'FEDERATED' is disabled.
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
120217 8:45:22 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Last MySQL binlog file position 0 547, file name /usr/local/mysql/log/mysql-bin.000035
120217 8:45:27 InnoDB: Started; log sequence number 43 3565657939
120217 8:45:27 [Note] Recovering after a crash using /usr/local/mysql/log/mysql-bin
120217 8:45:27 [Note] Starting crash recovery...
120217 8:45:27 [Note] Crash recovery finished.
120217 8:45:42 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000037' at position 89665957, relay log '/usr/local/mysql/log/relay-
bin.000238' position: 89666102
120217 8:45:42 [Note] Slave I/O thread: connected to master 'user@host:3306',replication started in log 'mysql-bin.000037' at position 131046151
120217 8:45:42 [Note] Event Scheduler: Loaded 0 events
120217 8:45:42 [Note] /usr/local/mysql/bin/mysqld: ready for connections.
Version: '5.1.50-log' socket: '/usr/local/mysql/tmp/mysql.sock' port: 3306 MySQL Community Server (GPL)