一、现状描述
11月25日晚上8点40分接到现场电话,说我们公司所有员工的im不能正常登陆,im数据库服务器(mysql数据库)的公司表中对应的我们公司名称被删除,需要处理。接到这个异常后,第一想到的是下午下班前,收到一封ogg的警告邮件说ogg进程终止,然后我登陆数据库查看的时候,发现ogg已经工作正常,所以也就没有太多关注。既然已经出现了这个问题,那么先解决问题,再分析原因。因为这些都是ogg从oracle端同步过来的一些数据,所以直接从oracle那边初始化一份过来,然后重设同步程序就可以了。
二、错误分析
1、出现这个问题,第一想到的就是binlog,因为公司表的数据是从我们oracle那边同步过去的,而且oracle那边没有任何关于这个表的删除操作,所以我定位这个表的delete操作
[mysql@ezgclient mysqllog]$ mysqlbinlog mysqlbin.000150>/tmp/11_25.txt
[mysql@ezgclient mysqllog]$ vi /tmp/11_25.txt
……
# at 1396789
# at 1396875
#111125 16:15:31 server id 2 end_log_pos 1396875 Table_map: `a`.`abc` mapped to number 5304
#111125 16:15:31 server id 2 end_log_pos 1397000 Delete_rows: table id 5304 flags: STMT_END_F
BINLOG '
o07PThMCAAAAVgAAAItQFQAAALgUAAAAAAAABnNydGFpcgAPdGFiX3V1bV9jb21wYW55AA4DD/b2
CAwMCA8PDw8PCBAAARQAFgCAAIAAQABAAEAA+D8=
o07PThkCAAAAfQAAAAhRFQAQALgUAAAAAAEADv//AP7iAAAAFAC6vNbdysC1vL/GvLzT0M/euavL
voAAAAAAOLkeQYAAAAAAAAAAAAEpAAAAAAAAANkqumA8EgAAO8kaakoSAAABAAAAAAAAAA9CVTAw
MDAwMDAzNTgwNzg=
……
2、因为binlog_format采用的是row模式,所以需要进一步解析binglog
[mysql@ezgclient mysqllog]$ mysqlbinlog -v -v mysqlbin.000150>/tmp/11_25.txt
[mysql@ezgclient mysqllog]$ vi /tmp/11_25.txt
#111125 16:15:31 server id 2 end_log_pos 1396875 Table_map: `a`.`abc` mapped to number 5304
#111125 16:15:31 server id 2 end_log_pos 1397000 Delete_rows: table id 5304 flags: STMT_END_F
BINLOG '
o07PThMCAAAAVgAAAItQFQAAALgUAAAAAAAABnNydGFpcgAPdGFiX3V1bV9jb21wYW55AA4DD/b2
CAwMCA8PDw8PCBAAARQAFgCAAIAAQABAAEAA+D8=
o07PThkCAAAAfQAAAAhRFQAQALgUAAAAAAEADv//AP7iAAAAFAC6vNbdysC1vL/GvLzT0M/euavL
voAAAAAAOLkeQYAAAAAAAAAAAAEpAAAAAAAAANkqumA8EgAAO8kaakoSAAABAAAAAAAAAA9CVTAw
MDAwMDAzNTgwNzg=
'/*!*/;
### DELETE FROM a.abc
### WHERE
### @1=226 /* INT meta=0 nullable=0 is_null=0 */
### @2='*****有限公司' /* VARSTRING(256) meta=256 nullable=0 is_null=0 */
### @3=951656001 /* DECIMAL(20,0) meta=5120 nullable=0 is_null=0 */
### @4=000000001 /* DECIMAL(22,0) meta=5632 nullable=1 is_null=0 */
### @5=41 /* LONGINT meta=0 nullable=1 is_null=0 */
### @6=2005-05-30 15:11:29 /* DATETIME meta=0 nullable=1 is_null=0 */
### @7=2011-08-17 02:02:19 /* DATETIME meta=0 nullable=1 is_null=0 */
### @8=1 /* LONGINT meta=0 nullable=1 is_null=0 */
### @9='BU0000000358078' /* VARSTRING(128) meta=128 nullable=1 is_null=0 */
### @10=NULL /* VARSTRING(128) meta=128 nullable=1 is_null=1 */
### @11=NULL /* VARSTRING(128) meta=64 nullable=1 is_null=1 */
### @12=NULL /* VARSTRING(128) meta=64 nullable=1 is_null=1 */
### @13=NULL /* VARSTRING(128) meta=64 nullable=1 is_null=1 */
### @14=NULL /* VARSTRING(128) meta=0 nullable=1 is_null=1 */
# at 1397000
#111125 16:15:31 server id 2 end_log_pos 1397027 Xid = 79238866
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
通过上面的分析和这里的日志情况显示,很明显有人误删除了这条记录,导致我们公司所有员工不能登录im(登录在线的,不会使用到这条记录,这个也就是导致了我们到晚上八点多才发现这个异常)
3、检查error日志
检查这个日志,发现一个很明显的bug,这个是导致数据库重启,以及那个时间因为数据库重启导致ogg进程异常收到邮件
111125 16:15:35 InnoDB: Assertion failure in thread 1095162176 in file row/row0mysql.c line 1534
InnoDB: Failing assertion: index->type & DICT_CLUSTERED
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.1/en/forcing-recovery.html
InnoDB: about forcing recovery.
111125 16:15:35 - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.
key_buffer_size=8388608
read_buffer_size=1048576
max_used_connections=30
max_threads=1000
threads_connected=14
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 9234379 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
thd: 0xb8bf170
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x4146d100 thread_stack 0x30000
/opt/mysql/product/5.1/bin/mysqld(my_print_stacktrace+0x2e)[0x8a74ce]
/opt/mysql/product/5.1/bin/mysqld(handle_segfault+0x322)[0x5dc992]
/lib64/libpthread.so.0[0x357980eb10]
/lib64/libc.so.6(gsignal+0x35)[0x3578c30265]
/lib64/libc.so.6(abort+0x110)[0x3578c31d10]
/opt/mysql/product/5.1/bin/mysqld(row_unlock_for_mysql+0x2f2)[0x7f4a52]
/opt/mysql/product/5.1/bin/mysqld(row_search_for_mysql+0x22e1)[0x802591]
/opt/mysql/product/5.1/bin/mysqld(_ZN11ha_innobase10index_readEPhPKhj16ha_rkey_function+0x192)[0x7724d2]
/opt/mysql/product/5.1/bin/mysqld(_ZN7handler16read_range_firstEPK12st_key_rangeS2_bb+0xbe)[0x6caa9e]
/opt/mysql/product/5.1/bin/mysqld(_ZN7handler22read_multi_range_firstEPP18st_key_multi_rangeS1_jbP17st_handler_buffer+0xce)[0x6c85be]
/opt/mysql/product/5.1/bin/mysqld(_ZN18QUICK_RANGE_SELECT8get_nextEv+0x127)[0x6aa557]
/opt/mysql/product/5.1/bin/mysqld[0x6c415d]
/opt/mysql/product/5.1/bin/mysqld(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP11st_sql_listyyb+0x86c)[0x66fc5c]
/opt/mysql/product/5.1/bin/mysqld(_Z21mysql_execute_commandP3THD+0x38bf)[0x5f03af]
/opt/mysql/product/5.1/bin/mysqld(_Z11mysql_parseP3THDPKcjPS2_+0x357)[0x5f25e7]
/opt/mysql/product/5.1/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0xe63)[0x5f3453]
/opt/mysql/product/5.1/bin/mysqld(_Z10do_commandP3THD+0xe6)[0x5f3d16]
/opt/mysql/product/5.1/bin/mysqld(handle_one_connection+0x236)[0x5e66d6]
/lib64/libpthread.so.0[0x357980673d]
/lib64/libc.so.6(clone+0x6d)[0x3578cd3d1d]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0x2aab4890fcd0 is an invalid pointer
thd->thread_id=62259
thd->killed=NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
111125 16:15:35 mysqld_safe Number of processes running now: 0
111125 16:15:35 mysqld_safe mysqld restarted
InnoDB: Log scan progressed past the checkpoint lsn 0 694228728
111125 16:15:36 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
InnoDB: Doing recovery: scanned up to log sequence number 0 694229872
111125 16:15:36 InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percents: 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 5
6 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
InnoDB: Apply batch completed
InnoDB: Last MySQL binlog file position 0 1397027, file name /opt/mysql/mysqldata/mysqllog/mysqlbin.000149
111125 16:15:37 InnoDB: Started; log sequence number 0 694229872
111125 16:15:37 [Note] Recovering after a crash using /opt/mysql/mysqldata/mysqllog/mysqlbin
111125 16:15:37 [Note] Starting crash recovery...
111125 16:15:37 [Note] Crash recovery finished.
而且还有个巧合就是查看binlog发现,DELETE FROM a.abc where ……之后,数据库就因为这个bug自动重启了。
4、网友解释
# /opt/mysql/product/5.1/bin/mysqld(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP11st_sql_listyyb+0x86c)[0x66fc5c]
# /opt/mysql/product/5.1/bin/mysqld(_Z21mysql_execute_commandP3THD+0x38bf)[0x5f03af]
---从这信息看是整理簇索引,导致表空间出现损坏
---分析的情况,你的系统应该正在做一个DELETE操作,而且应该无索引可走,删除的数据量也比较大
---可能是大量数据被缓存在innodb_buffer_pool_size中,并且其内部有创建自适应的hash索引,因删除数据而不得不重新创建,
---以及你的服务器当时IO出现瓶颈,导致一时无法响应Innodb master thread,而出现问题,并且InnoDB引擎在此方面出现过BUG
---解决版本是5.1.37之后,所以建议使用:5.1.40版本,较稳定
阅读(2826) | 评论(0) | 转发(0) |