排错经历：全局变量被多次析构-weixiaobao7-ChinaUnix博客

weixiaobao7的ChinaUnix博客

首页　| 　博文目录　| 　关于我

weixiaobao7

博客访问： 96039
博文数量： 55
博客积分： 0
博客等级：民兵
技术积分： 335
用户组：普通用户
注册时间： 2015-01-07 14:27

文章分类

全部博文（55）

未分配的博文（55）

文章存档

2015年（55）

我的朋友

相关博文

排错经历：全局变量被多次析构

分类： LINUX

2015-03-26 11:06:39

我们team有一套C++写的server程序，最近发现它在每次退出的时候会崩溃，core dump文件的栈如下：

(gdb) bt

#0 0x0000003ea4e32925 in raise () from /lib64/libc.so.6

#1 0x0000003ea4e34105 in abort () from /lib64/libc.so.6

#2 0x0000003ea4e70837 in __libc_message () from /lib64/libc.so.6

#3 0x0000003ea4e76166 in malloc_printerr () from /lib64/libc.so.6

#4 0x0000003ea729d4c9 in std::basic_string, std::allocator >::~basic_string() ()

from /usr/lib64/libstdc++.so.6

#5 0x0000003ea4e35e22 in exit () from /lib64/libc.so.6

#6 0x0000003ea4e1ed24 in __libc_start_main () from /lib64/libc.so.6

#7 0x0000000000400629 in _start ()

下面介绍一下我是如何找到出问题的代码。

请注意，因为编译器优化的缘故，这个栈是不完整的。安装完调试符号后，栈应该是这样：

(gdb) bt

#0 0x0000003ea4e32925 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64

#1 0x0000003ea4e34105 in abort () at abort.c:92

#2 0x0000003ea4e70837 in __libc_message (do_abort=2, fmt=0x3ea4f58aa0 “*** glibc detected *** %s: %s: 0x%s ***\n”)

at ../sysdeps/unix/sysv/linux/libc_fatal.c:198

#3 0x0000003ea4e76166 in malloc_printerr (action=3, str=0x3ea4f58d48 “double free or corruption (fasttop)”,

ptr=) at malloc.c:6336

#4 0x0000003ea729d4c9 in _M_dispose (this=, __in_chrg=)

at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.h:236

#5 std::basic_string, std::allocator >::~basic_string (this=,

__in_chrg=)

at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.h:503

#6 0x0000003ea4e35e22 in __run_exit_handlers (status=0) at exit.c:78

#7 exit (status=0) at exit.c:100

#8 0x0000003ea4e1ed24 in __libc_start_main (main=0x4006f0 , argc=1, ubp_av=0x7fffffffe488,

init=, fini=, rtld_fini=, stack_end=0x7fffffffe478)

at libc-start.c:258

#9 0x0000000000400629 in _start ()

可惜我们服务器上的glibc是自定制的，找不到调试符号。所以我就只好黑灯瞎火的搞。崩溃发生在main函数返回之后。exit()函数执行全局变量的析构，然后就崩溃了。看上去应该跟内存的释放有关，但是我即便用valgrind也得不到任何有用的信息。只能知道是同一块内存被释放了两次，而且这是一个std::string，而且是个全局变量。但是我们的代码有几十万行，从何查起啊…… 于是首要任务是，先找到这个std::string的指针以及它所存的字符串的内容。

于是我首先切换到和std::basic_string相关的一桢

(gdb) f 5

#5 std::basic_string, std::allocator >::~basic_string (this=,

__in_chrg=)

at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.h:503

503 { _M_rep()->_M_dispose(this->get_allocator()); }

(gdb) p *this

No symbol “this” in current context.

可惜this指针被优化掉了。如果我能找到this指针的值，就能直接用gdb的info sym命令得到symbol name，从而得知是哪个变量出错了。不管怎样，先在这断下来再说。

但是每次我尝试给这个函数加断点的时候，gdb就崩溃了

../../gdb/breakpoint.c:5721: internal-error: set_raw_breakpoint: Assertion `sal.pspace != NULL’ failed.

A problem internal to GDB has been detected,

further debugging may prove unreliable.

Quit this debugging session? (y or n) y

虽然后来我摸索出来了怎么不让gdb崩溃也能加断点的办法（直接b *addr)，但那是后话了。就算我在这里加了断点，也很难设置条件。这个函数在exit中被调用几百次呢。

于是我就在它的下一层的函数加了断点

(gdb) b _ZNSs4_Rep10_M_destroyERKSaIcE

Breakpoint 1 at 0x3ea729c320

然后每次运行到这里的时候，可以看见this指针以及它的内容

Breakpoint 1, std::basic_string, std::allocator >::_Rep::_M_destroy (this=0x601e10,

__a=…)

at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.tcc:450

450 _Raw_bytes_alloc(__a).deallocate(reinterpret_cast(this), __size);

(gdb) p *this

$1 = {, std::allocator >::_Rep_base> = {_M_length = 3,

_M_capacity = 3, _M_refcount = -1}, static _S_max_size = 4611686018427387897, static _S_terminal = 0 ‘\000’,

static _S_empty_rep_storage = {0, 0, 0, 0}}

_M_refcount是一个引用计数器。正常情况下，走到这里时_M_refcount应该等于-1。如果发生了重复析构，那么_M_refcount就不等于-1。我想让进程停在异常发生的时候，可惜这个函数被调用的太频繁了。所以我给它加个条件断点来找异常的情况。

(gdb) b _ZNSs4_Rep10_M_destroyERKSaIcE if this->_M_refcount==-2

结果失败了。gdb运行的时候说

Error in testing breakpoint condition:

value has been optimized out

于是我把这个对象的内存打出来，找出这个成员变量的偏移地址：

(gdb) x/16 this

0x601e10: 3 0 3 0

0x601e20: -1 0 7895160 0

0x601e30: 0 0 131537 0

0x601e40: 0 0 0 0

从上面的输出可以看出，它的地址和this指针相差4个int的大小。因为this指针存放在$edi中，所以this->_M_refcount的位置就是(int*)$edi+4。

所以上述的条件“if this->_M_refcount==-2”就可以重新为“if *((int*)$edi+4)!=-1”

(gdb) b _ZNSs4_Rep10_M_destroyERKSaIcE if *((int*)$edi+4)!=-1

然后成功捕获到了

Breakpoint 12, std::basic_string, std::allocator >::_Rep::_M_destroy (this=0x4b4b510, __a=…) at /usr/src/debug/gcc-4.4.7-20120601/obj-x86_64-redhat-linux/x86_64-redhat-linux/libstdc++-v3/include/bits/basic_string.tcc:450

450_Raw_bytes_alloc(__a).deallocate(reinterpret_cast(this), __size);

$4201 = {

, std::allocator >::_Rep_base> = {

_M_length = 78993104,

_M_capacity = 30,

_M_refcount = -2

members of std::basic_string, std::allocator >::_Rep:

static _S_max_size = 4611686018427387897,

static _S_terminal = 0 ‘\000’,

static _S_empty_rep_storage = {0, 0, 0, 0}

}

但是字符串的内容在哪呢？查了下代码，发现它就在Rep这个对象的后面。可以直接用gdb的x命令打印，但是为了打印的更漂亮点，把这段内存像金山游侠那样打印出来，我在网上找到了一个办法

首先定义一个名为xxd的函数

(gdb) define xxd

Redefine command “xxd”? (y or n) y

Type commands for definition of “xxd”.

End with a line saying just “end”.

>dump binary memory dump.bin $arg0 ((char*)$arg0)+$arg1

>shell xxd dump.bin

>end

第一个参数是要打印的内存的起始地址，第二个参数是内存区的长度。

然后调用这个函数。

(gdb) xxd this 64

0000000: 0000 0000 0000 0000 1e00 0000 0000 0000 ................

0000010: feff ffff 0000 0000 6170 706c 6963 6174 ........applicat

0000020: 696f 6e2f 786d 6c3b 2063 6861 7273 6574 ion/xml; charset

0000030: 3d55 5446 2d38 0000 b101 0200 0000 0000 =UTF-8..........

一目了然，是”application/xml; charset=UTF-8”这个字符串被析构了两次。然后去代码中grep。可惜，如果能直接找到string的地址的话，用info sym (addr) 就可以直接知道符号名了，不用去代码中grep。

下面我把出问题的代码精简出来，给大家看看：

main.cpp:它只管载入两个so，其它什么都不做

#include 
#include 

int main(int argc,char* argv[]){
  void* h1=dlopen("./libt1.so",RTLD_NOW|RTLD_GLOBAL);
  if (!h1) {
    fprintf(stderr, "%s:%d load t1 fail %s\n", __FILE__,(int)__LINE__,dlerror());
    return -1;
  }
  void* h2=dlopen("./libt2.so",RTLD_NOW|RTLD_GLOBAL);
  if (!h2) {
    fprintf(stderr, "%s:%d load t2 fail %s\n", __FILE__,(int)__LINE__,dlerror());
    dlclose(h1);
    return -1;
  }
  return 0;
}

libt1.cpp: libt1.so的源文件

#include 

std::string msg("application/xml; charset=UTF-8");

libt2.cpp: libt2.so的源文件，和libt1.cpp完全相同

编译命令

$ g++ -g3 -o t main.cpp -ldl

$ g++ -O2 -fPIC -shared -g3 -o libt1.so libt1.cpp

$ g++ -O2 -fPIC -shared -g3 -o libt2.so libt1.cpp

然后执行主程序，每次必崩溃。

$ ./t

Aborted (core dumped)

这是因为，msg这个变量是一个global的变量，当解析这个符号的时候，后加载的so会覆盖前面。并且，每个so在加载的时候，会向 exit函数注册一个hook，因为这个全局变量需要在main函数退出后被析构。exit函数的hooks是一个链表。这两个so各在其中注册了一条。所以，当exit函数执行的时候，第二个so的msg这个变量所指向的内存会被析构两次。我讲的不是很清楚，你可以按照上面的代码实验一下，很快的。

This article is from: https://www.sunchangming.com/blog/post/4648.html

本文来自：

阅读(1601) | 评论(0) | 转发(0) |

上一篇：CI框架入门中的简单MVC例子

下一篇：Criterion - 一个简单可扩展的 C 语言测试框架

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6