使用Tair时遇到pthread_join段错误问题解决-guojun07-ChinaUnix博客

guojun07的博客

首页　| 　博文目录　| 　关于我

guojun07

博客访问： 500765
博文数量： 17
博客积分： 0
博客等级：民兵
技术积分： 1693
用户组：普通用户
注册时间： 2013-09-13 09:23

个人简介

前EMC高级软件工程师，现小米分布式存储码农，关注分布式存储，文件系统，Linux内核。微博: http://weibo.com/u/2203007022

文章分类

全部博文（17）

nginx（2）
Tair（1）
python（1）
网络通信（1）
AT&T汇编语言（1）
C/C++（4）
Linux内核（3）
分布式系统（4）
未分配的博文（0）

文章存档

2015年（1）

2014年（6）

2013年（10）

我的朋友

相关博文

使用Tair时遇到pthread_join段错误问题解决

分类：服务器与存储

2014-04-27 21:37:49

最经使用程序访问Tair时，程序经常Crash，通过跟踪和分析发现原因如下
     在tair_client_impl::retrieve_server_addr中调用了如下函数：
            thread.start(this, reinterpret_cast(heart_type));
            response_thread.start(this, reinterpret_cast(response_type));
    当前线程创建出错，但是没有处理，但是在tair_client_impl::close函数中调用了如下函数：
             thread.join();
             response_thread.join();
    由于线程创建失败，所以这里产生了段错误。

具体分析和解决步骤如下：
（1） gdb调试core dump：
        通过core dump得到的stack如下：

点击(此处)折叠或打开

#0 0x0000003a14c07fc3 in pthread_join () from /lib64/libpthread.so.0
#1 0x00000000004abe6f in join (this=0x7f1df3ffe130) at /home/guojun8/lib/lib/include/tbsys/thread.h:51
#2 tair::tair_client_impl::close (this=0x7f1df3ffe130) at tair_client_api_impl.cpp:247
#3 0x00000000004b07a7 in tair::tair_client_impl::~tair_client_impl (this=0x7f1df3ffe130, __in_chrg=<value optimized out>) at tair_client_api_impl.cpp:83
#4 0x00000000004a58f0 in tair::new_tair_client (master_addr=<value optimized out>, slave_addr=<value optimized out>, group_name=<value optimized out>)
at tair_client_api.cpp:584
#5 0x00000000004a5b43 in tair::tair_client_api::startup (this=0x7f1dd4001170, master_addr=0x7f1dd40010d8 "127.0.0.1:5198",
slave_addr=0x7f1dd4001108 "127.0.0.1:5198", group_name=<value optimized out>) at tair_client_api.cpp:72
#6 0x0000000000447126 in imagestorage::Tair_Handler::Connect (this=0x7f1dd4000f90) at imagestorage/tair_handler.cc:10
#7 0x00000000004502cc in imagestorage::ImageHandler::FetchImage (this=0x1e8cb90, image_name=0x7f1dd4000908 "h00731dcfb73d42acc95f5a54e6088df117",
image_norm_name=0x1e8e9a0 "\270\347\350\001", image_buffer=0x7f1de19fc010 "", image_size=0x7f1df3ffe894, schema="plaza", err_msg="")
at imagestorage/image_handler.cc:213
....

2. 通过gdb调试：

点击(此处)折叠或打开

(gdb) f 2
#2 0x00000000004b3c72 in tair::tair_client_impl::close (this=0x7fa2bf4f3120) at tair_client_api_impl.cpp:248
warning: Source file is more recent than executable.
248 response_thread.join();
(gdb) p response_thread
$1 = {tid = 140336135386880, pid = 0, runnable = 0x7fa2bf4f3128, args = 0x1} ===============》 pid = 0
(gdb)

查看源码：

点击(此处)折叠或打开

static void *hook(void *arg) {
CThread *thread = (CThread*) arg;
thread->pid = gettid(); =========> 如果线程启动成功， pid不应该为0，因此怀疑创建线程失败；
if (thread->getRunnable()) {
thread->getRunnable()->run(thread, thread->getArgs());
}
return (void*) NULL;
}

3. 添加日志：

点击(此处)折叠或打开

ret_thread = thread.start(this, reinterpret_cast<void *>(heart_type));
if(!ret_thread) {
TBSYS_LOG(ERROR, "create thread failed.");
}
ret_thread = response_thread.start(this, reinterpret_cast<void *>(response_type));
if(!ret_thread) {
TBSYS_LOG(ERROR, "create response_thread failed.");
}

重新运行后得到下面的日志输出，因此判断创建线程出错。

点击(此处)折叠或打开

[2013-10-29 18:07:21.531977] WARN parse_invalidate_server (tair_client_api_impl.cpp:3449) [140336971073280] no invalid server info found.
[2013-10-29 18:07:21.532869] ERROR retrieve_server_addr (tair_client_api_impl.cpp:3434) [140336971073280] create response_thread failed.
[2013-10-29 18:07:21.532915] INFO transport.cpp:394 [140336976336640] ADDIOC, SOCK: 24, 127.0.0.1:5198, RON: 1, WON: 1, IOCount:1, IOC:0x7fa270802ea0
[2013-10-29 18:07:21.532941] INFO transport.cpp:394 [140337076029184] ADDIOC, SOCK: 25, 127.0.0.1:5198, RON: 1, WON: 1, IOCount:1, IOC:0x7fa2a0803c50

4. 得到pthread_create的失败信息：

点击(此处)折叠或打开

int ret = pthread_create(&tid, NULL, CThread::hook, this);
if(ret != 0)
printf("pthread_create failed, ret = %s\n", strerror(ret));
assert(ret == 0);
return 0 == ret;

得到的日志输出结果为：
     pthread_create failed, ret = Resource temporarily unavailable

5. 解决方法：
查看错误信息，得到：
       EAGAIN not enough system resources to create a process for the new
              thread.

       EAGAIN more than PTHREAD_THREADS_MAX threads are already active.

./asm/errno.h:14:#define        EAGAIN          11      /* Try again */

怀疑当前用户的进程数超出：
    [sre@WDDS-DEV-016 ~]$ ulimit -u
    1024
修改/etc/security/limits.d/90-nproc.conf中的默认值到10240，具体参见（ulimit限制之nproc问题）
修改之后的值为10240.
     [sre@WDDS-DEV-016 ~]$ ulimit -u
     10240
修改用户进程限制后，问题解决。

阅读(11343) | 评论(0) | 转发(0) |

上一篇：Centos上安装nginx

下一篇：C语言可变长参数实现原理

给主人留下些什么吧！~~

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6