最经使用程序访问Tair时,程序经常Crash,通过跟踪和分析发现原因如下
在tair_client_impl::retrieve_server_addr中调用了如下函数:
thread.start(this, reinterpret_cast(heart_type));
response_thread.start(this, reinterpret_cast(response_type));
当前线程创建出错,但是没有处理,但是在tair_client_impl::close函数中调用了如下函数:
thread.join();
response_thread.join();
由于线程创建失败,所以这里产生了段错误。
具体分析和解决步骤如下:
(1) gdb调试core dump:
通过core dump得到的stack如下:
-
#0 0x0000003a14c07fc3 in pthread_join () from /lib64/libpthread.so.0
-
#1 0x00000000004abe6f in join (this=0x7f1df3ffe130) at /home/guojun8/lib/lib/include/tbsys/thread.h:51
-
#2 tair::tair_client_impl::close (this=0x7f1df3ffe130) at tair_client_api_impl.cpp:247
-
#3 0x00000000004b07a7 in tair::tair_client_impl::~tair_client_impl (this=0x7f1df3ffe130, __in_chrg=<value optimized out>) at tair_client_api_impl.cpp:83
-
#4 0x00000000004a58f0 in tair::new_tair_client (master_addr=<value optimized out>, slave_addr=<value optimized out>, group_name=<value optimized out>)
-
at tair_client_api.cpp:584
-
#5 0x00000000004a5b43 in tair::tair_client_api::startup (this=0x7f1dd4001170, master_addr=0x7f1dd40010d8 "127.0.0.1:5198",
-
slave_addr=0x7f1dd4001108 "127.0.0.1:5198", group_name=<value optimized out>) at tair_client_api.cpp:72
-
#6 0x0000000000447126 in imagestorage::Tair_Handler::Connect (this=0x7f1dd4000f90) at imagestorage/tair_handler.cc:10
-
#7 0x00000000004502cc in imagestorage::ImageHandler::FetchImage (this=0x1e8cb90, image_name=0x7f1dd4000908 "h00731dcfb73d42acc95f5a54e6088df117",
-
image_norm_name=0x1e8e9a0 "\270\347\350\001", image_buffer=0x7f1de19fc010 "", image_size=0x7f1df3ffe894, schema="plaza", err_msg="")
-
at imagestorage/image_handler.cc:213
-
....
2. 通过gdb调试:
-
(gdb) f 2
-
#2 0x00000000004b3c72 in tair::tair_client_impl::close (this=0x7fa2bf4f3120) at tair_client_api_impl.cpp:248
-
warning: Source file is more recent than executable.
-
248 response_thread.join();
-
(gdb) p response_thread
-
$1 = {tid = 140336135386880, pid = 0, runnable = 0x7fa2bf4f3128, args = 0x1} ===============》 pid = 0
-
(gdb)
查看源码:
-
static void *hook(void *arg) {
-
CThread *thread = (CThread*) arg;
-
thread->pid = gettid(); =========> 如果线程启动成功, pid不应该为0,因此怀疑创建线程失败;
-
-
if (thread->getRunnable()) {
-
thread->getRunnable()->run(thread, thread->getArgs());
-
}
-
-
return (void*) NULL;
-
}
3. 添加日志:
-
ret_thread = thread.start(this, reinterpret_cast<void *>(heart_type));
-
if(!ret_thread) {
-
TBSYS_LOG(ERROR, "create thread failed.");
-
}
-
ret_thread = response_thread.start(this, reinterpret_cast<void *>(response_type));
-
if(!ret_thread) {
-
TBSYS_LOG(ERROR, "create response_thread failed.");
-
}
重新运行后得到下面的日志输出,因此判断创建线程出错。
-
[2013-10-29 18:07:21.531977] WARN parse_invalidate_server (tair_client_api_impl.cpp:3449) [140336971073280] no invalid server info found.
-
[2013-10-29 18:07:21.532869] ERROR retrieve_server_addr (tair_client_api_impl.cpp:3434) [140336971073280] create response_thread failed.
-
[2013-10-29 18:07:21.532915] INFO transport.cpp:394 [140336976336640] ADDIOC, SOCK: 24, 127.0.0.1:5198, RON: 1, WON: 1, IOCount:1, IOC:0x7fa270802ea0
-
[2013-10-29 18:07:21.532941] INFO transport.cpp:394 [140337076029184] ADDIOC, SOCK: 25, 127.0.0.1:5198, RON: 1, WON: 1, IOCount:1, IOC:0x7fa2a0803c50
4. 得到pthread_create的失败信息:
-
int ret = pthread_create(&tid, NULL, CThread::hook, this);
-
if(ret != 0)
-
printf("pthread_create failed, ret = %s\n", strerror(ret));
-
assert(ret == 0);
-
return 0 == ret;
得到的日志输出结果为:
pthread_create failed, ret = Resource temporarily unavailable
5. 解决方法:
查看错误信息,得到:
EAGAIN not enough system resources to create a process
for the new
thread.
EAGAIN more than PTHREAD_THREADS_MAX threads are already active.
./asm/errno.h:14:#define EAGAIN 11 /* Try again */
怀疑当前用户的进程数超出:
[sre@WDDS-DEV-016 ~]$ ulimit -u
1024
修改/etc/security/limits.d/90-nproc.conf中的默认值到10240,具体参见(ulimit限制之nproc问题)
修改之后的值为10240.
[sre@WDDS-DEV-016 ~]$ ulimit -u
10240
修改用户进程限制后,问题解决。
阅读(11080) | 评论(0) | 转发(0) |