分类:
2008-04-07 18:59:31
在 Solaris 系统上,我通常使用 Sun Studio dbx 工具解决 JVM 中的崩溃和挂起等问题。对于大多数此类问题,我们从客户那里接收到一个磁心存储文件。该文件使用由 JVM 构建的产品所生成。由于 JVM 构建的产品并不包含任何调试信息,所以 dbx 就不能显示 JVM 的对象细节,这使得调试和解密地址非常困难。
dbx 调试器有一个非常有用的功能。使用这项功能我们可以“加载”和“使用”一个交替共享对象,而不是在进程运行中所使用的对象。
(dbx) loadobject -load
alternate-lib
(dbx) loadobject -use
alternate-lib
这项功能对于调试没有调试信息的进程或磁心存储文件非常有用。使用 -g
标志对我们所希望调试的库进行重新编译,然后将库加载到 dbx 调试会话中,这是可能的。
在发生挂起的情况下,在装有 JDK 版本 1.5.0_08 的 Solaris x86 系统上客户面临一个硬挂起,甚至进程不响应 kill -3
。客户向我们发送一个挂起进程的磁心存储文件。
对于具有调试信息的 dbx,我们使用 -g
标志重新构建 JDK 版本 1.5.0_08 的产品版本 libjvm.so
,并将之加载,然后使用带有 dbx 的 .so
。做了这样的更改,dbx 就能够获得使用 libjvm.so
生成的原本不带任何调试信息的磁心存储文件的所有调试信息。
在磁心存储文件上的 pstack
显示,VMThread
正试图获取 Threads_lock
,主线程在 Threads_lock
上正处于等待状态,另一个线程 3992 也试图攫取 Threads_lock
。那么这个锁看起来会引起死锁。下面给出了这个磁心存储文件。
----------------- lwp# 1 / thread# 1 -------------------- fef4009b lwp_park (0, 0, 0) fef3a909 cond_wait_queue (807c540, 807c528, 0, 0) + 3b fef3ae02 _cond_wait (807c540, 807c528) + 66 fef3ae44 cond_wait (807c540, 807c528) + 21 fe92fc64 __1cHMonitorEwait6Mil_i_ (807c4d0, 0, 0) + 2a4 fecf526b __1cHThreadsKdestroy_vm6F_i_ (8073fb9, 0, 806dc14, fee4d33c, fee4a2a0, 10004) + 7b feb66ea3 jni_DestroyJavaVM (fee43560) + 123 08051f8e main (2, 807017c, 80432b4) + bfb 080512fa ???????? (86, 8043468, 804346d, 8043494, 80434a8, 80434c6) ----------------- lwp# 11 / thread# 11 -------------------- fef4009b lwp_park (0, 0, 0) fef399bb slow_lock (fbc92000, 807c528, 0) + 3d fef39ab1 mutex_lock_impl (807c528, 0) + ec fef39bbd mutex_lock (807c528) + 1a fe8c9a00 __1cFMutexElock6MpnGThread__v_ (807c4d0, 8261480) + c0 fe8cc1fe __1cFMutexElock6M_v_ (807c4d0) + 4e fe996295 __1cUSafepointSynchronizeFbegin6F_v_ (fee493fc, 8261480, fedee000, f8199b74, 3, f8199bbc) + 85 fea659c4 __1cIVMThreadEloop6M_v_ (8261480) + 134 fea66478 __1cIVMThreadDrun6M_v_ (8261480) + 88 fecbe0f2 __1cG_start6Fpv_0_ (8261480) + d2 fef3fd52 _thr_setup (fbc92000) + 4e fef40040 _lwp_start (fbc92000, 0, 0, f8199ff8, fef40040, fbc92000) ----------------- lwp# 3992 / thread# 3992 -------------------- fef4009b lwp_park (0, 0, 0) fef399bb slow_lock (6c173c00, 807c528, 0) + 3d fef39ab1 mutex_lock_impl (807c528, 0) + ec fef39bbd mutex_lock (807c528) + 1a fe8ea95e __1cFMutexbClock_without_safepoint_check6M_v_ (807c4d0) + 7e fe95c56e __1cUSafepointSynchronizeFblock6FpnKJavaThread__v_ (a27f780) + ee fecb6825 __1cGParkerEpark6Mix_v_ (a27f908, 0, d964b198, 45) + 1e5 fed0963d Unsafe_Park (a27f840, 6807aba4, 0, d964b198, 45, 9480bcb8) + 1dd f876f2f9 ???????? (0, 0, a27c330, 0, 9ae69a8, 0) 00000000 ???????? (fecf6330, fea09d40, fea03c80, fe964f40, fe91cc40, fe939c50) 00000000 ???????? (fffffd2c, 3badd9be, f468fe16, e78b7254, fe1dbbcc, 0) 00000000 ???????? () |
现在让我们使用 dbx 看一看 Threads_lock
。
(dbx) print -fx Threads_lock Threads_lock = 0x807c4d0 (dbx) print -r *((Monitor*)0x807c4d0) *((class Monitor *) 0x807c4d0) = { Monitor::Mutex::_lock_count = 0 Monitor::Mutex::_lock_event = 0x807c500 Monitor::Mutex::_suppress_signal = 0 Monitor::Mutex::_owner = 0x97b0978 <---------------- Monitor::Mutex::_name = 0xfed97a0f "Threads_lock" Monitor::Mutex::INVALID_THREAD = 0 Monitor::_event = (nil) Monitor::_counter = 0 Monitor::_waiters = 1 Monitor::_tickets = 0 } (dbx) print -r *((Thread*)0x97b0978) *((class Thread *) 0x97b0978) = { Thread::ThreadShadow::_pending_exception = (nil) Thread::_exception_file = (nil) Thread::_exception_line = 0 Thread::_real_malloc_address = 0x97b0d28 Thread::_rawmonitor_list = (nil) Thread::_SR_lock = 0x9b7b890 Thread::_vm_suspend_depth = 0 Thread::_suspend_flags = 0 Thread::_active_handles = 0x97b0b48 Thread::_free_handle_block = (nil) Thread::_last_handle_mark = 0x65c92618 Thread::_oops_do_parity = 0 Thread::_highest_lock = 0x65c92cb8 "x^I{^I\xe8^K{^I\xb0\xd6\x82\xf4\xa0H\xbeu" Thread::_unboosted_priority = NoPriority Thread::_tlab = { Thread::ThreadLocalAllocBuffer::_start = 0x762476d8 Thread::ThreadLocalAllocBuffer::_top = 0x7624c750 Thread::ThreadLocalAllocBuffer::_end = 0x76257308 Thread::ThreadLocalAllocBuffer::_size = 14078U Thread::ThreadLocalAllocBuffer::_refill_waste_limit = 219U Thread::ThreadLocalAllocBuffer::_number_of_refills = 6U Thread::ThreadLocalAllocBuffer::_fast_refill_waste = 0 Thread::ThreadLocalAllocBuffer::_slow_refill_waste = 202U Thread::ThreadLocalAllocBuffer::_gc_waste = 0 Thread::ThreadLocalAllocBuffer::_slow_allocations = 0 Thread::ThreadLocalAllocBuffer::_allocation_fraction = { Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_average = 0.006992618 Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_sample_count = 1U Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_weight = 35U Thread::ThreadLocalAllocBuffer::AdaptiveWeightedAverage::_last_sample = 0.006992618 } Thread::ThreadLocalAllocBuffer::_target_refills = 50 Thread::ThreadLocalAllocBuffer::_global_stats = 136222936 } Thread::_vm_operation_started_count = 1 Thread::_vm_operation_completed_count = 1 Thread::_current_pending_monitor = (nil) Thread::_current_pending_monitor_is_from_java = 1 Thread::_current_waiting_monitor = (nil) Thread::_osthread = 0x97b0b00 <------------- Thread::_resource_area = 0x8c55f50 Thread::_handle_area = 0x98dafc0 Thread::_stack_base = 0x65c93000 " |
在前面的的文件显示,Threads_lock
的所有者是具有 ID 11982 的线程。现在我们看一下这个线程的堆栈跟踪。
(dbx) thread t@11982 t@11982 (l@11982) stopped in __lwp_park at 0xfef4009b 0xfef4009b: __lwp_park+0x000b: jae __lwp_park+0x19 [ 0xfef400a9, .+0xe ] (dbx) where current thread: t@11982 =>[1] __lwp_park(0x0, 0x0), at 0xfef4009b [2] mutex_lock_queue(0x67366400, 0x0, 0x88a0598, 0x0), at 0xfef3911f [3] slow_lock(0x67366400, 0x88a0598, 0x0), at 0xfef399bb [4] mutex_lock_impl(0x88a0598, 0x0), at 0xfef39ab1 [5] __mutex_lock(0x88a0598), at 0xfef39bbd [6] Mutex::wait_for_lock_blocking_implementation(0xa27f908, 0x97b0978), at 0xfe9d93ba [7] Mutex::lock(0xa27f908, 0x97b0978), at 0xfe8c99ea [8] Mutex::lock(0xa27f908), at 0xfe8cc1fe [9] Parker::unpark(0xa27f908), at 0xfecb69bb [10] os::interrupt(0xa27f780), at 0xfecbf6bb [11] Thread::interrupt(0xa27f780), at 0xfecf2ed9 [12] JVM_Interrupt(0x97b0a38, 0x65c923fc, 0x65c923d0, 0x0, 0x65c923fc, 0xf4873020, 0x0, 0xf482f600, 0x65c923fc, 0x65c92424, 0xf8202b6b, 0x0, 0xf8206399, 0x9593fdc0, 0x1, 0x0, 0x65c92400, 0xf482db6b, 0x65c92438, 0xf4873020), at 0xfeb92255 [13] 0xf820843a(0x0, 0xf8206399, 0x9593fdc0, 0x1, 0x0, 0x65c92400), at 0xf820843a [14] 0xf8202b6b(0x0, 0x0, 0x97f59c18, 0x9593fdc0, 0x65c9243c, 0xf5746e95), at 0xf8202b6b [15] 0xf8202b6b(0x967a0c40, 0x9, 0x968f6eb8, 0x65c92464, 0xf57472be, 0x65c92498), at 0xf8202b6b [16] 0xf8202b6b(0x0, 0x968f6eb8, 0x967a0c40, 0x65c9249c, 0xf5845a0c, 0x65c924c4), at 0xf8202b6b [17] 0xf8202b6b(0x1b58, 0x967a0c40, 0x65c924c8, 0xf5845674, 0x65c92524, 0xf58a1f78), at 0xf8202b6b [18] 0xf8202b6b(0x7624c168, 0x7624c138, 0x7624c108, 0x7624c0d8, 0x7624bee8, 0x7624bd00), at 0xf8202b6b |
线程正试图获取 Monitor 0xa27f908。
现在,让我们看一下 Monitor。
(dbx) print -r *((Monitor*)0xa27f908) *((class Monitor *) 0xa27f908) = { Monitor::Mutex::_lock_count = 0 Monitor::Mutex::_lock_event = 0x88a0570 <----------- Monitor::Mutex::_suppress_signal = 0 Monitor::Mutex::_owner = (nil) Monitor::Mutex::_name = 0xfed97c99 "Park Semaphore" Monitor::Mutex::INVALID_THREAD = 0 Monitor::_event = (nil) Monitor::_counter = 0 Monitor::_waiters = 0 Monitor::_tickets = 0 } (dbx) print -r *((Event*)0x88a0570) More than one identifier 'Event'. Select one of the following: 0) Cancel 1) class os::Solaris::Event 2) class Solaris::Event > 1 *((class os::Solaris::Event *) 0x88a0570) = { Event::_count = 0 Event::_nParked = 0 Event::CachePad = (-6.7529205832009e-205, -6.6220952168053e-205, -1.2786650622506e-204, -1.9668306807887e-204) Event::_mutex = ( { Event::_lwp_mutex::flags = { Event::_lwp_mutex::flag1 = 4U Event::_lwp_mutex::flag2 = '\0' Event::_lwp_mutex::ceiling = '\0' Event::_lwp_mutex::mbcp_type_un = { Event::_lwp_mutex::bcptype = 0 Event::_lwp_mutex::mtype_rcount = { Event::_lwp_mutex::count_type1 = '\0' Event::_lwp_mutex::count_type2 = '\0' } } Event::_lwp_mutex::magic = 19800U } Event::_lwp_mutex::lock = { Event::_lwp_mutex::lock64 = { Event::_lwp_mutex::pad = "" } Event::_lwp_mutex::lock32 = { Event::_lwp_mutex::ownerpid = 0 Event::_lwp_mutex::lockword = 16842752U } Event::_lwp_mutex::owner64 = 72339069014638592ULL } Event::_lwp_mutex::data = 1813462016ULL } ) Event::_cond = ( { Event::_lwp_cond::flags = { Event::_lwp_cond::flag = "" Event::_lwp_cond::type = 0 Event::_lwp_cond::magic = 17238U } Event::_lwp_cond::data = 0 } ) Event::FreeNext = (nil) Event::Immortal = 0 } |
这里,_lwp_mutex->data
是 0x6C173C00。它是 ulwp_t)
线程的所有者。
接下来,让我们看一下该线程。
(dbx) x 0x6C173C00/30X 0x6c173c00: 0x6c173c00 0x00000000 0x00000000 0x00000000 0x6c173c10: 0x00000000 0x00000000 0x00000000 0x00000000 0x6c173c20: 0x00000000 0x00000000 0x00000000 0xfef6b740 0x6c173c30: 0x00000000 0x00000000 0x6c174000 0x6c173800 0x6c173c40: 0x00000000 0x00000000 0x00000000 0x6803b000 0x6c173c50: 0x00040000 0x00000000 0x6807b000 0x00040000 0x6c173c60: 0x6803b000 0x00040000 0x00000000 0x00000398 0x6c173c70: 0x00000f98 0x0000007f |
线程 0x00000f98 ( 是 3992) 是持有 Event::mutex 的线程 ID。
现在,让我们看一下线程 3992。
(dbx) thread t@3992 t@3992 (l@3992) stopped in __lwp_park at 0xfef4009b 0xfef4009b: __lwp_park+0x000b: jae __lwp_park+0x19 [ 0xfef400a9, .+0xe ] (dbx) where current thread: t@3992 =>[1] __lwp_park(0x0, 0x0), at 0xfef4009b [2] mutex_lock_queue(0x6c173c00, 0x0, 0x807c528, 0x0), at 0xfef3911f [3] slow_lock(0x6c173c00, 0x807c528, 0x0), at 0xfef399bb [4] mutex_lock_impl(0x807c528, 0x0), at 0xfef39ab1 [5] __mutex_lock(0x807c528), at 0xfef39bbd [6] Mutex::lock_without_safepoint_check(0x807c4d0), at 0xfe8ea95e [7] SafepointSynchronize::block(0xa27f780), at 0xfe95c56e [8] Parker::park(0xa27f908, 0x0, 0xd964b198, 0x45), at 0xfecb6825 [9] Unsafe_Park(0xa27f840, 0x6807aba4, 0x0, 0xd964b198, 0x45, 0x9480bcb8), at 0xfed0963d |
它正等待锁定 Threads_lock
。
因此线程 11982 和 3992 发生了死锁:
Thread_lock
并等待获取 Monitor 0xa27f908。
Threads_lock
。 这个问题在 Java SE 5.0 更新 15 和 Java SE 6 中使用 CR 6601725 已经得到了修复。
以上文章转自于 : http://developers.sun.com.cn/