Chinaunix首页 | 论坛 | 博客
  • 博客访问: 3319697
  • 博文数量: 631
  • 博客积分: 10716
  • 博客等级: 上将
  • 技术积分: 8397
  • 用 户 组: 普通用户
  • 注册时间: 2008-04-01 22:35
文章分类

全部博文(631)

文章存档

2020年(2)

2019年(22)

2018年(4)

2017年(37)

2016年(22)

2015年(1)

2013年(12)

2012年(20)

2011年(19)

2010年(20)

2009年(282)

2008年(190)

分类: SOLARIS

2016-08-15 10:48:50

Abstract: Abnormal termination of a process will trigger a core dump file. A core dump file is very helpful to programmers or support engineers for determining the root cause of abnormal termination, because it provides invaluable information about the runtime status at crash time. This article provides information about core dumps, as well as features and analysis tools in the Solaris Operating System that can be used to manage core dumps.

Note: The information provided in this article is mainly for the Solaris 10 OS.

Contents:

Types of Core Dumps: Process and System
 

A core dump is a file that records the contents of a process along with other useful information, such as the processor register's value. There are two types of core dumps: system core dumps and process core dumps. They differ in many aspects, such as the manner in which they are created and the method used to analyze them.

Cause of Process Core Dumps
 

When an application process receives a specific signal and terminates, the system generates a core dump and stops the process. In most cases, the signal leading to the application crash isSIGSEGV orSIGBUS.

SIGSEGV indicates that the application is accessing an invalid memory address. This situation often occurs in C/C++ programs if there are code errors in pointer manipulation.

On the Solaris OS, you can use the libumem(3LIB) library as the user-mode memory allocator instead oflibc. Thelibumem library can help find memory leaks, buffer overflows, attempts to use freed data, and many other memory allocation errors. Also, its memory allocator is very fast and scalable with multithreaded applications.

SIGBUS indicates that the application is accessing a memory address that does not conform to CPU memory alignment rules. This usually happens to a system with an UltraSPARC processor. Systems with x86/x64 CPUs can handle unaligned memory addresses, but there is a performance impact.

The Sun Studio C/C++ compiler has the -xmemalign option, which can be used to adjust the behavior of the UltraSPARC CPU when there are unaligned memory addresses that can be determined at compile time. The-xmemalign option causes the compiler to generate additional load/store instructions for unaligned memory access. However, the-xmemalign option cannot handle unaligned memory access during runtime. If unaligned memory access happens during runtime, the developer needs to change the source code.

There are other signals whose default disposition is to create a core dump, for example,SIGFPE, which indicates a floating point exception. TheSignal(3HEAD) man page provides more details.

How to Manage a Process Core Dump
 

The Solaris OS attempts to create up to three core dump files for each abnormally terminated process. One of the core dump files, which is called the per-process core file, is located in the current directory. Another core dump file, which is called the global core file, is created in the system-wide location. If the process is running in a local zone, a third core file is created in the global zone's location.

You can use the coreadm(1M) command to manage the core dumps. All the settings are saved in the/etc/coreadm.conf configuration file.

Below is a typical scenario, which shows the current system configuration for core dumps:

-bash-3.00# coreadm

            global core file pattern:

            global core file content: default

            init core file pattern: core

            init core file content: default

            global core dumps: disabled

            per-process core dumps: enabled

            global setid core dumps: disabled

            per-process setid core dumps: disabled

            global core dump logging: disabled

            

In the previous output:

  • The global core dumps: disabled line indicates no global core dump will be generated.
  • The per-process core dumps: enabled line indicates a per-process core dump will be generated for each abnormal process.
  • The init core file pattern line indicates the contents will be gathered from the live process to the per-process core dump.

You can also use the coreadm command to control the core dump file name:

-bash-3.00# coreadm -i core.%f.%p

            

This command causes the per-process core file name to be appended with the program file name(%f) and the runtime process ID(%p). A core dump file will be generated in the current working directory of the process.

-bash-3.00# coreadm -g /globalcore/core.%f.%p -e global

            

By default, the global core dump is disabled. You need to use the coreadm command with the-e global option to enable it. The-g option causes the command to append the program name(%f) and the runtime process ID(%p) to the core file name.

As indicated previously, coreadm can specify the parts of the process that will be saved to the core file. Previously, when you performed a post-mortem analysis, you needed to obtain all the specific versions of the dependent libraries and runtime modules, because the core dump file does not contain this text information. It is quite a headache for programmers to recreate the environment from the original machine.

With the default configuration, the Solaris OS applies the "default" pattern to each process core dump, which means the process core dump contains stack, heap, text, shared memory (SHM), intimate shared memory (ISM), and dynamic intimate shared memory (DISM) information, plus other information. The text part of the process core dump also contains a partial symbol table (dynsm), which will help you get a readable stack trace directly from one core file without any other boring dependent libraries. If the dynsm is insufficient, you can use coreadm to include all symbol tables, as follows:

-bash-3.00# coreadm -G all -I all

            

This previous command makes both the global core file (-G) and the per-process core file (-I) contain all the parts of the process.

Here's how to use coreadm to verify the changes:

-bash-3.00# coreadm

            global core file pattern: /globalcore/core.%f.%p

            global core file content: all

            init core file pattern: core.%f.%p

            init core file content: all

            global core dumps: enabled

            per-process core dumps: enabled

            global setid core dumps: disabled

            per-process setid core dumps: disabled

            global core dump logging: disabled

            

The coreadm command is used to edit the configuration file of thecoreadm service, which is managed by the Service Management Facility (SMF) with this service identifier:svc:/system/coreadm:default.

How to Create a Process Core Dump Manually
 

The Solaris OS provides the gcore(1) command in case you need to create a core dump manually for a live process for analysis purposes:

-bash-3.00# echo $$

            2770

            -bash-3.00# gcore $$

            gcore: core.2770 dumped

            

The live process ID is appended automatically to the name of the generated core dump. In the previous example, the process of the current shell is dumped and its process ID is 2770.

Note: There are other constraints you need take into account while generating the core dump, for example, the write permissions on the destination directory, the existence of the destination directory, the file system mount option, and process resource limitation. For resource limitation information, refer to the man pages forsetrlimit(2) andulimit(1).

Another useful tool called AppCrash is available. It automatically collects diagnostic and debugging information when any application crashes under the Solaris OS. This article does not address its usage. For more information on using AppCrash, refer to Greg Nakhimovsky'sblog.

How to Analyze a Process Core Dump File
 

There are lots of tools in the Solaris OS for analyzing core dump files: dbx(1), mdb(1), and pstack(1). The most convenient method is to use thepstack tool to determine the process stack. This tool helps show multithreaded programs as well:

 -bash-3.00# pstack core.2580  | more

            core 'core.2580' of 2580:       java_vm

            -----------------  lwp# 1 / thread# 1  --------------------

            fef40a27 read     (b, 804280c, 1)

            feb11ba8 __1cDhpiEread6FipvI_I_ (b, 804280c, 1) + a8

            feb11aef JVM_Read (b, 804280c, 1) + 2f

            fe77045e ???????? (80685b8, 8042864, 22)

            ...

            feb1d55c jni_CallStaticVoidMethod (80685b8, 8069020, 80e8355,

            0) + 14c

            080516c2 main     (2, 8047168, 8047174) + 50c

            08050daa ???????? (2, 80472cc, 80472d4, 0, 80472d5, 8047301)

            -----------------  lwp# 2 / thread# 2  --------------------

            fef40d27 lwp_cond_wait (8067ae8, 8067ad0, fb3a9c08, 0)

            fef2de3f _lwp_cond_timedwait (8067ae8, 8067ad0, fb3a9c50) + 35

            ...

            fef3fc32 _thr_setup (fef82400) + 4e

            fef3ff20 _lwp_start (fef82400, 0, 0, fb3a9ff8, fef3ff20,

            fef82400)

            -----------------  lwp# 3 / thread# 3  --------------------

            fef40d27 lwp_cond_wait (8116588, 8116570, 0, 0)

            feab737c __1cCosHSolarisFEventEpark6M_v_ (8116548) + 4c

            ...

            

In general, if the program's symbol table is not stripped and its runtime stack trace is available, you can expect almost 50 percent of the problems to be resolved.

dbx is a free source-level debug tool provided by Sun Studio software. Sun Studio software includes free, optimizing C, C++, and Fortran compilers that can be used on both the Solaris OS and Linux.dbx not only helps you inspect the state of your program, but it also collects the program performance data. Here is a typical scenario for analyzing the core file usingdbx. For more details ondbx, please refer to the document called.

  -bash-3.00# /opt/SUNWspro/bin/dbx   tServer   core

            For information about new features see 'help changes'

            To remove this message, put 'dbxenv suppress_startup_message 7.5'

            in your .dbxrc

            Reading tServer

            core file header read successfully

            Reading ld.so.1

            Reading libpthread.so.1

            Reading librt.so.1

            Reading libsocket.so.1

            Reading libnsl.so.1

            Reading libc.so.1

            Reading libthread.so.1

            Reading libCrun.so.1

            Reading libm.so.1

            Reading libkstat.so.1

            t@1 (l@1) program terminated by signal SEGV (no mapping at

            the fault address)

            0xffffffff7ce3ce90: strcmp+0x0014:      ldub     [%i1], %i5

            Current function is txnAtomMatchRqst

            177  && strcmp(pMsg->inHeader.msgVer, "01" == 0)) {

            (dbx) threads                    ** show all the threads

            o>    t@1  a  l@1   ?()   signal SIGSEGV in  strcmp()

            t@2  b  l@2   tTimerThread()   LWP suspended in  __pollsys()

            (dbx) thread -info t@1           ** show the thread information

            

            Thread t@1 (0xffffffff7a500000) at priority 0

            state: active on    l@1

            base function: 0x0: 0x0000000000000000() stack:

            0xffffffff80000000[8388608]

            flags: (none)

            masked signals: (none)

            Currently active in strcmp

            (dbx) where ** show the thread stack

            current thread: t@1

            [1] strcmp(0x100263d63, 0x0, 0xac, 0x0, 0x30, 0x31), at

            0xffffffff7ce3ce90

            =>[2] tAtomMatchRqst(), line 177 in "tAtomMatchRqst.c"

            [3] tFlow(), line 96 in "tFlow.c"

            [4] tServer(rqst = 0x1001e6c58), line 73 in "tServer.c"

            [5] _tsvcdsp(0x1700, 0x0, 0x10004ca60, 0x1001e55c0, 0x0,

            0x1001d9440), at 0xffffffff7e15d138

            [6] _trunserver(0x1001e3844, 0x1001da958, 0x0,

            0xffffffff7e3525c8, 0x1400, 0x1001ee400), at 0xffffffff7e180ea0

            [7] _tstartserver(0x0, 0xffffffff7ffff568, 0x1001bcc38,

            0x1001d9440, 0x0, 0x0), at 0xffffffff7e15be28

            [8] main(0xf, 0xffffffff7ffff568, 0xffffffff7ffff5e8, 0x0,

            0x0, 0x100000000), at 0x1000099ec

            (dbx) quit

            -bash-3.00#

            

From the previous example, you can use dbx to determine the abnormal thread, which is marked with "o," and its root cause by showing the source code. Of course, this will not happen unless you provide the application source code and add debug information during the compile phase.

If you are familiar with assembly language and hardware specifications, you can usemdb to debug the core file, becausemdb is a low-level debugging utility for both programs and the Solaris OS.

Cause of System Core Dumps
 

There are lots of reasons why the Solaris OS might crash and produce a core dump. Not only software problems, such as like drivers and programs, but also hardware errors can induce a system core dump.

How a System Core Dump Is Created
 

When detecting whether the integrity of data was corrupted or whether a fatal error in hardware occurred, the Solaris OS invokespanic(). Thepanic() routine interrupts all processes as if the OS is suspended. Then it generates a system core dump, which is a copy of OS in the memory, and saves it to the dump device. After a crash, the OS usesavecore(1) to retrieve the core dump from the dump device to thesavecore directory during the next boot. Thesavecore routine generates two files. One file isunix., which is an OS symbol table list, and the other isvmcore., which is the core dump data file. By default, the dump device is a swap disk partition and thesavcore directory is set to/var/crash/. The trailing in the file names is an integer that grows every timesavecore runs.

How to Manage a System Core Dump
 

You can use dumpadm(1M) to manage dump devices and the savecore directory:

-bash-3.00# dumpadm -d /dump  -s /savecore

            Dump content: kernel pages

            Dump device: /dump (dedicated)

            Savecore directory: /savecore

            Savecore enabled: yes

            

To verify this or see the current configuration, you can run only dumpadm:

-bash-3.00# dumpadm

            Dump content: kernel pages

            Dump device: /dump (dedicated)

            Savecore directory: /savecore

            Savecore enabled: yes

            

You can also use dumpadm to set the dump content and enable savecore(1) operation during the boot.

All the configuration information is saved in the /etc/dumpadm.conf configure file. The system crash dump service is also managed by SMF with this service identifier:svc:/system/dumpadm:default.

How to Create a System Core Dump Manually
 

In some cases, you need to save a core dump manually to take a snapshot of the live system. In the Solaris OS, there are several means you can use. For example, you can usereboot -d to force the generation of a core dump with reboot. Or you can usesavecore -L to create a live OS core dump. If you want to usesavecore(1M) to create a live core dump, you must usedumpadm to set a non-swap device as the dump device, because live core dumps take a swap device as a part of virtual memory, which is to be dumped.

Sometimes, the system will hang without crashing. If you are using a Sun UltraSPARC processor-based machine, you can press Stop-A to run in OpenBoot PROM (OBP) mode, and then use thesync OBP command to force a crash core dump.

On x86 platforms, there is no corresponding OBP unit. However, you can use kmdb(1M). To use kmdb to create a core dump, you need load its module during system booting.

Here are the steps for the Solaris 10 1/06 OS or later.

  1. Edit the /boot/grub/menu.lst file and append the -k string to theinitrd line, as follows:
    title Solaris 10 11/06 s10x_u3wos_10 X86
    
                    root (hd0,1,a)
    
                    kernel /platform/i86pc/multiboot -k
    
                    module /platform/i86pc/boot_archive
    
                    

    This will make the OS boot with kmdb.

  2. Then restart the machine manually.

Alternatively, if you are using the Solaris 10 GA OS, just enter b -k when you see theSelect (B)oot or (I)nterpreter: system prompt during the system boot stage.

After performing these steps, press F1-A to break the system to kmdb. This action must be performed in console mode, becausekmdb suspends the system and GUI applications. If you are using a desktop system, the Solaris OS will fail to switch to console mode and your desktop will appear to hang. Howeverkmdb is running and you can still type commands.

$
			

The systemdump command generates the core dump file for you. The dump device andsavecore directory for this operation are still constrained bydumpadm.

Sometimes, the system will hang without any response even when you use kmdb or OBP. In this case, use the "deadman timer." The deadman timer allows the OS to force a kernel panic in the event of a system hang. This feature is available on x86 and SPARC systems. Add the following line to /etc/system and reboot so the deadman timer will be enabled.

set snooping=1

            

The enabled deadman timer will perform a level 15 interrupt once a second. It will check whether the kernellbolt variable is updated. If the deadman timer notices that thelbolt variable has not been incremented for a period of time (the default is 50 seconds), it will cause a panic. The period of time can be configured in/etc/system. The following example makes the deadman timer wait 120 seconds for thelbolt variable update:

set snoop_interval=120000000

            

Solaris Dynamic Tracing (DTrace) was introduced with the release of Solaris 10 OS. DTrace allows you to understand and explore applications or the operating system. DTrace contains a feature called Anonymous Tracing. It provides device driver developers with a way to debug and trace system activity that occurs during the system boot. If the Solaris OS hangs, you can use this feature to generate a core dump and catch other information you want. For the information on using DTrace, refer to the and the.

How to Analyze a System Core Dump File
 

This article cannot provide solutions for fixing a system core dump, because such an analysis requires much low-level computing knowledge of the OS kernel and also of the hardware. However here are some basic guidelines for your reference:

  1. Check the system console and the /var/adm/messages file, because they contain valuable information for identifying the problem that the system encountered.
  2. Use the strings(1) command to process the core dump file. This command prints out the ASCII strings in any binary file, including a core dump file. You need to look at these ASCII strings.
  3. Check the error you encounter on the , or use the free Solaris Crash Analysis Tool (CAT) to help you investigate further.
References
 
Acknowledgments
 

Thanks to Xinfeng Liu and Oliver Yang, Professional Engineers from the Sun China Engineering & Research Institute, for their invaluable comments.

Author Contact Information
 

The author can also be reached at .

阅读(10468) | 评论(0) | 转发(1) |
给主人留下些什么吧!~~