分类: LINUX
2015-08-21 14:25:32
For the BUG86475, IMO, summarized as follows after
making some investigations.
1. In the signal processing function, the non-reentrant function can’t be called directly
or indirectly. For example, calloc and malloc. Because they operate the global memory
allocation table.
2. But in the function "certd_sig_handler"(
certd.c: 7617) which is used to handle SIGCHLD,
the function "create_manifest"(certd-cluster-funcs.c:
339) invokes the function "calloc".
As a result, it was not expected when the signal SIGCHLD is triggered.
3. Now, we analyse the stack that was follow.
The
daemon "certd" called the function "malloc" that was
indicated by the tag #9. At the same time, the signal handler "certd_sig_handler" was called before the function
malloc returned. It was indicated by the tag #2.
I
dare to guess that the #9 invoke the mutex that protects the global memory
allocation table and the #2 also invokes the same mutex before the #9
returns. So it leads deadlock which was indicated by the blue font.
Although the probability of error is very small in the transient signal
processing functions, but reproduce it easily in the SMP device and high
workload.
# gdb -p 3698
(gdb) bt
#0 0x00002aaab0a8fd4e in ?? () from /lib64/libc.so.6
#1 0x00002aaab0a1ba51 in ?? () from /lib64/libc.so.6
#2 0x00002aaab0a19dc1
in calloc () from /lib64/libc.so.6
#3 0x00002aaaad9cb9b6 in create_manifest () from
/lib64/libhashfiles.so
#4 0x0000000000406db3 in ?? ()
#5 0x00002aaaad7c4a08 in signal_sigaction () from
/lib64/libsignal.so
#6
#7 0x00002aaab0a15eb5 in ?? () from /lib64/libc.so.6
#8 0x00002aaab0a172b8 in ?? () from /lib64/libc.so.6
#9 0x00002aaab0a19440
in malloc () from /lib64/libc.so.6
#10 0x00002aaaad01d663 in CRYPTO_malloc () from
/lib64/libcrypto.so.1.0.0
#11 0x00002aaaad092ad4 in BUF_MEM_grow () from
/lib64/libcrypto.so.1.0.0
#12 0x00002aaaad0d4920 in PEM_read_bio () from
/lib64/libcrypto.so.1.0.0
#13 0x00002aaaad0d4ec6 in PEM_bytes_read_bio () from
/lib64/libcrypto.so.1.0.0
#14 0x00002aaaad0d682f in PEM_ASN1_read_bio () from
/lib64/libcrypto.so.1.0.0
#15 0x00002aaaac47c8fb in load_cert () from
/lib64/libpkicli.so
#16 0x00002aaaac4897c9 in wg_get_cert_info_by_purpose ()
from /lib64/libpkicli.so
#17 0x000000000040e33a in ?? ()
#18 0x000000000040ea0f in ?? ()
#19 0x0000000000408b3c in ?? ()
#20 0x0000000000408b7e in ?? ()
#21 0x0000000000408b3c in ?? ()
#22 0x000000000040f203 in ?? ()
---Type
#23 0x00002aaaabc3108b in ?? () from /lib64/liblistener.so
#24 0x00002aaaabc3003c in ListenLoop () from
/lib64/liblistener.so
#25 0x0000000000404fec in ?? ()
#26 0x00002aaab09bebb5 in __libc_start_main () from
/lib64/libc.so.6
#27 0x0000000000405415 in ?? ()
[Certd hang]
#strace -p 3698
Process 3698 attached - interrupt to quit
futex(0x2aaab0d41600,
FUTEX_WAIT_PRIVATE, 2, NULL
Process 3698 detached
4.
Teamtrack ID (Bug/RFE/Task):
BUG86475: certd sometimes get stuck because of writing non-async-signal-safe
signal handler
Root Cause (Bug)
or Purpose (RFE/Task):
(1). In the signal processing function, it calls the non-reentrant function.
For example, "calloc" and "free" etc. It is likely to lead
to a deadlock.
(2). Currently, the main thread generates the zombie process likely. Since its
signals "SIGCHLD" and the signal "SIGCHLD" in the function
"wgut_system" are unexpected.
Solution:
(1). Create pipe. When the signal was triggered, the signal processing function
only uses the reentrant function to send the corresponding signal value to the
pipe. Be similar to the upper part of interrupt handling.
The main thread always listens the read event of the pipe. When there is a read
event, it's time to finish all the rest of the work. Thus, it completes the
most of the signal processing work that includes the non-reentrant function. Be
similar to the lower part of interrupt handling.
(2). To avoid the zombie process, the main thread creates child process after
the operation of the signal "SIGCHLD".