Actions
Bug #19831
closedrgw: segfault during the shutdown procedure
Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:
0%
Source:
Development
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
On a very recent master:
^C2017-05-03 12:42:00.880081 7f74bf98f700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory 2017-05-03 12:42:00.880159 7f74bf98f700 -1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 0 2017-05-03 12:42:00.880193 7f74bf98f700 1 handle_sigterm 2017-05-03 12:42:00.880262 7f74f22f8640 -1 shutting down 2017-05-03 12:42:00.880322 7f74bf98f700 1 handle_sigterm set alarm for 120 2017-05-03 12:42:01.040245 7f748c110700 0 ERROR: FCGX_Accept_r returned -4 2017-05-03 12:42:01.040417 7f74bf98f700 -1 received signal: User defined signal 1 from PID: 31209 task name: bin/radosgw -n client.radosgw.gateway -c ./ceph-rgw.conf --rgw_socket_path /tmp/rgw.fcgi.sock -d -d --debug_rgw 20 --rgw_swift_enforce_content_length=true --rgw_swift_account_in_url=true --rgw_keystone_implicit_tenants=true --rgw_keystone_accepted_admin_roles admin --rgw_swift_versioning_enabled=true UID: 1000 2017-05-03 12:42:01.040425 7f74bf98f700 1 handle_sigterm 2017-05-03 12:42:01.044111 7f748c110700 20 cleaning up fcgx connections 2017-05-03 12:42:01.044892 7f74c3997700 0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125 2017-05-03 12:42:01.044917 7f74c3997700 5 run(): was stopped, exiting 2017-05-03 12:42:01.044920 7f74c3997700 20 clearing stack on run() exit: stack=0x7f74f35c1960 nref=1 2017-05-03 12:42:01.044929 7f74c3997700 20 run(stacks) returned r=-125 2017-05-03 12:42:01.052328 7f74f22f8640 20 remove_watcher() i=0 2017-05-03 12:42:01.052343 7f74f22f8640 2 removed watcher, disabling cache 2017-05-03 12:42:01.056263 7f74f22f8640 20 remove_watcher() i=1 2017-05-03 12:42:01.060180 7f74f22f8640 20 remove_watcher() i=2 2017-05-03 12:42:01.063779 7f74f22f8640 20 remove_watcher() i=3 2017-05-03 12:42:01.067242 7f74f22f8640 20 remove_watcher() i=4 2017-05-03 12:42:01.071057 7f74f22f8640 20 remove_watcher() i=5 2017-05-03 12:42:01.074296 7f74f22f8640 20 remove_watcher() i=6 2017-05-03 12:42:01.079278 7f74f22f8640 20 remove_watcher() i=7 2017-05-03 12:42:01.081615 7f74c1192700 20 BucketsSyncThread: done 2017-05-03 12:42:01.082347 7f74c0991700 20 UserSyncThread: done 2017-05-03 12:42:01.086479 7f74f22f8640 1 final shutdown *** Caught signal (Segmentation fault) ** in thread 7f74c0190700 thread_name:rgw_swift_k_rev ceph version 12.0.1-1890-gd0b3d4a (d0b3d4aee014402032ffc65ae3faf6f19d1ba77d) 1: (()+0x1a4047) [0x7f74f1db3047] 2: (()+0x10340) [0x7f74e807c340] 3: (pthread_mutex_lock()+0x4) [0x7f74e8076414] 4: (PR_Lock()+0x9) [0x7f74e79fb519] 5: (()+0x26f04) [0x7f74e7a00f04] 6: (()+0x26fc3) [0x7f74e7a00fc3] 7: (()+0x7f82) [0x7f74e8073f82] 8: (()+0x8195) [0x7f74e8074195] 9: (clone()+0x6d) [0x7f74e698b47d] Segmentation fault (core dumped)
Updated by Radoslaw Zarzynski almost 7 years ago
(gdb) handle SIGINT nostop pass SIGINT is used by the debugger. Are you sure you want to change it? (y or n) y Signal Stop Print Pass to program Description SIGINT No Yes Yes Interrupt (gdb) cont Continuing. ... Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f2240ba0700 (LWP 16210)] __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66 66 ../nptl/pthread_mutex_lock.c: No such file or directory. (gdb) bt #0 __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66 #1 0x00007f226840b519 in PR_Lock () from /usr/lib/x86_64-linux-gnu/libnspr4.so #2 0x00007f2268410f04 in ?? () from /usr/lib/x86_64-linux-gnu/libnspr4.so #3 0x00007f2268410fc3 in ?? () from /usr/lib/x86_64-linux-gnu/libnspr4.so #4 0x00007f2268a83f82 in __nptl_deallocate_tsd () at pthread_create.c:158 #5 0x00007f2268a84195 in start_thread (arg=0x7f2240ba0700) at pthread_create.c:325 #6 0x00007f226739b47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Updated by Radoslaw Zarzynski almost 7 years ago
After pulling the debug symbols for libnspr4:
(gdb) cont Continuing. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f407461b700 (LWP 12544)] __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66 66 ../nptl/pthread_mutex_lock.c: No such file or directory. (gdb) bt full #0 __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66 __PRETTY_FUNCTION__ = "__pthread_mutex_lock" type = 0 #1 0x00007f409be864d9 in PR_Lock (lock=0x0) at ptsynch.c:177 No locals. #2 0x00007f409be8bf04 in _pt_thread_death_internal (arg=0x7f40a7a94880, callDestructors=1) at ptthread.c:880 thred = 0x7f40a7a94880 #3 0x00007f409be8bfc3 in _pt_thread_death (arg=0x7f40a7a94880) at ptthread.c:865 thred = <optimized out> #4 0x00007f409c4fef82 in __nptl_deallocate_tsd () at pthread_create.c:158 data = 0x0 idx = 0 cnt = 139914807196248 #5 0x00007f409c4ff195 in start_thread (arg=0x7f407461b700) at pthread_create.c:325 pd = 0x7f407461b700 now = <optimized out> unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139914807195392, -1974657238859091938, 0, 0, 139914807196096, 139914807195392, 1880379800789872670, 1880713728613969950}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = <optimized out> pagesize_m1 = <optimized out> sp = <optimized out> freesize = <optimized out> __PRETTY_FUNCTION__ = "start_thread" #6 0x00007f409ae1647d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 No locals.
_pt_thread_death_internal of NSPR calls PR_Lock with pt_book.ml that is NULL at the moment. The member is being cleaned in PR_Cleanup:
/*
* I am not sure if it's safe to delete the cv and lock here,
* since there may still be "system" threads around. If this
* call isn't immediately prior to exiting, then there's a
* problem.
*/
if (0 == pt_book.system)
{
PR_DestroyCondVar(pt_book.cv); pt_book.cv = NULL;
PR_DestroyLock(pt_book.ml); pt_book.ml = NULL;
}
If that's the case, the PR_Cleanup is called before stopping the Keystone revocation thread. Let's verify that:
(gdb) break PR_Cleanup Breakpoint 1 at 0x7fd7e22c9960: file ptthread.c, line 1101. (gdb) cont Continuing. ... Breakpoint 1, PR_Cleanup () at ptthread.c:1101 1101 ptthread.c: No such file or directory. (gdb) bt #0 PR_Cleanup () at ptthread.c:1101 #1 0x00007fd7e31dad75 in ceph::crypto::shutdown (shared=<optimized out>) at /work/ceph-4/src/common/ceph_crypto.cc:88 #2 0x00007fd7e3187de5 in CephContext::~CephContext (this=0x7fd7edc28000, __in_chrg=<optimized out>) at /work/ceph-4/src/common/ceph_context.cc:663 #3 0x00007fd7e3187e1a in CephContext::put (this=0x7fd7edc28000) at /work/ceph-4/src/common/ceph_context.cc:670 #4 0x00007fd7ec6295e5 in ~intrusive_ptr (this=0x7ffc64219b50, __in_chrg=<optimized out>) at /work/ceph-4/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:97 #5 main (argc=<optimized out>, argv=<optimized out>) at /work/ceph-4/src/rgw/rgw_main.cc:274 (gdb) cont Continuing. Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fd7baa58700 (LWP 14013)] __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66 66 ../nptl/pthread_mutex_lock.c: No such file or directory. (gdb) bt #0 __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66 #1 0x00007fd7e22c34d9 in PR_Lock (lock=0x0) at ptsynch.c:177 #2 0x00007fd7e22c8f04 in _pt_thread_death_internal (arg=0x7fd7edc86640, callDestructors=1) at ptthread.c:880 #3 0x00007fd7e22c8fc3 in _pt_thread_death (arg=0x7fd7edc86640) at ptthread.c:865 #4 0x00007fd7e293bf82 in __nptl_deallocate_tsd () at pthread_create.c:158 #5 0x00007fd7e293c195 in start_thread (arg=0x7fd7baa58700) at pthread_create.c:325 #6 0x00007fd7e125347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Updated by Radoslaw Zarzynski almost 7 years ago
- Regression changed from No to Yes
The bug is a regression. It has been introduced in: b22977dfd27967def1b6d4caf83694a2264fc825 (PR 14801). After reverting the commit the segfault is gone.
Updated by Radoslaw Zarzynski almost 7 years ago
- Status changed from New to Fix Under Review
Updated by Daniel Gryniewicz about 4 years ago
- Status changed from Fix Under Review to Resolved
Actions