Bug #19831: rgw: segfault during the shutdown procedure - rgw - Ceph

Actions

Copy link

Bug #19831

closed

rgw: segfault during the shutdown procedure

Added by Radoslaw Zarzynski almost 7 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Development

Tags:

Backport:

Regression:

Yes

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

On a very recent master:

^C2017-05-03 12:42:00.880081 7f74bf98f700 -1 Fail to open '/proc/0/cmdline' error = (2) No such file or directory
2017-05-03 12:42:00.880159 7f74bf98f700 -1 received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0
2017-05-03 12:42:00.880193 7f74bf98f700  1 handle_sigterm
2017-05-03 12:42:00.880262 7f74f22f8640 -1 shutting down
2017-05-03 12:42:00.880322 7f74bf98f700  1 handle_sigterm set alarm for 120
2017-05-03 12:42:01.040245 7f748c110700  0 ERROR: FCGX_Accept_r returned -4
2017-05-03 12:42:01.040417 7f74bf98f700 -1 received  signal: User defined signal 1 from  PID: 31209 task name: bin/radosgw -n client.radosgw.gateway -c ./ceph-rgw.conf --rgw_socket_path /tmp/rgw.fcgi.sock -d -d --debug_rgw 20 --rgw_swift_enforce_content_length=true --rgw_swift_account_in_url=true --rgw_keystone_implicit_tenants=true --rgw_keystone_accepted_admin_roles admin --rgw_swift_versioning_enabled=true  UID: 1000
2017-05-03 12:42:01.040425 7f74bf98f700  1 handle_sigterm
2017-05-03 12:42:01.044111 7f748c110700 20 cleaning up fcgx connections
2017-05-03 12:42:01.044892 7f74c3997700  0 ERROR: failed to clone shard, completion_mgr.get_next() returned ret=-125
2017-05-03 12:42:01.044917 7f74c3997700  5 run(): was stopped, exiting
2017-05-03 12:42:01.044920 7f74c3997700 20 clearing stack on run() exit: stack=0x7f74f35c1960 nref=1
2017-05-03 12:42:01.044929 7f74c3997700 20 run(stacks) returned r=-125
2017-05-03 12:42:01.052328 7f74f22f8640 20 remove_watcher() i=0
2017-05-03 12:42:01.052343 7f74f22f8640  2 removed watcher, disabling cache
2017-05-03 12:42:01.056263 7f74f22f8640 20 remove_watcher() i=1
2017-05-03 12:42:01.060180 7f74f22f8640 20 remove_watcher() i=2
2017-05-03 12:42:01.063779 7f74f22f8640 20 remove_watcher() i=3
2017-05-03 12:42:01.067242 7f74f22f8640 20 remove_watcher() i=4
2017-05-03 12:42:01.071057 7f74f22f8640 20 remove_watcher() i=5
2017-05-03 12:42:01.074296 7f74f22f8640 20 remove_watcher() i=6
2017-05-03 12:42:01.079278 7f74f22f8640 20 remove_watcher() i=7
2017-05-03 12:42:01.081615 7f74c1192700 20 BucketsSyncThread: done
2017-05-03 12:42:01.082347 7f74c0991700 20 UserSyncThread: done
2017-05-03 12:42:01.086479 7f74f22f8640  1 final shutdown
*** Caught signal (Segmentation fault) **
 in thread 7f74c0190700 thread_name:rgw_swift_k_rev
 ceph version 12.0.1-1890-gd0b3d4a (d0b3d4aee014402032ffc65ae3faf6f19d1ba77d)
 1: (()+0x1a4047) [0x7f74f1db3047]
 2: (()+0x10340) [0x7f74e807c340]
 3: (pthread_mutex_lock()+0x4) [0x7f74e8076414]
 4: (PR_Lock()+0x9) [0x7f74e79fb519]
 5: (()+0x26f04) [0x7f74e7a00f04]
 6: (()+0x26fc3) [0x7f74e7a00fc3]
 7: (()+0x7f82) [0x7f74e8073f82]
 8: (()+0x8195) [0x7f74e8074195]
 9: (clone()+0x6d) [0x7f74e698b47d]
Segmentation fault (core dumped)

Actions

Copy link

Updated by Radoslaw Zarzynski almost 7 years ago

(gdb) handle SIGINT nostop pass
SIGINT is used by the debugger.
Are you sure you want to change it? (y or n) y
Signal        Stop    Print    Pass to program    Description
SIGINT        No    Yes    Yes        Interrupt
(gdb) cont
Continuing.
...
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f2240ba0700 (LWP 16210)]
__GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
66    ../nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
#1  0x00007f226840b519 in PR_Lock () from /usr/lib/x86_64-linux-gnu/libnspr4.so
#2  0x00007f2268410f04 in ?? () from /usr/lib/x86_64-linux-gnu/libnspr4.so
#3  0x00007f2268410fc3 in ?? () from /usr/lib/x86_64-linux-gnu/libnspr4.so
#4  0x00007f2268a83f82 in __nptl_deallocate_tsd () at pthread_create.c:158
#5  0x00007f2268a84195 in start_thread (arg=0x7f2240ba0700) at pthread_create.c:325
#6  0x00007f226739b47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Actions

Copy link

Updated by Radoslaw Zarzynski almost 7 years ago

After pulling the debug symbols for libnspr4:

(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f407461b700 (LWP 12544)]
__GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
66    ../nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) bt full
#0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
        __PRETTY_FUNCTION__ = "__pthread_mutex_lock" 
        type = 0
#1  0x00007f409be864d9 in PR_Lock (lock=0x0) at ptsynch.c:177
No locals.
#2  0x00007f409be8bf04 in _pt_thread_death_internal (arg=0x7f40a7a94880, callDestructors=1) at ptthread.c:880
        thred = 0x7f40a7a94880
#3  0x00007f409be8bfc3 in _pt_thread_death (arg=0x7f40a7a94880) at ptthread.c:865
        thred = <optimized out>
#4  0x00007f409c4fef82 in __nptl_deallocate_tsd () at pthread_create.c:158
        data = 0x0
        idx = 0
        cnt = 139914807196248
#5  0x00007f409c4ff195 in start_thread (arg=0x7f407461b700) at pthread_create.c:325
        pd = 0x7f407461b700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139914807195392, -1974657238859091938, 0, 0, 139914807196096, 139914807195392, 
                1880379800789872670, 1880713728613969950}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, 
              cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread" 
#6  0x00007f409ae1647d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.

_pt_thread_death_internal of NSPR calls PR_Lock with pt_book.ml that is NULL at the moment. The member is being cleaned in PR_Cleanup:

        /*
         * I am not sure if it's safe to delete the cv and lock here,
         * since there may still be "system" threads around. If this
         * call isn't immediately prior to exiting, then there's a
         * problem.
         */
        if (0 == pt_book.system)
        {
            PR_DestroyCondVar(pt_book.cv); pt_book.cv = NULL;
            PR_DestroyLock(pt_book.ml); pt_book.ml = NULL;
        }

If that's the case, the PR_Cleanup is called before stopping the Keystone revocation thread. Let's verify that:

(gdb) break PR_Cleanup
Breakpoint 1 at 0x7fd7e22c9960: file ptthread.c, line 1101.
(gdb) cont
Continuing.
...
Breakpoint 1, PR_Cleanup () at ptthread.c:1101
1101    ptthread.c: No such file or directory.
(gdb) bt
#0  PR_Cleanup () at ptthread.c:1101
#1  0x00007fd7e31dad75 in ceph::crypto::shutdown (shared=<optimized out>) at /work/ceph-4/src/common/ceph_crypto.cc:88
#2  0x00007fd7e3187de5 in CephContext::~CephContext (this=0x7fd7edc28000, __in_chrg=<optimized out>) at /work/ceph-4/src/common/ceph_context.cc:663
#3  0x00007fd7e3187e1a in CephContext::put (this=0x7fd7edc28000) at /work/ceph-4/src/common/ceph_context.cc:670
#4  0x00007fd7ec6295e5 in ~intrusive_ptr (this=0x7ffc64219b50, __in_chrg=<optimized out>)
    at /work/ceph-4/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:97
#5  main (argc=<optimized out>, argv=<optimized out>) at /work/ceph-4/src/rgw/rgw_main.cc:274
(gdb) cont
Continuing.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fd7baa58700 (LWP 14013)]
__GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
66    ../nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) bt
#0  __GI___pthread_mutex_lock (mutex=0x0) at ../nptl/pthread_mutex_lock.c:66
#1  0x00007fd7e22c34d9 in PR_Lock (lock=0x0) at ptsynch.c:177
#2  0x00007fd7e22c8f04 in _pt_thread_death_internal (arg=0x7fd7edc86640, callDestructors=1) at ptthread.c:880
#3  0x00007fd7e22c8fc3 in _pt_thread_death (arg=0x7fd7edc86640) at ptthread.c:865
#4  0x00007fd7e293bf82 in __nptl_deallocate_tsd () at pthread_create.c:158
#5  0x00007fd7e293c195 in start_thread (arg=0x7fd7baa58700) at pthread_create.c:325
#6  0x00007fd7e125347d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Actions

Copy link

Updated by Radoslaw Zarzynski almost 7 years ago

Regression changed from No to Yes

The bug is a regression. It has been introduced in: b22977dfd27967def1b6d4caf83694a2264fc825 (PR 14801). After reverting the commit the segfault is gone.

Actions

Copy link