Project

General

Profile

Bug #20988

client: dual client segfault with racing ceph_shutdown

Added by Jeff Layton about 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
08/12/2017
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
jewel,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:

Description

I have a testcase that I'm working on that has two threads, each with their own ceph_mount_info. If those threads end up doing racing ceph_shutdown calls, I see crashes:

Core was generated by `./bin/ceph_test_libcephfs --gtest_filter=LibCephFS.Delegation'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  lockdep_will_unlock (name=0x7fffc008a950 "PerfCountersCollection", id=<optimized out>) at /home/jlayton/git/ceph/src/common/lockdep.cc:369
369      lockdep_dout(20) << "_will_unlock " << name << dendl;
[Current thread is 1 (Thread 0x7fffb2ffd700 (LWP 8319))]
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.25-7.fc26.x86_64 libblkid-2.30.1-1.fc26.x86_64 libgcc-7.1.1-3.fc26.x86_64 libstdc++-7.1.1-3.fc26.x86_64 libuuid-2.30.1-1.fc26.x86_64 lttng-ust-2.9.0-2.fc26.x86_64 nspr-4.15.0-1.fc26.x86_64 nss-3.31.0-1.1.fc26.x86_64 nss-softokn-3.31.0-1.0.fc26.x86_64 nss-softokn-freebl-3.31.0-1.0.fc26.x86_64 nss-util-3.31.0-1.0.fc26.x86_64 sqlite-libs-3.19.3-1.fc26.x86_64 userspace-rcu-0.9.3-2.fc26.x86_64 zlib-1.2.11-2.fc26.x86_64
(gdb) bt
#0  lockdep_will_unlock (name=0x7fffc008a950 "PerfCountersCollection", id=<optimized out>) at /home/jlayton/git/ceph/src/common/lockdep.cc:369
#1  0x00007fffedb85820 in Mutex::_will_unlock (this=0x7fffc008a858) at /home/jlayton/git/ceph/src/common/Mutex.h:62
#2  Mutex::Unlock (this=this@entry=0x7fffc008a858) at /home/jlayton/git/ceph/src/common/Mutex.cc:120
#3  0x00007fffedb812b0 in Mutex::Locker::~Locker (this=<synthetic pointer>, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/common/Mutex.h:118
#4  PerfCountersCollection::remove (this=0x7fffc008a850, l=<optimized out>) at /home/jlayton/git/ceph/src/common/perf_counters.cc:62
#5  0x00007ffff7b87e6e in ObjectCacher::perf_stop (this=0x7fffc0125fd0) at /home/jlayton/git/ceph/src/osdc/ObjectCacher.cc:688
#6  0x00007ffff7b9edc1 in ObjectCacher::~ObjectCacher (this=0x7fffc0125fd0, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/osdc/ObjectCacher.cc:638
#7  0x00007ffff7b218ea in std::default_delete<ObjectCacher>::operator() (this=<optimized out>, __ptr=0x7fffc0125fd0) at /usr/include/c++/7/bits/unique_ptr.h:78
#8  std::unique_ptr<ObjectCacher, std::default_delete<ObjectCacher> >::~unique_ptr (this=0x7fffc01227a8, __in_chrg=<optimized out>)
    at /usr/include/c++/7/bits/unique_ptr.h:268
#9  Client::~Client (this=0x7fffc0121d40, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/client/Client.cc:301
#10 0x00007ffff7b21c99 in StandaloneClient::~StandaloneClient (this=0x7fffc0121d40, __in_chrg=<optimized out>) at /home/jlayton/git/ceph/src/client/Client.cc:13420
#11 0x00007ffff7ad2323 in ceph_mount_info::shutdown (this=0x7fffc008e590) at /home/jlayton/git/ceph/src/libcephfs.cc:164
#12 ceph_shutdown (cmount=0x7fffc008e590) at /home/jlayton/git/ceph/src/libcephfs.cc:357
#13 0x0000555555607f70 in breaker_func (filename=<optimized out>) at /home/jlayton/git/ceph/src/test/libcephfs/deleg.cc:53
#14 0x00007ffff6fb002f in ?? () from /lib64/libstdc++.so.6
#15 0x00007ffff78a836d in start_thread () from /lib64/libpthread.so.0
#16 0x00007ffff6706b8f in clone () from /lib64/libc.so.6

If I instead ensure that the calls are serialized, they don't crash.

0001-client-test-shutdown-race.patch View - Updated testcase (1.27 KB) Jeff Layton, 08/21/2017 06:14 PM


Related issues

Copied to fs - Backport #21525: luminous: client: dual client segfault with racing ceph_shutdown Resolved
Copied to fs - Backport #21526: jewel: client: dual client segfault with racing ceph_shutdown Closed

History

#1 Updated by Patrick Donnelly almost 2 years ago

  • Subject changed from segfault with racing libcephfs ceph_shutdown calls to client: dual client segfault with racing ceph_shutdown
  • Source set to Development

Jeff, just confirming this bug is with two client instances and not one instance with two threads?

#2 Updated by Jeff Layton almost 2 years ago

Correct. I'll see if I can roll up a testcase for this when I get a few mins.

#3 Updated by Jeff Layton almost 2 years ago

  • File 0001-client-test-shutdown-race.patch added

Here's a testcase that seems to trigger it fairly reliably. You may have to run it a few times to get it to crash but it should eventually.

#4 Updated by Jeff Layton almost 2 years ago

  • File deleted (0001-client-test-shutdown-race.patch)

#6 Updated by Jeff Layton almost 2 years ago

  • Project changed from fs to Ceph

Moving this to main "Ceph" project as it looks more like a problem in the AdminSocket code. The thing seems to mainly crash here in lockdep_will_unlock:

lockdep_dout(20) << "_will_unlock " << name << dendl;

The name pointer is OK, so I think we must be hitting a use-after-free in lockdep_dout(). That's just defined as:

#define lockdep_dout(v) lsubdout(g_lockdep_ceph_ctx, lockdep, v)

gdb says we crashed at +141:

   0x00007f2aaf03c9ab <+107>:    nopl   0x0(%rax,%rax,1)
   0x00007f2aaf03c9b0 <+112>:    lea    0x870f909(%rip),%rdi        # 0x7f2ab774c2c0 <_ZL13lockdep_mutex>
   0x00007f2aaf03c9b7 <+119>:    callq  0x7f2aaeca6390 <pthread_mutex_lock@plt>
   0x00007f2aaf03c9bc <+124>:    mov    0x870f8dd(%rip),%rcx        # 0x7f2ab774c2a0 <_ZL18g_lockdep_ceph_ctx>
   0x00007f2aaf03c9c3 <+131>:    movabs $0xcccccccccccccccd,%rsi
=> 0x00007f2aaf03c9cd <+141>:    mov    0x8(%rcx),%rax
   0x00007f2aaf03c9d1 <+145>:    mov    0x170(%rax),%rdx
   0x00007f2aaf03c9d8 <+152>:    mov    0x178(%rax),%rax
   0x00007f2aaf03c9df <+159>:    sub    %rdx,%rax

%rcx is NULL. That looks like a chain of pointer dereferencing and this happened just after the mutex lock. My money is on this line in dout_impl:

  if (cct->_conf->subsys.should_gather(sub, v)) {                       \      

That cct value is g_lockdep_ceph_ctx...and indeed:

  (gdb) p g_lockdep_ceph_ctx
  $2 = (CephContext *) 0x0

That thing is a shared resource and I think it needs some sort refcounting to ensure that it doesn't get freed while other contexts might still be using it.

#7 Updated by Zheng Yan almost 2 years ago

I found a workaround. We can create single CephContext for multiple ceph_mount.

#define _FILE_OFFSET_BITS 64
#include <pthread.h>
#include <features.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <assert.h>
#include <random>
#include <cephfs/libcephfs.h>

static struct ceph_mount_info *init_cmount = NULL;

void *foo_func(void *x_void_ptr)
{
        CephContext *cct = ceph_get_mount_context(init_cmount);

        unsigned n = 0;
        while (true) {
                struct ceph_mount_info *cmount1 = NULL;

                ceph_create_with_context(&cmount1, cct);
                ceph_conf_read_file(cmount1, "/etc/ceph/ceph.conf");
                ceph_conf_parse_env(cmount1, NULL);
                assert(ceph_mount(cmount1, NULL) == 0);

                ceph_shutdown(cmount1);
                if (++n % 100 == 0)
                        printf("%ld: %d\n", (long)x_void_ptr, n);
        }
        return NULL;
}

define ARRAY_SIZE(a) (sizeof(a) / sizeof(a[0]))

int main()
{
        ceph_create(&init_cmount, "admin");

        pthread_t threads[128];
        for (int i = 0; i < ARRAY_SIZE(threads); i++) {
                if(pthread_create(&threads[i], NULL, foo_func, (void*)(long)i)) {
                        fprintf(stderr, "Error creating thread\n");
                        return 1;
                }
        }
        for (int i = 0; i < ARRAY_SIZE(threads); i++) {
                pthread_join(threads[i], NULL);
        }
        return 0;
}

#8 Updated by Jeff Layton almost 2 years ago

Hmm. I'm not sure that really helps. Here's the doc comment over ceph_create_with_context:

/**                                                                             
 * Create a mount handle from a CephContext, which holds the configuration      
 * for the ceph cluster.  A CephContext can be acquired from an existing ceph_mount_info
 * handle, using the @ref ceph_get_mount_context call.  Note that using the same CephContext
 * for two different mount handles results in the same client entity id being used.
 *                                                                              
 * @param cmount the mount info handle to initialize                            
 * @param conf reuse this pre-existing CephContext config                       
 * @returns 0 on success, negative error code on failure                        
 */                                                                             
int ceph_create_with_context(struct ceph_mount_info **cmount, struct CephContext *conf);

In the situation where I noticed this, I really did not want the same client entity ID being used, so this workaround would not have been helpful.

This might however be sufficient for ganesha. We could create a context early on and use the same context for all the mounts.

#9 Updated by Jeff Layton almost 2 years ago

cc'ing Matt on this bug, as it may have implications for the new code that can fetch config info out of RADOS:

Basically, we need to ensure that when we're dipping into ceph libraries that we only ever create a single CephContext, as object is tied to lifetime of some info that is global -- lockdep, in particular, but also potentially other things.

I think we may need come up with a way to keep around a refcounted CephContext pointer that is populated when we first need it and torn down when the last user goes away.

#10 Updated by Jeff Layton almost 2 years ago

  • Assignee set to Jeff Layton

I spent a couple of hours today crawling over the code in ganesha and ceph that handles the CephContext. We have routines that can create a cephfs or rados "mount" from a CephContext pointer, but there is no API call to create a CephContext by itself.

So, we really have two choices here:

1/ keep some sort of global pointer to the cct, and refcount it. When a cct doesn't exist yet, create using the "normal" ceph_create or rados_create call, and then fetch the pointer out of it afterward to populate the global pointer. Ensure that when the last reference to the cct is put that the pointer is zeroed out.

2/ add a new API (probably to ceph-common lib) for creating a cct by itself, and to eventually release it. Have ganesha create and release the cct as needed, and change to using ceph_create_from_context and rados_create_from_context in ganesha.

2 is the simplest method, IMO, but it's not backward compatible to older libcephfs.

#11 Updated by Jeff Layton almost 2 years ago

We could also fix this less invasively too -- try to make g_lockdep_ceph_ctx a refcounted object pointer, and then fix up whatever other similar bugs we fine.

But...the more I look at this code, the more I think that we ought to just make the CephContext a singleton. Is there ever any reason why a single process image would need more than one CephContext?

We can also opt for something less invasive. We could just fix up common_preinit to keep a refcounted pointer, and then fix ceph_mount_info destructor and the rados equivalent to just put the reference to it.

#12 Updated by Jeff Layton almost 2 years ago

I started a discussion on ceph-devel and I think the consensus is that we can't make CephContext a singleton.

I went in and disabled lockdep and was able to run the reproducer in a loop for several minutes. I did hit one problem of hitting the open file limit, but once that was resolved it worked. So, that seems to indicate that lockdep is the main problem here.

Looking at that code though, it's heavily reliant on global variables. I'm not sure how to move that to per-ctx structures. More importantly, would doing that be sufficient anyway? It's may be possible to concoct a scenario that involves mutexes under two different CephContexts.

What we may have to settle for is making lockdep's structures a singleton somehow.

#13 Updated by Jeff Layton almost 2 years ago

Ok, I finally settled on just keeping things more or less limping along like they are now with lockdep, and just ensuring that we can handle it sanely when lockdep's cct is unregistered while other threads are flogging locks. So far, this passes my simple testcase for this:

https://github.com/ceph/ceph/pull/17738

#14 Updated by Jeff Layton almost 2 years ago

Note that I think we do need to convert over programs like ganesha and samba to only keep a single CephContext and share it amongst all mounts/clients/etc. Anything else seems pretty wasteful, as you need extra threads, sockets, etc for each CephContext.

#15 Updated by Greg Farnum almost 2 years ago

What resources are actually duplicated across a CephContext, which we don't want duplicated? When I think of duplicated resources the top of my list includes the messenger and stuff like that, but right now we don't want to share them if they're using different cephx entities...

#16 Updated by Jeff Layton almost 2 years ago

Greg Farnum wrote:

What resources are actually duplicated across a CephContext, which we don't want duplicated? When I think of duplicated resources the top of my list includes the messenger and stuff like that, but right now we don't want to share them if they're using different cephx entities...

CephContext encompasses a lot of stuff, and I'm not very familiar with all of it. Seems like maybe some of the observers would be shareable? IDK...

The big issue with this turns out to be with the lockdep code, which was just never designed for a world where you have multiple CephContexts. For now, the patches I have should at least keep the existing code limping along, but we may need to re-think how that should work with multiple cct's eventually if that's the sort of design we're aiming for.

#17 Updated by Kefu Chai almost 2 years ago

  • Status changed from New to Need Review

#18 Updated by Kefu Chai almost 2 years ago

the PR has been merged.

but i am not sure if we should backport it. i am pinging Jeff for his insights on it at https://github.com/ceph/ceph/pull/17738. we should mark this ticket "resolved" or "pending backport" once it's confirmed.

#19 Updated by Jeff Layton almost 2 years ago

  • Status changed from Need Review to Pending Backport

I think we do want to backport this to any currently maintained releases. It really shouldn't change any behavior and it does close some race conditions.

#20 Updated by Kefu Chai almost 2 years ago

  • Tags set to jewel, luminous

#21 Updated by Nathan Cutler almost 2 years ago

  • Tags deleted (jewel, luminous)
  • Backport set to jewel, luminous

#22 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #21525: luminous: client: dual client segfault with racing ceph_shutdown added

#23 Updated by Nathan Cutler almost 2 years ago

  • Copied to Backport #21526: jewel: client: dual client segfault with racing ceph_shutdown added

#24 Updated by Jeff Layton almost 2 years ago

To be clear, I think we may want to leave off the patch that adds the new testcase from this series as it's uncovering some other races (see tracker #21512). Until we shake those out, let's not add the new testcase to the stable releases just yet (not sure if that's routinely done or not).

#25 Updated by Patrick Donnelly over 1 year ago

  • Project changed from Ceph to fs
  • Component(FS) Client added

#26 Updated by Patrick Donnelly over 1 year ago

  • Status changed from Pending Backport to Resolved
  • Target version set to v13.0.0
  • Backport changed from jewel, luminous to jewel,luminous

#27 Updated by Nathan Cutler over 1 year ago

Jeff Layton wrote:

To be clear, I think we may want to leave off the patch that adds the new testcase from this series as it's uncovering some other races (see tracker #21512). Until we shake those out, let's not add the new testcase to the stable releases just yet (not sure if that's routinely done or not).

I just noticed #20998-24 (quoted above) where Jeff warns about the possible problem encountered with the test case.

Unfortunately, the test case has already been merged to luminous, so I'll open a PR to revert.

#28 Updated by Nathan Cutler over 1 year ago

Also available in: Atom PDF