Bug #6674: Busted client locking - Ceph - Ceph

Actions

Copy link

Bug #6674

closed

Busted client locking

Added by Greg Farnum over 10 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int)' thread 7f10ee7fc700 time 2013-10-29 00:05:43.531369
2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: common/lockdep.cc: 181: FAILED assert(0)
2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  ceph version 0.71-359-gce84576 (ce8457680d9c4002c7168912e2a7ca0967486d0f)
2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  1: (lockdep_will_lock(char const*, int)+0x625) [0x7f11005ce935]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  2: (Mutex::Lock(bool)+0x104) [0x7f1100574104]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  3: (CephContextServiceThread::entry()+0x1a1) [0x7f11005b0371]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  4: (()+0x7e9a) [0x7f1100129e9a]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  5: (clone()+0x6d) [0x7f10ff940ccd]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: terminate called after throwing an instance of 'ceph::FailedAssertion'
2013-10-29T00:05:43.619 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: Aborted (core dumped)
2013-10-29T00:05:43.620 INFO:teuthology.task.workunit:Stopping libcephfs/test.sh on client.0...

http://qa-proxy.ceph.com/teuthology/teuthology-2013-10-28_23:01:26-fs-master-testing-basic-plana/73869/

Maybe the semaphore replacement isn't quite as safe as we'd hoped?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Noah Watkins over 10 years ago

Looking at initializations of CephContext, I don't see any explicit freeing, which begs the question which thread is running ~CephContext()? If it happens to be the context thread, then it will be locking itself via exit_thread(). I'm not sure exactly what guarantees (if any) there are on who runs destructors on exit.

martini:ceph nwatkins$ git grep delete | grep cct
src/include/Context.h:    mydout(cct,10) << "C_Gather " << this << ".delete" << dendl;
src/librbd/internal.cc:      ldout(cct, 2) << "trim_image objects " << delete_start << " to " 
src/msg/SimpleMessenger.cc:    ldout(cct,10) << "reaper deleted pipe " << p << dendl;
src/osdc/Objecter.cc:  ldout(cct, 10) << "delete_pool_snap; pool: " << pool << "; snap: " << snap_name << dendl;
src/osdc/Objecter.cc:  ldout(cct, 10) << "delete_selfmanaged_snap; pool: " << pool << "; snap: " 
src/osdc/Objecter.cc:  ldout(cct, 10) << "delete_pool " << pool << dendl;

Actions

Copy link

Updated by Noah Watkins over 10 years ago

I don't think the theory from the previous comment has any merit! I wonder if it's possible that lockdep is missing an unlock, especially since g_lockdep isn't read under any type of mutex, although i'm failing to come up with any type of interleaving that would cause this.

diff --git a/src/common/ceph_context.cc b/src/common/ceph_context.cc
index 7822407..d0b4bc8 100644
--- a/src/common/ceph_context.cc
+++ b/src/common/ceph_context.cc
@@ -296,12 +296,12 @@ CephContext::CephContext(uint32_t module_type_)

 CephContext::~CephContext()
 {
+  join_service_thread();
+
   if (_conf->lockdep) {
     lockdep_unregister_ceph_context(this);
   }

-  join_service_thread();
-
   _admin_socket->unregister_command("perfcounters_dump");
   _admin_socket->unregister_command("perf dump");
   _admin_socket->unregister_command("1");

Actions

Copy link

Updated by Samuel Just over 10 years ago

Project changed from CephFS to Ceph
Priority changed from High to Urgent

Actions

Copy link

Updated by Samuel Just over 10 years ago

Assignee set to Samuel Just

Actions

Copy link

Updated by Greg Farnum over 10 years ago

c7d975aeadf908d11577c480fa0a2e831d069c55 is the one I was discussing.

Actions

Copy link

Updated by Noah Watkins over 10 years ago

Has this been triggered again since the first time we saw it?

Actions

Copy link

Updated by Greg Farnum over 10 years ago

I just saw it a couple more times going over fs nightlies from this week, and it broke several rados tests last night (see #6738).

Actions

Copy link

Updated by Noah Watkins over 10 years ago

This is my theory. When CephContext is being cleaned-up, we first disable lockdep, then wait for the service thread to exit.

293 CephContext::~CephContext()                                                                                                                                                                                                                                               
294 {                                                                                                                                                                                                                                                                         
295   if (_conf->lockdep) {                                                                                                                                                                                                                                                   
296     lockdep_unregister_ceph_context(this);                                                                                                                                                                                                                                
297   }                                                                                                                                                                                                                                                                       
298                                                                                                                                                                                                                                                                           
299   join_service_thread();

When the context is unregistered, it occurs under a mutex.

 67 void lockdep_unregister_ceph_context(CephContext *cct)                                                                                                                                                                                                                    
 68 {                                                                                                                                                                                                                                                                         
 69   pthread_mutex_lock(&lockdep_mutex);                                                                                                                                                                                                                                     
 70   if (cct == g_lockdep_ceph_ctx) {                                                                                                                                                                                                                                        
 71     // this cct is going away; shut it down!                                                                                                                                                                                                                              
 72     g_lockdep = false;                                                                                                                                                                                                                                                    
 73     g_lockdep_ceph_ctx = NULL;                                                                                                                                                                                                                                            
 74   }                                                                                                                                                                                                                                                                       
 75   pthread_mutex_unlock(&lockdep_mutex);                                                                                                                                                                                                                                   
 76 }

But, the g_lockdep flag is not checked under the mutex.

 79 void Mutex::Lock(bool no_lockdep) {                                                                                                                                                                                                                                       
 80   if (lockdep && g_lockdep && !no_lockdep) _will_lock();

So, if the context thread is migrated between CPUs, I don't think there is any guarantee that x86 makes that the cache lines will be updated. So, at first glance this looks like it is entirely possible that the service thread could see multiple values for g_lockdep, corrupting the lockdep state?

Actions

Copy link

Updated by Samuel Just over 10 years ago

Status changed from New to Resolved

1212a2119f3681de40cf947dae9c3b0d3f19e6fe

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #6674

Busted client locking

Updated by Noah Watkins over 10 years ago

Updated by Noah Watkins over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Samuel Just over 10 years ago

Updated by Greg Farnum over 10 years ago

Updated by Noah Watkins over 10 years ago

Updated by Greg Farnum over 10 years ago

Updated by Noah Watkins over 10 years ago

Updated by Samuel Just over 10 years ago