Project

General

Profile

Bug #6674

Busted client locking

Added by Greg Farnum almost 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
10/29/2013
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int)' thread 7f10ee7fc700 time 2013-10-29 00:05:43.531369
2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: common/lockdep.cc: 181: FAILED assert(0)
2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  ceph version 0.71-359-gce84576 (ce8457680d9c4002c7168912e2a7ca0967486d0f)
2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  1: (lockdep_will_lock(char const*, int)+0x625) [0x7f11005ce935]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  2: (Mutex::Lock(bool)+0x104) [0x7f1100574104]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  3: (CephContextServiceThread::entry()+0x1a1) [0x7f11005b0371]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  4: (()+0x7e9a) [0x7f1100129e9a]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  5: (clone()+0x6d) [0x7f10ff940ccd]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: terminate called after throwing an instance of 'ceph::FailedAssertion'
2013-10-29T00:05:43.619 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: Aborted (core dumped)
2013-10-29T00:05:43.620 INFO:teuthology.task.workunit:Stopping libcephfs/test.sh on client.0...

http://qa-proxy.ceph.com/teuthology/teuthology-2013-10-28_23:01:26-fs-master-testing-basic-plana/73869/

Maybe the semaphore replacement isn't quite as safe as we'd hoped?


Related issues

Duplicated by Ceph - Bug #6738: lockdep segfault when running rados test.sh Duplicate 11/08/2013

Associated revisions

Revision 1212a211 (diff)
Added by Samuel Just almost 6 years ago

CephContext: unregister lockdep after stopping service thread

Fixes: #6769
Fixes: #6674
Signed-off-by: Samuel Just <>
Reviewed-by: Josh Durgin <>

History

#1 Updated by Noah Watkins almost 6 years ago

Looking at initializations of CephContext, I don't see any explicit freeing, which begs the question which thread is running ~CephContext()? If it happens to be the context thread, then it will be locking itself via exit_thread(). I'm not sure exactly what guarantees (if any) there are on who runs destructors on exit.

martini:ceph nwatkins$ git grep delete | grep cct
src/include/Context.h:    mydout(cct,10) << "C_Gather " << this << ".delete" << dendl;
src/librbd/internal.cc:      ldout(cct, 2) << "trim_image objects " << delete_start << " to " 
src/msg/SimpleMessenger.cc:    ldout(cct,10) << "reaper deleted pipe " << p << dendl;
src/osdc/Objecter.cc:  ldout(cct, 10) << "delete_pool_snap; pool: " << pool << "; snap: " << snap_name << dendl;
src/osdc/Objecter.cc:  ldout(cct, 10) << "delete_selfmanaged_snap; pool: " << pool << "; snap: " 
src/osdc/Objecter.cc:  ldout(cct, 10) << "delete_pool " << pool << dendl;

#2 Updated by Noah Watkins almost 6 years ago

I don't think the theory from the previous comment has any merit! I wonder if it's possible that lockdep is missing an unlock, especially since g_lockdep isn't read under any type of mutex, although i'm failing to come up with any type of interleaving that would cause this.

diff --git a/src/common/ceph_context.cc b/src/common/ceph_context.cc
index 7822407..d0b4bc8 100644
--- a/src/common/ceph_context.cc
+++ b/src/common/ceph_context.cc
@@ -296,12 +296,12 @@ CephContext::CephContext(uint32_t module_type_)

 CephContext::~CephContext()
 {
+  join_service_thread();
+
   if (_conf->lockdep) {
     lockdep_unregister_ceph_context(this);
   }

-  join_service_thread();
-
   _admin_socket->unregister_command("perfcounters_dump");
   _admin_socket->unregister_command("perf dump");
   _admin_socket->unregister_command("1");

#3 Updated by Samuel Just almost 6 years ago

  • Project changed from fs to Ceph
  • Priority changed from High to Urgent

#4 Updated by Samuel Just almost 6 years ago

  • Assignee set to Samuel Just

#5 Updated by Greg Farnum almost 6 years ago

c7d975aeadf908d11577c480fa0a2e831d069c55 is the one I was discussing.

#6 Updated by Noah Watkins almost 6 years ago

Has this been triggered again since the first time we saw it?

#7 Updated by Greg Farnum almost 6 years ago

I just saw it a couple more times going over fs nightlies from this week, and it broke several rados tests last night (see #6738).

#8 Updated by Noah Watkins almost 6 years ago

This is my theory. When CephContext is being cleaned-up, we first disable lockdep, then wait for the service thread to exit.

293 CephContext::~CephContext()                                                                                                                                                                                                                                               
294 {                                                                                                                                                                                                                                                                         
295   if (_conf->lockdep) {                                                                                                                                                                                                                                                   
296     lockdep_unregister_ceph_context(this);                                                                                                                                                                                                                                
297   }                                                                                                                                                                                                                                                                       
298                                                                                                                                                                                                                                                                           
299   join_service_thread();

When the context is unregistered, it occurs under a mutex.

 67 void lockdep_unregister_ceph_context(CephContext *cct)                                                                                                                                                                                                                    
 68 {                                                                                                                                                                                                                                                                         
 69   pthread_mutex_lock(&lockdep_mutex);                                                                                                                                                                                                                                     
 70   if (cct == g_lockdep_ceph_ctx) {                                                                                                                                                                                                                                        
 71     // this cct is going away; shut it down!                                                                                                                                                                                                                              
 72     g_lockdep = false;                                                                                                                                                                                                                                                    
 73     g_lockdep_ceph_ctx = NULL;                                                                                                                                                                                                                                            
 74   }                                                                                                                                                                                                                                                                       
 75   pthread_mutex_unlock(&lockdep_mutex);                                                                                                                                                                                                                                   
 76 } 

But, the g_lockdep flag is not checked under the mutex.

 79 void Mutex::Lock(bool no_lockdep) {                                                                                                                                                                                                                                       
 80   if (lockdep && g_lockdep && !no_lockdep) _will_lock();

So, if the context thread is migrated between CPUs, I don't think there is any guarantee that x86 makes that the cache lines will be updated. So, at first glance this looks like it is entirely possible that the service thread could see multiple values for g_lockdep, corrupting the lockdep state?

#9 Updated by Samuel Just almost 6 years ago

  • Status changed from New to Resolved

1212a2119f3681de40cf947dae9c3b0d3f19e6fe

Also available in: Atom PDF