Project

General

Profile

Actions

Bug #6674

closed

Busted client locking

Added by Greg Farnum over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: common/lockdep.cc: In function 'int lockdep_will_lock(const char*, int)' thread 7f10ee7fc700 time 2013-10-29 00:05:43.531369
2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: common/lockdep.cc: 181: FAILED assert(0)
2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  ceph version 0.71-359-gce84576 (ce8457680d9c4002c7168912e2a7ca0967486d0f)
2013-10-29T00:05:43.532 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  1: (lockdep_will_lock(char const*, int)+0x625) [0x7f11005ce935]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  2: (Mutex::Lock(bool)+0x104) [0x7f1100574104]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  3: (CephContextServiceThread::entry()+0x1a1) [0x7f11005b0371]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  4: (()+0x7e9a) [0x7f1100129e9a]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  5: (clone()+0x6d) [0x7f10ff940ccd]
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]:  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2013-10-29T00:05:43.533 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: terminate called after throwing an instance of 'ceph::FailedAssertion'
2013-10-29T00:05:43.619 INFO:teuthology.task.workunit.client.0.err:[10.214.132.38]: Aborted (core dumped)
2013-10-29T00:05:43.620 INFO:teuthology.task.workunit:Stopping libcephfs/test.sh on client.0...

http://qa-proxy.ceph.com/teuthology/teuthology-2013-10-28_23:01:26-fs-master-testing-basic-plana/73869/

Maybe the semaphore replacement isn't quite as safe as we'd hoped?


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #6738: lockdep segfault when running rados test.shDuplicateSamuel Just11/08/2013

Actions
Actions #1

Updated by Noah Watkins over 10 years ago

Looking at initializations of CephContext, I don't see any explicit freeing, which begs the question which thread is running ~CephContext()? If it happens to be the context thread, then it will be locking itself via exit_thread(). I'm not sure exactly what guarantees (if any) there are on who runs destructors on exit.

martini:ceph nwatkins$ git grep delete | grep cct
src/include/Context.h:    mydout(cct,10) << "C_Gather " << this << ".delete" << dendl;
src/librbd/internal.cc:      ldout(cct, 2) << "trim_image objects " << delete_start << " to " 
src/msg/SimpleMessenger.cc:    ldout(cct,10) << "reaper deleted pipe " << p << dendl;
src/osdc/Objecter.cc:  ldout(cct, 10) << "delete_pool_snap; pool: " << pool << "; snap: " << snap_name << dendl;
src/osdc/Objecter.cc:  ldout(cct, 10) << "delete_selfmanaged_snap; pool: " << pool << "; snap: " 
src/osdc/Objecter.cc:  ldout(cct, 10) << "delete_pool " << pool << dendl;
Actions #2

Updated by Noah Watkins over 10 years ago

I don't think the theory from the previous comment has any merit! I wonder if it's possible that lockdep is missing an unlock, especially since g_lockdep isn't read under any type of mutex, although i'm failing to come up with any type of interleaving that would cause this.

diff --git a/src/common/ceph_context.cc b/src/common/ceph_context.cc
index 7822407..d0b4bc8 100644
--- a/src/common/ceph_context.cc
+++ b/src/common/ceph_context.cc
@@ -296,12 +296,12 @@ CephContext::CephContext(uint32_t module_type_)

 CephContext::~CephContext()
 {
+  join_service_thread();
+
   if (_conf->lockdep) {
     lockdep_unregister_ceph_context(this);
   }

-  join_service_thread();
-
   _admin_socket->unregister_command("perfcounters_dump");
   _admin_socket->unregister_command("perf dump");
   _admin_socket->unregister_command("1");
Actions #3

Updated by Samuel Just over 10 years ago

  • Project changed from CephFS to Ceph
  • Priority changed from High to Urgent
Actions #4

Updated by Samuel Just over 10 years ago

  • Assignee set to Samuel Just
Actions #5

Updated by Greg Farnum over 10 years ago

c7d975aeadf908d11577c480fa0a2e831d069c55 is the one I was discussing.

Actions #6

Updated by Noah Watkins over 10 years ago

Has this been triggered again since the first time we saw it?

Actions #7

Updated by Greg Farnum over 10 years ago

I just saw it a couple more times going over fs nightlies from this week, and it broke several rados tests last night (see #6738).

Actions #8

Updated by Noah Watkins over 10 years ago

This is my theory. When CephContext is being cleaned-up, we first disable lockdep, then wait for the service thread to exit.

293 CephContext::~CephContext()                                                                                                                                                                                                                                               
294 {                                                                                                                                                                                                                                                                         
295   if (_conf->lockdep) {                                                                                                                                                                                                                                                   
296     lockdep_unregister_ceph_context(this);                                                                                                                                                                                                                                
297   }                                                                                                                                                                                                                                                                       
298                                                                                                                                                                                                                                                                           
299   join_service_thread();

When the context is unregistered, it occurs under a mutex.

 67 void lockdep_unregister_ceph_context(CephContext *cct)                                                                                                                                                                                                                    
 68 {                                                                                                                                                                                                                                                                         
 69   pthread_mutex_lock(&lockdep_mutex);                                                                                                                                                                                                                                     
 70   if (cct == g_lockdep_ceph_ctx) {                                                                                                                                                                                                                                        
 71     // this cct is going away; shut it down!                                                                                                                                                                                                                              
 72     g_lockdep = false;                                                                                                                                                                                                                                                    
 73     g_lockdep_ceph_ctx = NULL;                                                                                                                                                                                                                                            
 74   }                                                                                                                                                                                                                                                                       
 75   pthread_mutex_unlock(&lockdep_mutex);                                                                                                                                                                                                                                   
 76 } 

But, the g_lockdep flag is not checked under the mutex.

 79 void Mutex::Lock(bool no_lockdep) {                                                                                                                                                                                                                                       
 80   if (lockdep && g_lockdep && !no_lockdep) _will_lock();

So, if the context thread is migrated between CPUs, I don't think there is any guarantee that x86 makes that the cache lines will be updated. So, at first glance this looks like it is entirely possible that the service thread could see multiple values for g_lockdep, corrupting the lockdep state?

Actions #9

Updated by Samuel Just over 10 years ago

  • Status changed from New to Resolved

1212a2119f3681de40cf947dae9c3b0d3f19e6fe

Actions

Also available in: Atom PDF