Project

General

Profile

Backport #22069

Updated by Nathan Cutler over 6 years ago

I encountered the bug in #13937. I wanted to help test PR12088, and may have encountered an unrelated bug as a result. 

 <pre> 
      0> 2016-12-06 14:33:35.773259 7f3357278700 -1 osd/ReplicatedPG.cc: In function 'int ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)' thread 7f3357278700 time 2016-12-06 14:33:35.758593 
 osd/ReplicatedPG.cc: 10740: FAILED assert(0) 

  ceph version 10.2.3-366-g289696d (289696d533038c2248c1fe0c8ee03adad343cfa9) 
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5619aedc4af0] 
  2: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0xa3f) [0x5619ae87843f] 
  3: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0xc2e) [0x5619ae87ffee] 
  4: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x372) [0x5619ae6f3d72] 
  5: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x20) [0x5619ae742090] 
  6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x5619aedb6cc1] 
  7: (ThreadPool::WorkThread::entry()+0x10) [0x5619aedb7dc0] 
  8: (()+0x770a) [0x7f33819c970a] 
  9: (clone()+0x6d) [0x7f337fa4282d] 
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 
 </pre> 

 After applying the PR at https://github.com/ceph/ceph/pull/12088, building ceph version 10.2.3-366-g289696d (289696d533038c2248c1fe0c8ee03adad343cfa9) on both Ubuntu 14.04 and 16.04 using the steps at http://docs.ceph.com/docs/jewel/install/build-ceph/ ... 

 I started the OSDs which had been marked "out" as per discussion related to #13937. Fairly shortly thereafter, the same OSDs which were crashing on 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) started crashing again, but with a new error. Attaching the entire log from one such OSD below with debug settings at 0/20. 

 As usual, please let me know what other information I can provide or tests I can run to help troubleshoot :)

Back