Backport #22069
Updated by Nathan Cutler over 6 years ago
I encountered the bug in #13937. I wanted to help test PR12088, and may have encountered an unrelated bug as a result.
<pre>
0> 2016-12-06 14:33:35.773259 7f3357278700 -1 osd/ReplicatedPG.cc: In function 'int ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)' thread 7f3357278700 time 2016-12-06 14:33:35.758593
osd/ReplicatedPG.cc: 10740: FAILED assert(0)
ceph version 10.2.3-366-g289696d (289696d533038c2248c1fe0c8ee03adad343cfa9)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x5619aedc4af0]
2: (ReplicatedPG::recover_replicas(int, ThreadPool::TPHandle&)+0xa3f) [0x5619ae87843f]
3: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&, int*)+0xc2e) [0x5619ae87ffee]
4: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x372) [0x5619ae6f3d72]
5: (ThreadPool::WorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x20) [0x5619ae742090]
6: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x5619aedb6cc1]
7: (ThreadPool::WorkThread::entry()+0x10) [0x5619aedb7dc0]
8: (()+0x770a) [0x7f33819c970a]
9: (clone()+0x6d) [0x7f337fa4282d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
</pre>
After applying the PR at https://github.com/ceph/ceph/pull/12088, building ceph version 10.2.3-366-g289696d (289696d533038c2248c1fe0c8ee03adad343cfa9) on both Ubuntu 14.04 and 16.04 using the steps at http://docs.ceph.com/docs/jewel/install/build-ceph/ ...
I started the OSDs which had been marked "out" as per discussion related to #13937. Fairly shortly thereafter, the same OSDs which were crashing on 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) started crashing again, but with a new error. Attaching the entire log from one such OSD below with debug settings at 0/20.
As usual, please let me know what other information I can provide or tests I can run to help troubleshoot :)