Bug #4556
closedOSDs crash with OSD::handle_op during recovery
0%
Description
While tracking down #3816 I stumbled upon this one multiple times.
I tried the upgrade to 0.56.4 to be sure, but that didn't change anything.
ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca) 1: /usr/bin/ceph-osd() [0x788fba] 2: (()+0xfcb0) [0x7f083e63ecb0] 3: (gsignal()+0x35) [0x7f083cffd425] 4: (abort()+0x17b) [0x7f083d000b8b] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f083d94f69d] 6: (()+0xb5846) [0x7f083d94d846] 7: (()+0xb5873) [0x7f083d94d873] 8: (()+0xb596e) [0x7f083d94d96e] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1df) [0x8343af] 10: (OSD::handle_op(std::tr1::shared_ptr<OpRequest>)+0x12d8) [0x624668] 11: (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0xe9) [0x62cba9] 12: (OSD::do_waiters()+0x1a5) [0x62d105] 13: (OSD::ms_dispatch(Message*)+0x1c2) [0x636a82] 14: (DispatchQueue::entry()+0x349) [0x8c7399] 15: (DispatchQueue::DispatchThread::entry()+0xd) [0x81fbad] 16: (()+0x7e9a) [0x7f083e636e9a] 17: (clone()+0x6d) [0x7f083d0bacbd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
This cluster has already sustained a lot of issues and some OSDs have been down and out for quite some time now.
I added the logs of two OSDs:
- osd.2
- osd.38
It goes wrong during the peering process. All 40 OSDs are active and trying to recover, but one by one they keep falling down until I eventually end up with 11 OSDs and a cluster in a very bad state.
osdmap e21977: 40 osds: 11 up, 11 in
In the end it's just 11 OSDs surviving. I added the output of "ceph osd tree". As you can see it's 11 OSDs surviving, but not always the same ones.
The attached logs were produced with debug osd = 20
From what I make up out of the logs it goes wrong when OSDs transition to the Primary state for a PG, that seems to go wrong and they crash.
Files