Bug #8495
closed
osd: bad state machine event on backfill request
Added by Dmitry Smirnov almost 10 years ago.
Updated almost 10 years ago.
Description
Three OSDs crashed together for no apparent reason during routine backfilling/remapping.
Situation: 12 OSDs on 5 hosts; 1 OSD down/in; 1 OSD up/out.
I've decided to replace one OSD so I've used command 'ceph osd out 3'.
Some hours later 3 OSDs crashed all together.
See attached logs.
Files
Can you reproduce this with
debug osd = 20
debug ms = 1
debug filestore = 20
on all osds and attach all of the logs?
- Subject changed from 0.80.1: multiple simultaneious OSD crashes to osd: bad state machine event on backfill request
Also, is there any chance that this is a mixed cluster?
Samuel Just wrote:
Also, is there any chance that this is a mixed cluster?
I don't know what "mixed cluster" is. My cluster configuration is very straightforward -- there are only three replicated pools. All hosts are in one subnet so crush map is flat and simple, etc.
Dmitry Smirnov wrote:
Samuel Just wrote:
Also, is there any chance that this is a mixed cluster?
I don't know what "mixed cluster" is. My cluster configuration is very straightforward -- there are only three replicated pools. All hosts are in one subnet so crush map is flat and simple, etc.
Sam is referring to mixed versions. So, different versions of ceph-osd daemons participating in the same cluster, vs every single daemon (and client) running the same release.
You can check this with
for f in `ceph osd ls`; do ceph osd metadata $f ; done | grep ceph_version
I, see... Nice command by the way, thanks.
All cluster components are v0.80.1.
- Status changed from New to Duplicate
Also available in: Atom
PDF