Bug #8495
osd: bad state machine event on backfill request
0%
Description
Three OSDs crashed together for no apparent reason during routine backfilling/remapping.
Situation: 12 OSDs on 5 hosts; 1 OSD down/in; 1 OSD up/out.
I've decided to replace one OSD so I've used command 'ceph osd out 3'.
Some hours later 3 OSDs crashed all together.
See attached logs.
Related issues
History
#1 Updated by Samuel Just almost 10 years ago
Can you reproduce this with
debug osd = 20
debug ms = 1
debug filestore = 20
on all osds and attach all of the logs?
#2 Updated by Sage Weil almost 10 years ago
- Subject changed from 0.80.1: multiple simultaneious OSD crashes to osd: bad state machine event on backfill request
#3 Updated by Samuel Just almost 10 years ago
Also, is there any chance that this is a mixed cluster?
#4 Updated by Dmitry Smirnov almost 10 years ago
Samuel Just wrote:
Also, is there any chance that this is a mixed cluster?
I don't know what "mixed cluster" is. My cluster configuration is very straightforward -- there are only three replicated pools. All hosts are in one subnet so crush map is flat and simple, etc.
#5 Updated by Sage Weil almost 10 years ago
Dmitry Smirnov wrote:
Samuel Just wrote:
Also, is there any chance that this is a mixed cluster?
I don't know what "mixed cluster" is. My cluster configuration is very straightforward -- there are only three replicated pools. All hosts are in one subnet so crush map is flat and simple, etc.
Sam is referring to mixed versions. So, different versions of ceph-osd daemons participating in the same cluster, vs every single daemon (and client) running the same release.
You can check this with
for f in `ceph osd ls`; do ceph osd metadata $f ; done | grep ceph_version
#6 Updated by Dmitry Smirnov almost 10 years ago
I, see... Nice command by the way, thanks.
All cluster components are v0.80.1.
#7 Updated by Sage Weil over 9 years ago
- Status changed from New to Duplicate