Bug #4279
mon: received but didn't forward osd boot message
0%
Description
on a cluster of 6 osds, we get osd boot messages from ~4 osds, and forward only some of them to the leader. this reproduced 2 times out of 200 runs from this job:
nuke-on-error: true overrides: ceph: conf: global: ms inject delay max: 1 ms inject delay probability: 0.005 ms inject delay type: osd ms inject socket failures: 2500 osd: debug filestore: 20 debug ms: 1 debug osd: 20 debug journal: 20 fs: xfs log-whitelist: - slow request branch: wip-pglog roles: - - mon.a - mon.b - osd.0 - osd.1 - osd.2 - - mon.c - mds.a - osd.3 - osd.4 - osd.5 - client.0 tasks: - chef: null - clock: null - install: - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: timeout: 1200 - rados: clients: - client.0 objects: 500 op_weights: delete: 50 read: 100 rollback: 50 snap_create: 50 snap_remove: 50 write: 100 ops: 4000
History
#1 Updated by Sage Weil about 11 years ago
- Priority changed from High to Urgent
I suspect this also happend with run ubuntu@teuthology:/a/samuelj-2013-03-08_16:03:20-regression-wip_omap_snaps-testing-basic/18696
#2 Updated by Joao Eduardo Luis about 11 years ago
These logs appear to be missing. Either that or I'm at a complete loss on where to find them. teuthology's archive directory don't really have anything else besides teuthology's logs and files, and the remote servers where the test was run appear to have been reused/cleaned up. Any ideas where I can find the logs?
#3 Updated by Sage Weil about 11 years ago
yeah :( i thin we need to wait for it to happen again. i tried ot reproduce it explicitly with a targetted job and got nothing after 700 iterations.
#4 Updated by Sage Weil about 11 years ago
- Assignee changed from Joao Eduardo Luis to Sage Weil
#5 Updated by Sage Weil about 11 years ago
- Assignee changed from Sage Weil to Joao Eduardo Luis
teuthology-2013-03-10_01:00:05-regression-master-testing-gcov/20546
- blogbench workload hung, waiting for osds to be up during ceph_health
#6 Updated by Joao Eduardo Luis about 11 years ago
Those logs appear to have gone AWAL as well, so I've been trying to reproduce this for approx. 24 hours now without any joy.
#7 Updated by Sage Weil about 11 years ago
i'm going to crank up mon logs across all the qa runs so that next time this happens we will ahve it.
the trick is that the job hangs, so we'll have to be careful about nuking hung runs.
#8 Updated by Sage Weil about 11 years ago
- Status changed from 12 to In Progress
#9 Updated by Joao Eduardo Luis about 11 years ago
- Status changed from In Progress to Need More Info
Managed to get a bunch of runs in which one of the monitors didn't forward one osd_boot message.
The thing is, none of those runs hung. They always ended up going on with business as usual. Haven't been able to reproduce this bug yet.
On another note, something caught my eye. All the subset of runs (that ended up not forwarding one osd_boot message) followed exactly the same pattern:
- osd.X sends osd_boot to mon.peon1
- mon.peon1 forwards osd_boot to mon.leader
- osd.X sends osd_boot to mon.peon2 two seconds after sending the first osd_boot message
- mon.peon2 drops osd_boot as the osdmap has been updated in the meantime and osd.X is up -- message considered a duplicate
Furthermore, in all the runs this happened, the osd_boot message dropped by mon.peon2 is always the second osd_boot ever received. Might have something to do with teuthology being really predictable. Nothing appears to break, or be broken, aside from the fact that this is the only osd_boot message sent twice for different monitors in such a short span of time.
#10 Updated by Sage Weil about 11 years ago
- Status changed from Need More Info to Resolved
7aec13f749035b9bef5e398c1ac3d56ceec8eb81 and two follow-on commits.