mon: pg creations don't get queued on mon startup
PGMonitor::send_pg_creates also divvies up pg creations among the current osds they map to. This happens from update_from_paxos(), and presumably also when osdmaps update. On startup, it happens from preinit() -> init_paxos(), which calls PGMonitor::update_from_paxos() before teh OSDMOnitor, which means the OSDMap is not loaded and everything maps to no OSD. Until there is an osdmap update, a reconnecting osd will fail to see creations queued for it.
The fix is probably to break the divvying out of send_pg_creates(), and then ensure that it is called at some other point during startup.
The result is that a pool creation that races with a mon restart will hang. Some other path also gets in this state, or there is a different bug, since it was triggered by the job below (osd thrashing only). In any case, after that hang, restarting the mon got into this buggy state, so it should get fixed regardless.
ubuntu@teuthology:/a/sage-2013-04-06_09:10:56-rados-wip-osd-throttle-testing-basic/9833$ cat orig.config.yaml kernel: kdb: true sha1: b0bb70d12c365872547f10d185bf88eba3ed6083 machine_type: plana nuke-on-error: true overrides: ceph: conf: global: ms inject delay max: 1 ms inject delay probability: 0.005 ms inject delay type: osd ms inject socket failures: 2500 mon: debug mon: 20 debug ms: 20 debug paxos: 20 fs: ext4 log-whitelist: - slow request sha1: aca0aea1bfbafba9cab1b2c693760b824bd82d30 s3tests: branch: master workunit: sha1: aca0aea1bfbafba9cab1b2c693760b824bd82d30 roles: - - mon.a - mon.c - osd.0 - osd.1 - osd.2 - - mon.b - mds.a - osd.3 - osd.4 - osd.5 - client.0 tasks: - chef: null - clock.check: null - install: null - ceph: log-whitelist: - wrongly marked me down - objects unfound and apparently lost - thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 timeout: 1200 - ceph-fuse: null - workunit: clients: client.0: - rados/test.sh
mon: remap creating pgs on startup
After Monitor::init_paxos() has loaded all of the PaxosService state,
we should then map creating pgs to osds. This ensures we do so after the
osdmap has been loaded and the pgs actually map somewhere meaningful.
#3 Updated by Greg Farnum almost 6 years ago
- Status changed from Need Review to Need More Info
Okay, I've looked at the patches and I've looked at the bug description and I can't tell what the problem is here. The effective change from the patches is to queue PG creates less frequently. Each monitor calls update_from_paxos() when it boots or wins an election, and when the OSDMonitor updates then it tells the PGMonitor to check the map, which calculates these mappings again. So it should all be fine with or without the patches (which would not have any impact on the stuck-creating PGs that I see when I go look at the teuthology archive).
Unfortunately there are no logs that I can find, but I see that there were 8 PGs creating and it manages one of them; do we know this is a pool create and not a split? Why do we believe the issue is with the monitors?
#5 Updated by Sage Weil almost 6 years ago
- Status changed from Need More Info to Need Review
- Assignee changed from Sage Weil to Greg Farnum
the problem is that update_from_apxos() is called on startup when the osdmap isn't loaded yet, so it remaps everything to  and no creates are queued. then osdmap does get loaded, mon starts up, osds reconnect.. but if there are then no osdmap updates, then the creates never get recalculated with a non-broken value. unfortunately it's a difficult case to reproduce; i only saw it once with the hung qa run last week. easy fix though.
the quick fix is just the if get_epoch() != 0 chekcs in teh second patch. the first patch can wait.. eventually we'll want to be more explicit about when we recalc the mapping and when we send, although as you say that's not needed for cuttlefish.
#6 Updated by Greg Farnum almost 6 years ago
Okay, but an OSD booting creates a new OSD Map, which will lead to PGMonitor::check_pg_map(), which will lead to send_pg_creates(). I do see that we won't actually calculate them again when the OSDMap initially loads since we'll have seen it previously, so that assumption of mine wasn't quite right, but as soon as we have an OSD boot we're good.
So, thus the thrashing monitors but not thrashing OSDs. Okay, I see it now. But this won't actually fix that race either — the only callers of send_pg_creates are PGMonitor::update_from_paxos() and PGMonitor::check_osd_map(). The second is called whenever the OSD Map changes and the PGMap hasn't seen it before, but that's true for the same cases with and without these patches. PGMonitor::update_from_paxos() is also going to happen at the same times with or without these patches. So I'm still not seeing how these do anything. Just pushed a wip-4675-model that might handle this better; check it out?
Also I think the actual referenced hang here is a different issue than this race.