Upgrading 0.94.6 -> 0.94.9 saturating mon node networking
While attempting to upgrade a 1200+ OSD cluster from 0.94.6 to 0.94.9 serious performance issues are seen every time an OSD is restarted. The monitors are already upgraded and running 94.9, when restarting the OSD's as part of the upgrade it causes several minutes of network saturation on all three monitor nodes. This causes thousands of slow requests.
Initially monitor logs were flooded with the following messages:
2016-09-14 15:51:12.174478 osd.405 126.96.36.199:6805/41332 329 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.174635 osd.220 188.8.131.52:6816/92203 301 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.178740 osd.872 184.108.40.206:6816/235917 55 : cluster [WRN] failed to encode map e727238 with expected crc
But 'clog_to_monitors false' was set and this is no longer occuring but network still gets saturated during restarts of OSD's.
Above issue is discussed on the following community thread:
It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers). When this happens all the 0.94.6 OSDs report the crc problem back to the mons, but the newer 0.94.9 OSDs don't.
Ceph users list discussion on this current issue:
Current theory is that downrev OSD's appear to be continually pulling osdmaps from the upgraded mons
It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers
the CRC mismatch warning is expected:
pg_pool_t is a field in
OSDMap itself. in 0.94.6,
pg_pool_t is encoded with v17 scheme, while in 0.94.9, this structure is encoded using v21. after upgrade, the monitors encode the (inc) osdmap using the new scheme, while OSD running 0.94.6 is still re-encoding the full osdmap using the v17, and then compare the crc of the re-encoded full map with the crc of the original fullmap encoded using v21. that's why the CRCs mismatch.
in a large cluster, resending the fullmap could be burden to monitor and saturates the cluster network. maybe we can have
- we do have the machinery to re-encode osdmap for old client. but we need to do this explicitly, i.e.
- add CEPH_FEATURE_RESERVED (the non-exist feature bit) to the feature bits
- encode the MOSDMap message in OSDMonitor::send_incremental() before sending it down to messenger, which will just put the pre-encoded incremental maps and full maps into the payload buffer. (downside: larger memory foot print)
- or, we can add an option to disable the crc checking (or full map upon CRC mismatch) on the OSD side. so we can disable it at run-time at seeing the performance degradation due to this problem. (downside: yet another knob)
option 1: re-encode the MOSDMap message if GMT_HITSET feature bit is missing.
- larger memory foot print
- waste CPU cycles on re-encoding the OSDMap again and again for the mon clients without GMT_HITSET.
- the assert for detecting the caller of
Incremental::encode()is disabled. this is just a precaution, but would be good to have it.
// only a select set of callers should *ever* be encoding new // OSDMaps. others should be passing around the canonical encoded // buffers from on high. select out those callers by passing in an // "impossible" feature bit. // assert(features & CEPH_FEATURE_RESERVED);
option 2: we can add an option to disable the crc checking (or full map upon CRC mismatch) on the OSD side. so we can disable it at run-time at seeing the performance degradation due to this problem. (downside: yet another knob)
- downside: it requires user to a hotfix of OSD just for upgrading the mon or OSD, to prevent the old version of OSDs from asking for full osdmaps.
i think this does not make sense, and is not a viable solution. so guess the fix on monitor side is better.
so we need three patches for fixing this problem:
- for this specific issue. we need to re-encode the OSDMap when sending it to client if the GMT_HITSET feature bit is missing. but this fix is not supposed to be merged into master
- add an option on the OSD side, so it won't ask for the full map upon CRC mismatch, and also document the suggested steps for upgrade from 0.94.6 to 0.94.6+: upgrade the OSD (and other monitor clients) side first with this option enabled, then upgrade the monitor.
- add an option on the OSD side, so it will ask for the latest full map upon boot. so the osdmaps won't diverge over time.
sorry for the latency, right. for clarification:for jewel
https://github.com/ceph/ceph/pull/11258// not needed anymore.
- https://github.com/ceph/ceph/pull/11180 // this is tracked by #17365, which will also be backported to jewel. as commented by sage as below, it is a prereq for PR#11610.
#15 Updated by Loic Dachary 9 months ago
https://github.com/ceph/ceph/pull/11284/files needs the matching ceph-qa-suite commit https://github.com/ceph/ceph-qa-suite/commit/322363a41741d212dd82c02aec148647e447d055