Bug #17386
closedUpgrading 0.94.6 -> 0.94.9 saturating mon node networking
0%
Description
While attempting to upgrade a 1200+ OSD cluster from 0.94.6 to 0.94.9 serious performance issues are seen every time an OSD is restarted. The monitors are already upgraded and running 94.9, when restarting the OSD's as part of the upgrade it causes several minutes of network saturation on all three monitor nodes. This causes thousands of slow requests.
Initially monitor logs were flooded with the following messages:
2016-09-14 15:51:12.174478 osd.405 24.161.248.95:6805/41332 329 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.174635 osd.220 24.161.248.119:6816/92203 301 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.178740 osd.872 24.161.248.104:6816/235917 55 : cluster [WRN] failed to encode map e727238 with expected crc
But 'clog_to_monitors false' was set and this is no longer occuring but network still gets saturated during restarts of OSD's.
Above issue is discussed on the following community thread:
http://ceph-users.ceph.narkive.com/rPGrATpE/v0-94-7-hammer-released
It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers). When this happens all the 0.94.6 OSDs report the crc problem back to the mons, but the newer 0.94.9 OSDs don't.
Ceph users list discussion on this current issue:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013216.html
Current theory is that downrev OSD's appear to be continually pulling osdmaps from the upgraded mons