Project

General

Profile

Actions

Bug #17386

closed

Upgrading 0.94.6 -> 0.94.9 saturating mon node networking

Added by Michael Hackett over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Support
Tags:
Backport:
jewel, hammer
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While attempting to upgrade a 1200+ OSD cluster from 0.94.6 to 0.94.9 serious performance issues are seen every time an OSD is restarted. The monitors are already upgraded and running 94.9, when restarting the OSD's as part of the upgrade it causes several minutes of network saturation on all three monitor nodes. This causes thousands of slow requests.

Initially monitor logs were flooded with the following messages:

2016-09-14 15:51:12.174478 osd.405 24.161.248.95:6805/41332 329 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.174635 osd.220 24.161.248.119:6816/92203 301 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.178740 osd.872 24.161.248.104:6816/235917 55 : cluster [WRN] failed to encode map e727238 with expected crc

But 'clog_to_monitors false' was set and this is no longer occuring but network still gets saturated during restarts of OSD's.

Above issue is discussed on the following community thread:
http://ceph-users.ceph.narkive.com/rPGrATpE/v0-94-7-hammer-released

It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers). When this happens all the 0.94.6 OSDs report the crc problem back to the mons, but the newer 0.94.9 OSDs don't.

Ceph users list discussion on this current issue:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013216.html

Current theory is that downrev OSD's appear to be continually pulling osdmaps from the upgraded mons


Related issues 4 (1 open3 closed)

Related to Ceph - Bug #17365: mon: forwarded message is encoded with sending client's featuresResolvedSage Weil09/21/2016

Actions
Related to RADOS - Bug #63389: Failed to encode map X with expected CRCPending BackportRadoslaw Zarzynski

Actions
Copied to Ceph - Backport #17534: hammer: doc: document the changed upgrade steps for hammerResolvedKefu ChaiActions
Copied to Ceph - Backport #17734: jewel: Upgrading 0.94.6 -> 0.94.9 saturating mon node networkingResolvedLoïc DacharyActions
Actions

Also available in: Atom PDF