Project

General

Profile

Bug #17386

Upgrading 0.94.6 -> 0.94.9 saturating mon node networking

Added by Michael Hackett 12 months ago. Updated 11 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
09/22/2016
Due date:
% Done:

0%

Source:
Support
Tags:
Backport:
jewel, hammer
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Release:
hammer
Needs Doc:
No

Description

While attempting to upgrade a 1200+ OSD cluster from 0.94.6 to 0.94.9 serious performance issues are seen every time an OSD is restarted. The monitors are already upgraded and running 94.9, when restarting the OSD's as part of the upgrade it causes several minutes of network saturation on all three monitor nodes. This causes thousands of slow requests.

Initially monitor logs were flooded with the following messages:

2016-09-14 15:51:12.174478 osd.405 24.161.248.95:6805/41332 329 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.174635 osd.220 24.161.248.119:6816/92203 301 : cluster [WRN] failed to encode map e727238 with expected crc
2016-09-14 15:51:12.178740 osd.872 24.161.248.104:6816/235917 55 : cluster [WRN] failed to encode map e727238 with expected crc

But 'clog_to_monitors false' was set and this is no longer occuring but network still gets saturated during restarts of OSD's.

Above issue is discussed on the following community thread:
http://ceph-users.ceph.narkive.com/rPGrATpE/v0-94-7-hammer-released

It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers). When this happens all the 0.94.6 OSDs report the crc problem back to the mons, but the newer 0.94.9 OSDs don't.

Ceph users list discussion on this current issue:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/013216.html

Current theory is that downrev OSD's appear to be continually pulling osdmaps from the upgraded mons


Related issues

Related to Ceph - Bug #17365: mon: forwarded message is encoded with sending client's features Resolved 09/21/2016
Copied to Ceph - Backport #17534: hammer: doc: document the changed upgrade steps for hammer Resolved
Copied to Ceph - Backport #17734: jewel: Upgrading 0.94.6 -> 0.94.9 saturating mon node networking Resolved

History

#1 Updated by Kefu Chai 12 months ago

It appears that starting with 0.94.7 that the osdmap encoding changed (which was unexpected by developers

the CRC mismatch warning is expected:

pg_pool_t is a field in OSDMap::Incremental, and OSDMap itself. in 0.94.6, pg_pool_t is encoded with v17 scheme, while in 0.94.9, this structure is encoded using v21. after upgrade, the monitors encode the (inc) osdmap using the new scheme, while OSD running 0.94.6 is still re-encoding the full osdmap using the v17, and then compare the crc of the re-encoded full map with the crc of the original fullmap encoded using v21. that's why the CRCs mismatch.

in a large cluster, resending the fullmap could be burden to monitor and saturates the cluster network. maybe we can have

  • we do have the machinery to re-encode osdmap for old client. but we need to do this explicitly, i.e.
    1. add CEPH_FEATURE_RESERVED (the non-exist feature bit) to the feature bits
    2. encode the MOSDMap message in OSDMonitor::send_incremental() before sending it down to messenger, which will just put the pre-encoded incremental maps and full maps into the payload buffer. (downside: larger memory foot print)
  • or, we can add an option to disable the crc checking (or full map upon CRC mismatch) on the OSD side. so we can disable it at run-time at seeing the performance degradation due to this problem. (downside: yet another knob)

#2 Updated by Kefu Chai 12 months ago

  • Assignee set to Kefu Chai

#3 Updated by Kefu Chai 12 months ago

  • Target version deleted (v0.94.8)

#4 Updated by Ken Dreyer 12 months ago

  • Backport set to jewel, hammer

#5 Updated by Kefu Chai 12 months ago

option 1: re-encode the MOSDMap message if GMT_HITSET feature bit is missing.

- downside

  • larger memory foot print
  • waste CPU cycles on re-encoding the OSDMap again and again for the mon clients without GMT_HITSET.
  • the assert for detecting the caller of Incremental::encode() is disabled. this is just a precaution, but would be good to have it.
      // only a select set of callers should *ever* be encoding new
      // OSDMaps.  others should be passing around the canonical encoded
      // buffers from on high.  select out those callers by passing in an
      // "impossible" feature bit.
      // assert(features & CEPH_FEATURE_RESERVED);
    

option 2: we can add an option to disable the crc checking (or full map upon CRC mismatch) on the OSD side. so we can disable it at run-time at seeing the performance degradation due to this problem. (downside: yet another knob)

- downside: it requires user to a hotfix of OSD just for upgrading the mon or OSD, to prevent the old version of OSDs from asking for full osdmaps.

i think this does not make sense, and is not a viable solution. so guess the fix on monitor side is better.

#6 Updated by Kefu Chai 12 months ago

so we need three patches for fixing this problem:

  1. for this specific issue. we need to re-encode the OSDMap when sending it to client if the GMT_HITSET feature bit is missing. but this fix is not supposed to be merged into master
  2. add an option on the OSD side, so it won't ask for the full map upon CRC mismatch, and also document the suggested steps for upgrade from 0.94.6 to 0.94.6+: upgrade the OSD (and other monitor clients) side first with this option enabled, then upgrade the monitor.
  3. add an option on the OSD side, so it will ask for the latest full map upon boot. so the osdmaps won't diverge over time.

#8 Updated by Loic Dachary 12 months ago

  • Copied to Backport #17534: hammer: doc: document the changed upgrade steps for hammer added

#9 Updated by Nathan Cutler 11 months ago

Kefu, https://github.com/ceph/ceph/pull/11284 should be backported to jewel only, correct? I.e. not to hammer.

#10 Updated by Kefu Chai 11 months ago

@Nathan

sorry for the latency, right. for clarification:

for jewel for hammer

#11 Updated by Sage Weil 11 months ago

note that d4f5e88f36e5388ae9e062c4bc49ac1c684a3f3c is a prereq for https://github.com/ceph/ceph/pull/11610

#12 Updated by Kefu Chai 11 months ago

  • Related to Bug #17365: mon: forwarded message is encoded with sending client's features added

#13 Updated by Kefu Chai 11 months ago

  • Status changed from Need Review to Pending Backport

#14 Updated by Loic Dachary 11 months ago

  • Copied to Backport #17734: jewel: Upgrading 0.94.6 -> 0.94.9 saturating mon node networking added

#16 Updated by Loic Dachary 11 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF