Project

General

Profile

Backport #18582

Issue with upgrade from 0.94.9 to 10.2.5

Added by Piotr Dalek 5 months ago. Updated 4 months ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Release:
jewel

e0934f13-dc36d3c7.badcrc (5.03 KB) Piotr Dalek, 01/20/2017 12:05 PM

jewel_full_9_dc36d3c7.osdmap (5.05 KB) Piotr Dalek, 01/20/2017 12:05 PM

History

#1 Updated by Piotr Dalek 5 months ago

It looks like that Jewel MONs encode osdmaps with pg_pool_t in version 24 even if they're communicating with Hammer OSDs.
I attached the osdmap encoded by Jewel mon (jewel_full_9_dc36d3c7.osdmap) and reencoded with Hammer osd (e0934f13-dc36d3c7.badcrc).

#2 Updated by Alexey Sheplyakov 5 months ago

  • Assignee set to Alexey Sheplyakov

I've found a deterministic (well, almost) way to reproduce the problem and am working on fixing it

#4 Updated by Piotr Dalek 5 months ago

  • Status changed from New to Need Review
  • Assignee changed from Alexey Sheplyakov to Piotr Dalek

Pull request https://github.com/ceph/ceph/pull/13131 fixes the issue completely for us.

#5 Updated by Alexey Sheplyakov 5 months ago

Steps to reproduce:

1. Deploy a small cluster 3 monitors, 3 OSDs and a client (all running hammer).
2. Start some IO (rados bench, fio, whatever)
3. Restart some OSD (say, osd.0)
4. Wait for PGs to become active+clean
5. Upgrade monitors (one by one) to jewel, wait for monitors to establish a quorum
6. Restart another OSD (say, osd.1)

Result:

All OSDs complain about crc mismatch like this:

2017-01-26 13:13:43.518218 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e15 with expected crc
2017-01-26 13:13:43.519653 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e15 with expected crc
2017-01-26 13:13:44.534886 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e16 with expected crc
2017-01-26 13:13:45.540765 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc
2017-01-26 13:13:45.545426 7f98e735c700  0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc
2017-01-26 13:13:45.549021 7f98e735c700  0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc

#6 Updated by Nathan Cutler 5 months ago

  • Tracker changed from Bug to Backport
  • Description updated (diff)
  • Status changed from Need Review to In Progress
  • Release jewel added

description

During our testing we found out that during upgrade from 0.94.9 to 10.2.5 we're hitting issue http://tracker.ceph.com/issues/17386 ("Upgrading 0.94.6 -> 0.94.9 saturating mon node networking"). Apparently, there's a few commits for both hammer and jewel which are supposed to fix this issue for upgrades from 0.94.6 to 0.94.9 (and possibly for others), but we're still seeing this upgrading to Jewel, and symptoms are exactly same - after upgrading MONs, each not yet upgraded OSD takes full OSDMap from monitors after failing the internal CRC check.

#7 Updated by Nathan Cutler 5 months ago

  • Description updated (diff)

#8 Updated by Loic Dachary 4 months ago

  • Status changed from In Progress to Resolved
  • Target version set to v10.2.6

Also available in: Atom PDF