Project

General

Profile

Actions

Backport #18582

closed

Issue with upgrade from 0.94.9 to 10.2.5

Added by Piotr Dalek over 7 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
High
Assignee:
Target version:
Release:
jewel
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Files

e0934f13-dc36d3c7.badcrc (5.03 KB) e0934f13-dc36d3c7.badcrc Piotr Dalek, 01/20/2017 12:05 PM
jewel_full_9_dc36d3c7.osdmap (5.05 KB) jewel_full_9_dc36d3c7.osdmap Piotr Dalek, 01/20/2017 12:05 PM

Updated by Piotr Dalek over 7 years ago

It looks like that Jewel MONs encode osdmaps with pg_pool_t in version 24 even if they're communicating with Hammer OSDs.
I attached the osdmap encoded by Jewel mon (jewel_full_9_dc36d3c7.osdmap) and reencoded with Hammer osd (e0934f13-dc36d3c7.badcrc).

Actions #2

Updated by Alexey Sheplyakov about 7 years ago

  • Assignee set to Alexey Sheplyakov

I've found a deterministic (well, almost) way to reproduce the problem and am working on fixing it

Actions #4

Updated by Piotr Dalek about 7 years ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Alexey Sheplyakov to Piotr Dalek

Pull request https://github.com/ceph/ceph/pull/13131 fixes the issue completely for us.

Actions #5

Updated by Alexey Sheplyakov about 7 years ago

Steps to reproduce:

1. Deploy a small cluster 3 monitors, 3 OSDs and a client (all running hammer).
2. Start some IO (rados bench, fio, whatever)
3. Restart some OSD (say, osd.0)
4. Wait for PGs to become active+clean
5. Upgrade monitors (one by one) to jewel, wait for monitors to establish a quorum
6. Restart another OSD (say, osd.1)

Result:

All OSDs complain about crc mismatch like this:

2017-01-26 13:13:43.518218 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e15 with expected crc
2017-01-26 13:13:43.519653 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e15 with expected crc
2017-01-26 13:13:44.534886 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e16 with expected crc
2017-01-26 13:13:45.540765 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc
2017-01-26 13:13:45.545426 7f98e735c700  0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc
2017-01-26 13:13:45.549021 7f98e735c700  0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc

Actions #6

Updated by Nathan Cutler about 7 years ago

  • Tracker changed from Bug to Backport
  • Description updated (diff)
  • Status changed from Fix Under Review to In Progress
  • Release set to jewel

description

During our testing we found out that during upgrade from 0.94.9 to 10.2.5 we're hitting issue http://tracker.ceph.com/issues/17386 ("Upgrading 0.94.6 -> 0.94.9 saturating mon node networking"). Apparently, there's a few commits for both hammer and jewel which are supposed to fix this issue for upgrades from 0.94.6 to 0.94.9 (and possibly for others), but we're still seeing this upgrading to Jewel, and symptoms are exactly same - after upgrading MONs, each not yet upgraded OSD takes full OSDMap from monitors after failing the internal CRC check.

Actions #7

Updated by Nathan Cutler about 7 years ago

  • Description updated (diff)
Actions #8

Updated by Loïc Dachary about 7 years ago

  • Status changed from In Progress to Resolved
  • Target version set to v10.2.6
Actions

Also available in: Atom PDF