Backport #18582: Issue with upgrade from 0.94.9 to 10.2.5 - Ceph - Ceph

Actions

Copy link

Backport #18582

closed

Issue with upgrade from 0.94.9 to 10.2.5

Added by Piotr Dalek over 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

High

Assignee:

Piotr Dalek

Target version:

v10.2.6

Release:

jewel

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

https://github.com/ceph/ceph/pull/13131

Files

Download all files

e0934f13-dc36d3c7.badcrc (5.03 KB) e0934f13-dc36d3c7.badcrc		Piotr Dalek, 01/20/2017 12:05 PM
jewel_full_9_dc36d3c7.osdmap (5.05 KB) jewel_full_9_dc36d3c7.osdmap		Piotr Dalek, 01/20/2017 12:05 PM

Actions

Copy link Download all files

Updated by Piotr Dalek over 7 years ago

File e0934f13-dc36d3c7.badcrc e0934f13-dc36d3c7.badcrc added
File jewel_full_9_dc36d3c7.osdmap jewel_full_9_dc36d3c7.osdmap added

It looks like that Jewel MONs encode osdmaps with pg_pool_t in version 24 even if they're communicating with Hammer OSDs.
I attached the osdmap encoded by Jewel mon (jewel_full_9_dc36d3c7.osdmap) and reencoded with Hammer osd (e0934f13-dc36d3c7.badcrc).

Actions

Copy link

Updated by Alexey Sheplyakov about 7 years ago

Assignee set to Alexey Sheplyakov

I've found a deterministic (well, almost) way to reproduce the problem and am working on fixing it

Actions

Copy link

Updated by Alexey Sheplyakov about 7 years ago

https://github.com/ceph/ceph/pull/13127

Actions

Copy link

Updated by Piotr Dalek about 7 years ago

Status changed from New to Fix Under Review
Assignee changed from Alexey Sheplyakov to Piotr Dalek

Pull request https://github.com/ceph/ceph/pull/13131 fixes the issue completely for us.

Actions

Copy link

Updated by Alexey Sheplyakov about 7 years ago

Steps to reproduce:

1. Deploy a small cluster 3 monitors, 3 OSDs and a client (all running hammer).
2. Start some IO (rados bench, fio, whatever)
3. Restart some OSD (say, osd.0)
4. Wait for PGs to become active+clean
5. Upgrade monitors (one by one) to jewel, wait for monitors to establish a quorum
6. Restart another OSD (say, osd.1)

Result:

All OSDs complain about crc mismatch like this:

2017-01-26 13:13:43.518218 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e15 with expected crc
2017-01-26 13:13:43.519653 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e15 with expected crc
2017-01-26 13:13:44.534886 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e16 with expected crc
2017-01-26 13:13:45.540765 7f98e8b5f700  0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc
2017-01-26 13:13:45.545426 7f98e735c700  0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc
2017-01-26 13:13:45.549021 7f98e735c700  0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc

Actions

Copy link

Updated by Nathan Cutler about 7 years ago

Tracker changed from Bug to Backport
Description updated (diff)
Status changed from Fix Under Review to In Progress
Release set to jewel

description¶

During our testing we found out that during upgrade from 0.94.9 to 10.2.5 we're hitting issue http://tracker.ceph.com/issues/17386 ("Upgrading 0.94.6 -> 0.94.9 saturating mon node networking"). Apparently, there's a few commits for both hammer and jewel which are supposed to fix this issue for upgrades from 0.94.6 to 0.94.9 (and possibly for others), but we're still seeing this upgrading to Jewel, and symptoms are exactly same - after upgrading MONs, each not yet upgraded OSD takes full OSDMap from monitors after failing the internal CRC check.

Actions

Copy link