Backport #18582
closedIssue with upgrade from 0.94.9 to 10.2.5
Added by Piotr Dalek over 7 years ago. Updated about 7 years ago.
Files
e0934f13-dc36d3c7.badcrc (5.03 KB) e0934f13-dc36d3c7.badcrc | Piotr Dalek, 01/20/2017 12:05 PM | ||
jewel_full_9_dc36d3c7.osdmap (5.05 KB) jewel_full_9_dc36d3c7.osdmap | Piotr Dalek, 01/20/2017 12:05 PM |
Updated by Piotr Dalek over 7 years ago
- File e0934f13-dc36d3c7.badcrc e0934f13-dc36d3c7.badcrc added
- File jewel_full_9_dc36d3c7.osdmap jewel_full_9_dc36d3c7.osdmap added
It looks like that Jewel MONs encode osdmaps with pg_pool_t in version 24 even if they're communicating with Hammer OSDs.
I attached the osdmap encoded by Jewel mon (jewel_full_9_dc36d3c7.osdmap) and reencoded with Hammer osd (e0934f13-dc36d3c7.badcrc).
Updated by Alexey Sheplyakov over 7 years ago
- Assignee set to Alexey Sheplyakov
I've found a deterministic (well, almost) way to reproduce the problem and am working on fixing it
Updated by Alexey Sheplyakov over 7 years ago
Updated by Piotr Dalek over 7 years ago
- Status changed from New to Fix Under Review
- Assignee changed from Alexey Sheplyakov to Piotr Dalek
Pull request https://github.com/ceph/ceph/pull/13131 fixes the issue completely for us.
Updated by Alexey Sheplyakov over 7 years ago
Steps to reproduce:
1. Deploy a small cluster 3 monitors, 3 OSDs and a client (all running hammer).
2. Start some IO (rados bench, fio, whatever)
3. Restart some OSD (say, osd.0)
4. Wait for PGs to become active+clean
5. Upgrade monitors (one by one) to jewel, wait for monitors to establish a quorum
6. Restart another OSD (say, osd.1)
Result:
All OSDs complain about crc mismatch like this:
2017-01-26 13:13:43.518218 7f98e8b5f700 0 log_channel(cluster) log [WRN] : failed to encode map e15 with expected crc 2017-01-26 13:13:43.519653 7f98e8b5f700 0 log_channel(cluster) log [WRN] : failed to encode map e15 with expected crc 2017-01-26 13:13:44.534886 7f98e8b5f700 0 log_channel(cluster) log [WRN] : failed to encode map e16 with expected crc 2017-01-26 13:13:45.540765 7f98e8b5f700 0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc 2017-01-26 13:13:45.545426 7f98e735c700 0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc 2017-01-26 13:13:45.549021 7f98e735c700 0 log_channel(cluster) log [WRN] : failed to encode map e17 with expected crc
Updated by Nathan Cutler over 7 years ago
- Tracker changed from Bug to Backport
- Description updated (diff)
- Status changed from Fix Under Review to In Progress
- Release set to jewel
description¶
During our testing we found out that during upgrade from 0.94.9 to 10.2.5 we're hitting issue http://tracker.ceph.com/issues/17386 ("Upgrading 0.94.6 -> 0.94.9 saturating mon node networking"). Apparently, there's a few commits for both hammer and jewel which are supposed to fix this issue for upgrades from 0.94.6 to 0.94.9 (and possibly for others), but we're still seeing this upgrading to Jewel, and symptoms are exactly same - after upgrading MONs, each not yet upgraded OSD takes full OSDMap from monitors after failing the internal CRC check.
Updated by Loïc Dachary about 7 years ago
- Status changed from In Progress to Resolved
- Target version set to v10.2.6