Project

General

Profile

Backport #19508

Upgrading from 0.94.6 to 10.2.6 can overload monitors (failed to encode map with expected crc)

Added by Alexey Sheplyakov 6 months ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Target version:
Release:
jewel

History

#2 Updated by Alexey Sheplyakov 6 months ago

  • Status changed from New to In Progress
  • Priority changed from Normal to High

Setting priority to High since this disrupts upgrades from Hammer

#3 Updated by Alexey Sheplyakov 6 months ago

2017-04-07 06:04:45.777268 7f78f9fff700  2 osd.0 24 got incremental 25 but failed to encode full with correct crc; requesting
2017-04-07 06:04:45.777273 7f78f9fff700  0 log_channel(cluster) log [WRN] : failed to encode map e25 with expected crc
2017-04-07 06:04:45.777276 7f78f9fff700 20 osd.0 24 my encoded map was:
00000000  08 07 f1 0f 00 00 03 01  14 0a 00 00 38 25 08 5d  |............8%.]|
00000010  c8 7d 48 64 93 36 14 0d  e0 ee 97 8f 19 00 00 00  |.}Hd.6..........|
00000020  84 3e e6 58 ca ee 11 16  fd 2b e7 58 9d 68 01 2e  |.>.X.....+.X.h..|
00000030  06 00 00 00 00 00 00 00  00 00 00 00 15 05 d2 00  |................|
00000040  00 00 01 03 00 02 40 00  00 00 40 00 00 00 00 00  |......@...@.....|
00000050  00 00 00 00 00 00 12 00  00 00 03 00 00 00 00 00  |................|
00000060  00 00 12 00 00 00 00 00  00 00 01 00 00 00 01 00  |................|
00000070  00 00 00 00 00 00 03 00  00 00 00 00 00 00 00 00  |................|
00000080  00 00 00 00 00 00 01 00  00 00 00 00 00 00 00 00  |................|
00000090  00 00 02 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000a0  00 00 00 00 00 00 00 ff  ff ff ff ff ff ff ff 00  |................|
000000b0  ff ff ff ff ff ff ff ff  ff ff ff ff ff ff ff ff  |................|

Bytes at

00000030  06 00 00 00 00 00 00 00  00 00 00 00 15 05 d2 00

is the start of serialized OSDMap.pools (see https://github.com/ceph/ceph/blob/jewel/src/osd/OSDMap.cc#L1903-L1910)

15 05 (decimal 21 05) corresponds to current and compat versions of pg_pool_t encoding (https://github.com/ceph/ceph/blob/jewel/src/osd/osd_types.cc#L1503)
Thus Jewel OSD picks v21 encoding for pg_pool_t, but Hammer (0.94.6) OSDs use v17 instead, hence CRC mismatch.

#4 Updated by Alexey Sheplyakov 6 months ago

It looks like the issue has been already reported to ceph-users list: http://www.spinics.net/lists/ceph-users/msg34843.html

#5 Updated by Alexey Sheplyakov 6 months ago

  • Assignee set to Alexey Sheplyakov

#6 Updated by Nathan Cutler 5 months ago

  • Tracker changed from Bug to Backport
  • Description updated (diff)

description

Steps to reproduce:

  1. Deploy a test cluster using Ceph 0.94.6: 3 OSDs, 3 monitors
  2. Make a test load (create a rbd image, run fio -ioengine=rbd)
  3. Perform the upgrade:
    a. ceph osd set noout
    b. Pick an OSD node, shut down some OSD daemon, upgrade ceph packages, restart the OSD
    c. wait until all placement groups are active+clean

Result: after the upgraded OSD starts it requests the OSD map, fails to decode the incremental map,
and requests the complete map:

2017-04-06 07:19:15.489229 7f0e7a4e0800  0 set uid:gid to 64045:64045 (ceph:ceph)
2017-04-06 07:19:15.489261 7f0e7a4e0800  0 ceph version 10.2.6-1~u14.04+1 (8a5b25e3b370b6abf610579a315471958813e33e), process ceph-osd, pid 9126

[skipped]

2017-04-06 07:19:32.642988 7f0e7a4e0800  0 osd.1 22 using 0 op queue with priority op cut off at 64.
2017-04-06 07:19:32.643627 7f0e7a4e0800 -1 osd.1 22 log_to_monitors {default=true}
2017-04-06 07:19:32.770925 7f0e7a4e0800  0 osd.1 22 done with init, starting boot process
2017-04-06 07:19:33.749922 7f0e547ff700  0 log_channel(cluster) log [WRN] : failed to encode map e23 with expected crc
2017-04-06 07:19:33.750052 7f0e547ff700  0 log_channel(cluster) log [WRN] : failed to encode map e23 with expected crc
2017-04-06 07:19:34.756327 7f0e52ffc700  0 log_channel(cluster) log [WRN] : failed to encode map e26 with expected crc
2017-04-06 07:19:34.759619 7f0e52ffc700  0 log_channel(cluster) log [WRN] : failed to encode map e26 with expected crc
2017-04-06 07:19:34.761147 7f0e547ff700  0 log_channel(cluster) log [WRN] : failed to encode map e26 with expected crc
2017-04-06 07:19:34.761200 7f0e547ff700  0 log_channel(cluster) log [WRN] : failed to encode map e26 with expected crc

In a cluster with many (>~ 100) OSDs sending that many complete maps can easily overload monitors

#7 Updated by Nathan Cutler 5 months ago

  • Release jewel added

#8 Updated by Nathan Cutler 5 months ago

  • Priority changed from High to Normal

#9 Updated by Nathan Cutler 3 months ago

  • Status changed from In Progress to Resolved
  • Target version set to v10.2.8

Also available in: Atom PDF