Backport #19508
closedUpgrading from 0.94.6 to 10.2.6 can overload monitors (failed to encode map with expected crc)
Added by Alexey Sheplyakov about 7 years ago. Updated almost 7 years ago.
Updated by Alexey Sheplyakov about 7 years ago
Updated by Alexey Sheplyakov about 7 years ago
- Status changed from New to In Progress
- Priority changed from Normal to High
Setting priority to High since this disrupts upgrades from Hammer
Updated by Alexey Sheplyakov about 7 years ago
2017-04-07 06:04:45.777268 7f78f9fff700 2 osd.0 24 got incremental 25 but failed to encode full with correct crc; requesting 2017-04-07 06:04:45.777273 7f78f9fff700 0 log_channel(cluster) log [WRN] : failed to encode map e25 with expected crc 2017-04-07 06:04:45.777276 7f78f9fff700 20 osd.0 24 my encoded map was: 00000000 08 07 f1 0f 00 00 03 01 14 0a 00 00 38 25 08 5d |............8%.]| 00000010 c8 7d 48 64 93 36 14 0d e0 ee 97 8f 19 00 00 00 |.}Hd.6..........| 00000020 84 3e e6 58 ca ee 11 16 fd 2b e7 58 9d 68 01 2e |.>.X.....+.X.h..| 00000030 06 00 00 00 00 00 00 00 00 00 00 00 15 05 d2 00 |................| 00000040 00 00 01 03 00 02 40 00 00 00 40 00 00 00 00 00 |......@...@.....| 00000050 00 00 00 00 00 00 12 00 00 00 03 00 00 00 00 00 |................| 00000060 00 00 12 00 00 00 00 00 00 00 01 00 00 00 01 00 |................| 00000070 00 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 |................| 00000080 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 |................| 00000090 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000000a0 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff 00 |................| 000000b0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................|
Bytes at
00000030 06 00 00 00 00 00 00 00 00 00 00 00 15 05 d2 00
is the start of serialized OSDMap.pools (see https://github.com/ceph/ceph/blob/jewel/src/osd/OSDMap.cc#L1903-L1910)
15 05 (decimal 21 05) corresponds to current and compat versions of pg_pool_t encoding (https://github.com/ceph/ceph/blob/jewel/src/osd/osd_types.cc#L1503)
Thus Jewel OSD picks v21 encoding for pg_pool_t, but Hammer (0.94.6) OSDs use v17 instead, hence CRC mismatch.
Updated by Alexey Sheplyakov about 7 years ago
It looks like the issue has been already reported to ceph-users list: http://www.spinics.net/lists/ceph-users/msg34843.html
Updated by Alexey Sheplyakov about 7 years ago
- Assignee set to Alexey Sheplyakov
Updated by Nathan Cutler about 7 years ago
- Tracker changed from Bug to Backport
- Description updated (diff)
description¶
Steps to reproduce:
- Deploy a test cluster using Ceph 0.94.6: 3 OSDs, 3 monitors
- Make a test load (create a rbd image, run fio -ioengine=rbd)
- Perform the upgrade:
a. ceph osd set noout
b. Pick an OSD node, shut down some OSD daemon, upgrade ceph packages, restart the OSD
c. wait until all placement groups are active+clean
Result: after the upgraded OSD starts it requests the OSD map, fails to decode the incremental map,
and requests the complete map:
2017-04-06 07:19:15.489229 7f0e7a4e0800 0 set uid:gid to 64045:64045 (ceph:ceph) 2017-04-06 07:19:15.489261 7f0e7a4e0800 0 ceph version 10.2.6-1~u14.04+1 (8a5b25e3b370b6abf610579a315471958813e33e), process ceph-osd, pid 9126
[skipped]
2017-04-06 07:19:32.642988 7f0e7a4e0800 0 osd.1 22 using 0 op queue with priority op cut off at 64. 2017-04-06 07:19:32.643627 7f0e7a4e0800 -1 osd.1 22 log_to_monitors {default=true} 2017-04-06 07:19:32.770925 7f0e7a4e0800 0 osd.1 22 done with init, starting boot process 2017-04-06 07:19:33.749922 7f0e547ff700 0 log_channel(cluster) log [WRN] : failed to encode map e23 with expected crc 2017-04-06 07:19:33.750052 7f0e547ff700 0 log_channel(cluster) log [WRN] : failed to encode map e23 with expected crc 2017-04-06 07:19:34.756327 7f0e52ffc700 0 log_channel(cluster) log [WRN] : failed to encode map e26 with expected crc 2017-04-06 07:19:34.759619 7f0e52ffc700 0 log_channel(cluster) log [WRN] : failed to encode map e26 with expected crc 2017-04-06 07:19:34.761147 7f0e547ff700 0 log_channel(cluster) log [WRN] : failed to encode map e26 with expected crc 2017-04-06 07:19:34.761200 7f0e547ff700 0 log_channel(cluster) log [WRN] : failed to encode map e26 with expected crc
In a cluster with many (>~ 100) OSDs sending that many complete maps can easily overload monitors
Updated by Nathan Cutler almost 7 years ago
- Status changed from In Progress to Resolved
- Target version set to v10.2.8