Backport #13288
hammer: stuck recovering, unfound hit set due to removing it while !active
Related issues
History
#1 Updated by Loïc Dachary over 8 years ago
- Assignee deleted (
Samuel Just)
#2 Updated by Loïc Dachary over 8 years ago
- Description updated (diff)
- Status changed from New to In Progress
- Assignee set to Loïc Dachary
#3 Updated by Kefu Chai over 8 years ago
- Description updated (diff)
#4 Updated by Loïc Dachary over 8 years ago
- Description updated (diff)
#5 Updated by Loïc Dachary over 8 years ago
- Status changed from In Progress to Resolved
- Target version set to v0.94.6
#6 Updated by Loïc Dachary over 8 years ago
- Description updated (diff)
- Status changed from Resolved to In Progress
- Assignee deleted (
Loïc Dachary) - Target version deleted (
v0.94.6)
https://github.com/ceph/ceph/pull/5825 was reverted by https://github.com/ceph/ceph/pull/6644 because of a regression http://tracker.ceph.com/issues/13812
#7 Updated by Loïc Dachary over 8 years ago
- Related to Bug #13812: pgs stuck creating in upgrade test hammer->hammer.?->jewel added
#8 Updated by Loïc Dachary over 8 years ago
- Status changed from In Progress to New
#9 Updated by Loïc Dachary about 8 years ago
- Assignee set to Loïc Dachary
#10 Updated by Nathan Cutler about 8 years ago
- Subject changed from stuck recovering, unfound hit set due to removing it while !active to hammer: stuck recovering, unfound hit set due to removing it while !active
#11 Updated by Loïc Dachary about 8 years ago
- Assignee changed from Loïc Dachary to Kefu Chai
@Kefu, assigning to you as I'm not familiar with the original patch
#12 Updated by Kefu Chai about 8 years ago
pushed to wip-12848-hammer.
running
teuthology-suite --priority 10 --suite upgrade --suite-branch jewel --email tchaikov@gmail.com --ceph wip-12848-hammer --machine-type smithi,mira --filter 'upgrade/hamm
er/older/{0-cluster/start.yaml 1-install/v0.94.yaml 2-workload/blogbench.yaml 3-upgrade-sequence/upgrade-osd-mon-mds.yaml 4-final/{monthrash.yaml osdthrash.yaml testrados.yaml} distros/ubuntu_14.04'
at http://pulpito.ceph.com/kchai-2016-03-02_00:13:18-upgrade-wip-12848-hammer---basic-multi/ to find out what's going wrong.
#13 Updated by Kefu Chai about 8 years ago
- Status changed from New to In Progress
#14 Updated by Kefu Chai about 8 years ago
- mon.c v0.94, (e61c4f093f88e44961d157f65091733580cea79a)
- osd.1
- before upgrade v0.94, (e61c4f093f88e44961d157f65091733580cea79a)
- after upgrade v0.94.6-52-ge640a61 (e640a61ea4c5d6700380cb503f08ebe6950848bc). this version constantly have problem decoding the inc osdmap.
- osd.1 wants osd#14
2016-03-02 08:21:18.709293 7f99180a2700 1 -- 172.21.5.140:6816/17048 <== mon.2 172.21.3.108:6790/0 92 ==== osd_map(14..14 src has 1..14) v3 ==== 238+0+0 (4188977541 0 0) 0x57a5600 con 0x544f600 2016-03-02 08:21:18.709339 7f99180a2700 3 osd.1 13 handle_osd_map epochs [14,14], i have 13, src has [1,14] 2016-03-02 08:21:18.709343 7f99180a2700 10 osd.1 13 handle_osd_map got inc map for epoch 14 2016-03-02 08:21:18.709487 7f99180a2700 2 osd.1 13 got incremental 14 but failed to encode full with correct crc; requesting 2016-03-02 08:21:18.709530 7f99180a2700 1 -- 172.21.5.140:6816/17048 --> 172.21.3.108:6790/0 -- mon_get_osdmap(full 14-14) v1 -- ?+0 0x5a79600 con 0x544f600
- mon.c gets the get_osdmap(full 14-14) and returns full 0..0, this is a known bug, see #12410
2016-03-02 08:21:18.710399 7fb358210700 1 -- 172.21.3.108:6790/0 <== osd.1 172.21.5.140:6816/17048 68 ==== mon_get_osdmap(full 14-14) v1 ==== 34+0+0 (4126344047 0 0) 0x3a28a00 con 0x37e15a0 2016-03-02 08:21:18.710680 7fb358210700 10 mon.c@2(peon).osd e14 preprocess_get_osdmap mon_get_osdmap(full 14-14) v1 2016-03-02 08:21:18.710709 7fb358210700 1 -- 172.21.3.108:6790/0 --> 172.21.5.140:6816/17048 -- osd_map(0..0 src has 1..14) v3 -- ?+0 0x380ba80 con 0x37e15a0
- osd.1 is disappointed at receiving full 0..0 osdmap
2016-03-02 08:21:18.709801 7f99180a2700 1 -- 172.21.5.140:6816/17048 <== mon.2 172.21.3.108:6790/0 94 ==== osd_map(0..0 src has 1..14) v3 ==== 32+0+0 (3775169053 0 0) 0x5861840 con 0x544f600 2016-03-02 08:21:18.709863 7f99180a2700 3 osd.1 13 handle_osd_map epochs [0,0], i have 13, src has [1,14] 2016-03-02 08:21:18.709867 7f99180a2700 10 osd.1 13 no new maps here, dropping
probably that's why the decode crc error prevents osd from moving ahead. it's waiting for an updated osdmap. but monitor cannot help it. because in this changeset, pg_pool_t has three more field (two of them are padding field to be compatible with pg_pool_t version >= 19, so the re-encoded osdmap's CRC is different from the one calculated by the mon.c v0.94. so, i don't think we support the combination of
- 1-install/v0.94.yaml, 3-upgrade-sequence/upgrade-osd-mon-mds
- 1-install/v0.94.1.yaml, 3-upgrade-sequence/upgrade-osd-mon-mds
with ceph with change(s) that could lead to "failed to encode full with correct crc" errors.
#15 Updated by Kefu Chai about 8 years ago
- Status changed from In Progress to Fix Under Review
#16 Updated by Kefu Chai about 8 years ago
or, we can encode the osdmap (i.e. pg_pool_t) in the same old way before the GMT feature bit is introduced.
#17 Updated by Kefu Chai almost 8 years ago
- Status changed from Fix Under Review to Resolved
#18 Updated by Nathan Cutler almost 8 years ago
- Related to Backport #12848: ReplicatedPG::hit_set_trim osd/ReplicatedPG.cc: 11006: FAILED assert(obc) added
#19 Updated by Loïc Dachary over 7 years ago
- Description updated (diff)
- Target version set to v0.94.7