Project

General

Profile

Backport #13288

hammer: stuck recovering, unfound hit set due to removing it while !active

Added by Nathan Cutler almost 5 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
Release:
hammer
Crash signature:


Related issues

Related to Ceph - Bug #13812: pgs stuck creating in upgrade test hammer->hammer.?->jewel Resolved 11/16/2015
Related to Ceph - Backport #12848: ReplicatedPG::hit_set_trim osd/ReplicatedPG.cc: 11006: FAILED assert(obc) Resolved
Copied from Ceph - Bug #13192: stuck recovering, unfound hit set due to removing it while !active Resolved 09/21/2015

History

#1 Updated by Loic Dachary almost 5 years ago

  • Assignee deleted (Samuel Just)

#2 Updated by Loic Dachary almost 5 years ago

  • Description updated (diff)
  • Status changed from New to In Progress
  • Assignee set to Loic Dachary

#3 Updated by Kefu Chai almost 5 years ago

  • Description updated (diff)

#4 Updated by Loic Dachary almost 5 years ago

  • Description updated (diff)

#5 Updated by Loic Dachary almost 5 years ago

  • Status changed from In Progress to Resolved
  • Target version set to v0.94.6

#6 Updated by Loic Dachary over 4 years ago

  • Description updated (diff)
  • Status changed from Resolved to In Progress
  • Assignee deleted (Loic Dachary)
  • Target version deleted (v0.94.6)

#7 Updated by Loic Dachary over 4 years ago

  • Related to Bug #13812: pgs stuck creating in upgrade test hammer->hammer.?->jewel added

#8 Updated by Loic Dachary over 4 years ago

  • Status changed from In Progress to New

#9 Updated by Loic Dachary over 4 years ago

  • Assignee set to Loic Dachary

#10 Updated by Nathan Cutler over 4 years ago

  • Subject changed from stuck recovering, unfound hit set due to removing it while !active to hammer: stuck recovering, unfound hit set due to removing it while !active

#11 Updated by Loic Dachary over 4 years ago

  • Assignee changed from Loic Dachary to Kefu Chai

@Kefu, assigning to you as I'm not familiar with the original patch

#12 Updated by Kefu Chai over 4 years ago

pushed to wip-12848-hammer.

running

teuthology-suite --priority 10 --suite upgrade --suite-branch jewel --email --ceph wip-12848-hammer --machine-type smithi,mira --filter 'upgrade/hamm
er/older/{0-cluster/start.yaml 1-install/v0.94.yaml 2-workload/blogbench.yaml 3-upgrade-sequence/upgrade-osd-mon-mds.yaml 4-final/{monthrash.yaml osdthrash.yaml testrados.yaml} distros/ubuntu_14.04'

at http://pulpito.ceph.com/kchai-2016-03-02_00:13:18-upgrade-wip-12848-hammer---basic-multi/ to find out what's going wrong.

#13 Updated by Kefu Chai over 4 years ago

  • Status changed from New to In Progress

#14 Updated by Kefu Chai over 4 years ago

  • mon.c v0.94, (e61c4f093f88e44961d157f65091733580cea79a)
  • osd.1
    • before upgrade v0.94, (e61c4f093f88e44961d157f65091733580cea79a)
    • after upgrade v0.94.6-52-ge640a61 (e640a61ea4c5d6700380cb503f08ebe6950848bc). this version constantly have problem decoding the inc osdmap.
  1. osd.1 wants osd#14
    2016-03-02 08:21:18.709293 7f99180a2700  1 -- 172.21.5.140:6816/17048 <== mon.2 172.21.3.108:6790/0 92 ==== osd_map(14..14 src has 1..14) v3 ==== 238+0+0 (4188977541 0 0) 0x57a5600 con 0x544f600
    2016-03-02 08:21:18.709339 7f99180a2700  3 osd.1 13 handle_osd_map epochs [14,14], i have 13, src has [1,14]
    2016-03-02 08:21:18.709343 7f99180a2700 10 osd.1 13 handle_osd_map  got inc map for epoch 14
    2016-03-02 08:21:18.709487 7f99180a2700  2 osd.1 13 got incremental 14 but failed to encode full with correct crc; requesting
    2016-03-02 08:21:18.709530 7f99180a2700  1 -- 172.21.5.140:6816/17048 --> 172.21.3.108:6790/0 -- mon_get_osdmap(full 14-14) v1 -- ?+0 0x5a79600 con 0x544f600
    
  2. mon.c gets the get_osdmap(full 14-14) and returns full 0..0, this is a known bug, see #12410
    2016-03-02 08:21:18.710399 7fb358210700  1 -- 172.21.3.108:6790/0 <== osd.1 172.21.5.140:6816/17048 68 ==== mon_get_osdmap(full 14-14) v1 ==== 34+0+0 (4126344047 0 0) 0x3a28a00 con 0x37e15a0
    2016-03-02 08:21:18.710680 7fb358210700 10 mon.c@2(peon).osd e14 preprocess_get_osdmap mon_get_osdmap(full 14-14) v1
    2016-03-02 08:21:18.710709 7fb358210700  1 -- 172.21.3.108:6790/0 --> 172.21.5.140:6816/17048 -- osd_map(0..0 src has 1..14) v3 -- ?+0 0x380ba80 con 0x37e15a0
    
  3. osd.1 is disappointed at receiving full 0..0 osdmap
    2016-03-02 08:21:18.709801 7f99180a2700  1 -- 172.21.5.140:6816/17048 <== mon.2 172.21.3.108:6790/0 94 ==== osd_map(0..0 src has 1..14) v3 ==== 32+0+0 (3775169053 0 0) 0x5861840 con 0x544f600
    2016-03-02 08:21:18.709863 7f99180a2700  3 osd.1 13 handle_osd_map epochs [0,0], i have 13, src has [1,14]
    2016-03-02 08:21:18.709867 7f99180a2700 10 osd.1 13  no new maps here, dropping
    

probably that's why the decode crc error prevents osd from moving ahead. it's waiting for an updated osdmap. but monitor cannot help it. because in this changeset, pg_pool_t has three more field (two of them are padding field to be compatible with pg_pool_t version >= 19, so the re-encoded osdmap's CRC is different from the one calculated by the mon.c v0.94. so, i don't think we support the combination of

  • 1-install/v0.94.yaml, 3-upgrade-sequence/upgrade-osd-mon-mds
  • 1-install/v0.94.1.yaml, 3-upgrade-sequence/upgrade-osd-mon-mds

with ceph with change(s) that could lead to "failed to encode full with correct crc" errors.

#16 Updated by Kefu Chai over 4 years ago

or, we can encode the osdmap (i.e. pg_pool_t) in the same old way before the GMT feature bit is introduced.

#17 Updated by Kefu Chai over 4 years ago

  • Status changed from Fix Under Review to Resolved

#18 Updated by Nathan Cutler over 4 years ago

  • Related to Backport #12848: ReplicatedPG::hit_set_trim osd/ReplicatedPG.cc: 11006: FAILED assert(obc) added

#19 Updated by Loic Dachary almost 4 years ago

  • Description updated (diff)
  • Target version set to v0.94.7

Also available in: Atom PDF