Project

General

Profile

Bug #38219

rebuild-mondb hangs

Added by Sage Weil over 1 year ago. Updated 14 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous, mimic, nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

http://pulpito.ceph.com/sage-2019-02-06_23:33:50-rados-master-distro-basic-smithi/

2019-02-06T23:50:53.045 INFO:tasks.ceph.mon.a.smithi097.stderr:/build/ceph-14.0.1-3328-g44ff972/src/mon/OSDMonitor.cc: In function 'virtual void OSDMonitor::update_from_paxos(bool*)' thread 7f47cb9d9700 time 2019-02-06 23:50:53.044652
2019-02-06T23:50:53.045 INFO:tasks.ceph.mon.a.smithi097.stderr:/build/ceph-14.0.1-3328-g44ff972/src/mon/OSDMonitor.cc: 384: FAILED ceph_assert(version > osdmap.epoch)
2019-02-06T23:50:53.046 INFO:tasks.ceph.mon.a.smithi097.stderr: ceph version 14.0.1-3328-g44ff972 (44ff9720582cae805c727d210b665786cb7519a0) nautilus (dev)
2019-02-06T23:50:53.047 INFO:tasks.ceph.mon.a.smithi097.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f47d870259a]
2019-02-06T23:50:53.047 INFO:tasks.ceph.mon.a.smithi097.stderr: 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x7f47d8702775]
2019-02-06T23:50:53.047 INFO:tasks.ceph.mon.a.smithi097.stderr: 3: (OSDMonitor::update_from_paxos(bool*)+0x3858) [0x56220e1fcfb8]
2019-02-06T23:50:53.047 INFO:tasks.ceph.mon.a.smithi097.stderr: 4: (PaxosService::refresh(bool*)+0x25a) [0x56220e17c8aa]
2019-02-06T23:50:53.047 INFO:tasks.ceph.mon.a.smithi097.stderr: 5: (Monitor::refresh_from_paxos(bool*)+0x10c) [0x56220e00516c]
2019-02-06T23:50:53.047 INFO:tasks.ceph.mon.a.smithi097.stderr: 6: (Paxos::do_refresh()+0x4f) [0x56220e16b65f]
2019-02-06T23:50:53.048 INFO:tasks.ceph.mon.a.smithi097.stderr: 7: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0x8c0) [0x56220e1761c0]
2019-02-06T23:50:53.048 INFO:tasks.ceph.mon.a.smithi097.stderr: 8: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x243) [0x56220e176f43]
2019-02-06T23:50:53.048 INFO:tasks.ceph.mon.a.smithi097.stderr: 9: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x15c5) [0x56220e036c85]
2019-02-06T23:50:53.048 INFO:tasks.ceph.mon.a.smithi097.stderr: 10: (Monitor::_ms_dispatch(Message*)+0x4d2) [0x56220e0372d2]
2019-02-06T23:50:53.048 INFO:tasks.ceph.mon.a.smithi097.stderr: 11: (Monitor::ms_dispatch(Message*)+0x23) [0x56220e065263]
2019-02-06T23:50:53.048 INFO:tasks.ceph.mon.a.smithi097.stderr: 12: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x26) [0x56220e061716]
2019-02-06T23:50:53.048 INFO:tasks.ceph.mon.a.smithi097.stderr: 13: (DispatchQueue::entry()+0x1569) [0x7f47d8912089]

/a/sage-2019-02-06_23:33:50-rados-master-distro-basic-smithi/3557785

History

#1 Updated by Sage Weil over 1 year ago

/a/sage-2019-03-03_23:01:07-rados-wip-sage3-testing-2019-03-03-1043-distro-basic-smithi/3664297

#2 Updated by Neha Ojha over 1 year ago

rados:singleton/{all/rebuild-mondb.yaml msgr-failures/many.yaml msgr/async.yaml objectstore/bluestore-bitmap.yaml rados.yaml supported-random-distro$/{ubuntu_latest.yaml}}

Failed in 2 out 10 runs. http://pulpito.ceph.com/nojha-2019-03-06_01:06:50-rados:singleton-master-distro-basic-smithi/

#3 Updated by Kefu Chai over 1 year ago

  • Assignee set to Kefu Chai

#4 Updated by Sage Weil over 1 year ago

/a/sage-2019-03-26_03:52:56-rados-wip-sage-testing-2019-03-25-1934-distro-basic-smithi/3774206

#5 Updated by Kefu Chai over 1 year ago

i am using following script to reproduce this issue locally, so far no luck

MDS=0 MGR=1 OSD=3 MON=3 ../src/vstart.sh -n --without-dashboard

sleep 5
monmap_fn=/tmp/monmap
bin/ceph mon getmap -o $monmap_fn
bin/init-ceph stop mon
for name in a b c; do
    if [ "$name" == "a" ]; then
        rm -rf dev/mon.$name/store.db
    else
        rm -rf dev/mon.$name
    fi
done

mon_store_dir=dev/mon.a
for osd in `seq 0 2`; do
    bin/init-ceph stop osd.$osd
    bin/ceph-objectstore-tool --no-mon-config \
                              --op update-mon-db \
                              --mon-store-path $mon_store_dir \
                              --data-path dev/osd$osd
done

key_fn=keyring
bin/ceph-authtool $key_fn -n 'mon.' --cap mon 'allow *'
bin/ceph-authtool $key_fn -n 'client.admin' \
                  '--cap' 'mon' 'allow *' \
                  '--cap' 'osd' 'allow *' \
                  '--cap' 'mds' 'allow *' \
                  '--cap' 'mgr' 'allow *'
bin/ceph-monstore-tool $mon_store_dir \
                       rebuild -- \
                       --keyring $key_fn \
                       --monmap $monmap_fn

for name in a b c; do
    bin/ceph-mon --cluster ceph --mkfs \
             -i $name \
             --keyring $key_fn \
             --monmap $monmap_fn
    bin/init-ceph start mon.$name
    sleep 5
done

for osd in `seq 0 2`; do
    bin/init-ceph start osd.$osd
done

for name in x; do
    bin/init-ceph stop mgr.$name
    key_fn=dev/mgr.$name/keyring
    bin/ceph -i $key_fn auth add mgr.$name \
         mon 'allow profile mgr' \
         mds 'allow *' \
         osd 'allow *'
    bin/init-ceph start mgr.$name
done

will rerun the test in sepia with more verbose logging messages,

#8 Updated by Kefu Chai over 1 year ago

### on osd host

+ sudo adjust-ulimits ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal --no-mon-config --op update-mon-db --mon-store-path /home/ubuntu/cephtest/mon-store
2019-04-04T09:10:09.073 INFO:teuthology.orchestra.run.smithi116.stdout:adding auth for osd.0
2019-04-04T09:10:09.141 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#1
2019-04-04T09:10:09.142 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#1
2019-04-04T09:10:09.142 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 1
2019-04-04T09:10:09.142 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 1
2019-04-04T09:10:09.142 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#2
2019-04-04T09:10:09.142 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#2
2019-04-04T09:10:09.142 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 2
2019-04-04T09:10:09.142 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 2
2019-04-04T09:10:09.143 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#3
2019-04-04T09:10:09.143 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#3
2019-04-04T09:10:09.143 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 3
2019-04-04T09:10:09.143 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 3
2019-04-04T09:10:09.143 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#4
2019-04-04T09:10:09.143 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#4
2019-04-04T09:10:09.143 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 4
2019-04-04T09:10:09.144 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 4
2019-04-04T09:10:09.144 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#5
2019-04-04T09:10:09.144 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#5
2019-04-04T09:10:09.144 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 5
2019-04-04T09:10:09.144 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 5
2019-04-04T09:10:09.145 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#6
2019-04-04T09:10:09.145 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#6
2019-04-04T09:10:09.145 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 6
2019-04-04T09:10:09.145 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 6
2019-04-04T09:10:09.145 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#7
2019-04-04T09:10:09.146 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#7
2019-04-04T09:10:09.146 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 7
2019-04-04T09:10:09.146 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 7
2019-04-04T09:10:09.146 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#8
2019-04-04T09:10:09.147 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#8
2019-04-04T09:10:09.147 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 8
2019-04-04T09:10:09.147 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 8
2019-04-04T09:10:09.179 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#9
2019-04-04T09:10:09.179 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#9
2019-04-04T09:10:09.179 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 9
2019-04-04T09:10:09.179 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 9
2019-04-04T09:10:09.180 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#10
2019-04-04T09:10:09.180 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#10
2019-04-04T09:10:09.180 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 10
2019-04-04T09:10:09.180 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 10
2019-04-04T09:10:09.180 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#11
2019-04-04T09:10:09.180 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#11
2019-04-04T09:10:09.180 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 11
2019-04-04T09:10:09.181 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 11
2019-04-04T09:10:09.181 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#12
2019-04-04T09:10:09.181 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#12
2019-04-04T09:10:09.181 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 12
2019-04-04T09:10:09.181 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 12
2019-04-04T09:10:09.181 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#13
2019-04-04T09:10:09.181 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#13
2019-04-04T09:10:09.182 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 13
2019-04-04T09:10:09.182 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 13
2019-04-04T09:10:09.182 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#14
2019-04-04T09:10:09.182 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#14
2019-04-04T09:10:09.182 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 14
2019-04-04T09:10:09.182 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 14
2019-04-04T09:10:09.182 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/inc#15
2019-04-04T09:10:09.183 INFO:teuthology.orchestra.run.smithi116.stdout:adding osdmap/full#15
2019-04-04T09:10:09.183 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/last_committed = 15
2019-04-04T09:10:09.183 INFO:teuthology.orchestra.run.smithi116.stdout:set osdmap/full_latest = 15
2019-04-04T09:10:09.189 INFO:teuthology.orchestra.run.smithi116.stdout:osd.0   : 0 osdmaps trimmed, 15 osdmaps added.

### on the monitor's host
# ceph-monstore-tool --debug-rocksdb=0 /var/lib/ceph/mon/ceph-a/ dump-keys
2019-04-04 09:43:31.531 7fcada309280  0  set rocksdb option compression = kNoCompression
2019-04-04 09:43:31.531 7fcada309280  0  set rocksdb option level_compaction_dynamic_level_bytes = true
2019-04-04 09:43:31.531 7fcada309280  0  set rocksdb option write_buffer_size = 33554432
2019-04-04 09:43:31.531 7fcada309280  0  set rocksdb option compression = kNoCompression
2019-04-04 09:43:31.531 7fcada309280  0  set rocksdb option level_compaction_dynamic_level_bytes = true
2019-04-04 09:43:31.531 7fcada309280  0  set rocksdb option write_buffer_size = 33554432
auth / 1
auth / 2
auth / 3
auth / 4
auth / first_committed
auth / format_version
auth / last_committed
config / first_committed
config / last_committed
health / auth
health / first_committed
health / health
health / last_committed
health / leader
health / mdsmap
health / mgr
health / mgrstat
health / osdmap
health / quorum
logm / 1
logm / first_committed
logm / full_0
logm / full_latest
logm / last_committed
mdsmap / 1
mdsmap / first_committed
mdsmap / last_committed
mgr / 1
mgr / first_committed
mgr / last_committed
mgr_command_desc /
mgr_command_descs /
mgrstat / 1
mgrstat / first_committed
mgrstat / last_committed
mkfs / monmap
mon_config_key / config-history/1/
monitor / cluster_uuid
monitor / election_epoch
monitor / election_writeable_test
monitor / feature_set
monitor / magic
monitor_store / last_metadata
osd_pg_creating / creating
osdmap / 1
osdmap / 10
osdmap / 11
osdmap / 12
osdmap / 13
osdmap / 14
osdmap / 15
osdmap / 2
osdmap / 3
osdmap / 4
osdmap / 5
osdmap / 6
osdmap / 7
osdmap / 8
osdmap / 9
osdmap / first_committed
osdmap / full_1
osdmap / full_10
osdmap / full_11
osdmap / full_12
osdmap / full_13
osdmap / full_14
osdmap / full_15
osdmap / full_2
osdmap / full_3
osdmap / full_4
osdmap / full_5
osdmap / full_6
osdmap / full_7
osdmap / full_8
osdmap / full_9
osdmap / full_latest
osdmap / last_committed
paxos / 1
paxos / 2
paxos / first_committed
paxos / last_committed
paxos / last_pn
paxos / pending_pn
paxos / pending_v

# ceph-monstore-tool --debug-rocksdb=0 /var/lib/ceph/mon/ceph-a/ show-versions -- --map-type osdmap
2019-04-04 09:43:09.084 7f004491d280  0  set rocksdb option compression = kNoCompression
2019-04-04 09:43:09.084 7f004491d280  0  set rocksdb option level_compaction_dynamic_level_bytes = true
2019-04-04 09:43:09.084 7f004491d280  0  set rocksdb option write_buffer_size = 33554432
2019-04-04 09:43:09.084 7f004491d280  0  set rocksdb option compression = kNoCompression
2019-04-04 09:43:09.084 7f004491d280  0  set rocksdb option level_compaction_dynamic_level_bytes = true
2019-04-04 09:43:09.084 7f004491d280  0  set rocksdb option write_buffer_size = 33554432
first committed:    1
last  committed:    1

# ceph-monstore-tool --debug-rocksdb=0 /var/lib/ceph/mon/ceph-a/ show-versions -- --map-type paxos
2019-04-04 09:44:20.202 7f5eb4f23280  0  set rocksdb option compression = kNoCompression
2019-04-04 09:44:20.202 7f5eb4f23280  0  set rocksdb option level_compaction_dynamic_level_bytes = true
2019-04-04 09:44:20.202 7f5eb4f23280  0  set rocksdb option write_buffer_size = 33554432
2019-04-04 09:44:20.202 7f5eb4f23280  0  set rocksdb option compression = kNoCompression
2019-04-04 09:44:20.202 7f5eb4f23280  0  set rocksdb option level_compaction_dynamic_level_bytes = true
2019-04-04 09:44:20.202 7f5eb4f23280  0  set rocksdb option write_buffer_size = 33554432
first committed:    1
last  committed:    2

# ceph-monstore-tool --debug-rocksdb=0 /var/lib/ceph/mon/ceph-a/ show-versions -- --map-type auth
2019-04-04 09:44:51.982 7feb443cd280  0  set rocksdb option compression = kNoCompression
2019-04-04 09:44:51.982 7feb443cd280  0  set rocksdb option level_compaction_dynamic_level_bytes = true
2019-04-04 09:44:51.982 7feb443cd280  0  set rocksdb option write_buffer_size = 33554432
2019-04-04 09:44:51.982 7feb443cd280  0  set rocksdb option compression = kNoCompression
2019-04-04 09:44:51.982 7feb443cd280  0  set rocksdb option level_compaction_dynamic_level_bytes = true
2019-04-04 09:44:51.982 7feb443cd280  0  set rocksdb option write_buffer_size = 33554432
first committed:    1
last  committed:    1

#9 Updated by Kefu Chai over 1 year ago

  • Status changed from 12 to Fix Under Review
  • Backport set to luminous, mimic, nautilus
  • Pull request ID set to 27465

#10 Updated by huang jun over 1 year ago

@Kefu Chai i don't understand why the osdmap/first_committed and osdmap/last_committed = 1?
it already set osdmap/last_committed=15 i think.

#11 Updated by Greg Farnum about 1 year ago

  • Status changed from Fix Under Review to In Progress
  • Priority changed from Urgent to Normal

Demoting as if you're running this you already need manual intervention anyway.

#12 Updated by Christoffer Anselm 12 months ago

Just a quick note as this might be relevant for the decision whether or not to integrate this PR:

Running mimic 13.2.6 we had a total loss of all mon data. For recovery we attempted steps similar to those stated in https://tracker.ceph.com/issues/38219#note-5

We had to rebuild the ceph-13.2.6 RPMs with this issue's PR and inject them into a ceph/ceph docker container to prevent the recovered mon from crashing once other mons got added to the cluster.

This PR saved our day. Please consider merging it for the next release if it does not introduce new regressions as it seems necessary for successful recovery in total loss of mon data scenarios.

#13 Updated by Neha Ojha 7 months ago

Seen in nautilus: /a/yuriw-2020-02-15_16:49:25-rados-nautilus-distro-basic-smithi/4767419/

#14 Updated by Deepika Upadhyay 14 days ago

/a/yuriw-2020-09-16_01:27:14-rados-wip-yuri3-testing-2020-09-16-0014-nautilus-distro-basic-smithi/5437537/teuthology.log

Also available in: Atom PDF