Project

General

Profile

Bug #61909

mds/fsmap: fs fail cause to mon crash

Added by yite gu 8 months ago. Updated 7 months ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph health ok before run `ceph fs fail <fs_name>`

# ceph -s
  cluster:
    id:     81ba98a0-72d4-48f5-a3e0-23c91b1e3196
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum b,a,e (age 13m)
    mgr: a(active, since 15h)
    mds: vees-root-cephfs:1 {0=vees-root-cephfs-a=up:active} 1 up:standby-replay 1 up:active
    osd: 24 osds: 24 up (since 8d), 24 in (since 8d)

  task status:
    scrub status:
        mds.vees-root-cephfs-a: idle
        mds.vees-root-cephfs-b: idle
        mds.vees-root-cephfs-c: idle

  data:
    pools:   4 pools, 193 pgs
    objects: 445.37k objects, 1.6 TiB
    usage:   4.0 TiB used, 164 TiB / 168 TiB avail
    pgs:     193 active+clean

# ceph fs dump
dumped fsmap epoch 44
e44
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 2

Filesystem 'vees-root-cephfs' (2)
fs_name    vees-root-cephfs
epoch    44
flags    32
created    2023-07-05T08:00:53.406008+0000
modified    2023-07-05T09:34:26.062907+0000
tableserver    0
root    0
session_timeout    60
session_autoclose    300
max_file_size    1099511627776
min_compat_client    0 (unknown)
last_failure    0
last_failure_osd_epoch    3781
compat    compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds    1
in    0
up    {0=20804286}
failed
damaged
stopped
data_pools    [6]
metadata_pool    5
inline_data    disabled
balancer
standby_count_wanted    1
[mds.vees-root-cephfs-b{0:20804282} state up:active seq 844 join_fscid=2 addr [v2:10.6.4.26:6864/2481973924,v1:10.6.4.26:6865/2481973924]]
[mds.vees-root-cephfs-a{0:20804286} state up:active seq 844 join_fscid=2 addr [v2:10.6.4.28:6832/712889410,v1:10.6.4.28:6833/712889410]]
[mds.vees-root-cephfs-c{0:20831362} state up:standby-replay seq 3 join_fscid=2 addr [v2:10.6.4.27:6864/2324823098,v1:10.6.4.27:6865/2324823098]]

execu: ceph fs fail <fs_name>
    -1> 2023-07-06T02:34:45.388+0000 7f1e79735700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.7/rpm/el8/BUILD/ceph-15.2.7/src/mds/FSMap.cc: In function 'void FSMap::erase(mds_gid_t, epoch_t)' thread 7f1e79735700 time 2023-07-06T02:34:45.386825+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.7/rpm/el8/BUILD/ceph-15.2.7/src/mds/FSMap.cc: 989: FAILED ceph_assert(fs->mds_map.up.at(info.rank) == info.global_id)

 ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f1e8729ab8c]
 2: (()+0x279da6) [0x7f1e8729ada6]
 3: (()+0x77f95a) [0x7f1e877a095a]
 4: (MDSMonitor::fail_mds_gid(FSMap&, mds_gid_t)+0x1df) [0x558a5e046f9f]
 5: (FailHandler::handle(Monitor*, FSMap&, boost::intrusive_ptr<MonOpRequest>, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > >, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, boost::variant<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool, long, double, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >, std::vector<long, std::allocator<long> >, std::vector<double, std::allocator<double> > > > > > const&, std::__cxx11::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >&)+0x381) [0x558a5e063c11]
 6: (MDSMonitor::prepare_command(boost::intrusive_ptr<MonOpRequest>)+0x283) [0x558a5e04c6d3]
 7: (MDSMonitor::prepare_update(boost::intrusive_ptr<MonOpRequest>)+0x153) [0x558a5e04d223]
 8: (PaxosService::dispatch(boost::intrusive_ptr<MonOpRequest>)+0xa6d) [0x558a5df7de3d]
 9: (Monitor::handle_command(boost::intrusive_ptr<MonOpRequest>)+0x5d11) [0x558a5de80ae1]
 10: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x888) [0x558a5de81c68]
 11: (Monitor::_ms_dispatch(Message*)+0x69d) [0x558a5de82fdd]
 12: (Monitor::handle_forward(boost::intrusive_ptr<MonOpRequest>)+0x941) [0x558a5de84ca1]
 13: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x138b) [0x558a5de8276b]
 14: (Monitor::_ms_dispatch(Message*)+0x69d) [0x558a5de82fdd]
 15: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x558a5deb1b8c]
 16: (DispatchQueue::entry()+0x126a) [0x7f1e874b92aa]
 17: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f1e8755b901]
 18: (()+0x82de) [0x7f1e84d912de]
 19: (clone()+0x43) [0x7f1e83ac8e83]

History

#1 Updated by yite gu 8 months ago

Supplement key logs:

    -5> 2023-07-06T02:34:45.386+0000 7f1e79735700  0 log_channel(audit) log [INF] : from='client.21065428 ' entity='client.admin' cmd=[{"prefix": "fs fail", "fs_name": "vees-root-cephfs"}]: dispatch
    -4> 2023-07-06T02:34:45.386+0000 7f1e79735700  5 mon.b@0(leader).paxos(paxos active c 34877455..34878174) is_readable = 1 - now=2023-07-06T02:34:45.386778+0000 lease_expire=2023-07-06T02:34:49.279095+0000 has v0 lc 34878174
    -3> 2023-07-06T02:34:45.386+0000 7f1e79735700  7 mon.b@0(leader).mds e44 prepare_update mon_command({"prefix": "fs fail", "fs_name": "vees-root-cephfs"} v 0) v1
    -2> 2023-07-06T02:34:45.386+0000 7f1e79735700  1 mon.b@0(leader).mds e44 fail_mds_gid 20804282 mds.vees-root-cephfs-b role 0

#2 Updated by Dhairya Parmar 8 months ago

I see the scrub status is IDLE:

  task status:
    scrub status:
        mds.vees-root-cephfs-a: idle
        mds.vees-root-cephfs-b: idle
        mds.vees-root-cephfs-c: idle

Since how long has this been like this? This likely should be the reason; when you try to fail the fs; all ranks are failed, but since MDSs cannot fail at the moment because there's some scrub under way; it fails at `ceph_assert(mon.osdmon()->is_writeable());`

#3 Updated by yite gu 8 months ago

Dhairya Parmar wrote:

I see the scrub status is IDLE:
[...]

Since how long has this been like this? This likely should be the reason; when you try to fail the fs; all ranks are failed, but since MDSs cannot fail at the moment because there's some scrub under way; it fails at `ceph_assert(mon.osdmon()->is_writeable());`

mds service is new add at this cluster, fs haven't provide data to any client yet. why fs->mds_map.up.at(info.rank) != info.global_id? Is this a bug or is the operator operating incorrectly?

#4 Updated by yite gu 8 months ago

active mds is vees-root-cephfs-a, gid is 20804286. but used gid of vees-root-cephfs-b,

#5 Updated by Dhairya Parmar 8 months ago

yite gu wrote:

Dhairya Parmar wrote:

I see the scrub status is IDLE:
[...]

Since how long has this been like this? This likely should be the reason; when you try to fail the fs; all ranks are failed, but since MDSs cannot fail at the moment because there's some scrub under way; it fails at `ceph_assert(mon.osdmon()->is_writeable());`

mds service is new add at this cluster, fs haven't provide data to any client yet. why fs->mds_map.up.at(info.rank) != info.global_id? Is this a bug or is the operator operating incorrectly?

I could not find fs->mds_map.up.at(info.rank) != info.global_id, are you pointing to fs->mds_map.up.at(info.rank) == info.global_id in FSMap::erase()?

#6 Updated by yite gu 8 months ago

Dhairya Parmar wrote:

yite gu wrote:

Dhairya Parmar wrote:

I see the scrub status is IDLE:
[...]

Since how long has this been like this? This likely should be the reason; when you try to fail the fs; all ranks are failed, but since MDSs cannot fail at the moment because there's some scrub under way; it fails at `ceph_assert(mon.osdmon()->is_writeable());`

mds service is new add at this cluster, fs haven't provide data to any client yet. why fs->mds_map.up.at(info.rank) != info.global_id? Is this a bug or is the operator operating incorrectly?

I could not find fs->mds_map.up.at(info.rank) != info.global_id, are you pointing to fs->mds_map.up.at(info.rank) == info.global_id in FSMap::erase()?

Yes, this is assert point.

#7 Updated by yite gu 8 months ago

[mds.vees-root-cephfs-b{0:20804282} state up:active seq 844 join_fscid=2 addr [v2:10.6.4.26:6864/2481973924,v1:10.6.4.26:6865/2481973924]]
[mds.vees-root-cephfs-a{0:20804286} state up:active seq 844 join_fscid=2 addr [v2:10.6.4.28:6832/712889410,v1:10.6.4.28:6833/712889410]]

I think this may be the problem, both a and b are active at the same time, they are all attach to fsid 2

#8 Updated by yite gu 8 months ago

multi active should be as blow:
max_mds 2
up {0=11123,1=11087}
[mds.vees-root-cephfs-c{0:11123} state up:active seq 3 join_fscid=2 addr [v2:10.133.17.43:6809/224983201,v1:10.133.17.43:6813/224983201]]
[mds.vees-root-cephfs-a{1:11087} state up:active seq 1102 join_fscid=2 addr [v2:10.133.17.81:6834/3469607936,v1:10.133.17.81:6835/3469607936]]

# ceph fs dump
dumped fsmap epoch 21
e21
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 2

Filesystem 'vees-root-cephfs' (2)
fs_name    vees-root-cephfs
epoch    21
flags    32
created    2023-07-06T13:34:48.954228+0000
modified    2023-07-06T14:48:02.380019+0000
tableserver    0
root    0
session_timeout    60
session_autoclose    300
max_file_size    1099511627776
min_compat_client    0 (unknown)
last_failure    0
last_failure_osd_epoch    0
compat    compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds    2
in    0,1
up    {0=11123,1=11087}
failed
damaged
stopped
data_pools    [6]
metadata_pool    5
inline_data    disabled
balancer
standby_count_wanted    1
[mds.vees-root-cephfs-c{0:11123} state up:active seq 3 join_fscid=2 addr [v2:10.133.17.43:6809/224983201,v1:10.133.17.43:6813/224983201]]
[mds.vees-root-cephfs-b{0:11106} state up:standby-replay seq 3 join_fscid=2 addr [v2:10.133.17.78:6832/3539299445,v1:10.133.17.78:6833/3539299445]]
[mds.vees-root-cephfs-a{1:11087} state up:active seq 1102 join_fscid=2 addr [v2:10.133.17.81:6834/3469607936,v1:10.133.17.81:6835/3469607936]]

#9 Updated by Patrick Donnelly 8 months ago

yite gu wrote:

multi active should be as blow:
max_mds 2
up {0=11123,1=11087}
[mds.vees-root-cephfs-c{0:11123} state up:active seq 3 join_fscid=2 addr [v2:10.133.17.43:6809/224983201,v1:10.133.17.43:6813/224983201]]
[mds.vees-root-cephfs-a{1:11087} state up:active seq 1102 join_fscid=2 addr [v2:10.133.17.81:6834/3469607936,v1:10.133.17.81:6835/3469607936]]
[...]

Yes, the scary thing about this bug is that you had two rank 0 MDS at the same time! Any idea how that happened? Any logs?

#10 Updated by yite gu 8 months ago

Patrick Donnelly wrote:

yite gu wrote:

multi active should be as blow:
max_mds 2
up {0=11123,1=11087}
[mds.vees-root-cephfs-c{0:11123} state up:active seq 3 join_fscid=2 addr [v2:10.133.17.43:6809/224983201,v1:10.133.17.43:6813/224983201]]
[mds.vees-root-cephfs-a{1:11087} state up:active seq 1102 join_fscid=2 addr [v2:10.133.17.81:6834/3469607936,v1:10.133.17.81:6835/3469607936]]
[...]

Yes, the scary thing about this bug is that you had two rank 0 MDS at the same time! Any idea how that happened? Any logs?

I find the operator who operates this cluster, but himself doesn't know what caused this bug due to his behavior.

#11 Updated by yite gu 8 months ago

any way to recover this bug?

#12 Updated by Patrick Donnelly 8 months ago

yite gu wrote:

any way to recover this bug?

I would reset the MDSMap:

https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#mds-map-reset

You may want to wipe the MDS journals (at least rank 0) because of the strong possibility of a corrupt journal. You can backup and test recovery though.

#13 Updated by yite gu 8 months ago

Patrick Donnelly wrote:

yite gu wrote:

any way to recover this bug?

I would reset the MDSMap:

https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#mds-map-reset

You may want to wipe the MDS journals (at least rank 0) because of the strong possibility of a corrupt journal. You can backup and test recovery though.

mon crash again:

debug      0> 2023-07-07T13:31:39.489+0000 7f732fc6e700 -1 *** Caught signal (Aborted) **
 in thread 7f732fc6e700 thread_name:ms_dispatch

 ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
 1: (()+0x12dd0) [0x7f733b2d4dd0]
 2: (gsignal()+0x10f) [0x7f7339f3d70f]
 3: (abort()+0x127) [0x7f7339f27b25]
 4: (()+0x9006b) [0x7f733a8f206b]
 5: (()+0x9650c) [0x7f733a8f850c]
 6: (()+0x96567) [0x7f733a8f8567]
 7: (()+0x967c8) [0x7f733a8f87c8]
 8: (()+0x92015) [0x7f733a8f4015]
 9: (FSMap::print_summary(ceph::Formatter*, std::ostream*) const+0x15bc) [0x7f733dce2c1c]
 10: (MDSMonitor::on_active()+0xc1) [0x55865c3484c1]
 11: (PaxosService::_active()+0x1f5) [0x55865c287d75]
 12: (Context::complete(int)+0xd) [0x55865c1918dd]
 13: (void finish_contexts<std::__cxx11::list<Context*, std::allocator<Context*> > >(ceph::common::CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0xa5) [0x55865c1b9ed5]
 14: (Paxos::finish_round()+0x169) [0x55865c27d779]
 15: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xe29) [0x55865c27eb49]
 16: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x417) [0x55865c27f7e7]
 17: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0x1089) [0x55865c18b469]
 18: (Monitor::_ms_dispatch(Message*)+0x69d) [0x55865c18bfdd]
 19: (Dispatcher::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x5c) [0x55865c1bab8c]
 20: (DispatchQueue::entry()+0x126a) [0x7f733d9f22aa]
 21: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f733da94901]
 22: (()+0x82de) [0x7f733b2ca2de]
 23: (clone()+0x43) [0x7f733a001e83]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ceph can not access, how to recover mon, any method is acceptable

#14 Updated by yite gu 8 months ago

yite gu wrote:

Patrick Donnelly wrote:

yite gu wrote:

any way to recover this bug?

I would reset the MDSMap:

https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#mds-map-reset

You may want to wipe the MDS journals (at least rank 0) because of the strong possibility of a corrupt journal. You can backup and test recovery though.

mon crash again:
[...]

ceph can not access, how to recover mon, any method is acceptable

stop mon, rm all mds key from rocksdb by ceph-kvstore-tool, start mon ok. but fs will never come back.

#15 Updated by Patrick Donnelly 8 months ago

yite gu wrote:

yite gu wrote:

Patrick Donnelly wrote:

yite gu wrote:

any way to recover this bug?

I would reset the MDSMap:

https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#mds-map-reset

You may want to wipe the MDS journals (at least rank 0) because of the strong possibility of a corrupt journal. You can backup and test recovery though.

mon crash again:
[...]

ceph can not access, how to recover mon, any method is acceptable

stop mon, rm all mds key from rocksdb by ceph-kvstore-tool, start mon ok. but fs will never come back.

Did you do this?

#16 Updated by yite gu 8 months ago

Patrick Donnelly wrote:

yite gu wrote:

yite gu wrote:

Patrick Donnelly wrote:

yite gu wrote:

any way to recover this bug?

I would reset the MDSMap:

https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#mds-map-reset

You may want to wipe the MDS journals (at least rank 0) because of the strong possibility of a corrupt journal. You can backup and test recovery though.

mon crash again:
[...]

ceph can not access, how to recover mon, any method is acceptable

stop mon, rm all mds key from rocksdb by ceph-kvstore-tool, start mon ok. but fs will never come back.

Did you do this?

Yes, there's really no other way, because have client use rbd storage in this cluster, I am in a hurry to recover ceph. Otherwise, I will be complained

#17 Updated by Venky Shankar 7 months ago

  • Status changed from New to Can't reproduce

Yes, there's really no other way, because have client use rbd storage in this cluster, I am in a hurry to recover ceph. Otherwise, I will be complained

Would have been good if we could the mon logs to see why this happened, but I guess the urgency of the situation prompted the above. Please file a tracker ticket if have the more details.

Closing this for now.

Also available in: Atom PDF