Bug #63281
opensrc/mds/MDLog.h: 100: FAILED ceph_assert(!segments.empty())
0%
Description
/a/vshankar-2023-10-19_01:21:59-fs-wip-vshankar-testing-reef-20231018.163145-testing-default-smithi/7431739
2023-10-19T10:26:45.704 INFO:tasks.ceph.mds.b.smithi046.stderr:./src/mds/MDLog.h: In function 'LogSegment* MDLog::get_current_segment()' thread 7f9ab6c5f640 time 2023-10-19T10:26:45.702298+0000 2023-10-19T10:26:45.704 INFO:tasks.ceph.mds.b.smithi046.stderr:./src/mds/MDLog.h: 100: FAILED ceph_assert(!segments.empty()) 2023-10-19T10:26:45.705 INFO:tasks.ceph.mds.b.smithi046.stderr: ceph version 18.2.0-910-gfd76c602 (fd76c602de5f283b87218395fceeb0217af17b62) reef (stable) 2023-10-19T10:26:45.705 INFO:tasks.ceph.mds.b.smithi046.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x121) [0x7f9abe5a17d5] 2023-10-19T10:26:45.705 INFO:tasks.ceph.mds.b.smithi046.stderr: 2: /usr/lib/ceph/libceph-common.so.2(+0x162989) [0x7f9abe5a1989] 2023-10-19T10:26:45.705 INFO:tasks.ceph.mds.b.smithi046.stderr: 3: ceph-mds(+0x18eaa3) [0x558d76dd7aa3] 2023-10-19T10:26:45.706 INFO:tasks.ceph.mds.b.smithi046.stderr: 4: (MDCache::dispatch_fragment_dir(boost::intrusive_ptr<MDRequestImpl>&)+0x308) [0x558d76edfee8] 2023-10-19T10:26:45.706 INFO:tasks.ceph.mds.b.smithi046.stderr: 5: (MDCache::fragment_frozen(boost::intrusive_ptr<MDRequestImpl>&, int)+0xe7) [0x558d76ee0977] 2023-10-19T10:26:45.706 INFO:tasks.ceph.mds.b.smithi046.stderr: 6: (MDCache::fragment_mark_and_complete(boost::intrusive_ptr<MDRequestImpl>&)+0x8be) [0x558d76ee13ee] 2023-10-19T10:26:45.706 INFO:tasks.ceph.mds.b.smithi046.stderr: 7: (MDCache::merge_dir(CInode*, frag_t)+0x4d0) [0x558d76ee1f30] 2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 8: ceph-mds(+0x1204ad) [0x558d76d694ad] 2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 9: (CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x146) [0x7f9abe684c16] 2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 10: /usr/lib/ceph/libceph-common.so.2(+0x246691) [0x7f9abe685691] 2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 11: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f9abdf1ab43] 2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f9abdfaca00]
I've seen this before and there is most likely and existing tracker for it.
Updated by Venky Shankar 6 months ago
This is with reef branch - none of the PRs seen to be responsible for this. PRs in the test branch are:
https://github.com/ceph/ceph/pull/53497 - reef: mds: fix deadlock between unlinking and linkmerge
https://github.com/ceph/ceph/pull/53548 - reef: mds: log message when exiting due to asok command
https://github.com/ceph/ceph/pull/53553 - reef: qa: lengthen shutdown timeout for thrashed MDS
https://github.com/ceph/ceph/pull/53558 - reef: mds: add event for batching getattr/lookup
https://github.com/ceph/ceph/pull/53572 - reef: mgr/volumes: Fix pending_subvolume_deletions in volume info
https://github.com/ceph/ceph/pull/53624 - reef: qa/cephfs: switch to python3 for centos stream 9
https://github.com/ceph/ceph/pull/53661 - reef: ceph: allow xlock state to be LOCK_PREXLOCK when putting it
https://github.com/ceph/ceph/pull/53666 - reef: client: move the Inode to new auth mds session when changing auth cap
https://github.com/ceph/ceph/pull/53825 - reef: qa/cephfs: fix ior project build failure
https://github.com/ceph/ceph/pull/53827 - reef: qa/cephfs: fix build failure for mdtest project
https://github.com/ceph/ceph/pull/53906 - reef: qa: move nfs (mgr/nfs) related tests to fs suite
https://github.com/ceph/ceph/pull/53918 - reef: pybind/mgr/volumes: log mutex locks to help debug deadlocks
https://github.com/ceph/ceph/pull/54035 - reef: mds: adjust pre_segments_size for MDLog when trimming segments for st
https://github.com/ceph/ceph/pull/54037 - reef: qa: assign file system affinity for replaced MDS
Updated by Leonid Usov 6 months ago
- Related to Bug #51278: mds: "FAILED ceph_assert(!segments.empty())" added
Updated by Leonid Usov 6 months ago
There seems to be evidence that the mds in question was stopping:
-173> 2023-10-19T10:26:44.832+0000 7f9ab645e640 5 mds.beacon.b Sending beacon up:stopping seq 118 -143> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 7 mds.1.cache shutdown_pass -134> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.log trim_all: 2/0/0
We could just add another test for "mds->is_stopping()" in MDCache::dispatch_fragment_dir
but I don't think that's the right place. It's already too late: we shouldn't have gotten there after we've decided to shut down and trim the log. I wonder why the balancer was allowed to queue the merging operation after the shutdown pass:
-111> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 7 mds.1.626 shutdown_pass=false -110> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.bal operator() merging [dir 0x60e.000* ~mds1/stray4/ [2,head] auth v=26894 cv=26894/26894 state=1073741825|complete f(v20 m2023-10-19T10:13:30.448411+0000) n(v33 rc2023-10-19T10:13:30.448411+0000) hs=0+0,ss=0+0 0x558d7f7b2900] -109> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.bal operator() all sibs under 001* [0x558d83f87b00] should merge -108> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.bal operator() all sibs under 01* [0x558d83f87f80,0x558d83f88400] should merge -107> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.bal operator() all sibs under 1* [0x558d83f88880,0x558d83f88d00,0x558d83f89180,0x558d83f89600] should merge
Apparently, that was called by a timer event:
void MDBalancer::queue_merge(CDir *dir) { ... mds->timer.add_event_after(bal_fragment_interval, new LambdaContext(std::move(callback)));
We could avoid scheduling the merge if `mds->is_stopping()`, but maybe we should just cancel (or run?) all timers before calling the shutdown pass on the cache?
Updated by Patrick Donnelly 6 months ago
Leonid Usov wrote:
There seems to be evidence that the mds in question was stopping:
[...]
Right, mds_thrash is adjusting max_mds causing MDS to stop.
We could just add another test for "mds->is_stopping()" in
MDCache::dispatch_fragment_dir
but I don't think that's the right place. It's already too late: we shouldn't have gotten there after we've decided to shut down and trim the log. I wonder why the balancer was allowed to queue the merging operation after the shutdown pass:[...]
Apparently, that was called by a timer event:
[...]
We could avoid scheduling the merge if `mds->is_stopping()`, but maybe we should just cancel (or run?) all timers before calling the shutdown pass on the cache?
I think adding a check in the callbacks for queue_split/queue_merge is appropriate.
Updated by Leonid Usov 6 months ago
- Status changed from New to In Progress
- Pull request ID set to 54178
Updated by Leonid Usov 6 months ago
- Status changed from In Progress to Fix Under Review
Updated by Rishabh Dave 6 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 6 months ago
- Copied to Backport #63479: reef: src/mds/MDLog.h: 100: FAILED ceph_assert(!segments.empty()) added
Updated by Backport Bot 6 months ago
- Copied to Backport #63480: quincy: src/mds/MDLog.h: 100: FAILED ceph_assert(!segments.empty()) added