Bug #63281: src/mds/MDLog.h: 100: FAILED ceph_assert(!segments.empty()) - CephFS - Ceph

Actions

Copy link

Bug #63281

open

src/mds/MDLog.h: 100: FAILED ceph_assert(!segments.empty())

Added by Venky Shankar 6 months ago. Updated 6 months ago.

Status:

Pending Backport

Priority:

Normal

Assignee:

Leonid Usov

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

backport_processed

Backport:

reef,quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

54178

Crash signature (v1):

Crash signature (v2):

Description

/a/vshankar-2023-10-19_01:21:59-fs-wip-vshankar-testing-reef-20231018.163145-testing-default-smithi/7431739

2023-10-19T10:26:45.704 INFO:tasks.ceph.mds.b.smithi046.stderr:./src/mds/MDLog.h: In function 'LogSegment* MDLog::get_current_segment()' thread 7f9ab6c5f640 time 2023-10-19T10:26:45.702298+0000
2023-10-19T10:26:45.704 INFO:tasks.ceph.mds.b.smithi046.stderr:./src/mds/MDLog.h: 100: FAILED ceph_assert(!segments.empty())
2023-10-19T10:26:45.705 INFO:tasks.ceph.mds.b.smithi046.stderr: ceph version 18.2.0-910-gfd76c602 (fd76c602de5f283b87218395fceeb0217af17b62) reef (stable)
2023-10-19T10:26:45.705 INFO:tasks.ceph.mds.b.smithi046.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x121) [0x7f9abe5a17d5]
2023-10-19T10:26:45.705 INFO:tasks.ceph.mds.b.smithi046.stderr: 2: /usr/lib/ceph/libceph-common.so.2(+0x162989) [0x7f9abe5a1989]
2023-10-19T10:26:45.705 INFO:tasks.ceph.mds.b.smithi046.stderr: 3: ceph-mds(+0x18eaa3) [0x558d76dd7aa3]
2023-10-19T10:26:45.706 INFO:tasks.ceph.mds.b.smithi046.stderr: 4: (MDCache::dispatch_fragment_dir(boost::intrusive_ptr<MDRequestImpl>&)+0x308) [0x558d76edfee8]
2023-10-19T10:26:45.706 INFO:tasks.ceph.mds.b.smithi046.stderr: 5: (MDCache::fragment_frozen(boost::intrusive_ptr<MDRequestImpl>&, int)+0xe7) [0x558d76ee0977]
2023-10-19T10:26:45.706 INFO:tasks.ceph.mds.b.smithi046.stderr: 6: (MDCache::fragment_mark_and_complete(boost::intrusive_ptr<MDRequestImpl>&)+0x8be) [0x558d76ee13ee]
2023-10-19T10:26:45.706 INFO:tasks.ceph.mds.b.smithi046.stderr: 7: (MDCache::merge_dir(CInode*, frag_t)+0x4d0) [0x558d76ee1f30]
2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 8: ceph-mds(+0x1204ad) [0x558d76d694ad]
2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 9: (CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x146) [0x7f9abe684c16]
2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 10: /usr/lib/ceph/libceph-common.so.2(+0x246691) [0x7f9abe685691]
2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 11: /lib/x86_64-linux-gnu/libc.so.6(+0x94b43) [0x7f9abdf1ab43]
2023-10-19T10:26:45.707 INFO:tasks.ceph.mds.b.smithi046.stderr: 12: /lib/x86_64-linux-gnu/libc.so.6(+0x126a00) [0x7f9abdfaca00]

I've seen this before and there is most likely and existing tracker for it.

Related issues 3 (3 open — 0 closed)

Actions

Copy link

Updated by Venky Shankar 6 months ago

This is with reef branch - none of the PRs seen to be responsible for this. PRs in the test branch are:

https://github.com/ceph/ceph/pull/53497 - reef: mds: fix deadlock between unlinking and linkmerge
https://github.com/ceph/ceph/pull/53548 - reef: mds: log message when exiting due to asok command
https://github.com/ceph/ceph/pull/53553 - reef: qa: lengthen shutdown timeout for thrashed MDS
https://github.com/ceph/ceph/pull/53558 - reef: mds: add event for batching getattr/lookup
https://github.com/ceph/ceph/pull/53572 - reef: mgr/volumes: Fix pending_subvolume_deletions in volume info
https://github.com/ceph/ceph/pull/53624 - reef: qa/cephfs: switch to python3 for centos stream 9
https://github.com/ceph/ceph/pull/53661 - reef: ceph: allow xlock state to be LOCK_PREXLOCK when putting it
https://github.com/ceph/ceph/pull/53666 - reef: client: move the Inode to new auth mds session when changing auth cap
https://github.com/ceph/ceph/pull/53825 - reef: qa/cephfs: fix ior project build failure
https://github.com/ceph/ceph/pull/53827 - reef: qa/cephfs: fix build failure for mdtest project
https://github.com/ceph/ceph/pull/53906 - reef: qa: move nfs (mgr/nfs) related tests to fs suite
https://github.com/ceph/ceph/pull/53918 - reef: pybind/mgr/volumes: log mutex locks to help debug deadlocks
https://github.com/ceph/ceph/pull/54035 - reef: mds: adjust pre_segments_size for MDLog when trimming segments for st
https://github.com/ceph/ceph/pull/54037 - reef: qa: assign file system affinity for replaced MDS

Actions

Copy link

Updated by Milind Changire 6 months ago

Assignee set to Leonid Usov

Actions

Copy link

Updated by Leonid Usov 6 months ago

Related to Bug #51278: mds: "FAILED ceph_assert(!segments.empty())" added

Actions

Copy link

Updated by Leonid Usov 6 months ago

There seems to be evidence that the mds in question was stopping:

  -173> 2023-10-19T10:26:44.832+0000 7f9ab645e640  5 mds.beacon.b Sending beacon up:stopping seq 118
  -143> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640  7 mds.1.cache shutdown_pass
  -134> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.log trim_all: 2/0/0

We could just add another test for "mds->is_stopping()" in MDCache::dispatch_fragment_dir but I don't think that's the right place. It's already too late: we shouldn't have gotten there after we've decided to shut down and trim the log. I wonder why the balancer was allowed to queue the merging operation after the shutdown pass:

  -111> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640  7 mds.1.626 shutdown_pass=false
  -110> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.bal operator() merging [dir 0x60e.000* ~mds1/stray4/ [2,head] auth v=26894 cv=26894/26894 state=1073741825|complete f(v20 m2023-10-19T10:13:30.448411+0000) n(v33 rc2023-10-19T10:13:30.448411+0000) hs=0+0,ss=0+0 0x558d7f7b2900]
  -109> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.bal operator()   all sibs under 001* [0x558d83f87b00] should merge
  -108> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.bal operator()   all sibs under 01* [0x558d83f87f80,0x558d83f88400] should merge
  -107> 2023-10-19T10:26:45.696+0000 7f9ab6c5f640 10 mds.1.bal operator()   all sibs under 1* [0x558d83f88880,0x558d83f88d00,0x558d83f89180,0x558d83f89600] should merge

Apparently, that was called by a timer event:

void MDBalancer::queue_merge(CDir *dir)
{
...
    mds->timer.add_event_after(bal_fragment_interval,
        new LambdaContext(std::move(callback)));

We could avoid scheduling the merge if `mds->is_stopping()`, but maybe we should just cancel (or run?) all timers before calling the shutdown pass on the cache?

Actions

Copy link

Updated by Patrick Donnelly 6 months ago

Leonid Usov wrote:

There seems to be evidence that the mds in question was stopping:

[...]

Right, mds_thrash is adjusting max_mds causing MDS to stop.

We could just add another test for "mds->is_stopping()" in MDCache::dispatch_fragment_dir but I don't think that's the right place. It's already too late: we shouldn't have gotten there after we've decided to shut down and trim the log. I wonder why the balancer was allowed to queue the merging operation after the shutdown pass:

[...]

Apparently, that was called by a timer event:

[...]

We could avoid scheduling the merge if `mds->is_stopping()`, but maybe we should just cancel (or run?) all timers before calling the shutdown pass on the cache?

I think adding a check in the callbacks for queue_split/queue_merge is appropriate.

Actions

Copy link