Project

General

Profile

Actions

Bug #51278

open

mds: "FAILED ceph_assert(!segments.empty())"

Added by Patrick Donnelly almost 3 years ago. Updated 6 months ago.

Status:
Triaged
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-06-17T00:38:52.017 INFO:tasks.ceph.mds.f.smithi175.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-5307-g06fb5cf0/rpm/el8/BUILD/ceph-17.0.0-5307-g06fb5cf0/src/mds/MDLog.h: In function 'LogSegment* MDLog::get_current_segment()' thread 7f196e925700 time 2021-06-17T00:38:52.017144+0000
2021-06-17T00:38:52.018 INFO:tasks.ceph.mds.f.smithi175.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-5307-g06fb5cf0/rpm/el8/BUILD/ceph-17.0.0-5307-g06fb5cf0/src/mds/MDLog.h: 99: FAILED ceph_assert(!segments.empty())
2021-06-17T00:38:52.019 INFO:tasks.ceph.mds.f.smithi175.stderr: ceph version 17.0.0-5307-g06fb5cf0 (06fb5cf0031e099ece537a86a27543dc4010ce0c) quincy (dev)
2021-06-17T00:38:52.019 INFO:tasks.ceph.mds.f.smithi175.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f1978b7e00c]
2021-06-17T00:38:52.020 INFO:tasks.ceph.mds.f.smithi175.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x27d214) [0x7f1978b7e214]
2021-06-17T00:38:52.020 INFO:tasks.ceph.mds.f.smithi175.stderr: 3: ceph-mds(+0x271483) [0x5562dbfa2483]
2021-06-17T00:38:52.020 INFO:tasks.ceph.mds.f.smithi175.stderr: 4: (MDCache::dispatch_fragment_dir(boost::intrusive_ptr<MDRequestImpl>&)+0x4c4) [0x5562dc02c094]
2021-06-17T00:38:52.020 INFO:tasks.ceph.mds.f.smithi175.stderr: 5: (MDCache::fragment_frozen(boost::intrusive_ptr<MDRequestImpl>&, int)+0x2cc) [0x5562dc02ce1c]
2021-06-17T00:38:52.020 INFO:tasks.ceph.mds.f.smithi175.stderr: 6: (MDCache::fragment_mark_and_complete(boost::intrusive_ptr<MDRequestImpl>&)+0xa2c) [0x5562dc03d65c]
2021-06-17T00:38:52.021 INFO:tasks.ceph.mds.f.smithi175.stderr: 7: (MDCache::merge_dir(CInode*, frag_t)+0x5a3) [0x5562dc03e483]
2021-06-17T00:38:52.021 INFO:tasks.ceph.mds.f.smithi175.stderr: 8: ceph-mds(+0x3ed2a2) [0x5562dc11e2a2]
2021-06-17T00:38:52.022 INFO:tasks.ceph.mds.f.smithi175.stderr: 9: (Context::complete(int)+0xd) [0x5562dbeec07d]
2021-06-17T00:38:52.022 INFO:tasks.ceph.mds.f.smithi175.stderr: 10: (SafeTimer::timer_thread()+0x1c0) [0x7f1978c830b0]
2021-06-17T00:38:52.022 INFO:tasks.ceph.mds.f.smithi175.stderr: 11: (SafeTimerThread::entry()+0x11) [0x7f1978c85c51]
2021-06-17T00:38:52.022 INFO:tasks.ceph.mds.f.smithi175.stderr: 12: (Thread::_entry_func(void*)+0xd) [0x7f1978c74d2d]
2021-06-17T00:38:52.023 INFO:tasks.ceph.mds.f.smithi175.stderr: 13: /lib64/libpthread.so.0(+0x814a) [0x7f197791914a]
2021-06-17T00:38:52.023 INFO:tasks.ceph.mds.f.smithi175.stderr: 14: clone()
2021-06-17T00:38:52.023 INFO:tasks.ceph.mds.f.smithi175.stderr:*** Caught signal (Aborted) **

From: /ceph/teuthology-archive/pdonnell-2021-06-16_21:26:55-fs-wip-pdonnell-testing-20210616.191804-distro-basic-smithi/6175747/teuthology.log


Related issues 4 (2 open2 closed)

Related to CephFS - Bug #50821: qa: untar_snap_rm failure during mds thrashingNewXiubo Li

Actions
Related to CephFS - Bug #53753: mds: crash (assert hit) when merging dirfragsDuplicate

Actions
Related to CephFS - Bug #63281: src/mds/MDLog.h: 100: FAILED ceph_assert(!segments.empty())Pending BackportLeonid Usov

Actions
Has duplicate CephFS - Bug #57204: MDLog.h: 99: FAILED ceph_assert(!segments.empty())DuplicateKotresh Hiremath Ravishankar

Actions
Actions #1

Updated by Patrick Donnelly almost 3 years ago

  • Related to Bug #50821: qa: untar_snap_rm failure during mds thrashing added
Actions #2

Updated by Patrick Donnelly almost 3 years ago

  • Backport changed from FAILED ceph_assert(!segments.empty())pacific,octopus to pacific,octopus
Actions #3

Updated by Patrick Donnelly almost 3 years ago

  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
Actions #4

Updated by Venky Shankar over 2 years ago

Actions #5

Updated by Venky Shankar over 2 years ago

  • Related to Bug #53753: mds: crash (assert hit) when merging dirfrags added
Actions #6

Updated by Venky Shankar almost 2 years ago

Latest occurrence with similar backtrace - https://pulpito.ceph.com/vshankar-2022-06-03_10:03:27-fs-wip-vshankar-testing1-20220603-134300-testing-default-smithi/6862102/

    -1> 2022-06-03T11:49:41.369+0000 7f3ed4ff3700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-12791-g44457532/rpm/el8/BUILD/ceph-17.0.0-12791-g4445753
2/src/mds/MDLog.h: In function 'uint64_t MDLog::get_last_segment_seq() const' thread 7f3ed4ff3700 time 2022-06-03T11:49:41.370159+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-12791-g44457532/rpm/el8/BUILD/ceph-17.0.0-12791-g44457532/src/mds/MDLog.h: 247: FAILED ceph_assert(!segments.
empty())

 ceph version 17.0.0-12791-g44457532 (44457532553f59daa44f901930401f299f7ef2a5) quincy (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f3ee68c1b14]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2c1d35) [0x7f3ee68c1d35]
 3: (MDLog::trim_all()+0x94d) [0x55caacd13b7d]
 4: (MDCache::shutdown_pass()+0xa6d) [0x55caacb3a1bd]
 5: (MDSRankDispatcher::tick()+0x2d0) [0x55caac9f0ca0]
 6: (Context::complete(int)+0xd) [0x55caac9cd2fd]
 7: (CommonSafeTimer<ceph::fair_mutex>::timer_thread()+0x18b) [0x7f3ee69d967b]
 8: (CommonSafeTimerThread<ceph::fair_mutex>::entry()+0x11) [0x7f3ee69db681]
 9: /lib64/libpthread.so.0(+0x817a) [0x7f3ee500817a]
 10: clone()

Patrick, did you root cause this?

Actions #7

Updated by Venky Shankar almost 2 years ago

  • Target version changed from v17.0.0 to v18.0.0
  • Backport changed from pacific,octopus to quincy,pacific
  • Severity changed from 3 - minor to 2 - major
Actions #8

Updated by Venky Shankar almost 2 years ago

  • Category set to Correctness/Safety
  • Assignee changed from Patrick Donnelly to Venky Shankar
Actions #9

Updated by Patrick Donnelly over 1 year ago

  • Has duplicate Bug #57204: MDLog.h: 99: FAILED ceph_assert(!segments.empty()) added
Actions #10

Updated by Stephen Cuppett over 1 year ago

Venky Shankar wrote:

Latest occurrence with similar backtrace - https://pulpito.ceph.com/vshankar-2022-06-03_10:03:27-fs-wip-vshankar-testing1-20220603-134300-testing-default-smithi/6862102/

[...]

Patrick, did you root cause this?

If it helps, I had a similar failure on rook 1.10.3 with ceph 17.2.3:

[scuppett@nzxt ~]$ kubectl -n $ROOK_CLUSTER_NAMESPACE exec -it $TOOLS_POD -- ceph status
  cluster:
    id:     5564c7f9-ff56-4641-8d2e-99978fcfd257
    health: HEALTH_WARN
            1 daemons have recently crashed

  services:
    mon: 3 daemons, quorum b,c,d (age 4d)
    mgr: a(active, since 4d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 5 osds: 5 up (since 29m), 5 in (since 4d)

  data:
    volumes: 1/1 healthy
    pools:   3 pools, 65 pgs
    objects: 303.69k objects, 44 GiB
    usage:   137 GiB used, 503 GiB / 640 GiB avail
    pgs:     65 active+clean

  io:
    client:   20 KiB/s rd, 18 KiB/s wr, 3 op/s rd, 5 op/s wr

[scuppett@nzxt ~]$ kubectl -n $ROOK_CLUSTER_NAMESPACE exec -it $TOOLS_POD -- ceph crash ls
ID                                                                ENTITY      NEW  
2022-10-26T19:51:15.385675Z_d6cd885c-2141-4562-b1a2-2c88f26d690d  mds.myfs-b   *   
[scuppett@nzxt ~]$ kubectl -n $ROOK_CLUSTER_NAMESPACE exec -it $TOOLS_POD -- ceph crash info 2022-10-26T19:51:15.385675Z_d6cd885c-2141-4562-b1a2-2c88f26d690d
{
    "assert_condition": "!segments.empty()",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/arm64/AVAILABLE_ARCH/arm64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.h",
    "assert_func": "LogSegment* MDLog::get_current_segment()",
    "assert_line": 99,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/arm64/AVAILABLE_ARCH/arm64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.h: In function 'LogSegment* MDLog::get_current_segment()' thread ffff9970b000 time 2022-10-26T19:51:15.383031+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/arm64/AVAILABLE_ARCH/arm64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.h: 99: FAILED ceph_assert(!segments.empty())\n",
    "assert_thread_name": "ms_dispatch",
    "backtrace": [
        "__kernel_rt_sigreturn()",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b4) [0xffffa00202cc]",
        "(ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0xffffa002046c]",
        "(Server::journal_close_session(Session*, int, Context*)+0x6cc) [0xaaaac0c2108c]",
        "(Server::kill_session(Session*, Context*)+0x218) [0xaaaac0c21748]",
        "(Server::apply_blocklist()+0xf8) [0xaaaac0c21a00]",
        "(MDSRank::apply_blocklist(std::set<entity_addr_t, std::less<entity_addr_t>, std::allocator<entity_addr_t> > const&, unsigned int)+0x44) [0xaaaac0be3954]",
        "(MDSRankDispatcher::handle_osd_map()+0xd0) [0xaaaac0be3c70]",
        "(MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x348) [0xaaaac0bcd6b8]",
        "(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0xd4) [0xaaaac0bce0d4]",
        "(DispatchQueue::entry()+0x11f8) [0xffffa0242188]",
        "(DispatchQueue::DispatchThread::entry()+0x10) [0xffffa02eba70]",
        "/lib64/libpthread.so.0(+0x78b8) [0xffff9f9068b8]",
        "/lib64/libc.so.6(+0x23a7c) [0xffff9f492a7c]" 
    ],
    "ceph_version": "17.2.3",
    "crash_id": "2022-10-26T19:51:15.385675Z_d6cd885c-2141-4562-b1a2-2c88f26d690d",
    "entity_name": "mds.myfs-b",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mds",
    "stack_sig": "46495e15d9854b9436e341bda663a9d0a5b88afbd4c94dad07c353304c72cdc8",
    "timestamp": "2022-10-26T19:51:15.385675Z",
    "utsname_hostname": "rook-ceph-mds-myfs-b-8578b5f859-cq9zz",
    "utsname_machine": "aarch64",
    "utsname_release": "5.4.209-116.367.amzn2.aarch64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Aug 31 00:10:02 UTC 2022" 
}
Actions #11

Updated by Patrick Donnelly 7 months ago

  • Target version deleted (v18.0.0)
Actions #12

Updated by Leonid Usov 6 months ago

  • Related to Bug #63281: src/mds/MDLog.h: 100: FAILED ceph_assert(!segments.empty()) added
Actions #13

Updated by Leonid Usov 6 months ago

Venky, the instances you add in the comments have the same assert but different call stacks. I think they are all different issues, while the original call stack in the ticket is the same as the one in the related Bug #63281. I'll address the issue there and you can decide whether you would like this ticket duplicated against that one

Actions

Also available in: Atom PDF