Project

General

Profile

Actions

Bug #60986

open

crash: void MDCache::rejoin_send_rejoins(): assert(auth >= 0)

Added by Telemetry Bot 12 months ago. Updated about 20 hours ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Telemetry
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):

5069956bbe0e3828fa162226b569d9ebfaea1b69544f0f4d5b3cf057977b8640


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=71d482317bedfc17674af4f5780c6eb1b2f0f2f213b1aa9ad21545c7c3ce8231

Assert condition: auth >= 0
Assert function: void MDCache::rejoin_send_rejoins()

Sanitized backtrace:

    MDCache::rejoin_send_rejoins()
    MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)
    MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)
    MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)
    MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)
    DispatchQueue::entry()
    DispatchQueue::DispatchThread::entry()

Crash dump sample:
{
    "assert_condition": "auth >= 0",
    "assert_file": "mds/MDCache.cc",
    "assert_func": "void MDCache::rejoin_send_rejoins()",
    "assert_line": 4084,
    "assert_msg": "mds/MDCache.cc: In function 'void MDCache::rejoin_send_rejoins()' thread 7fad76ed2700 time 2023-04-25T04:30:36.005858-0400\nmds/MDCache.cc: 4084: FAILED ceph_assert(auth >= 0)",
    "assert_thread_name": "ms_dispatch",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12d80) [0x7fad7f7e5d80]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7fad807e947b]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x2695e7) [0x7fad807e95e7]",
        "(MDCache::rejoin_send_rejoins()+0x2013) [0x55fd558e2eb3]",
        "(MDSRankDispatcher::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&, MDSMap const&)+0x1b72) [0x55fd5577d2a2]",
        "(MDSDaemon::handle_mds_map(boost::intrusive_ptr<MMDSMap const> const&)+0xd66) [0x55fd5574e9b6]",
        "(MDSDaemon::handle_core_message(boost::intrusive_ptr<Message const> const&)+0x367) [0x55fd55751f77]",
        "(MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x177) [0x55fd557526f7]",
        "(Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x478) [0x7fad80a64668]",
        "(DispatchQueue::entry()+0x50f) [0x7fad80a61aaf]",
        "(DispatchQueue::DispatchThread::entry()+0x11) [0x7fad80b28ef1]",
        "/lib64/libpthread.so.0(+0x82de) [0x7fad7f7db2de]",
        "clone()" 
    ],
    "ceph_version": "17.2.5",
    "crash_id": "2023-04-25T08:30:36.008366Z_31305a85-41e6-4558-9503-9dd700dfb575",
    "entity_name": "mds.ea712f0f7730e25884ca687e6f7d3d73e1e91998",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8 (Core)",
    "os_version_id": "8",
    "process_name": "ceph-mds",
    "stack_sig": "5069956bbe0e3828fa162226b569d9ebfaea1b69544f0f4d5b3cf057977b8640",
    "timestamp": "2023-04-25T08:30:36.008366Z",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-80.el8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Tue Jun 4 09:19:46 UTC 2019" 
}


Files


Related issues 1 (1 open0 closed)

Related to CephFS - Bug #54765: crash: void MDCache::rejoin_send_rejoins(): assert(auth >= 0)New

Actions
Actions #1

Updated by Telemetry Bot 12 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.2.5 added
Actions #2

Updated by Milind Changire 11 months ago

  • Related to Bug #54765: crash: void MDCache::rejoin_send_rejoins(): assert(auth >= 0) added
Actions #3

Updated by Milind Changire 11 months ago

  • Assignee set to Milind Changire
Actions #4

Updated by Xiubo Li 4 days ago · Edited

A new report from the ceph-user mail list: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GOAZLA6NQHYFM3OHQ6ALWU6VQENCB5OS/

Hi,

a 17.2.7 cluster with two filesystems has suddenly non-working MDSs:

# ceph -s
  cluster:
    id:     f54eea86-265a-11eb-a5d0-457857ba5742
    health: HEALTH_ERR
            22 failed cephadm daemon(s)
            2 filesystems are degraded
            1 mds daemon damaged
            insufficient standby MDS daemons available

  services:
    mon: 5 daemons, quorum ceph00,ceph03,ceph04,ceph01,ceph02 (age 4h)
    mgr: ceph03.odfupq(active, since 4h), standbys: ppc721.vsincn, ceph00.lvbddp, ceph02.zhyxjg, ceph06.eifppc
    mds: 4/5 daemons up
    osd: 145 osds: 145 up (since 20h), 145 in (since 2d)
    rgw: 12 daemons active (4 hosts, 1 zones)

  data:
    volumes: 0/2 healthy, 2 recovering; 1 damaged
    pools:   15 pools, 4897 pgs
    objects: 195.64M objects, 195 TiB
    usage:   617 TiB used, 527 TiB / 1.1 PiB avail
    pgs:     4892 active+clean
             5    active+clean+scrubbing+deep

  io:
    client:   2.5 MiB/s rd, 20 MiB/s wr, 665 op/s rd, 938 op/s wr

# ceph fs status
ABC - 4 clients
===========
RANK   STATE              MDS             ACTIVITY   DNS    INOS   DIRS   CAPS
 0     failed
 1    resolve  ABC.ceph04.lzlkdu               0      3      1      0
 2    resolve  ABC.ppc721.rzfmyi               0      3      1      0
 3    resolve  ABC.ceph04.jiepaw             249    252     13      0
          POOL             TYPE     USED  AVAIL
cephfs.ABC.meta  metadata  33.0G   104T
cephfs.ABC.data    data     390T   104T
DEF - 154 clients
===========
RANK      STATE                 MDS             ACTIVITY   DNS    INOS   DIRS   CAPS
 0    rejoin(laggy)  DEF.ceph06.etthum            30.9k  30.8k  5084      0
          POOL             TYPE     USED  AVAIL
cephfs.DEF.meta  metadata   190G   104T
cephfs.DEF.data    data     118T   104T
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)

The first filesystem will not get an MDS in rank 0,
we already tried to set max_msd to 1 but to no avail.

The second filesystem's MDS shows "replay" for a while and then
it crashes in the rejoin phase with:

  -92> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 handle_mds_map i am now mds.0.501522
   -91> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 handle_mds_map state change up:reconnect --> up:rejoin
   -90> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 rejoin_start
   -89> 2024-05-06T16:07:15.514+0000 7f1927e9d700  1 mds.0.501522 rejoin_joint_start
   -88> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfece err -22/0
   -87> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671eb5 err -22/0
   -86> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfed3 err -22/0
   -85> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -84> 2024-05-06T16:07:15.514+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
   -83> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671eb5 err -22/0
   -82> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -81> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ebd err -22/-22
   -80> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ecd err -22/-22
   -79> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc9ea err -22/-22
   -78> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfed3 err -22/0
   -77> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc9c3 err -22/-22
   -76> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc978 err -22/-22
   -75> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc99d err -22/-22
   -74> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc95b err -22/-22
   -73> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -72> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
   -71> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x20001dc7a7e err -22/-22
   -70> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012be364 err -22/-22
   -69> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22
   -68> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671eb5 err -22/0
   -67> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -66> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ebd err -22/-22
   -65> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000671ecd err -22/-22
   -64> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc9ea err -22/-22
   -63> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc978 err -22/-22
   -62> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x3000069373a err -22/-22
   -61> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012dc5d8 err -22/-22
   -60> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a32e8 err -22/-22
   -59> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x30000696952 err -22/-22
   -58> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -57> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005bfed3 err -22/0
   -56> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc99d err -22/-22
   -55> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -54> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x20001dc7a7e err -22/-22
   -53> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -52> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012c3a0e err -22/-22
   -51> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -50> 2024-05-06T16:07:15.518+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
   -49> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x20001dc7a7f err -22/-22
   -48> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc95b err -22/-22
   -47> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22
   -46> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012be364 err -22/-22
   -45> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b388d err -22/-22
   -44> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x10007185ac2 err -22/-22
   -43> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -42> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -41> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc99d err -22/-22
   -40> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -39> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012c3a0e err -22/-22
   -38> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22
   -37> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x40000d63bff err -22/-22
   -36> 2024-05-06T16:07:15.522+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -35> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc95b err -22/-22
   -34> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012be364 err -22/-22
   -33> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x10007185ac2 err -22/-22
   -32> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -31> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x10007185ac4 err -22/-22
   -30> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -29> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -28> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22
   -27> 2024-05-06T16:07:15.526+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58f4 err -22/-22
   -26> 2024-05-06T16:07:15.530+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -25> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -24> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -23> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22
   -22> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58f4 err -22/-22
   -21> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a5fc4 err -22/-22
   -20> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a634d err -22/-22
   -19> 2024-05-06T16:07:15.534+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a63bf err -22/-22
   -18> 2024-05-06T16:07:15.542+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc94c err -22/0
   -17> 2024-05-06T16:07:15.542+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bc980 err -22/-22
   -16> 2024-05-06T16:07:15.546+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58cf err -22/-22
   -15> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58db err -22/-22
   -14> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a58f4 err -22/-22
   -13> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a5fc4 err -22/-22
   -12> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a634d err -22/-22
   -11> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x400000a63bf err -22/-22
   -10> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bfd9c err -22/-22
    -9> 2024-05-06T16:07:15.550+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x200012bfb78 err -22/-22
    -8> 2024-05-06T16:07:15.554+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
    -7> 2024-05-06T16:07:15.554+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22
    -6> 2024-05-06T16:07:15.554+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b388d err -22/-22
    -5> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b0274 err -22/0
    -4> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b2e32 err -22/-22
    -3> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x300005b388d err -22/-22
    -2> 2024-05-06T16:07:15.562+0000 7f1921e91700  0 mds.0.cache  failed to open ino 0x40000d5a226 err -22/-22
    -1> 2024-05-06T16:07:15.634+0000 7f1921e91700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: In function 'void MDCache::rejoin_send_rejoins()' thread 7f1921e91700 time 2024-05-06T16:07:15.635683+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc: 4086: FAILED ceph_assert(auth >= 0)

 ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f1930ad94a3]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f1930ad9669]
 3: (MDCache::rejoin_send_rejoins()+0x216b) [0x5614ac8747eb]
 4: (MDCache::process_imported_caps()+0x1993) [0x5614ac872353]
 5: (Context::complete(int)+0xd) [0x5614ac6e182d]
 6: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
 7: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5614ac6e6f5d]
 8: (OpenFileTable::_open_ino_finish(inodeno_t, int)+0x156) [0x5614aca765a6]
 9: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
 10: (void finish_contexts<std::vector<MDSContext*, std::allocator<MDSContext*> > >(ceph::common::CephContext*, std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d) [0x5614ac6e6f5d]
 11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&, int)+0x138) [0x5614ac867168]
 12: (MDCache::_open_ino_backtrace_fetched(inodeno_t, ceph::buffer::v15_2_0::list&, int)+0x290) [0x5614ac87ff90]
 13: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
 14: (MDSIOContextBase::complete(int)+0x534) [0x5614aca426e4]
 15: (Finisher::finisher_thread_entry()+0x18d) [0x7f1930b7884d]
 16: /lib64/libpthread.so.0(+0x81ca) [0x7f192fac81ca]
 17: clone()

How do we solve this issue? 
Actions #5

Updated by Xiubo Li 4 days ago

Another one https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NDWFYV5XFDCUW5EBRWXEDQFGVFL5HAIV/:

Hi all,

we have a serious problem with CephFS. A few days ago, the CephFS file
systems became inaccessible, with the message MDS_DAMAGE: 1 mds daemon
damaged

The cephfs-journal-tool tells us: "Overall journal integrity: OK" 

The usual attempts with redeploy were unfortunately not successful.

After many attempts to achieve something with the orchestrator, we set the
MDS to “failed” and provoked the creation of new MDS with “ceph fs reset”.

But this MDS crashes:
ceph-17.2.7/src/mds/MDCache.cc: In function 'void
MDCache::rejoin_send_rejoins()'
ceph-17.2.7/src/mds/MDCache.cc: 4086: FAILED ceph_assert(auth >= 0)

(The full trace is attached).

What can we do now? We are grateful for any help!

Updated by Robert Sander 4 days ago

Xiubo Li wrote in #note-4:

A new report from the ceph-user mail list: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GOAZLA6NQHYFM3OHQ6ALWU6VQENCB5OS/

We now collected debug logs. The following actions have been made:

ceph fs set ksz-cephfs2 max_mds 1
ceph fs set vol-bierinf max_mds 1
ceph fs set ksz-cephfs2 standby_count_wanted 1
ceph fs set vol-bierinf standby_count_wanted 1

# for _md in $(ceph orch ps | grep mds | cut -d ' ' -f 1)
> do
> ceph orch daemon stop $_md
> done
...

# ceph fs status
vol-bierinf - 0 clients
===========
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
  0    failed
  1    failed
  2    failed
  3    failed
           POOL             TYPE     USED  AVAIL
cephfs.vol-bierinf.meta  metadata  33.0G   104T
cephfs.vol-bierinf.data    data     390T   104T
ksz-cephfs2 - 0 clients
===========
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
  0    failed
           POOL             TYPE     USED  AVAIL
cephfs.ksz-cephfs2.meta  metadata   190G   104T
cephfs.ksz-cephfs2.data    data     118T   104T

# ceph fs dump
e501532
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: -1

Filesystem 'vol-bierinf' (10)
fs_name vol-bierinf
epoch 501532
flags 12 joinable allow_snaps allow_multimds_snaps
created 2021-05-04T12:21:48.898727+0000
modified 2024-05-07T06:37:23.924755+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 8796093022208
required_client_features {}
last_failure 0
last_failure_osd_epoch 279880
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default
file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file
layout v2,10=snaprealm v2}
max_mds 1
in 0,1,2,3
up {}
failed 1,2,3
damaged 0
stopped 4,5,6,7
data_pools [26]
metadata_pool 25
inline_data disabled
balancer
standby_count_wanted 1

Filesystem 'ksz-cephfs2' (11)
fs_name ksz-cephfs2
epoch 501530
flags 12 joinable allow_snaps allow_multimds_snaps
created 2024-05-05T19:00:35.635663+0000
modified 2024-05-07T06:36:09.307179+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 279575
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default
file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file
layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=8532994}
failed
damaged
stopped 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
data_pools [37]
metadata_pool 36
inline_data disabled
balancer
standby_count_wanted 1
[mds.vol-bierinf.ppc721.meydsp{0:8532994} state up:rejoin seq 4 laggy since
2024-05-06T16:07:32.761409+0000 join_fscid=10 addr
[v2:10.149.12.250:6802/2957952954,v1:10.149.12.250:6803/2957952954] compat
{c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 501532

ceph config set global debug_mds 25
ceph config set global debug_ms 1

ceph orch daemon start mds.vol-bierinf.ceph04.jiepaw
ceph orch daemon start mds.vol-bierinf.ceph04.lzlkdu # erst in ksz-cephfs2(crash), dann in
vol-bierinf
ceph orch daemon start mds.vol-bierinf.ppc721.dfkfzy
ceph orch daemon start mds.vol-bierinf.ppc721.rzfmyi

ceph fs status
vol-bierinf - 4 clients
===========
RANK   STATE              MDS             ACTIVITY   DNS    INOS   DIRS   CAPS
  0     failed
  1    resolve  vol-bierinf.ceph04.jiepaw               0      3      1      0
  2    resolve  vol-bierinf.ceph04.lzlkdu               0      3      1      0
  3    resolve  vol-bierinf.ppc721.dfkfzy             249    252     13      0
           POOL             TYPE     USED  AVAIL
cephfs.vol-bierinf.meta  metadata  33.0G   104T
cephfs.vol-bierinf.data    data     390T   104T
ksz-cephfs2 - 154 clients
===========
RANK      STATE                 MDS             ACTIVITY   DNS    INOS   DIRS   CAPS
  0    rejoin(laggy)  vol-bierinf.ppc721.rzfmyi             143    102     14      0
           POOL             TYPE     USED  AVAIL
cephfs.ksz-cephfs2.meta  metadata   190G   104T
cephfs.ksz-cephfs2.data    data     118T   104T
MDS version: ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
(stable)

ceph config rm global debug_mds
ceph config rm global debug_ms

ceph fs dump
e501610
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: -1
Filesystem 'vol-bierinf' (10)
fs_name vol-bierinf
epoch 501589
flags 12 joinable allow_snaps allow_multimds_snaps
created 2021-05-04T12:21:48.898727+0000
modified 2024-05-07T07:17:51.056080+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 8796093022208
required_client_features {}
last_failure 0
last_failure_osd_epoch 279909
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default
file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file
layout v2,10=snaprealm v2}
max_mds 1
in 0,1,2,3
up {1=8559832,2=8559872,3=8559902}
failed
damaged 0
stopped 4,5,6,7
data_pools [26]
metadata_pool 25
inline_data disabled
balancer
standby_count_wanted 1
[mds.vol-bierinf.ceph04.jiepaw{1:8559832} state up:resolve seq 2 join_fscid=10 addr
[v2:10.149.12.14:6800/3253575127,v1:10.149.12.14:6801/3253575127] compat
{c=[1],r=[1],i=[7ff]}]
[mds.vol-bierinf.ceph04.lzlkdu{2:8559872} state up:resolve seq 2 join_fscid=10 addr
[v2:10.149.12.14:6802/4188570865,v1:10.149.12.14:6803/4188570865] compat
{c=[1],r=[1],i=[7ff]}]
[mds.vol-bierinf.ppc721.dfkfzy{3:8559902} state up:resolve seq 2 join_fscid=10 addr
[v2:10.149.12.250:6800/5841986,v1:10.149.12.250:6801/5841986] compat
{c=[1],r=[1],i=[7ff]}]
Filesystem 'ksz-cephfs2' (11)
fs_name ksz-cephfs2
epoch 501610
flags 12 joinable allow_snaps allow_multimds_snaps
created 2024-05-05T19:00:35.635663+0000
modified 2024-05-07T07:21:25.482889+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 279922
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default
file layouts on dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file
layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=8536329}
failed
damaged
stopped 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
data_pools [37]
metadata_pool 36
inline_data disabled
balancer
standby_count_wanted 1
[mds.vol-bierinf.ppc721.rzfmyi{0:8536329} state up:rejoin seq 4 laggy since
2024-05-07T07:21:25.482828+0000 join_fscid=10 addr
[v2:10.149.12.250:6802/3452773200,v1:10.149.12.250:6803/3452773200] compat
{c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 501610
Actions #7

Updated by Robert Sander 3 days ago

While trying to export the journal the following error shows up:

# cephfs-journal-tool --rank=ksz-cephfs2:0 journal export
/tmp/ksz-cephfs2.rank0.journal.export.bin
journal is 98761711816021~506984147
2024-05-07T16:59:38.160+0200 7f412db87040 -1 Error 22 ((22) Invalid argument) seeking to
0x59d2c0c00d55
Error ((22) Invalid argument)
Actions #8

Updated by Xiubo Li 3 days ago

Robert Sander wrote in #note-6:

Xiubo Li wrote in #note-4:

A new report from the ceph-user mail list: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GOAZLA6NQHYFM3OHQ6ALWU6VQENCB5OS/

We now collected debug logs. The following actions have been made:

[...]

Since this tracker is assigned to Milind, I will leave it to him. Thanks!

Actions #9

Updated by Robert Sander 3 days ago

Can we please bump the severity a few levels?

There is loss of production as the MDS are currently not running. Even after the upgrade to 18.2.2 the error just after starting happens.

We will collect debug logs and add them here.

Actions #10

Updated by Robert Sander 3 days ago

These are the current actions tried:

# cephfs-journal-tool --rank=ksz-cephfs2:0 event recover_dentries summary
Events by type:
   COMMITTED: 45
   EXPORT: 1930
   IMPORTFINISH: 2151
   IMPORTSTART: 2153
   OPEN: 3788
   PEERUPDATE: 118
   SESSION: 198
   SESSIONS: 1
   SUBTREEMAP: 128
   UPDATE: 43121
Errors: 0
# ca. 10-15 Min.

### JOURNAL TRUNCATION
# cephfs-journal-tool --rank=ksz-cephfs2:0 journal reset
old journal was 98761711816021~506984147
new journal start will be 98762219323392 (523224 bytes past old end)
writing journal head
writing EResetJournal entry
done

### MDS TABLE WIPES
# cephfs-table-tool --rank=ksz-cephfs2:0 reset session
Error (2024-05-08T13:25:25.066+0200 7f7184801000 -1 main: Bad rank selection: --rank=ksz-cephfs2:0'
(2) No such file or directory)

# cephfs-table-tool --rank=ksz-cephfs2:0 reset snap
Error (2024-05-08T13:52:27.464+0200 7f622ab2f000 -1 main: Bad rank selection: --rank=ksz-cephfs2:0'
(2) No such file or directory)

# this is different

# cephfs-table-tool --rank=ksz-cephfs2:0 reset inode 
Error (2024-05-08T13:53:44.254+0200 7f2c30c85000 -1 main: Bad rank selection: --rank=ksz-cephfs2:0'
(2) No such file or directory)

# ceph fs reset ksz-cephfs2 --yes-i-really-mean-it
Error EINVAL: all MDS daemons must be inactive before resetting filesystem: set the
cluster_down flag and use `ceph mds fail` to make this so

Actions #11

Updated by Robert Sander about 20 hours ago

The cluster now serves the two CephFS again after running these commands for both:

cephfs-journal-tool event recover_dentries summary

cephfs-journal-tool journal reset

ceph fs reset
Actions

Also available in: Atom PDF