Project

General

Profile

Actions

Bug #64058

open

qa: Command failed (workunit test fs/snaps/untar_snap_rm.sh)

Added by Milind Changire 4 months ago. Updated about 2 months ago.

Status:
Pending Backport
Priority:
High
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
squid,reef,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, qa-suite
Labels (FS):
crash, qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2024-01-10T20:43:52.620+0000 7f8e9c4e3700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7-278-gd6f81946/rpm/el8/BUILD/ceph-17.2.7-278-gd6f81946/src/mds/MDSRank.cc: In function 'void MDSRank::abort(std::string_view)' thread 7f8e9c4e3700 time 2024-01-10T20:43:52.620029+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7-278-gd6f81946/rpm/el8/BUILD/ceph-17.2.7-278-gd6f81946/src/mds/MDSRank.cc: 941: ceph_abort_msg("abort() called")

 ceph version 17.2.7-278-gd6f81946 (d6f81946113429824c706976328252a37cd7285f) quincy (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd7) [0x7f8eaacfcb07]
 2: (MDSRank::abort(std::basic_string_view<char, std::char_traits<char> >)+0x7d) [0x55ce71c89ebd]
 3: (CDentry::check_corruption(bool)+0x740) [0x55ce71f011b0]
 4: (EMetaBlob::add_primary_dentry(EMetaBlob::dirlump&, CDentry*, CInode*, unsigned char)+0x47) [0x55ce71d5d4a7]
 5: (EMetaBlob::add_dir_context(CDir*, int)+0x17b) [0x55ce72067b4b]
 6: (MDCache::create_subtree_map()+0xc41) [0x55ce71dcfbe1]
 7: (MDLog::_journal_segment_subtree_map(MDSContext*)+0x4d) [0x55ce71ff679d]
 8: (MDLog::start_submit_entry(LogEvent*, MDSLogContextBase*)+0xc1) [0x55ce71d543c1]
 9: (MDCache::_fragment_old_purged(dirfrag_t, int, boost::intrusive_ptr<MDRequestImpl> const&)+0x33a) [0x55ce71dc3c7a]
 10: (MDSContext::complete(int)+0x5f) [0x55ce71fdf49f]
 11: (MDSIOContextBase::complete(int)+0x534) [0x55ce71fdfc34]
 12: (Finisher::finisher_thread_entry()+0x18d) [0x7f8eaad9b8bd]
 13: /lib64/libpthread.so.0(+0x81cf) [0x7f8ea9cea1cf]
 14: clone()

archive_path: /home/teuthworker/archive/yuriw-2024-01-10_16:13:48-fs-wip-yuri6-testing-2024-01-05-0744-quincy-distro-default-smithi/7511909


Related issues 3 (1 open2 closed)

Copied to CephFS - Backport #64920: squid: qa: Command failed (workunit test fs/snaps/untar_snap_rm.sh)ResolvedPatrick DonnellyActions
Copied to CephFS - Backport #64921: quincy: qa: Command failed (workunit test fs/snaps/untar_snap_rm.sh)In ProgressPatrick DonnellyActions
Copied to CephFS - Backport #64922: reef: qa: Command failed (workunit test fs/snaps/untar_snap_rm.sh)ResolvedPatrick DonnellyActions
Actions #1

Updated by Patrick Donnelly 4 months ago

  • Status changed from New to Triaged
  • Assignee set to Patrick Donnelly
  • Target version set to v19.0.0
  • Backport set to reef,quincy
  • Component(FS) MDS added
  • Labels (FS) crash added

This is a good find. The relevant part of the log:

2024-01-10T20:43:52.618+0000 7f8ea24ef700 10 MDSContext::complete: 12C_MDS_VoidFn
2024-01-10T20:43:52.618+0000 7f8ea24ef700  1 mds.0.280 resolve_done
2024-01-10T20:43:52.618+0000 7f8ea24ef700  3 mds.0.280 request_state up:reconnect
2024-01-10T20:43:52.618+0000 7f8ea24ef700  5 mds.beacon.e set_want_state: up:resolve -> up:reconnect
2024-01-10T20:43:52.618+0000 7f8ea24ef700  5 mds.beacon.e Sending beacon up:reconnect seq 40
2024-01-10T20:43:52.618+0000 7f8ea24ef700  1 -- [v2:172.21.15.47:6832/3562607475,v1:172.21.15.47:6833/3562607475] --> [v2:172.21.15.47:3300/0,v1:172.21.15.47:6789/0] -- mdsbeacon(8022/e up:reconnect seq=40 v281) v8 -- 0x55ce75078b00 con 0x55ce7507ac00
2024-01-10T20:43:52.618+0000 7f8ea24ef700 10 mds.0.snapclient sync
2024-01-10T20:43:52.618+0000 7f8ea24ef700 10 mds.0.snapclient refresh want 1
2024-01-10T20:43:52.618+0000 7f8ea24ef700  1 -- [v2:172.21.15.47:6832/3562607475,v1:172.21.15.47:6833/3562607475] send_to--> mds [v2:172.21.15.47:6832/3562607475,v1:172.21.15.47:6833/3562607475] -- mds_table_request(snaptable query 1 9 bytes) v1 -- ?+0 0x55ce7520e480
2024-01-10T20:43:52.618+0000 7f8ea24ef700  1 -- [v2:172.21.15.47:6832/3562607475,v1:172.21.15.47:6833/3562607475] --> [v2:172.21.15.47:6832/3562607475,v1:172.21.15.47:6833/3562607475] -- mds_table_request(snaptable query 1 9 bytes) v1 -- 0x55ce7520e480 con 0x55ce7507a000
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 10 MDSContext::complete: 25C_IO_MDC_FragmentPurgeOld
2024-01-10T20:43:52.618+0000 7f8ea24ef700  1 -- [v2:172.21.15.47:6832/3562607475,v1:172.21.15.47:6833/3562607475] <== mds.0 v2:172.21.15.47:6832/3562607475 0 ==== mds_table_request(snaptable query 1 9 bytes) v1 ==== 0+0+0 (unknown 0 0 0) 0x55ce7520e480 con 0x55ce7507a000
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 10 mds.0.cache fragment_old_purged 0x10000003d88
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 10 mds.0.log submit_entry also starting new segment: last = 64925/241326736, event seq = 65806
2024-01-10T20:43:52.618+0000 7f8e9c4e3700  7 mds.0.log _prepare_new_segment seq 65807
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 10 mds.0.cache advance_stray to index 1 fragmenting index -1
2024-01-10T20:43:52.618+0000 7f8e9c4e3700  7 mds.0.log _journal_segment_subtree_map
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 10 mds.0.cache create_subtree_map 12 subtrees, 11 fullauth
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 10 mds.0.cache number of subtrees = 12; not printing subtrees
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x1 / [2,head] auth v=826 cv=0/0 dir_auth=0 state=1610612736 f(v0 m2024-01-10T20:03:30.742671+0000 1=0+1) n(v24 rc2024-01-10T20:43:11.194482+0000 b170731713 18232=17060+1172) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 dirty=1 0x55ce7436ed00]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache   subtree bound [dir 0x10000002bea /client.0/tmp/linux-2.6.33/arch/ [2,head] rep@1.0 dir_auth=1 state=134217728 f(v10 m2024-01-10T20:42:28.420395+0000 24=2+22)/f(v10 m2024-01-10T20:42:21.532543+0000 23=2+21) n(v18 rc2024-01-10T20:42:45.387033+0000 b74360890 13407=12530+877)/n(v18 rc2024-01-10T20:42:34.965256+0000 b71425106 12586=11736+850) hs=9+0,ss=0+0 | child=1 subtree=1 dirty=0 0x55ce7947c000]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x100 ~mds0/ [2,head] auth v=21965 cv=0/0 dir_auth=0 state=1610612736 f(v0 10=0+10) n(v68 rc2024-01-10T20:37:59.960150+0000 b5173082 889=524+365) hs=10+0,ss=0+0 dirty=10 | child=1 subtree=1 dirty=1 0x55ce7436f180]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x20000004fb1 /client.0/tmp/linux-2.6.33/arch/m68k/ [2,head] auth v=975 cv=0/0 dir_auth=0 state=1610612737|complete f(v0 m2024-01-10T20:40:16.654210+0000 24=4+20) n(v4 rc2024-01-10T20:40:16.657210+0000 b4928581 473=449+24) hs=24+0,ss=0+0 dirty=24 | child=1 subtree=1 dirty=1 0x55ce7550b600]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x20000006479 /client.0/tmp/linux-2.6.33/arch/um/ [2,head] auth v=761 cv=0/0 dir_auth=0 state=1610612737|complete f(v0 m2024-01-10T20:42:28.385396+0000 24=15+9) n(v5 rc2024-01-10T20:42:28.419395+0000 b854076 371=344+27) hs=24+0,ss=0+0 dirty=24 | child=1 subtree=1 dirty=1 0x55ce762ba480]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x20000006640 /client.0/tmp/linux-2.6.33/arch/x86/include/ [2,head] auth v=1199 cv=0/0 dir_auth=0 state=1610612737|complete f(v0 m2024-01-10T20:42:40.091146+0000 1=0+1) n(v2 rc2024-01-10T20:42:42.666091+0000 b1121623 318=314+4) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 dirty=1 0x55ce7656b180]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x20000006224 /client.0/tmp/linux-2.6.33/arch/sparc/ [2,head] auth v=1177 cv=0/0 dir_auth=0 state=1610612737|complete f(v0 m2024-01-10T20:42:21.511543+0000 12=3+9) n(v5 rc2024-01-10T20:42:21.532543+0000 b3403846 596=586+10) hs=12+0,ss=0+0 dirty=12 | child=1 subtree=1 dirty=1 0x55ce76ddba80]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x20000004cbc /client.0/tmp/linux-2.6.33/arch/ia64/ [2,head] auth v=1000 cv=0/0 dir_auth=0 state=1610612737|complete f(v0 m2024-01-10T20:40:08.471385+0000 20=5+15) n(v5 rc2024-01-10T20:40:08.485385+0000 b3773572 519=486+33) hs=20+0,ss=0+0 dirty=20 | child=1 subtree=1 dirty=1 0x55ce7734da80]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x200000052d8 /client.0/tmp/linux-2.6.33/arch/mips/ [2,head] auth v=2280 cv=0/0 dir_auth=0 state=1610612737|complete f(v1 m2024-01-10T20:41:27.825690+0000 39=3+36) n(v7 rc2024-01-10T20:41:27.841689+0000 b7399933 1462=1324+138) hs=39+0,ss=0+0 dirty=39 | child=1 subtree=1 dirty=1 0x55ce77624900]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x20000005ed4 /client.0/tmp/linux-2.6.33/arch/sh/ [2,head] auth v=1079 cv=0/0 dir_auth=0 state=1610612737|complete f(v0 m2024-01-10T20:42:11.678753+0000 17=4+13) n(v9 rc2024-01-10T20:42:11.681753+0000 b4445763 847=759+88) hs=17+0,ss=0+0 dirty=17 | child=1 subtree=1 dirty=1 0x55ce786aa400]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x2000000577f /client.0/tmp/linux-2.6.33/arch/powerpc/ [2,head] auth v=2209 cv=0/0 dir_auth=0 state=1610612737|complete f(v0 m2024-01-10T20:41:55.365101+0000 16=4+12) n(v10 rc2024-01-10T20:41:55.403101+0000 b12428948 1437=1389+48) hs=16+0,ss=0+0 dirty=16 | child=1 subtree=1 dirty=1 0x55ce78963b00]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 15 mds.0.cache  auth subtree [dir 0x10000002d1c /client.0/tmp/linux-2.6.33/arch/arm/ [2,head] auth v=8180 cv=0/0 dir_auth=0 state=1610612737|complete f(v0 m2024-01-10T20:38:51.494038+0000 86=4+82) n(v7 rc2024-01-10T20:38:51.502038+0000 b18201892 3343=3128+215) hs=86+0,ss=0+0 dirty=86 | child=1 subtree=1 dirty=1 0x55ce795c1180]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 20 mds.0.journal EMetaBlob::add_dir_context final:
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 20 mds.0.journal EMetaBlob::add_dir_context final:
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 20 mds.0.journal EMetaBlob::add_dir_context(0x55ce7436ed00) have lump 0x1
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 20 mds.0.journal EMetaBlob::add_dir_context final: 0x55ce74333900,0x55ce7a47ec80,0x55ce77f09b80,0x55ce7a21aa00
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 -1 mds.0.cache.den(0x1 client.0) newly corrupt dentry to be committed: [dentry #0x1/client.0 [2,head] auth (dversion lock) v=825 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x55ce74333900]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 10 mds.0.cache.dir(0x1) go_bad_dentry client.0
2024-01-10T20:43:52.618+0000 7f8e9c4e3700 -1 log_channel(cluster) log [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry #0x1/client.0 [2,head] auth (dversion lock) v=825 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x55ce74333900]
2024-01-10T20:43:52.618+0000 7f8e9c4e3700  1 -- [v2:172.21.15.47:6832/3562607475,v1:172.21.15.47:6833/3562607475] --> [v2:172.21.15.47:3300/0,v1:172.21.15.47:6789/0] -- log(1 entries from seq 1 at 2024-01-10T20:43:52.619991+0000) v1 -- 0x55ce743468c0 con 0x55ce7507ac00
2024-01-10T20:43:52.620+0000 7f8e9c4e3700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7-278-gd6f81946/rpm/el8/BUILD/ceph-17.2.7-278-gd6f81946/src/mds/MDSRank.cc: In function 'void MDSRank::abort(std::string_view)' thread 7f8e9c4e3700 time 2024-01-10T20:43:52.620029+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7-278-gd6f81946/rpm/el8/BUILD/ceph-17.2.7-278-gd6f81946/src/mds/MDSRank.cc: 941: ceph_abort_msg("abort() called")

The issue is that the snapclient is not yet sync'd but the snapserver is ready. So the current snap version is 1 causing the abort. I will look into a solution soon.

(No legitimate corruption found.)

Actions #2

Updated by Patrick Donnelly 2 months ago

Expanding this a bit:

2024-01-10T20:45:02.520+0000 7f88fbca0700 10 mds.0.cache rollback_uncommitted_fragments: 1 pending
2024-01-10T20:45:02.520+0000 7f88fbca0700 10 mds.0.cache fragment_committed 0x10000003d88
2024-01-10T20:45:02.520+0000 7f88fbca0700 10 mds.0.cache  truncate orphan dirfrag 10000003d88.00000000
2024-01-10T20:45:02.520+0000 7f88fbca0700  1 -- [v2:172.21.15.110:6836/3143488632,v1:172.21.15.110:6839/3143488632] --> [v2:172.21.15.47:6824/41508,v1:172.21.15.47:6825/41508] -- osd_op(unknown.0.286:86 2.13 2:cb9b4589:::10000003d88.00000000:head [truncate 0,omap-clear] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e129) v8 -- 0x55d31e57b000 con 0x55d31e57a000
...
2024-01-10T20:45:02.523+0000 7f88ff4a7700  1 -- [v2:172.21.15.110:6836/3143488632,v1:172.21.15.110:6839/3143488632] <== osd.0 v2:172.21.15.47:6824/41508 9 ==== osd_op_reply(86 10000003d88.00000000 [truncate 0,omap-clear] v129'1254 uv1254 ondisk = 0) v8 ==== 206+0+0 (crc 0 0 0) 0x55d31e532d80 con 0x55d31e57a000
2024-01-10T20:45:02.523+0000 7f88f5c94700 10 MDSIOContextBase::complete: 25C_IO_MDC_FragmentPurgeOld
...
2024-01-10T20:45:02.680+0000 7f88f5c94700 10 MDSContext::complete: 25C_IO_MDC_FragmentPurgeOld
...
2024-01-10T20:45:02.680+0000 7f88f5c94700 -1 mds.0.cache.den(0x1 client.0) newly corrupt dentry to be committed: [dentry #0x1/client.0 [2,head] auth (dversion lock) v=825 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x55d320ed8000]
2024-01-10T20:45:02.680+0000 7f88f5c94700 10 mds.0.cache.dir(0x1) go_bad_dentry client.0
2024-01-10T20:45:02.680+0000 7f88f5c94700 -1 log_channel(cluster) log [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry #0x1/client.0 [2,head] auth (dversion lock) v=825 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x55d320ed8000]
2024-01-10T20:45:02.680+0000 7f88f5c94700  1 -- [v2:172.21.15.110:6836/3143488632,v1:172.21.15.110:6839/3143488632] --> [v2:172.21.15.110:3300/0,v1:172.21.15.110:6789/0] -- log(1 entries from seq 1 at 2024-01-10T20:45:02.681468+0000) v1 -- 0x55d31d5f08c0 con 0x55d31e325400
2024-01-10T20:45:02.682+0000 7f88f5c94700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7-278-gd6f81946/rpm/el8/BUILD/ceph-17.2.7-278-gd6f81946/src/mds/MDSRank.cc: In function 'void MDSRank::abort(std::string_view)' thread 7f88f5c94700 time 2024-01-10T20:45:02.681498+0000

From: /teuthology/yuriw-2024-01-10_16:13:48-fs-wip-yuri6-testing-2024-01-05-0744-quincy-distro-default-smithi/7511909/remote/smithi110/log/ceph-mds.d.log.gz

Not sure why I forked this issue to #64290. I will close that.

Actions #3

Updated by Patrick Donnelly 2 months ago

  • Status changed from Triaged to Fix Under Review
  • Backport changed from reef,quincy to squid,reef,quincy
  • Pull request ID set to 55412
Actions #4

Updated by Venky Shankar about 2 months ago

  • Status changed from Fix Under Review to Pending Backport
  • Priority changed from Normal to High
Actions #5

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64920: squid: qa: Command failed (workunit test fs/snaps/untar_snap_rm.sh) added
Actions #6

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64921: quincy: qa: Command failed (workunit test fs/snaps/untar_snap_rm.sh) added
Actions #7

Updated by Backport Bot about 2 months ago

  • Copied to Backport #64922: reef: qa: Command failed (workunit test fs/snaps/untar_snap_rm.sh) added
Actions #8

Updated by Backport Bot about 2 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF