Project

General

Profile

Actions

Bug #62580

open

testing: Test failure: test_snapshot_remove (tasks.cephfs.test_strays.TestStrays)

Added by Patrick Donnelly 8 months ago. Updated about 1 month ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2023-08-22T19:23:31.360 INFO:tasks.cephfs.mount:Validating 8388608 bytes from snapdir/.snap/snap1/subdir/file_a
2023-08-22T19:23:31.360 DEBUG:teuthology.orchestra.run.smithi105:> sudo adjust-ulimits daemon-helper kill python3 -c '
2023-08-22T19:23:31.361 DEBUG:teuthology.orchestra.run.smithi105:> import zlib
2023-08-22T19:23:31.361 DEBUG:teuthology.orchestra.run.smithi105:> path = "/home/ubuntu/cephtest/mnt.0/snapdir/.snap/snap1/subdir/file_a" 
2023-08-22T19:23:31.361 DEBUG:teuthology.orchestra.run.smithi105:> with open(path, '"'"'r'"'"') as f:
2023-08-22T19:23:31.361 DEBUG:teuthology.orchestra.run.smithi105:>     bytes = f.read()
2023-08-22T19:23:31.362 DEBUG:teuthology.orchestra.run.smithi105:> if len(bytes) != 8388608:
2023-08-22T19:23:31.362 DEBUG:teuthology.orchestra.run.smithi105:>     raise RuntimeError("Bad length {0} vs. expected {1}".format(
2023-08-22T19:23:31.362 DEBUG:teuthology.orchestra.run.smithi105:>         len(bytes), 8388608
2023-08-22T19:23:31.362 DEBUG:teuthology.orchestra.run.smithi105:>     ))
2023-08-22T19:23:31.363 DEBUG:teuthology.orchestra.run.smithi105:> for i, b in enumerate(bytes):
2023-08-22T19:23:31.363 DEBUG:teuthology.orchestra.run.smithi105:>     val = zlib.crc32(str(i).encode('"'"'utf-8'"'"')) & 7
2023-08-22T19:23:31.363 DEBUG:teuthology.orchestra.run.smithi105:>     if b != chr(val):
2023-08-22T19:23:31.363 DEBUG:teuthology.orchestra.run.smithi105:>         raise RuntimeError("Bad data at offset {0}".format(i))
2023-08-22T19:23:31.364 DEBUG:teuthology.orchestra.run.smithi105:> '
2023-08-22T19:23:31.554 INFO:teuthology.orchestra.run.smithi105.stderr:Traceback (most recent call last):
2023-08-22T19:23:31.554 INFO:teuthology.orchestra.run.smithi105.stderr:  File "<string>", line 13, in <module>
2023-08-22T19:23:31.555 INFO:teuthology.orchestra.run.smithi105.stderr:RuntimeError: Bad data at offset 0
2023-08-22T19:23:31.687 INFO:teuthology.orchestra.run.smithi105.stderr:daemon-helper: command failed with exit status 1

From: /teuthology/yuriw-2023-08-22_14:48:56-fs-pacific-release-distro-default-smithi/7376538/teuthology.log

Last time we saw this was #21483, also a testing branch bug.


Related issues 3 (2 open1 closed)

Related to CephFS - Bug #21483: qa: test_snapshot_remove (kcephfs): RuntimeError: Bad data at offset 0ResolvedZheng Yan09/21/2017

Actions
Copied to CephFS - Backport #64075: reef: testing: Test failure: test_snapshot_remove (tasks.cephfs.test_strays.TestStrays)In ProgressXiubo LiActions
Copied to CephFS - Backport #64076: quincy: testing: Test failure: test_snapshot_remove (tasks.cephfs.test_strays.TestStrays)In ProgressXiubo LiActions
Actions #1

Updated by Patrick Donnelly 8 months ago

  • Related to Bug #21483: qa: test_snapshot_remove (kcephfs): RuntimeError: Bad data at offset 0 added
Actions #2

Updated by Xiubo Li 8 months ago

  • Assignee set to Xiubo Li

I will work on it.

Actions #3

Updated by Xiubo Li 8 months ago

  • Status changed from New to Duplicate
  • Parent task set to #61781

This should duplicate with https://tracker.ceph.com/issues/61892, which hasn't been backported to Pacific yet.

Actions #4

Updated by Xiubo Li 8 months ago

  • Parent task changed from #61781 to #61782
Actions #5

Updated by Xiubo Li 8 months ago

  • Parent task changed from #61782 to #61892
Actions #6

Updated by Xiubo Li 8 months ago

  • Status changed from Duplicate to In Progress
  • Parent task deleted (#61892)

Xiubo Li wrote:

This should duplicate with https://tracker.ceph.com/issues/61892, which hasn't been backported to Pacific yet.

Sorry, it's not exactly the same issue, but a similar one:

The MDS received the truncate request and then try to revoke the Fb caps and then added the request to the waiter list:

 -4488> 2023-08-22T19:23:28.715+0000 7f2739df6700  1 -- [v2:172.21.15.105:6834/2631492389,v1:172.21.15.105:6835/2631492389] <== client.12457 v1:172.21.15.105:0/2930057526 35 ==== client_request(client.12457:16 setattr size=0 #0x10000000002 2023-08-22T19:23:28.715412+0000 caller_uid=1000, caller_gid=1267{6,27,108,1267,10102,}) v6 ==== 240+0+0 (unknown 2218452622 0 0) 0x563133dd4dc0 con 0x563133d23000
 -4487> 2023-08-22T19:23:28.715+0000 7f2739df6700  4 mds.0.server handle_client_request client_request(client.12457:16 setattr size=0 #0x10000000002 2023-08-22T19:23:28.715412+0000 caller_uid=1000, caller_gid=1267{6,27,108,1267,10102,}) v6
...
 -4069> 2023-08-22T19:23:28.727+0000 7f2733dea700 10 mds.0.cache.ino(0x10000000002) auth_pin by 0x563133c1e400 on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=4 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile xlockdone x=1) (ixattr excl) (iversion lock w=1 last_client=12457) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFscb/pAsxXsxFxwb@7},l=12457 | ptrwaiter=0 request=1 lock=3 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80] now 4
 -4068> 2023-08-22T19:23:28.727+0000 7f2733dea700 15 mds.0.cache.dir(0x10000000001) adjust_nested_auth_pins 1 on [dir 0x10000000001 /snapdir/subdir/ [2,head] auth v=13 cv=0/0 ap=1+9 state=1610874881|complete f(v0 m2023-08-22T19:23:24.259491+0000 1=1+0) n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) hs=1+0,ss=1+0 dirty=2 | child=1 dirty=1 authpin=1 0x563133c5a880] by 0x563133c99b80 count now 1/9
 -4067> 2023-08-22T19:23:28.727+0000 7f2733dea700 10 mds.0.cache.ino(0x10000000002) add_waiter tag 2000000040000000 0x563133da7160 !ambig 1 !frozen 1 !freezing 1
 -4066> 2023-08-22T19:23:28.727+0000 7f2733dea700 15 mds.0.cache.ino(0x10000000002) taking waiter here
 -4065> 2023-08-22T19:23:28.727+0000 7f2733dea700  7 mds.0.locker get_allowed_caps loner client.12457 allowed=pAsxLsXsxFc, xlocker allowed=pAsxLsXsxFc, others allowed=pLsFc on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=4 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile xlocksnap x=1) (ixattr excl) (iversion lock w=1 last_client=12457) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFscb/pAsxXsxFxwb@7},l=12457 | ptrwaiter=0 request=1 lock=3 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563133c99b80]
 -4064> 2023-08-22T19:23:28.727+0000 7f2733dea700 20 mds.0.locker  client.12457 pending pAsxLsXsxFscb allowed pAsxLsXsxFc wanted pAsxXsxFxwb
 -4063> 2023-08-22T19:23:28.727+0000 7f2733dea700  7 mds.0.locker    sending MClientCaps to client.12457 seq 8 new pending pAsxLsXsxFc was pAsxLsXsxFscb
 -4062> 2023-08-22T19:23:28.727+0000 7f2733dea700 20 mds.0.cache.ino(0x10000000002) encode_cap_message pfile 1 pauth 0 plink 0 pxattr 0 mtime 2023-08-22T19:23:28.715412+0000 ctime 2023-08-22T19:23:28.715412+0000 change_attr 4
 -4061> 2023-08-22T19:23:28.727+0000 7f2733dea700 10 mds.0.236 send_message_client_counted client.12457 seq 14 client_caps(revoke ino 0x10000000002 1 seq 8 caps=pAsxLsXsxFc dirty=- wanted=pAsxXsxFxwb follows 0 size 0/20971520 ts 2/0 mtime 2023-08-22T19:23:28.715412+0000 ctime 2023-08-22T19:23:28.715412+0000 change_attr 4 tws 1) v11
 -4060> 2023-08-22T19:23:28.727+0000 7f2733dea700  1 -- [v2:172.21.15.105:6834/2631492389,v1:172.21.15.105:6835/2631492389] --> v1:172.21.15.105:0/2930057526 -- client_caps(revoke ino 0x10000000002 1 seq 8 caps=pAsxLsXsxFc dirty=- wanted=pAsxXsxFxwb follows 0 size 0/20971520 ts 2/0 mtime 2023-08-22T19:23:28.715412+0000 ctime 2023-08-22T19:23:28.715412+0000 change_attr 4 tws 1) v11 -- 0x563133df0000 con 0x563133d23000
 -4059> 2023-08-22T19:23:28.727+0000 7f2733dea700 20 mds.0.bal hit_dir 1 pop is 5.04785, frag * size 1 [pop IRD:[C 0.00e+00] IWR:[C 5.05e+00] RDR:[C 0.00e+00] FET:[C 0.00e+00] STR:[C 0.00e+00] *LOAD:10.1]

 -4058> 2023-08-22T19:23:28.727+0000 7f2733dea700  7 mds.0.server reply_client_request 0 ((0) Success) client_request(client.12457:16 setattr size=0 #0x10000000002 2023-08-22T19:23:28.715412+0000 caller_uid=1000, caller_gid=1267{6,27,108,1267,10102,}) v6
...

The code, Line# 6433:

 6418 void MDCache::truncate_inode(CInode *in, LogSegment *ls)
 6419 {
 6420   const auto& pi = in->get_projected_inode();
 6421   dout(10) << "truncate_inode " 
 6422            << pi->truncate_from << " -> " << pi->truncate_size
 6423            << " on " << *in  
 6424            << dendl;
 6425  
 6426   ls->truncating_inodes.insert(in);
 6427   in->get(CInode::PIN_TRUNCATING);
 6428   in->auth_pin(this);
 6429  
 6430   if (!in->client_need_snapflush.empty() &&
 6431       (in->get_caps_issued() & CEPH_CAP_FILE_BUFFER)) {
 6432     ceph_assert(in->filelock.is_xlocked());
 6433     in->filelock.set_xlock_snap_sync(new C_MDC_RetryTruncate(this, in, ls));                                                                                                                                                     
 6434     mds->locker->issue_caps(in);
 6435     return;
 6436   }
 6437  
 6438   _truncate_inode(in, ls);
 6439 }

The in->filelock.set_xlock_snap_sync() will just queue it and wait the ifile locker to be stable:

 56   void set_xlock_snap_sync(MDSContext *c)
 57   {
 58     ceph_assert(get_type() == CEPH_LOCK_IFILE);
 59     ceph_assert(state == LOCK_XLOCK || state == LOCK_XLOCKDONE);
 60     state = LOCK_XLOCKSNAP;
 61     add_waiter(WAIT_STABLE, c);
 62   }

Then kclient sent the cap update request but still holding the Fb caps, but the Locker just transfered the ifile locker state xlocksnap --> excl:

 -4029> 2023-08-22T19:23:28.727+0000 7f2739df6700  1 -- [v2:172.21.15.105:6834/2631492389,v1:172.21.15.105:6835/2631492389] <== client.12457 v1:172.21.15.105:0/2930057526 37 ==== client_caps(update ino 0x10000000002 1 seq 8 caps=pAsxLsXsxFcb dirty=- wanted=pAsxXsxFxwb follows 0 size 0/0 mtime 2023-08-22T19:23:28.715412+0000 ctime 2023-08-22T19:23:28.715412+0000 change_attr 4 tws 1 xattrs(v=17143629482702199040 l=0)) v12 ==== 268+0+0 (unknown 2034319550 0 0) 0x563133df0000 con 0x563133d23000
 -4028> 2023-08-22T19:23:28.727+0000 7f2739df6700  7 mds.0.locker handle_client_caps  on 0x10000000002 tid 0 follows 0 op update flags 0x4
 -4027> 2023-08-22T19:23:28.727+0000 7f2739df6700 20 mds.0.236 get_session have 0x563133c8aa00 client.12457 v1:172.21.15.105:0/2930057526 state open
 -4026> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker  head inode [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=3 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFscb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563133c99b80]
 -4025> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker  follows 0 retains pAsxLsXsxFcb dirty - on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=3 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563133c99b80]
 -4024> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker  revocation in progress, not making any conclusions about null snapflushes
 -4023> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker _do_cap_update dirty - issued pAsxLsXsxFcb wanted pAsxXsxFxwb on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=3 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563133c99b80]
 -4022> 2023-08-22T19:23:28.727+0000 7f2739df6700 20 mds.0.locker inode is file
 -4021> 2023-08-22T19:23:28.727+0000 7f2739df6700 20 mds.0.locker client has write caps; m->get_max_size=0; old_max=20971520
 -4020> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker eval 3648 [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=3 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563133c99b80]
 -4019> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker eval_gather (ifile xlocksnap) on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=3 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563133c99b80]
 -4018> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker  next state is lock issued/allows loner cb/cb xlocker /cb other /cb
 -4017> 2023-08-22T19:23:28.727+0000 7f2739df6700  7 mds.0.locker eval_gather finished gather on (ifile xlocksnap) on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=3 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563133c99b80]
 -4016> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.cache.ino(0x10000000002) auth_unpin by 0x563133c99e90 on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile lock) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x563133c99b80] now 2
 -4015> 2023-08-22T19:23:28.727+0000 7f2739df6700 15 mds.0.cache.dir(0x10000000001) adjust_nested_auth_pins -1 on [dir 0x10000000001 /snapdir/subdir/ [2,head] auth v=13 cv=0/0 ap=0+7 state=1610874881|complete f(v0 m2023-08-22T19:23:24.259491+0000 1=1+0) n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) hs=1+0,ss=1+0 dirty=2 | child=1 dirty=1 authpin=0 0x563133c5a880] by 0x563133c99e90 count now 0/7
 -4014> 2023-08-22T19:23:28.727+0000 7f2739df6700  7 mds.0.locker file_eval wanted=xwb loner_wanted=xwb other_wanted=  filelock=(ifile lock) on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile lock) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80]
 -4013> 2023-08-22T19:23:28.727+0000 7f2739df6700  7 mds.0.locker file_eval stable, bump to loner (ifile lock) on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile lock) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80]
 -4012> 2023-08-22T19:23:28.727+0000 7f2739df6700  7 mds.0.locker file_excl (ifile lock) on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile lock) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80]
 -4011> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker simple_eval (iauth excl) on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80]
 -4010> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker simple_eval (ilink sync) on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80]
 -4009> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.locker simple_eval (ixattr excl) on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80]

Since the excl was a stable state and then it woke up the waiter and triggered to truncate the object from Rados but with needsnapflush=1 still being set:

 -4008> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 MDSContext::complete: 19C_MDC_RetryTruncate
 -4007> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.cache _truncate_inode 8388608 -> 0 on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80]
 -4006> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.cache  realm snaprealm(0x10000000000 seq 2 lc 2 cr 2 snaps={2=snap(2 0x10000000000 'snap1' 2023-08-22T19:23:28.639414+0000)} 0x563132ef7200)
 -4005> 2023-08-22T19:23:28.727+0000 7f2739df6700 10 mds.0.cache _truncate_inode  snapc 0x563132ef73a0 on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80]
 -4004> 2023-08-22T19:23:28.727+0000 7f2739df6700  1 -- [v2:172.21.15.105:6834/2631492389,v1:172.21.15.105:6835/2631492389] --> [v2:172.21.15.105:6808/32922,v1:172.21.15.105:6811/32922] -- osd_op(unknown.0.236:51 42.12 42:4bec3886:::10000000002.00000000:head [trimtrunc 2@0] snapc 2=[2] ondisk+write+known_if_redirected+full_force e290) v8 -- 0x563133d25c00 con 0x563133d25400
 -4003> 2023-08-22T19:23:28.727+0000 7f2739df6700  1 -- [v2:172.21.15.105:6834/2631492389,v1:172.21.15.105:6835/2631492389] --> [v2:172.21.15.190:6808/33098,v1:172.21.15.190:6809/33098] -- osd_op(unknown.0.236:52 42.1d 42:bbf7764d:::10000000002.00000001:head [trimtrunc 2@0] snapc 2=[2] ondisk+write+known_if_redirected+full_force e290) v8 -- 0x563133d24000 con 0x563133d24c00
 -4002> 2023-08-22T19:23:28.727+0000 7f2739df6700  7 mds.0.locker get_allowed_caps loner client.12457 allowed=pAsxLsXsxFsxcrwb, xlocker allowed=pAsxLsXsxFsxcrwb, others allowed=pLs on [inode 0x10000000002 [3,head] /snapdir/subdir/file_a auth v12 ap=2 DIRTYPARENT truncating(8388608 to 0) s=0 n(v0 rc2023-08-22T19:23:28.715412+0000 1=1+0) need_snapflush={2=12457} (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={12457=0-20971520@1} caps={12457=pAsxLsXsxFc/pAsxLsXsxFcb/pAsxXsxFxwb@8},l=12457 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x563133c99b80]
 -4001> 2023-08-22T19:23:28.727+0000 7f2739df6700 20 mds.0.locker  client.12457 pending pAsxLsXsxFc allowed pAsxLsXsxFsxcrwb wanted pAsxXsxFxwb

This would leave the dirty buffer not being flushed when the MDS was trying to truncate the objects in Rados, this just a similar issue with https://tracker.ceph.com/issues/61892, but not exactly the same.

Need to find one way to fix it in MDS.

Actions #7

Updated by Xiubo Li 8 months ago

  • Project changed from Linux kernel client to CephFS
  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 53238
Actions #10

Updated by Milind Changire 6 months ago

2023-10-26T10:02:48.720 INFO:teuthology.orchestra.run.smithi045.stderr:Traceback (most recent call last):
2023-10-26T10:02:48.721 INFO:teuthology.orchestra.run.smithi045.stderr:  File "<string>", line 33, in <module>
2023-10-26T10:02:48.721 INFO:teuthology.orchestra.run.smithi045.stderr:OSError: [Errno 28] No space left on device

reef: https://pulpito.ceph.com/yuriw-2023-10-25_14:39:02-fs-wip-yuri6-testing-2023-10-23-1148-reef-distro-default-smithi/7437188

Actions #14

Updated by Venky Shankar 3 months ago

  • Category set to Correctness/Safety
  • Status changed from Fix Under Review to Pending Backport
  • Target version set to v19.0.0
  • Backport set to quincy,reef
  • Component(FS) MDS added
Actions #15

Updated by Backport Bot 3 months ago

  • Copied to Backport #64075: reef: testing: Test failure: test_snapshot_remove (tasks.cephfs.test_strays.TestStrays) added
Actions #16

Updated by Backport Bot 3 months ago

  • Copied to Backport #64076: quincy: testing: Test failure: test_snapshot_remove (tasks.cephfs.test_strays.TestStrays) added
Actions #17

Updated by Backport Bot 3 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF