Project

General

Profile

Actions

Bug #64209

open

snaptest-multiple-capsnaps.sh fails with "got remote process result: 1"

Added by Venky Shankar 3 months ago. Updated 3 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS, kceph
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/vshankar-2024-01-22_07:03:31-fs-wip-vshankar-testing-20240119.075157-1-testing-default-smithi/7525752

2024-01-22T09:02:52.140 INFO:tasks.workunit.client.0.smithi055.stdout:asdf
2024-01-22T09:02:52.141 INFO:tasks.workunit.client.0.smithi055.stderr:+ stat .snap/2/a
2024-01-22T09:02:52.141 INFO:tasks.workunit.client.0.smithi055.stderr:+ grep 'Size: 5'
2024-01-22T09:02:52.142 DEBUG:teuthology.orchestra.run.smithi105:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.7.asok dump_blocked_ops
2024-01-22T09:02:52.143 INFO:tasks.workunit.client.0.smithi055.stdout:  Size: 5                 Blocks: 1          IO Block: 4194304 regular file
2024-01-22T09:02:52.144 INFO:tasks.workunit.client.0.smithi055.stderr:+ stat .snap/2/a
2024-01-22T09:02:52.145 INFO:tasks.workunit.client.0.smithi055.stderr:+ grep -- -rwxrwxrwx
2024-01-22T09:02:52.148 INFO:tasks.workunit.client.0.smithi055.stdout:Access: (0777/-rwxrwxrwx)  Uid: ( 1000/  ubuntu)   Gid: ( 1301/  ubuntu)
2024-01-22T09:02:52.149 INFO:tasks.workunit.client.0.smithi055.stderr:+ grep qwer .snap/3/a
2024-01-22T09:02:52.154 DEBUG:teuthology.orchestra.run:got remote process result: 1
2024-01-22T09:02:52.154 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...

A previous instance of this failure was fixed a while back

- https://tracker.ceph.com/issues/59343

... related to flushing snap buffer from the client when making a snapshot. AFAIR, a similar change went into the testing kernel so this should be a different bug. Needs RCA.


Related issues 1 (0 open1 closed)

Related to CephFS - Bug #59343: qa: fs/snaps/snaptest-multiple-capsnaps.sh failedResolvedXiubo Li

Actions
Actions #1

Updated by Venky Shankar 3 months ago

  • Related to Bug #59343: qa: fs/snaps/snaptest-multiple-capsnaps.sh failed added
Actions #2

Updated by Venky Shankar 3 months ago

  • Category set to Correctness/Safety
  • Assignee set to Venky Shankar
Actions #3

Updated by Xiubo Li 3 months ago

Venky,

For the flushsnap request for inode 0x1000000a0a4, which is the file a, the MDS just dropped it to the floor and never updated the corresponding file size, please see not expecting flushsnap 2b5 from client.4749 below:

2024-01-22T09:02:51.299+0000 7f1adc34d640  1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] <== client.4749 192.168.0.1:0/2926350419 383754 ==== client_caps(flushsnap ino 0x1000000a0a4 0 seq 0 tid 50889 caps=pAsxLsXsxFsxcrwb dirty=AxFw wanted=- follows 2b4 size 5/0 mtime 2024-01-22T09:02:50.812551+0000 ctime 2024-01-22T09:02:50.859550+0000 change_attr 6 tws 1) v12 ==== 268+0+0 (crc 0 0 0) 0x56519f12d180 con 0x565199b20400
2024-01-22T09:02:51.300+0000 7f1adc34d640  7 mds.0.locker handle_client_caps on 0x1000000a0a4 tid 50889 follows 2b4 op flushsnap flags 0x0
2024-01-22T09:02:51.300+0000 7f1adc34d640 20 mds.0.4 get_session have 0x565199a6c800 client.4749 192.168.0.1:0/2926350419 state open
2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.locker  head inode [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=1 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@12},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x5651a6e64b00]
2024-01-22T09:02:51.300+0000 7f1adc34d640 10  mds.0.cache.snaprealm(0x1000000a0a3 seq 693 0x56519ab7bb00) get_snaps 2b2,2b3,2b4,2b5 (seq 2b5 cached_seq 2b5)
2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.locker   flushsnap follows 2b4 -> snap 2b5
2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap follows 2b4 on [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=1 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@12},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x5651a6e64b00]
2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap found [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.locker  snapped inode [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.300+0000 7f1adc34d640  7 mds.0.locker  not expecting flushsnap 2b5 from client.4749 on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.4 send_message_client_counted client.4749 seq 189802 client_caps(flushsnap_ack ino 0x1000000a0a4 0 seq 0 tid 50889 caps=- dirty=AxFw wanted=- follows 2b4 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) v12
2024-01-22T09:02:51.300+0000 7f1adc34d640  1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] --> 192.168.0.1:0/2926350419 -- client_caps(flushsnap_ack ino 0x1000000a0a4 0 seq 0 tid 50889 caps=- dirty=AxFw wanted=- follows 2b4 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) v12 -- 0x56519bfe7c00 con 0x565199b20400

So later when reading from the file a it just return 0 size:

2024-01-22T09:02:51.873+0000 7f1adc34d640  1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] <== client.4749 192.168.0.1:0/2926350419 383773 ==== client_request(client.4749:260900 lookup #0x1000000a0a3//3/a 2024-01-22T09:02:51.873530+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6 ==== 204+0+0 (crc 0 0 0) 0x5651a9801b00 con 0x565199b20400
2024-01-22T09:02:51.873+0000 7f1adc34d640  4 mds.0.server handle_client_request client_request(client.4749:260900 lookup #0x1000000a0a3//3/a 2024-01-22T09:02:51.873530+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6
2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.4 get_session have 0x565199a6c800 client.4749 192.168.0.1:0/2926350419 state open
2024-01-22T09:02:51.873+0000 7f1adc34d640 15 mds.0.server  oldest_client_tid=260900
2024-01-22T09:02:51.873+0000 7f1adc34d640  7 mds.0.cache request_start request(client.4749:260900 nref=2 cr=0x5651a9801b00)
2024-01-22T09:02:51.873+0000 7f1adc34d640  7 mds.0.server dispatch_client_request client_request(client.4749:260900 lookup #0x1000000a0a3//3/a 2024-01-22T09:02:51.873530+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.server rdlock_path_pin_ref request(client.4749:260900 nref=2 cr=0x5651a9801b00) #0x1000000a0a3//3/a
2024-01-22T09:02:51.873+0000 7f1adc34d640  7 mds.0.cache traverse: opening base ino 0x1000000a0a3 snap head
2024-01-22T09:02:51.873+0000 7f1adc34d640 12 mds.0.cache traverse: path seg depth 0 '' snapid head
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.cache traverse: snapdir
2024-01-22T09:02:51.873+0000 7f1adc34d640 12 mds.0.cache traverse: path seg depth 1 '3' snapid snapdir
2024-01-22T09:02:51.873+0000 7f1adc34d640 10  mds.0.cache.snaprealm(0x1000000a0a3 seq 694 0x56519ab7bb00) resolve_snapname '3' in [0,head]
2024-01-22T09:02:51.873+0000 7f1adc34d640 15  mds.0.cache.snaprealm(0x1000000a0a3 seq 694 0x56519ab7bb00)  ? snap(2b2 0x1000000a0a3 '1' 2024-01-22T09:02:49.495576+0000)
2024-01-22T09:02:51.873+0000 7f1adc34d640 15  mds.0.cache.snaprealm(0x1000000a0a3 seq 694 0x56519ab7bb00)  ? snap(2b3 0x1000000a0a3 '2' 2024-01-22T09:02:49.544575+0000)
2024-01-22T09:02:51.873+0000 7f1adc34d640 15  mds.0.cache.snaprealm(0x1000000a0a3 seq 694 0x56519ab7bb00)  ? snap(2b4 0x1000000a0a3 '3' 2024-01-22T09:02:50.820550+0000)
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.cache traverse: snap 3 -> 2b4
2024-01-22T09:02:51.873+0000 7f1adc34d640 12 mds.0.cache traverse: path seg depth 2 'a' snapid 2b4
2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.cache.dir(0x1000000a0a3) lookup (a, '2b4')
2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.cache.dir(0x1000000a0a3)   hit -> (a,2b5)
2024-01-22T09:02:51.873+0000 7f1adc34d640  7 mds.0.cache found target [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 dirty=1 authpin=0 0x5651a5493b80]
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.cache path_traverse finish on snapid 2b4
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.server ref is [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 dirty=1 authpin=0 0x5651a5493b80]
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.cache.ino(0x1000000a0a4) auth_pin by 0x5651a3232000 on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 dirty=1 authpin=1 0x5651a5493b80] now 1
2024-01-22T09:02:51.873+0000 7f1adc34d640 15 mds.0.cache.dir(0x1000000a0a3) adjust_nested_auth_pins 1 on [dir 0x1000000a0a3 /client.0/tmp/ [2,head] auth v=27 cv=0/0 ap=0+1 state=1610874881|complete f(v0 m2024-01-22T09:02:49.488576+0000 1=1+0) n(v0 rc2024-01-22T09:02:51.299541+0000 b5 1=1+0) hs=1+1,ss=4+0 dirty=5 | child=1 dirty=1 authpin=0 0x56519a20f600] by 0x5651a5493b80 count now 0/1
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.locker acquire_locks request(client.4749:260900 nref=2 cr=0x5651a9801b00)
2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.locker  must rdlock (iauth sync) [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.locker  must rdlock (ixattr sync) [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.locker  must authpin [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.locker  already auth_pinned [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.873+0000 7f1adc34d640  7 mds.0.locker rdlock_start  on (iauth sync) on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.locker  got rdlock on (iauth sync r=1) [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth sync r=1) (iversion lock) | ptrwaiter=0 request=1 lock=1 dirty=1 authpin=1 0x5651a5493b80]

It seems somewhere didn't correctly set the needsnapflush=1 for the head_in, I am still going through the MDS code and logs.

Actions #4

Updated by Xiubo Li 3 months ago

It seems we are missing this fix in kclient:

https://tracker.ceph.com/issues/40474, mainly this commit https://github.com/ceph/ceph/pull/28685/commits/8628340e28208ad63019a6da61642f48fbde6313.

Just before the flushsnap request came there was a caps update request, and since the kclient didn't set the corresponding flag of MClientCaps::FLAG_PENDING_CAPSNAP then the MDS just cleared the needsnapflush=1:

2024-01-22T09:02:51.298+0000 7f1adc34d640  1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] <== client.4749 192.168.0.1:0/2926350419 383752 ==== client_caps(update ino 0x1000000a0a4 1 seq 11 caps=pAsxLsXsxFc dirty=- wanted=pAsxXsxFxwb follows 0 size 0/0 mtime 2024-01-22T09:02:50.925548+0000 ctime 2024-01-22T09:02:50.925548+0000 change_attr 8 tws 2 xattrs(v=643 l=0)) v12 ==== 268+0+0 (crc 0 0 0) 0x56519f840000 con 0x565199b20400
...
2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.locker _do_null_snapflush client.4749 on [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=4 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) need_snapflush={2b4=4749,2b5=4749} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFc/pAsxXsxFxwb@11},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x5651a6e64b00]
2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.locker  doing async NULL snapflush on 2b4 from client.4749
2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap follows 2b3 on [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=4 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) need_snapflush={2b4=4749,2b5=4749} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFc/pAsxXsxFxwb@11},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x5651a6e64b00]
2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap found [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.locker _do_snap_update dirty - follows 2b3 snap 2b4 on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.298+0000 7f1adc34d640 15 mds.0.cache.ino(0x1000000a0a4) project_inode 0x1000000a0a4

And then the flushsnap request came and being dropped to the floor:

2024-01-22T09:02:51.299+0000 7f1adc34d640  1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] <== client.4749 192.168.0.1:0/2926350419 383753 ==== client_caps(flushsnap ino 0x1000000a0a4 0 seq 0 tid 50888 caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=- follows 2b3 size 5/0 mtime 2024-01-22T09:02:50.812551+0000 ctime 2024-01-22T09:02:50.812551+0000 change_attr 5 tws 1) v12 ==== 268+0+0 (crc 0 0 0) 0x56519bfe7c00 con 0x565199b20400
2024-01-22T09:02:51.299+0000 7f1adc34d640  7 mds.0.locker handle_client_caps on 0x1000000a0a4 tid 50888 follows 2b3 op flushsnap flags 0x0
2024-01-22T09:02:51.299+0000 7f1adc34d640 20 mds.0.4 get_session have 0x565199a6c800 client.4749 192.168.0.1:0/2926350419 state open
2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.locker  head inode [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=1 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@12},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x5651a6e64b00]
2024-01-22T09:02:51.299+0000 7f1adc34d640 10  mds.0.cache.snaprealm(0x1000000a0a3 seq 693 0x56519ab7bb00) get_snaps 2b2,2b3,2b4,2b5 (seq 2b5 cached_seq 2b5)
2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.locker   flushsnap follows 2b3 -> snap 2b4
2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap follows 2b3 on [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=1 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@12},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x5651a6e64b00]
2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap found [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.locker  snapped inode [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.299+0000 7f1adc34d640  7 mds.0.locker  not expecting flushsnap 2b4 from client.4749 on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80]
2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.4 send_message_client_counted client.4749 seq 189801 client_caps(flushsnap_ack ino 0x1000000a0a4 0 seq 0 tid 50888 caps=- dirty=Fw wanted=- follows 2b3 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) v12
2024-01-22T09:02:51.299+0000 7f1adc34d640  1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] --> 192.168.0.1:0/2926350419 -- client_caps(flushsnap_ack ino 0x1000000a0a4 0 seq 0 tid 50888 caps=- dirty=Fw wanted=- follows 2b3 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) v12 -- 0x56519f840000 con 0x565199b20400
Actions #5

Updated by Xiubo Li 3 months ago

Xiubo Li wrote:

It seems we are missing this fix in kclient:

https://tracker.ceph.com/issues/40474, mainly this commit https://github.com/ceph/ceph/pull/28685/commits/8628340e28208ad63019a6da61642f48fbde6313.

Actually the kclient has already included this, but seems buggy:

From [1], when capsnap->cap_flush.tid is set it will just skip setting the CEPH_CLIENT_CAPS_PENDING_CAPSNAP flags.

While in [2], when the ci->i_ceph_lock is unlocked another thread could trigger the above code, during which the ->cap_flush.tid has already been set. Then when sending the cap update request in [1] it will just skip setting the CEPH_CLIENT_CAPS_PENDING_CAPSNAP flags.

And then in the MDS side just in case the capsnap flush request comes after the cap update request, we will see this issue.

[1] https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/caps.c#L1481-L1493
[2] https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/caps.c#L1643-L1726

Actions

Also available in: Atom PDF