Bug #64209
opensnaptest-multiple-capsnaps.sh fails with "got remote process result: 1"
0%
Description
/a/vshankar-2024-01-22_07:03:31-fs-wip-vshankar-testing-20240119.075157-1-testing-default-smithi/7525752
2024-01-22T09:02:52.140 INFO:tasks.workunit.client.0.smithi055.stdout:asdf 2024-01-22T09:02:52.141 INFO:tasks.workunit.client.0.smithi055.stderr:+ stat .snap/2/a 2024-01-22T09:02:52.141 INFO:tasks.workunit.client.0.smithi055.stderr:+ grep 'Size: 5' 2024-01-22T09:02:52.142 DEBUG:teuthology.orchestra.run.smithi105:> sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 30 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.7.asok dump_blocked_ops 2024-01-22T09:02:52.143 INFO:tasks.workunit.client.0.smithi055.stdout: Size: 5 Blocks: 1 IO Block: 4194304 regular file 2024-01-22T09:02:52.144 INFO:tasks.workunit.client.0.smithi055.stderr:+ stat .snap/2/a 2024-01-22T09:02:52.145 INFO:tasks.workunit.client.0.smithi055.stderr:+ grep -- -rwxrwxrwx 2024-01-22T09:02:52.148 INFO:tasks.workunit.client.0.smithi055.stdout:Access: (0777/-rwxrwxrwx) Uid: ( 1000/ ubuntu) Gid: ( 1301/ ubuntu) 2024-01-22T09:02:52.149 INFO:tasks.workunit.client.0.smithi055.stderr:+ grep qwer .snap/3/a 2024-01-22T09:02:52.154 DEBUG:teuthology.orchestra.run:got remote process result: 1 2024-01-22T09:02:52.154 INFO:tasks.workunit:Stopping ['fs/snaps'] on client.0...
A previous instance of this failure was fixed a while back
- https://tracker.ceph.com/issues/59343
... related to flushing snap buffer from the client when making a snapshot. AFAIR, a similar change went into the testing kernel so this should be a different bug. Needs RCA.
Updated by Venky Shankar 3 months ago
- Related to Bug #59343: qa: fs/snaps/snaptest-multiple-capsnaps.sh failed added
Updated by Venky Shankar 3 months ago
- Category set to Correctness/Safety
- Assignee set to Venky Shankar
Updated by Xiubo Li 3 months ago
Venky,
For the flushsnap request for inode 0x1000000a0a4, which is the file a, the MDS just dropped it to the floor and never updated the corresponding file size, please see not expecting flushsnap 2b5 from client.4749 below:
2024-01-22T09:02:51.299+0000 7f1adc34d640 1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] <== client.4749 192.168.0.1:0/2926350419 383754 ==== client_caps(flushsnap ino 0x1000000a0a4 0 seq 0 tid 50889 caps=pAsxLsXsxFsxcrwb dirty=AxFw wanted=- follows 2b4 size 5/0 mtime 2024-01-22T09:02:50.812551+0000 ctime 2024-01-22T09:02:50.859550+0000 change_attr 6 tws 1) v12 ==== 268+0+0 (crc 0 0 0) 0x56519f12d180 con 0x565199b20400 2024-01-22T09:02:51.300+0000 7f1adc34d640 7 mds.0.locker handle_client_caps on 0x1000000a0a4 tid 50889 follows 2b4 op flushsnap flags 0x0 2024-01-22T09:02:51.300+0000 7f1adc34d640 20 mds.0.4 get_session have 0x565199a6c800 client.4749 192.168.0.1:0/2926350419 state open 2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.locker head inode [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=1 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@12},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x5651a6e64b00] 2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.cache.snaprealm(0x1000000a0a3 seq 693 0x56519ab7bb00) get_snaps 2b2,2b3,2b4,2b5 (seq 2b5 cached_seq 2b5) 2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.locker flushsnap follows 2b4 -> snap 2b5 2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap follows 2b4 on [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=1 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@12},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x5651a6e64b00] 2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap found [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.locker snapped inode [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.300+0000 7f1adc34d640 7 mds.0.locker not expecting flushsnap 2b5 from client.4749 on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.300+0000 7f1adc34d640 10 mds.0.4 send_message_client_counted client.4749 seq 189802 client_caps(flushsnap_ack ino 0x1000000a0a4 0 seq 0 tid 50889 caps=- dirty=AxFw wanted=- follows 2b4 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) v12 2024-01-22T09:02:51.300+0000 7f1adc34d640 1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] --> 192.168.0.1:0/2926350419 -- client_caps(flushsnap_ack ino 0x1000000a0a4 0 seq 0 tid 50889 caps=- dirty=AxFw wanted=- follows 2b4 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) v12 -- 0x56519bfe7c00 con 0x565199b20400
So later when reading from the file a it just return 0 size:
2024-01-22T09:02:51.873+0000 7f1adc34d640 1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] <== client.4749 192.168.0.1:0/2926350419 383773 ==== client_request(client.4749:260900 lookup #0x1000000a0a3//3/a 2024-01-22T09:02:51.873530+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6 ==== 204+0+0 (crc 0 0 0) 0x5651a9801b00 con 0x565199b20400 2024-01-22T09:02:51.873+0000 7f1adc34d640 4 mds.0.server handle_client_request client_request(client.4749:260900 lookup #0x1000000a0a3//3/a 2024-01-22T09:02:51.873530+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6 2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.4 get_session have 0x565199a6c800 client.4749 192.168.0.1:0/2926350419 state open 2024-01-22T09:02:51.873+0000 7f1adc34d640 15 mds.0.server oldest_client_tid=260900 2024-01-22T09:02:51.873+0000 7f1adc34d640 7 mds.0.cache request_start request(client.4749:260900 nref=2 cr=0x5651a9801b00) 2024-01-22T09:02:51.873+0000 7f1adc34d640 7 mds.0.server dispatch_client_request client_request(client.4749:260900 lookup #0x1000000a0a3//3/a 2024-01-22T09:02:51.873530+0000 caller_uid=1000, caller_gid=1301{6,36,1000,1301,}) v6 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.server rdlock_path_pin_ref request(client.4749:260900 nref=2 cr=0x5651a9801b00) #0x1000000a0a3//3/a 2024-01-22T09:02:51.873+0000 7f1adc34d640 7 mds.0.cache traverse: opening base ino 0x1000000a0a3 snap head 2024-01-22T09:02:51.873+0000 7f1adc34d640 12 mds.0.cache traverse: path seg depth 0 '' snapid head 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.cache traverse: snapdir 2024-01-22T09:02:51.873+0000 7f1adc34d640 12 mds.0.cache traverse: path seg depth 1 '3' snapid snapdir 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.cache.snaprealm(0x1000000a0a3 seq 694 0x56519ab7bb00) resolve_snapname '3' in [0,head] 2024-01-22T09:02:51.873+0000 7f1adc34d640 15 mds.0.cache.snaprealm(0x1000000a0a3 seq 694 0x56519ab7bb00) ? snap(2b2 0x1000000a0a3 '1' 2024-01-22T09:02:49.495576+0000) 2024-01-22T09:02:51.873+0000 7f1adc34d640 15 mds.0.cache.snaprealm(0x1000000a0a3 seq 694 0x56519ab7bb00) ? snap(2b3 0x1000000a0a3 '2' 2024-01-22T09:02:49.544575+0000) 2024-01-22T09:02:51.873+0000 7f1adc34d640 15 mds.0.cache.snaprealm(0x1000000a0a3 seq 694 0x56519ab7bb00) ? snap(2b4 0x1000000a0a3 '3' 2024-01-22T09:02:50.820550+0000) 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.cache traverse: snap 3 -> 2b4 2024-01-22T09:02:51.873+0000 7f1adc34d640 12 mds.0.cache traverse: path seg depth 2 'a' snapid 2b4 2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.cache.dir(0x1000000a0a3) lookup (a, '2b4') 2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.cache.dir(0x1000000a0a3) hit -> (a,2b5) 2024-01-22T09:02:51.873+0000 7f1adc34d640 7 mds.0.cache found target [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 dirty=1 authpin=0 0x5651a5493b80] 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.cache path_traverse finish on snapid 2b4 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.server ref is [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 dirty=1 authpin=0 0x5651a5493b80] 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.cache.ino(0x1000000a0a4) auth_pin by 0x5651a3232000 on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 dirty=1 authpin=1 0x5651a5493b80] now 1 2024-01-22T09:02:51.873+0000 7f1adc34d640 15 mds.0.cache.dir(0x1000000a0a3) adjust_nested_auth_pins 1 on [dir 0x1000000a0a3 /client.0/tmp/ [2,head] auth v=27 cv=0/0 ap=0+1 state=1610874881|complete f(v0 m2024-01-22T09:02:49.488576+0000 1=1+0) n(v0 rc2024-01-22T09:02:51.299541+0000 b5 1=1+0) hs=1+1,ss=4+0 dirty=5 | child=1 dirty=1 authpin=0 0x56519a20f600] by 0x5651a5493b80 count now 0/1 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.locker acquire_locks request(client.4749:260900 nref=2 cr=0x5651a9801b00) 2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.locker must rdlock (iauth sync) [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.873+0000 7f1adc34d640 20 mds.0.locker must rdlock (ixattr sync) [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.locker must authpin [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.locker already auth_pinned [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.873+0000 7f1adc34d640 7 mds.0.locker rdlock_start on (iauth sync) on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iversion lock) | ptrwaiter=0 request=1 lock=0 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.873+0000 7f1adc34d640 10 mds.0.locker got rdlock on (iauth sync r=1) [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v20 ap=1 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth sync r=1) (iversion lock) | ptrwaiter=0 request=1 lock=1 dirty=1 authpin=1 0x5651a5493b80]
It seems somewhere didn't correctly set the needsnapflush=1 for the head_in, I am still going through the MDS code and logs.
Updated by Xiubo Li 3 months ago
It seems we are missing this fix in kclient:
https://tracker.ceph.com/issues/40474, mainly this commit https://github.com/ceph/ceph/pull/28685/commits/8628340e28208ad63019a6da61642f48fbde6313.
Just before the flushsnap request came there was a caps update request, and since the kclient didn't set the corresponding flag of MClientCaps::FLAG_PENDING_CAPSNAP then the MDS just cleared the needsnapflush=1:
2024-01-22T09:02:51.298+0000 7f1adc34d640 1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] <== client.4749 192.168.0.1:0/2926350419 383752 ==== client_caps(update ino 0x1000000a0a4 1 seq 11 caps=pAsxLsXsxFc dirty=- wanted=pAsxXsxFxwb follows 0 size 0/0 mtime 2024-01-22T09:02:50.925548+0000 ctime 2024-01-22T09:02:50.925548+0000 change_attr 8 tws 2 xattrs(v=643 l=0)) v12 ==== 268+0+0 (crc 0 0 0) 0x56519f840000 con 0x565199b20400 ... 2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.locker _do_null_snapflush client.4749 on [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=4 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) need_snapflush={2b4=4749,2b5=4749} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFc/pAsxXsxFxwb@11},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x5651a6e64b00] 2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.locker doing async NULL snapflush on 2b4 from client.4749 2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap follows 2b3 on [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=4 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) need_snapflush={2b4=4749,2b5=4749} (iauth excl) (ifile xlocksnap) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFc/pAsxXsxFxwb@11},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=1 dirtyparent=1 dirty=1 waiter=1 authpin=1 0x5651a6e64b00] 2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap found [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.298+0000 7f1adc34d640 10 mds.0.locker _do_snap_update dirty - follows 2b3 snap 2b4 on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.298+0000 7f1adc34d640 15 mds.0.cache.ino(0x1000000a0a4) project_inode 0x1000000a0a4
And then the flushsnap request came and being dropped to the floor:
2024-01-22T09:02:51.299+0000 7f1adc34d640 1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] <== client.4749 192.168.0.1:0/2926350419 383753 ==== client_caps(flushsnap ino 0x1000000a0a4 0 seq 0 tid 50888 caps=pAsxLsXsxFsxcrwb dirty=Fw wanted=- follows 2b3 size 5/0 mtime 2024-01-22T09:02:50.812551+0000 ctime 2024-01-22T09:02:50.812551+0000 change_attr 5 tws 1) v12 ==== 268+0+0 (crc 0 0 0) 0x56519bfe7c00 con 0x565199b20400 2024-01-22T09:02:51.299+0000 7f1adc34d640 7 mds.0.locker handle_client_caps on 0x1000000a0a4 tid 50888 follows 2b3 op flushsnap flags 0x0 2024-01-22T09:02:51.299+0000 7f1adc34d640 20 mds.0.4 get_session have 0x565199a6c800 client.4749 192.168.0.1:0/2926350419 state open 2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.locker head inode [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=1 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@12},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x5651a6e64b00] 2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.cache.snaprealm(0x1000000a0a3 seq 693 0x56519ab7bb00) get_snaps 2b2,2b3,2b4,2b5 (seq 2b5 cached_seq 2b5) 2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.locker flushsnap follows 2b3 -> snap 2b4 2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap follows 2b3 on [inode 0x1000000a0a4 [2b6,head] /client.0/tmp/a auth v16 ap=1 DIRTYPARENT truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:50.925548+0000 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={4749=0-4194304@2b1} caps={4749=pAsxLsXsxFsxcrwb/pAsxXsxFxwb@12},l=4749 | ptrwaiter=0 request=0 lock=0 caps=1 truncating=1 needsnapflush=0 dirtyparent=1 dirty=1 waiter=0 authpin=1 0x5651a6e64b00] 2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.cache pick_inode_snap found [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.locker snapped inode [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.299+0000 7f1adc34d640 7 mds.0.locker not expecting flushsnap 2b4 from client.4749 on [inode 0x1000000a0a4 [2b4,2b5] /client.0/tmp/a auth v6 pv20 ap=6 truncating(5 to 0) s=0 n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0)->n(v0 rc2024-01-22T09:02:49.590574+0000 1=1+0) (iauth snap->sync w=1) (ilink snap->sync w=1) (ifile snap->sync w=1) (ixattr snap->sync w=1) (iversion lock) cr={4749=0-4194304@2b1} | ptrwaiter=2 request=0 lock=4 dirty=1 authpin=1 0x5651a5493b80] 2024-01-22T09:02:51.299+0000 7f1adc34d640 10 mds.0.4 send_message_client_counted client.4749 seq 189801 client_caps(flushsnap_ack ino 0x1000000a0a4 0 seq 0 tid 50888 caps=- dirty=Fw wanted=- follows 2b3 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) v12 2024-01-22T09:02:51.299+0000 7f1adc34d640 1 -- [v2:172.21.15.105:6836/571244548,v1:172.21.15.105:6839/571244548] --> 192.168.0.1:0/2926350419 -- client_caps(flushsnap_ack ino 0x1000000a0a4 0 seq 0 tid 50888 caps=- dirty=Fw wanted=- follows 2b3 size 0/0 mtime 0.000000 ctime 0.000000 change_attr 0) v12 -- 0x56519f840000 con 0x565199b20400
Updated by Xiubo Li 3 months ago
Xiubo Li wrote:
It seems we are missing this fix in kclient:
https://tracker.ceph.com/issues/40474, mainly this commit https://github.com/ceph/ceph/pull/28685/commits/8628340e28208ad63019a6da61642f48fbde6313.
Actually the kclient has already included this, but seems buggy:
From [1], when capsnap->cap_flush.tid is set it will just skip setting the CEPH_CLIENT_CAPS_PENDING_CAPSNAP flags.
While in [2], when the ci->i_ceph_lock is unlocked another thread could trigger the above code, during which the ->cap_flush.tid has already been set. Then when sending the cap update request in [1] it will just skip setting the CEPH_CLIENT_CAPS_PENDING_CAPSNAP flags.
And then in the MDS side just in case the capsnap flush request comes after the cap update request, we will see this issue.
[1] https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/caps.c#L1481-L1493
[2] https://github.com/ceph/ceph-client/blob/for-linus/fs/ceph/caps.c#L1643-L1726