Bug #55377: kclient: mds revoke Fwb caps stuck after the kclient tries writebcak once - Linux kernel client - Ceph

Actions

Copy link

Bug #55377

closed

kclient: mds revoke Fwb caps stuck after the kclient tries writebcak once

Added by Xiubo Li about 2 years ago. Updated almost 2 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiubo Li

Category:

Target version:

Ceph - v18.0.0

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

Seen here: https://pulpito.ceph.com/vshankar-2022-04-07_05:07:33-fs-master-testing-default-smithi/6780578/

Its an upgrade test. cephadm disables standby-replay and reduced max_mds to 1 before upgrade (waits for a single active mds by checking mdsmap). The test above has 2 active MDSs, with each MDS configured with a standby-replay MDS daemon. Disabling standby-replay goes fine - MDSs transition to standby. However, after setting `max_mds = 1', one of the active MDS is stuck in `up:stopping' and never transitions to `down:stopped'. The test fails after hitting "max job timeout"

Problematic MDS log: ./remote/smithi176/log/aa4d8d7c-b63b-11ec-8c36-001a4aab830c/ceph-mds.cephfs.smithi176.ttvthc.log.gz

2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 mds.cephfs.smithi176.ttvthc my gid is 24327
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 mds.cephfs.smithi176.ttvthc map says I am mds.1.8 state up:stopping
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 mds.cephfs.smithi176.ttvthc msgr says I am [v2:172.21.15.176:6824/2074966087,v1:172.21.15.176:6825/2074966087]
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 mds.cephfs.smithi176.ttvthc handle_mds_map: handling map as rank 1
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 notify_mdsmap: mds.metrics
2022-04-07T06:41:43.092+0000 7fe6f18dc700 10 notify_mdsmap: mds.metrics: rank0 is mds.cephfs.smithi116.vbikdi
2022-04-07T06:41:43.092+0000 7fe6ed0d3700  7 mds.1.8 mds has 1 queued contexts
2022-04-07T06:41:43.092+0000 7fe6ed0d3700 10 mds.1.8  finish 0x55c7560da940

The MDS seems to be waiting for some event to reach completion (maybe exporting a dir?).

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Copied from Bug #55240: mds: stuck 2 seconds and keeps retrying to find ino from auth MDS added

Actions

Copy link

Updated by Xiubo Li about 2 years ago

More detail please see https://tracker.ceph.com/issues/55240#note-4:

Another issue in this failure:

In mds.1 after it find the inode for #0x1/client.0/tmp/fsstress/ltp-full-20091231/testcases/kernel/fs/fsstress/fsstress *, it will retry the *getarrt client request:

2022-04-07T06:34:55.052+0000 7fe6f18dc700  7 mds.1.server dispatch_client_request client_request(client.14520:12001 getattr Fsr #0x1000000615a 2022-04-07T06:34:53.430871+0000 caller_uid=1000, caller_gid=1258{6,36,1000,1258,}) v4
2022-04-07T06:34:55.052+0000 7fe6f18dc700 10 mds.1.server rdlock_path_pin_ref request(client.14520:12001 nref=3 cr=0x55c755025b80) #0x1000000615a
2022-04-07T06:34:55.052+0000 7fe6f18dc700 10 mds.1.locker acquire_locks request(client.14520:12001 nref=3 cr=0x55c755025b80)

But it immediately stuck waiting for rdlock, because it's lock state in replica mds.1 is LOCK:

2022-04-07T06:34:55.052+0000 7fe6f18dc700 10 mds.1.locker requesting rdlock from auth on (ifile lock) on [inode 0x1000000615a [2,head] /client.0/tmp/fsstress/ltp-full-20091231/testcases/kernel/fs/fsstress/fsstress rep@0.1 v18 s=0 n(v0 1=1+0) (iauth lock) (ifile lock) (ixattr lock) (iversion lock) cr={14520=0-4194304@1} | request=1 lock=1 0x55c75521cc00]

The source code for the LOCK:

                       // stable     loner  rep state  r     rp   rd   wr   fwr  l    x    caps(any,loner,xlocker,replica)
     [LOCK_LOCK]      = { 0,         false, LOCK_LOCK, AUTH, 0,   REQ, AUTH,0,   0,   0,   CEPH_CAP_GCACHE|CEPH_CAP_GBUFFER,0,0,0 },

The rd is REQ and it needs to do the lock state change from auth mds.0:

2022-04-07T06:34:55.052+0000 7fe6f18dc700  7 mds.1.locker rdlock_start  on (ifile lock) on [inode 0x1000000615a [2,head] /client.0/tmp/fsstress/ltp-full-20091231/testcases/kernel/fs/fsstress/fsstress rep@0.1 v18 s=0 n(v0 1=1+0) (iauth lock) (ifile lock) (ixattr lock) (iversion lock) cr={14520=0-4194304@1} | request=1 lock=1 0x55c75521cc00]
2022-04-07T06:34:55.052+0000 7fe6f18dc700 10 mds.1.locker requesting rdlock from auth on (ifile lock) on [inode 0x1000000615a [2,head] /client.0/tmp/fsstress/ltp-full-20091231/testcases/kernel/fs/fsstress/fsstress rep@0.1 v18 s=0 n(v0 1=1+0) (iauth lock) (ifile lock) (ixattr lock) (iversion lock) cr={14520=0-4194304@1} | request=1 lock=1 0x55c75521cc00]
2022-04-07T06:34:55.052+0000 7fe6f18dc700  1 -- [v2:172.21.15.176:6824/2074966087,v1:172.21.15.176:6825/2074966087] send_to--> mds [v2:172.21.15.116:6826/4138533272,v1:172.21.15.116:6827/4138533272] -- lock(a=reqrdlock ifile 0x1000000615a.head) v1 -- ?+0 0x55c751f85340
2022-04-07T06:34:55.052+0000 7fe6f18dc700  1 -- [v2:172.21.15.176:6824/2074966087,v1:172.21.15.176:6825/2074966087] --> [v2:172.21.15.116:6826/4138533272,v1:172.21.15.116:6827/4138533272] -- lock(a=reqrdlock ifile 0x1000000615a.head) v1 -- 0x55c751f85340 con 0x55c750115000
2022-04-07T06:34:55.052+0000 7fe6f18dc700  7 mds.1.locker rdlock_start waiting on (ifile lock) on [inode 0x1000000615a [2,head] /client.0/tmp/fsstress/ltp-full-20091231/testcases/kernel/fs/fsstress/fsstress rep@0.1 v18 s=0 n(v0 1=1+0) (iauth lock) (ifile lock) (ixattr lock) (iversion lock) cr={14520=0-4194304@1} | request=1 lock=1 0x55c75521cc00]
2022-04-07T06:34:55.052+0000 7fe6f18dc700 10 mds.1.cache.ino(0x1000000615a) add_waiter tag 2000000040000000 0x55c7540bd680 !ambig 1 !frozen 1 !freezing 1
2022-04-07T06:34:55.052+0000 7fe6f18dc700 15 mds.1.cache.ino(0x1000000615a) taking waiter here

And in mds.0, since the mds.0 is the auth MDS and the client.14520 is the loner client, so the client was issued the pAsxLsXsxFsxcrwb caps:

2022-04-07T06:34:55.054+0000 7f73ae424700  1 -- [v2:172.21.15.116:6826/4138533272,v1:172.21.15.116:6827/4138533272] <== mds.1 v2:172.21.15.176:6824/2074966087 17387 ==== lock(a=reqrdlock ifile 0x1000000615a.head) v1 ==== 69+0+0 (unknown 0 0 0) 0x55b263062fc0 con 0x55b2619b7000
2022-04-07T06:34:55.054+0000 7f73ae424700  7 mds.0.locker handle_file_lock a=reqrdlock on (ifile excl) from mds.1 [inode 0x1000000615a [2,head] /client.0/tmp/fsstress/ltp-full-20091231/testcases/kernel/fs/fsstress/fsstress auth{1=1} v18 ap=1 DIRTYPARENT s=0 n(v0 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={14520=0-4194304@1} caps={14520=pAsxLsXsxFsxcrwb/pAsxXsxFsxcrwb@2},l=14520 | request=0 caps=1 dirtyparent=1 replicated=1 dirty=1 authpin=1 0x55b2658e6680]

And when trying to do the filelock state change (ifile excl->sync), it will need to do revoke Fxwb caps from client.14520:

2022-04-07T06:34:55.054+0000 7f73ae424700  7 mds.0.locker simple_sync on (ifile excl) on [inode 0x1000000615a [2,head] /client.0/tmp/fsstress/ltp-full-20091231/testcases/kernel/fs/fsstress/fsstress auth{1=1} v18 ap=1 DIRTYPARENT s=0 n(v0 1=1+0) (iauth excl) (ifile excl) (ixattr excl) (iversion lock) cr={14520=0-4194304@1} caps={14520=pAsxLsXsxFsxcrwb/pAsxXsxFsxcrwb@2},l=14520 | request=0 caps=1 dirtyparent=1 replicated=1 dirty=1 authpin=1 0x55b2658e6680]
2022-04-07T06:34:55.054+0000 7f73ae424700  7 mds.0.locker issue_caps loner client.14520 allowed=pAsxLsXsxFscr, xlocker allowed=pAsxLsXsxFscr, others allowed=pLs on [inode 0x1000000615a [2,head] /client.0/tmp/fsstress/ltp-full-20091231/testcases/kernel/fs/fsstress/fsstress auth{1=1} v18 ap=1 DIRTYPARENT s=0 n(v0 1=1+0) (iauth excl) (ifile excl->sync) (ixattr excl) (iversion lock) cr={14520=0-4194304@1} caps={14520=pAsxLsXsxFsxcrwb/pAsxXsxFsxcrwb@2},l=14520 | request=0 caps=1 dirtyparent=1 replicated=1 dirty=1 authpin=1 0x55b2658e6680]
2022-04-07T06:34:55.054+0000 7f73ae424700 10 mds.0.5 send_message_client_counted client.14520 seq 25328 client_caps(revoke ino 0x1000000615a 1 seq 3 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFsxcrwb follows 0 size 0/4194304 ts 1/18446744073709551615 mtime 2022-04-07T06:34:53.392872+0000) v11
2022-04-07T06:34:55.054+0000 7f73ae424700  1 -- [v2:172.21.15.116:6826/4138533272,v1:172.21.15.116:6827/4138533272] --> v1:192.168.0.1:0/3995310513 -- client_caps(revoke ino 0x1000000615a 1 seq 3 caps=pAsxLsXsxFscr dirty=- wanted=pAsxXsxFsxcrwb follows 0 size 0/4194304 ts 1/18446744073709551615 mtime 2022-04-07T06:34:53.392872+0000) v11 -- 0x55b26607c900 con 0x55b2619b9400
2022-04-07T06:34:55.054+0000 7f73ae424700 10 mds.0.cache.ino(0x1000000615a) auth_pin by 0x55b2658e6990 on [inode 0x1000000615a [2,head] /client.0/tmp/fsstress/ltp-full-20091231/testcases/kernel/fs/fsstress/fsstress auth{1=1} v18 ap=2 DIRTYPARENT s=0 n(v0 1=1+0) (iauth excl) (ifile excl->sync) (ixattr excl) (iversion lock) cr={14520=0-4194304@1} caps={14520=pAsxLsXsxFscr/pAsxLsXsxFsxcrwb/pAsxXsxFsxcrwb@3},l=14520 | request=0 caps=1 dirtyparent=1 replicated=1 dirty=1 authpin=1 0x55b2658e6680] now 2

After 35 seconds later the cap update message came, but only the Fx caps is released:

2022-04-07T06:35:30.002+0000 7f73ae424700  1 -- [v2:172.21.15.116:6826/4138533272,v1:172.21.15.116:6827/4138533272] <== client.14520 v1:192.168.0.1:0/3995310513 24236 ==== client_caps(update ino 0x1000000615a 1 seq 3 tid 11103 caps=pAsxLsXsxFscrwb dirty=Fw wanted=pAsxXsxFsxcrwb follows 1 size 51656/0 mtime 2022-04-07T06:34:53.430871+0000 xattrs(v=18446744071603341568 l=0)) v10 ==== 236+0+0 (unknown 1066047011 0 0) 0x55b2643dac00 con 0x55b2619b9400
2022-04-07T06:35:30.002+0000 7f73ae424700  7 mds.0.locker handle_client_caps  on 0x1000000615a tid 11103 follows 1 op update flags 0x0

That means the kclient was still holding the Fwb caps until around 3 hours later:

2022-04-07T09:34:23.874+0000 7f73ae424700  1 -- [v2:172.21.15.116:6826/4138533272,v1:172.21.15.116:6827/4138533272] <== client.14520 v1:192.168.0.1:0/3995310513 40893 ==== client_caps(update ino 0x1000000615a 1 seq 3 caps=pAsLsXsFsc dirty=- wanted=- follows 0 size 51656/0 mtime 2022-04-07T06:34:53.430871+0000 xattrs(v=18446744071603341568 l=0)) v10 ==== 236+0+0 (unknown 2446509421 0 0) 0x55b26413b800 con 0x55b2619b9400

The kernel code seems buggy, when receiving the Fxwb revoke request it will queue a CEPH_I_WORK_WRITEBACK work for the inode, and then try to write back the dirty pages and clear the queue bit of CEPH_I_WORK_WRITEBACK.

But if after dirty pages were wrote back, if the ci->i_wrbuffer_ref is not 0 it won't release the Fwb caps. And after that the kernel won't trigger to do the writeback. That means the mds.0 may wait the kclient to release the Fwb caps for a very long time until someone triggers to do the dirty pages writeback.

Actions

Copy link

Updated by Jeff Layton about 2 years ago

The handling of the i_wrbuffer_ref value in the kernel seems broken by design. We bump that counter whenever a page is dirtied, and decrement the counter when a page is cleaned. When the counter goes to 0, we'll release Fb caps.

mmap has to use the pagecache regardless of what caps we have, so if you just keep dirtying pages via mmap, then you can keep Fb from being returned to the MDS indefinitely. I think we need to rework that counter handling to prevent this, but how to do that is not yet clear.

Actions

Copy link

Updated by Xiubo Li about 2 years ago

Jeff Layton wrote:

The handling of the i_wrbuffer_ref value in the kernel seems broken by design. We bump that counter whenever a page is dirtied, and decrement the counter when a page is cleaned. When the counter goes to 0, we'll release Fb caps.

mmap has to use the pagecache regardless of what caps we have, so if you just keep dirtying pages via mmap, then you can keep Fb from being returned to the MDS indefinitely. I think we need to rework that counter handling to prevent this, but how to do that is not yet clear.

Today I haven't read the code yet. Just one idea in my mind last night:

When the MDS is revoking the Fb caps we can schedule and queue the ceph_inode_work() to writeback all the dirty pages and at the same time drop all the pages from the i_mapping.

We are doing the similar stuff in Linux LIO/TCMU kernel code since years ago.

And then block any new coming write to pagecache in the ceph_filemap_fault(), while waiting we can request the Fwb caps from MDS.

Any idea ?

Actions

Copy link

Updated by Xiubo Li about 2 years ago

This bug is not introduce by the mmap(), it was introduce by the inline_data and buffer write:

1, mount with wsync.
2, create a file with O_RDWR, and the request was sent to mds.0:

   ceph_atomic_open()-->
     ceph_mdsc_do_request(openc)
     finish_open(file, dentry, ceph_open)-->
       ceph_open()-->
         ceph_init_file()-->
           ceph_init_file_info()-->
             ceph_uninline_data()-->
             {
               ...
               if (inline_version == 1 || /* initial version, no data */                                             
                   inline_version == CEPH_INLINE_NONE)                                                               
                     goto out_unlock;  
               ...
             }

The inline_version will be 1, the initial version for the new create file. And here the ci->i_inline_version will keep with 1, it's buggy.

3, buffer write to the file immediately:

   ceph_write_iter()-->
     ceph_get_caps(file, need=CEPH_CAP_FILE_WR, want=CEPH_CAP_FILE_BUFFER, ...);
     generic_perform_write()-->
       a_ops->write_begin()-->
         ceph_write_begin()-->
           netfs_write_begin()-->
             netfs_begin_read()-->
               netfs_rreq_submit_slice()-->
                 netfs_read_from_server()-->
                   rreq->netfs_ops->issue_read()-->
                     ceph_netfs_issue_read()-->
                     {
                       ...
                       if (ci->i_inline_version != CEPH_INLINE_NONE &&
                           ceph_netfs_issue_op_inline(subreq))
                         return;
                       ... 
                     }
     ceph_put_cap_refs(ci, Fwb);

The ceph_netfs_issue_op_inline() will send a getattr(Fsr) request to mds.1.

4, then the mds.1 will request the rd lock for CInode::filelock from the auth mds.0, the mds.0 will do the filelock state transation excl --> sync, but it need to revoke the Fxwb caps back from the clients.