Actions
Bug #53504
closedclient: infinite loop "got ESTALE" after mds recovery
% Done:
0%
Source:
Community (user)
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
After an MDS recovery we inevitably see a few clients hammering the MDSs in a loop, doing getattr on a stale fh.
On the MDS it looks like:
2021-12-06 15:13:43.082 7ff6381ba700 1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192925 ==== client_request(client.237200480:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=91 caller_uid=307, caller_gid=307{307,}) v4 ==== 144+0+0 (crc 0 0 0) 0x562c8a8f1800 con 0x5628d5dac000 2021-12-06 15:13:43.083 7ff6381ba700 1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904156 = -116 (116) Stale file handle) v1 -- 0x562c23feedc0 con 0x5628d5dac000 2021-12-06 15:13:43.084 7ff6381ba700 1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192926 ==== client_request(client.237200480:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=92 caller_uid=307, caller_gid=307{307,}) v4 ==== 144+0+0 (crc 0 0 0) 0x5629545c0f00 con 0x5628d5dac000 2021-12-06 15:13:43.084 7ff6381ba700 1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192927 ==== client_request(client.237200480:42904210 getattr pAsLsXsFs #0x100821dfe34 2021-12-06 15:03:24.827889 RETRY=160 caller_uid=0, caller_gid=0{0,}) v4 ==== 144+0+0 (crc 0 0 0) 0x5629545c1b00 con 0x5628d5dac000 2021-12-06 15:13:43.086 7ff6381ba700 1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904156 = -116 (116) Stale file handle) v1 -- 0x562b4ddf4840 con 0x5628d5dac000 2021-12-06 15:13:43.086 7ff6381ba700 1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192928 ==== client_request(client.237200480:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=93 caller_uid=307, caller_gid=307{307,}) v4 ==== 144+0+0 (crc 0 0 0) 0x563366a3e000 con 0x5628d5dac000 2021-12-06 15:13:43.086 7ff6381ba700 1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904210 = -116 (116) Stale file handle) v1 -- 0x5633cd792840 con 0x5628d5dac000 2021-12-06 15:13:43.087 7ff6381ba700 1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904156 = -116 (116) Stale file handle) v1 -- 0x563048a30000 con 0x5628d5dac000
On the client it shows:
2021-12-06 15:07:17.781 7fc047fff700 10 client.237200480 choose_target_mds resend_mds specified as mds.1 2021-12-06 15:07:17.781 7fc047fff700 20 client.237200480 mds is 1 2021-12-06 15:07:17.781 7fc047fff700 10 client.237200480 send_request rebuilding request 42904156 for mds.1 2021-12-06 15:07:17.781 7fc047fff700 20 client.237200480 encode_cap_releases enter (req: 0x7fc034282a30, mds: 1) 2021-12-06 15:07:17.781 7fc047fff700 10 client.237200480 send_request client_request(unknown.0:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=238 caller_uid=307, caller_gid=307{307,}) v4 to mds.1 2021-12-06 15:07:17.781 7fc047fff700 20 client.237200480 awaiting reply|forward|kick on 0x7fc047ffd100 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 handle_client_reply got a reply. Safe:1 tid 42904156 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 got ESTALE on tid 42904156 from mds.1 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds starting with req->inode 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) is_hash=0 hash=0 2021-12-06 15:07:17.782 7fc0617fa700 10 client.237200480 did not get mds through better means, so chose random mds 1 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 mds is 1 2021-12-06 15:07:17.782 7fc047fff700 10 client.237200480 choose_target_mds resend_mds specified as mds.1 2021-12-06 15:07:17.782 7fc047fff700 20 client.237200480 mds is 1 2021-12-06 15:07:17.782 7fc047fff700 10 client.237200480 send_request rebuilding request 42904156 for mds.1 2021-12-06 15:07:17.782 7fc047fff700 20 client.237200480 encode_cap_releases enter (req: 0x7fc034282a30, mds: 1) 2021-12-06 15:07:17.782 7fc047fff700 10 client.237200480 send_request client_request(unknown.0:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=239 caller_uid=307, caller_gid=307{307,}) v4 to mds.1 2021-12-06 15:07:17.782 7fc047fff700 20 client.237200480 awaiting reply|forward|kick on 0x7fc047ffd100 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 handle_client_reply got a reply. Safe:1 tid 42904210 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 got ESTALE on tid 42904210 from mds.2 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds starting with req->inode 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) is_hash=0 hash=0 2021-12-06 15:07:17.782 7fc0617fa700 10 client.237200480 did not get mds through better means, so chose random mds 1 2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 mds is 1 2021-12-06 15:07:17.782 7fc0477fe700 10 client.237200480 choose_target_mds resend_mds specified as mds.1 2021-12-06 15:07:17.782 7fc0477fe700 20 client.237200480 mds is 1 2021-12-06 15:07:17.782 7fc0477fe700 10 client.237200480 send_request rebuilding request 42904210 for mds.1 2021-12-06 15:07:17.782 7fc0477fe700 20 client.237200480 encode_cap_releases enter (req: 0x7fc0383228a0, mds: 1) 2021-12-06 15:07:17.782 7fc0477fe700 10 client.237200480 send_request client_request(unknown.0:42904210 getattr pAsLsXsFs #0x100821dfe34 2021-12-06 15:03:24.827889 RETRY=136 caller_uid=0, caller_gid=0{0,}) v4 to mds.1 2021-12-06 15:07:17.782 7fc0477fe700 20 client.237200480 awaiting reply|forward|kick on 0x7fc0477fc100 2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 handle_client_reply got a reply. Safe:1 tid 42904156 2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 got ESTALE on tid 42904156 from mds.1 2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 choose_target_mds starting with req->inode 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) 2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 choose_target_mds 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) is_hash=0 hash=0 2021-12-06 15:07:17.783 7fc0617fa700 10 client.237200480 did not get mds through better means, so chose random mds 1 2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 mds is 1
Evicting the client does not fix. (we have client_reconnect_stale true)
Just seen on nautilus client but presumed to still be present in o/p.
Files
Actions