Project

General

Profile

Actions

Bug #53504

closed

client: infinite loop "got ESTALE" after mds recovery

Added by Dan van der Ster over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Code Hygiene
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After an MDS recovery we inevitably see a few clients hammering the MDSs in a loop, doing getattr on a stale fh.
On the MDS it looks like:

2021-12-06 15:13:43.082 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192925 ==== client_request(client.237200480:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=91 caller_uid=307, caller_gid=307{307,}) v4 ==== 144+0+0 (crc 0 0 0) 0x562c8a8f1800 con 0x5628d5dac000
2021-12-06 15:13:43.083 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904156 = -116 (116) Stale file handle) v1 -- 0x562c23feedc0 con 0x5628d5dac000
2021-12-06 15:13:43.084 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192926 ==== client_request(client.237200480:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=92 caller_uid=307, caller_gid=307{307,}) v4 ==== 144+0+0 (crc 0 0 0) 0x5629545c0f00 con 0x5628d5dac000
2021-12-06 15:13:43.084 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192927 ==== client_request(client.237200480:42904210 getattr pAsLsXsFs #0x100821dfe34 2021-12-06 15:03:24.827889 RETRY=160 caller_uid=0, caller_gid=0{0,}) v4 ==== 144+0+0 (crc 0 0 0) 0x5629545c1b00 con 0x5628d5dac000
2021-12-06 15:13:43.086 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904156 = -116 (116) Stale file handle) v1 -- 0x562b4ddf4840 con 0x5628d5dac000
2021-12-06 15:13:43.086 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192928 ==== client_request(client.237200480:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=93 caller_uid=307, caller_gid=307{307,}) v4 ==== 144+0+0 (crc 0 0 0) 0x563366a3e000 con 0x5628d5dac000
2021-12-06 15:13:43.086 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904210 = -116 (116) Stale file handle) v1 -- 0x5633cd792840 con 0x5628d5dac000
2021-12-06 15:13:43.087 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904156 = -116 (116) Stale file handle) v1 -- 0x563048a30000 con 0x5628d5dac000

On the client it shows:

2021-12-06 15:07:17.781 7fc047fff700 10 client.237200480 choose_target_mds resend_mds specified as mds.1
2021-12-06 15:07:17.781 7fc047fff700 20 client.237200480 mds is 1
2021-12-06 15:07:17.781 7fc047fff700 10 client.237200480 send_request rebuilding request 42904156 for mds.1
2021-12-06 15:07:17.781 7fc047fff700 20 client.237200480 encode_cap_releases enter (req: 0x7fc034282a30, mds: 1)
2021-12-06 15:07:17.781 7fc047fff700 10 client.237200480 send_request client_request(unknown.0:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=238 caller_uid=307, caller_gid=307{307,}) v4 to mds.1
2021-12-06 15:07:17.781 7fc047fff700 20 client.237200480 awaiting reply|forward|kick on 0x7fc047ffd100
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 handle_client_reply got a reply. Safe:1 tid 42904156
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 got ESTALE on tid 42904156 from mds.1
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds starting with req->inode 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800)
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) is_hash=0 hash=0
2021-12-06 15:07:17.782 7fc0617fa700 10 client.237200480 did not get mds through better means, so chose random mds 1
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 mds is 1
2021-12-06 15:07:17.782 7fc047fff700 10 client.237200480 choose_target_mds resend_mds specified as mds.1
2021-12-06 15:07:17.782 7fc047fff700 20 client.237200480 mds is 1
2021-12-06 15:07:17.782 7fc047fff700 10 client.237200480 send_request rebuilding request 42904156 for mds.1
2021-12-06 15:07:17.782 7fc047fff700 20 client.237200480 encode_cap_releases enter (req: 0x7fc034282a30, mds: 1)
2021-12-06 15:07:17.782 7fc047fff700 10 client.237200480 send_request client_request(unknown.0:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=239 caller_uid=307, caller_gid=307{307,}) v4 to mds.1
2021-12-06 15:07:17.782 7fc047fff700 20 client.237200480 awaiting reply|forward|kick on 0x7fc047ffd100
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 handle_client_reply got a reply. Safe:1 tid 42904210
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 got ESTALE on tid 42904210 from mds.2
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds starting with req->inode 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800)
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) is_hash=0 hash=0
2021-12-06 15:07:17.782 7fc0617fa700 10 client.237200480 did not get mds through better means, so chose random mds 1
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 mds is 1
2021-12-06 15:07:17.782 7fc0477fe700 10 client.237200480 choose_target_mds resend_mds specified as mds.1
2021-12-06 15:07:17.782 7fc0477fe700 20 client.237200480 mds is 1
2021-12-06 15:07:17.782 7fc0477fe700 10 client.237200480 send_request rebuilding request 42904210 for mds.1
2021-12-06 15:07:17.782 7fc0477fe700 20 client.237200480 encode_cap_releases enter (req: 0x7fc0383228a0, mds: 1)
2021-12-06 15:07:17.782 7fc0477fe700 10 client.237200480 send_request client_request(unknown.0:42904210 getattr pAsLsXsFs #0x100821dfe34 2021-12-06 15:03:24.827889 RETRY=136 caller_uid=0, caller_gid=0{0,}) v4 to mds.1
2021-12-06 15:07:17.782 7fc0477fe700 20 client.237200480 awaiting reply|forward|kick on 0x7fc0477fc100
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 handle_client_reply got a reply. Safe:1 tid 42904156
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 got ESTALE on tid 42904156 from mds.1
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 choose_target_mds starting with req->inode 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800)
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 choose_target_mds 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) is_hash=0 hash=0
2021-12-06 15:07:17.783 7fc0617fa700 10 client.237200480 did not get mds through better means, so chose random mds 1
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 mds is 1

Evicting the client does not fix. (we have client_reconnect_stale true)

Just seen on nautilus client but presumed to still be present in o/p.


Files


Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #55934: quincy: client: infinite loop "got ESTALE" after mds recoveryResolvedXiubo LiActions
Copied to CephFS - Backport #55935: pacific: client: infinite loop "got ESTALE" after mds recoveryResolvedXiubo LiActions
Actions

Also available in: Atom PDF