Bug #53504: client: infinite loop "got ESTALE" after mds recovery - CephFS - Ceph

Actions

Copy link

Bug #53504

closed

client: infinite loop "got ESTALE" after mds recovery

Added by Dan van der Ster over 2 years ago. Updated over 1 year ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiubo Li

Category:

Code Hygiene

Target version:

Ceph - v18.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

quincy, pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Client

Labels (FS):

Pull request ID:

44234

Crash signature (v1):

Crash signature (v2):

Description

After an MDS recovery we inevitably see a few clients hammering the MDSs in a loop, doing getattr on a stale fh.
On the MDS it looks like:

2021-12-06 15:13:43.082 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192925 ==== client_request(client.237200480:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=91 caller_uid=307, caller_gid=307{307,}) v4 ==== 144+0+0 (crc 0 0 0) 0x562c8a8f1800 con 0x5628d5dac000
2021-12-06 15:13:43.083 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904156 = -116 (116) Stale file handle) v1 -- 0x562c23feedc0 con 0x5628d5dac000
2021-12-06 15:13:43.084 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192926 ==== client_request(client.237200480:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=92 caller_uid=307, caller_gid=307{307,}) v4 ==== 144+0+0 (crc 0 0 0) 0x5629545c0f00 con 0x5628d5dac000
2021-12-06 15:13:43.084 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192927 ==== client_request(client.237200480:42904210 getattr pAsLsXsFs #0x100821dfe34 2021-12-06 15:03:24.827889 RETRY=160 caller_uid=0, caller_gid=0{0,}) v4 ==== 144+0+0 (crc 0 0 0) 0x5629545c1b00 con 0x5628d5dac000
2021-12-06 15:13:43.086 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904156 = -116 (116) Stale file handle) v1 -- 0x562b4ddf4840 con 0x5628d5dac000
2021-12-06 15:13:43.086 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] <== client.237200480 188.185.89.63:0/2778507800 192928 ==== client_request(client.237200480:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=93 caller_uid=307, caller_gid=307{307,}) v4 ==== 144+0+0 (crc 0 0 0) 0x563366a3e000 con 0x5628d5dac000
2021-12-06 15:13:43.086 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904210 = -116 (116) Stale file handle) v1 -- 0x5633cd792840 con 0x5628d5dac000
2021-12-06 15:13:43.087 7ff6381ba700  1 -- [v2:10.116.2.210:6800/998364111,v1:10.116.2.210:6801/998364111] --> 188.185.89.63:0/2778507800 -- client_reply(???:42904156 = -116 (116) Stale file handle) v1 -- 0x563048a30000 con 0x5628d5dac000

On the client it shows:

2021-12-06 15:07:17.781 7fc047fff700 10 client.237200480 choose_target_mds resend_mds specified as mds.1
2021-12-06 15:07:17.781 7fc047fff700 20 client.237200480 mds is 1
2021-12-06 15:07:17.781 7fc047fff700 10 client.237200480 send_request rebuilding request 42904156 for mds.1
2021-12-06 15:07:17.781 7fc047fff700 20 client.237200480 encode_cap_releases enter (req: 0x7fc034282a30, mds: 1)
2021-12-06 15:07:17.781 7fc047fff700 10 client.237200480 send_request client_request(unknown.0:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=238 caller_uid=307, caller_gid=307{307,}) v4 to mds.1
2021-12-06 15:07:17.781 7fc047fff700 20 client.237200480 awaiting reply|forward|kick on 0x7fc047ffd100
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 handle_client_reply got a reply. Safe:1 tid 42904156
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 got ESTALE on tid 42904156 from mds.1
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds starting with req->inode 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800)
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) is_hash=0 hash=0
2021-12-06 15:07:17.782 7fc0617fa700 10 client.237200480 did not get mds through better means, so chose random mds 1
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 mds is 1
2021-12-06 15:07:17.782 7fc047fff700 10 client.237200480 choose_target_mds resend_mds specified as mds.1
2021-12-06 15:07:17.782 7fc047fff700 20 client.237200480 mds is 1
2021-12-06 15:07:17.782 7fc047fff700 10 client.237200480 send_request rebuilding request 42904156 for mds.1
2021-12-06 15:07:17.782 7fc047fff700 20 client.237200480 encode_cap_releases enter (req: 0x7fc034282a30, mds: 1)
2021-12-06 15:07:17.782 7fc047fff700 10 client.237200480 send_request client_request(unknown.0:42904156 getattr pAsLsXsFs #0x100821dfe34 2021-12-04 02:45:05.564904 RETRY=239 caller_uid=307, caller_gid=307{307,}) v4 to mds.1
2021-12-06 15:07:17.782 7fc047fff700 20 client.237200480 awaiting reply|forward|kick on 0x7fc047ffd100
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 handle_client_reply got a reply. Safe:1 tid 42904210
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 got ESTALE on tid 42904210 from mds.2
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds starting with req->inode 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800)
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 choose_target_mds 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) is_hash=0 hash=0
2021-12-06 15:07:17.782 7fc0617fa700 10 client.237200480 did not get mds through better means, so chose random mds 1
2021-12-06 15:07:17.782 7fc0617fa700 20 client.237200480 mds is 1
2021-12-06 15:07:17.782 7fc0477fe700 10 client.237200480 choose_target_mds resend_mds specified as mds.1
2021-12-06 15:07:17.782 7fc0477fe700 20 client.237200480 mds is 1
2021-12-06 15:07:17.782 7fc0477fe700 10 client.237200480 send_request rebuilding request 42904210 for mds.1
2021-12-06 15:07:17.782 7fc0477fe700 20 client.237200480 encode_cap_releases enter (req: 0x7fc0383228a0, mds: 1)
2021-12-06 15:07:17.782 7fc0477fe700 10 client.237200480 send_request client_request(unknown.0:42904210 getattr pAsLsXsFs #0x100821dfe34 2021-12-06 15:03:24.827889 RETRY=136 caller_uid=0, caller_gid=0{0,}) v4 to mds.1
2021-12-06 15:07:17.782 7fc0477fe700 20 client.237200480 awaiting reply|forward|kick on 0x7fc0477fc100
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 handle_client_reply got a reply. Safe:1 tid 42904156
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 got ESTALE on tid 42904156 from mds.1
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 choose_target_mds starting with req->inode 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800)
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 choose_target_mds 0x100821dfe34.head(faked_ino=0 ref=4 ll_ref=3 cap_refs={} open={2=1} mode=100644 size=0/4194304 nlink=1 btime=2021-12-03 01:58:01.844586 mtime=2021-12-03 01:58:01.844586 ctime=2021-12-03 01:58:01.844586 caps=- objectset[0x100821dfe34 ts 0/0 objects 0 dirty_or_tx 0] 0x7fc04c37a800) is_hash=0 hash=0
2021-12-06 15:07:17.783 7fc0617fa700 10 client.237200480 did not get mds through better means, so chose random mds 1
2021-12-06 15:07:17.783 7fc0617fa700 20 client.237200480 mds is 1

Evicting the client does not fix. (we have client_reconnect_stale true)

Just seen on nautilus client but presumed to still be present in o/p.

Files

flax-client.cvmfs-home-cvams-id.log.gz (842 KB) flax-client.cvmfs-home-cvams-id.log.gz

Dan van der Ster, 12/07/2021 08:16 AM

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Xiubo Li over 2 years ago

BTW, what's the `max_mds` in your setups ? And how many up MDSes after you upgraded ? It seems there only has one.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Assignee set to Xiubo Li

Actions

Copy link

Updated by Dan van der Ster over 2 years ago

Cluster had max_mds 3 at the time of those logs. It's running 14.2.22 -- we didn't upgrade; the latest recovery was following a network outage.
We have mds balancing disabled and use subtree pinning extensively.

I can share MDS logs in the morning if it helps... We still have several clients in this state.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Dan van der Ster wrote:

Cluster had max_mds 3 at the time of those logs. It's running 14.2.22 -- we didn't upgrade; the latest recovery was following a network outage.
We have mds balancing disabled and use subtree pinning extensively.

I can share MDS logs in the morning if it helps... We still have several clients in this state.

Yeah certainly, it will help much.

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Status changed from New to Fix Under Review
Pull request ID set to 44234

Actions

Copy link

Updated by Dan van der Ster over 2 years ago

File flax-client.cvmfs-home-cvams-id.log.gz flax-client.cvmfs-home-cvams-id.log.gz added

more of the client log is attached. (from yesterday)

do you still need mds? which debug level?

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Dan van der Ster wrote:

more of the client log is attached. (from yesterday)

do you still need mds? which debug level?

Yeah, please, I need to check whether there has any other problem I missed.

Actions

Copy link

Updated by Dan van der Ster over 2 years ago

Xiubo Li wrote:

Dan van der Ster wrote:

more of the client log is attached. (from yesterday)

do you still need mds? which debug level?

Yeah, please, I need to check whether there has any other problem I missed.

Hi sorry for the delay.

I have uploaded a ceph-mds log to https://new.cernbox.cern.ch/#/s/EwBkXiQSiCxKnFe , first with debug_ms=1 for ~10 seconds so you can see all the Stale fh's clearly.
Then debug_mds = 20 for another 10s.

client.254368052 is an example looping client.

Thank you

Actions

Copy link

Updated by Xiubo Li over 2 years ago

Dan van der Ster wrote:

Xiubo Li wrote:

Dan van der Ster wrote:

more of the client log is attached. (from yesterday)

do you still need mds? which debug level?

Yeah, please, I need to check whether there has any other problem I missed.

Hi sorry for the delay.

I have uploaded a ceph-mds log to https://new.cernbox.cern.ch/#/s/EwBkXiQSiCxKnFe , first with debug_ms=1 for ~10 seconds so you can see all the Stale fh's clearly.
Then debug_mds = 20 for another 10s.

client.254368052 is an example looping client.

Thank you

Thanks.

Yeah, this is what I have read from the source code, the MDS has iterated all the `active` peers already but the client keep retrying it again and again:

 492613 2021-12-08 08:59:37.651 7ff6381ba700  4 mds.0.server handle_client_request client_request(client.254368052:930452 getattr As #0x1007b02db90 2021-12-03 16:07:52.130947 RETRY=149 caller_uid=1001, caller_gid=0{}) v4
 492614 2021-12-08 08:59:37.651 7ff6381ba700 20 mds.0.253042 get_session have 0x562920b6d600 client.254368052 188.184.30.244:0/1340416446 state open
 492615 2021-12-08 08:59:37.651 7ff6381ba700 15 mds.0.server  oldest_client_tid=930452
 492616 2021-12-08 08:59:37.651 7ff6381ba700  7 mds.0.cache request_start request(client.254368052:930452 cr=0x563956acb800)
 492617 2021-12-08 08:59:37.651 7ff6381ba700  7 mds.0.server dispatch_client_request client_request(client.254368052:930452 getattr As #0x1007b02db90 2021-12-03 16:07:52.130947 RETRY=149 caller_uid=1001, caller_gid=0{}) v4
 492618 2021-12-08 08:59:37.651 7ff6381ba700 10 mds.0.server rdlock_path_pin_ref request(client.254368052:930452 cr=0x563956acb800) #0x1007b02db90
 492619 2021-12-08 08:59:37.651 7ff6381ba700  7 mds.0.cache traverse: opening base ino 0x1007b02db90 snap head
 492620 2021-12-08 08:59:37.651 7ff6381ba700 10 mds.0.server FAIL on ESTALE but attempting recovery                                                    
 492621 2021-12-08 08:59:37.651 7ff6381ba700  5 mds.0.cache find_ino_peers 0x1007b02db90 hint -1
 492622 2021-12-08 08:59:37.651 7ff6381ba700 10 mds.0.cache _do_find_ino_peer 1100865397 0x1007b02db90 active 0,1,2,3 all 0,1,2,3 checked
 492623 2021-12-08 08:59:37.651 7ff6381ba700 10 mds.0.cache handle_find_ino findino(201881059 0x1008b780186) v1
 492624 2021-12-08 08:59:37.651 7ff6381ba700 10 mds.0.cache handle_find_ino findino(2006167246 0x1007b02f8c3) v1
 492625 2021-12-08 08:59:37.651 7ff6381ba700 10 mds.0.cache handle_find_ino findino(1100798442 0x1008b77defc) v1
 492626 2021-12-08 08:59:37.651 7ff6381ba700  4 mds.0.server handle_client_request client_request(client.93154170:27608759786 getattr - #0x100093d8dd12021-12-08 08:59:37.648937 caller_uid=0, caller_gid=0{}) v2

Actions

Copy link

#10

Updated by Patrick Donnelly over 2 years ago

Target version set to v17.0.0
Source set to Community (user)
Backport set to pacific,octopus

Actions

Copy link

#11

Updated by Xiubo Li about 2 years ago

吴羡 reported a same bug in his product use case, and he can see the client stuck almost every 2 weeks.

I went through the MDS side code again, just guess that it must be the opened inode has missed from the MDCache after recovery, and the Openfiletable will be commited to the disk in the tick() thread. That means once the MDS crashes in theory there has chances that the openfile journal may lose. So in the replacing MDS it won't find the inode.

And let him tried to list all the sub-files or sub-direcotries from the mountpoint. And the bug resolved immediately. Because when listing from the mountpoint it will load all the file into the MDCache.

I have updated the PR and after checking from all the peers if it still fails will try to open the ino and load it from the disk and retry the request again.

Actions

Copy link

#12

Updated by Xiubo Li about 2 years ago

Xiubo Li wrote:

吴羡 reported a same bug in his product use case, and he can see the client stuck almost every 2 weeks.

I went through the MDS side code again, just guess that it must be the opened inode has missed from the MDCache after recovery, and the Openfiletable will be commited to the disk in the tick() thread. That means once the MDS crashes in theory there has chances that the openfile journal may lose. So in the replacing MDS it won't find the inode.

And let him tried to list all the sub-files or sub-direcotries from the mountpoint. And the bug resolved immediately. Because when listing from the mountpoint it will load all the file into the MDCache.

I have updated the PR and after checking from all the peers if it still fails will try to open the ino and load it from the disk and retry the request again.

I checked the client and MDS code more and seems there has another case will also could cause this:

In libcephfs side in `Client::statx(const char *relpath, ...)`, it will pass the `relpath` and then will switch `statx_to_mask()` and will only `want` the `pAsLsXsFs` caps from MDS when lookuping the inodes.

While in MDS Locker code it won't journal this inode in the OpenFileTable, please see `set_wanted() and `is_wanted_notable()`:

static bool is_wanted_notable(int wanted) {                                                                  
 return wanted & (CEPH_CAP_ANY_WR|CEPH_CAP_FILE_WR|CEPH_CAP_FILE_RD);
}

void Capability::set_wanted(int w) {
  CInode *in = get_inode();
  if (in) {
    if (!is_wanted_notable(_wanted) && is_wanted_notable(w)) {                                                                                         
      in->adjust_num_caps_notable(1);
      if (!is_notable())
        mark_notable();
    } else if (is_wanted_notable(_wanted) && !is_wanted_notable(w)) {
      in->adjust_num_caps_notable(-1);
      maybe_clear_notable();
    }   
  }
  _wanted = w;
}

That means the Inodes in the path could be lost from the MDCache after the MDS crashes and replaced. So in the `Client::statx()-->_getattr()` if the MDS crashes and the `_getattr()` request will retry after reconnected. We will see the same issue in theory.

Actions

Copy link

#13