Project

General

Profile

Bug #36079

ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs

Added by Ivan Guan 3 months ago. Updated 2 months ago.

Status:
Need Review
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, MDS
Labels (FS):
Pull request ID:

Description

Version: jewel(ceph-10.2.2)
MDS mode: active/hot-standby
Desscription:
As we know MDS will kill the session if client don't send renewcaps to mds within mds_session_autoclose(default 300s). If the client doesn't receive handle_mds_map message from monitor when the active and hot standby mds siwtch occurs it will hang all the time. I dumped the sessions on both client and mds side found the client record its session is open but mds side don't has it. After carefully study the client and mds log, i found the mds already killed the session but the client don't kown that beacause its network is not work at the time. Whereafter the active and hot standby mds switch happens, the client still didn't receive the handle_mds_map message due to its bad network.Thus the client think its session is ok,but it can't send and message to mds although the network is back to normal because its pipe has marked down.

client log:

2018-09-13 15:30:22.414875 7f5b40989700 21 client.87547 tick
2018-09-13 15:30:22.414923 7f5b40989700 20 client.87547 trim_cache size 0 max 16384
2018-09-13 15:30:23.104030 7f5b39577700  3 client.87547 ll_getattr 1.head
2018-09-13 15:30:23.104067 7f5b39577700 10 client.87547 _getattr mask pAsLsXsFs issued=0
2018-09-13 15:30:23.104082 7f5b39577700 15 Dentry   inode.get on 0x7f5b542b7e00 1.head now 4
2018-09-13 15:30:23.104091 7f5b39577700 20 client.87547 choose_target_mds starting with req->inode 1.head(faked_ino=0 ref=4 ll_ref=273 cap_refs={} open={} mode=40755 size=0/0 mtime=2018-09-11 20:13:45.682305 caps=-(0=pAsLsXsFs) has_dir_layout 0x7f5b542b7e00)
2018-09-13 15:30:23.104123 7f5b39577700 20 client.87547 choose_target_mds 1.head(faked_ino=0 ref=4 ll_ref=273 cap_refs={} open={} mode=40755 size=0/0 mtime=2018-09-11 20:13:45.682305 caps=-(0=pAsLsXsFs) has_dir_layout 0x7f5b542b7e00) is_hash=0 hash=0
2018-09-13 15:30:23.104138 7f5b39577700 10 client.87547 choose_target_mds from caps on inode 1.head(faked_ino=0 ref=4 ll_ref=273 cap_refs={} open={} mode=40755 size=0/0 mtime=2018-09-11 20:13:45.682305 caps=-(0=pAsLsXsFs) has_dir_layout 0x7f5b542b7e00)
2018-09-13 15:30:23.104146 7f5b39577700 20 client.87547 mds is 0
2018-09-13 15:30:23.104153 7f5b39577700 10 client.87547 send_request rebuilding request 9 for mds.0mds_name:  request op: 257
2018-09-13 15:30:23.104159 7f5b39577700 20 client.87547 encode_cap_releases enter (req: 0x7f5b5431b9c0, mds: 0)
2018-09-13 15:30:23.104163 7f5b39577700 25 client.87547 encode_cap_releases exit (req: 0x7f5b5431b9c0, mds 0
2018-09-13 15:30:23.104165 7f5b39577700 20 client.87547 send_request set sent_stamp to 2018-09-13 15:30:23.104164
2018-09-13 15:30:23.104169 7f5b39577700 10 client.87547 send_request client_request(unknown.0:9 getattr pAsLsXsFs #1 2018-09-13 15:30:23.104088) v3 to mds.0
2018-09-13 15:30:23.104184 7f5b39577700  1 -- 192.168.12.201:0/1506388177 --> 192.168.12.201:6803/14343 -- client_request(client.87547:9 getattr pAsLsXsFs #1 2018-09-13 15:30:23.104088) v3 -- ?+0 0x7f5b5431aec0 con 0x7f5b5428ea80
2018-09-13 15:30:23.104213 7f5b39577700  0 -- 192.168.12.201:0/1506388177 submit_message client_request(client.87547:9 getattr pAsLsXsFs #1 2018-09-13 15:30:23.104088) v3 remote, 192.168.12.201:6803/14343, f*ailed lossy con, dropping message 0x7f5b5431aec0*


Related issues

Related to fs - Bug #36507: client: connection failure during reconnect causes client to hang New

History

#2 Updated by Patrick Donnelly 3 months ago

  • Project changed from Ceph to fs
  • Subject changed from ceph fuse hang because it miss reconnect phase when hot standby mds switch occurs to ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs
  • Description updated (diff)
  • Due date deleted (09/20/2018)
  • Status changed from New to Need Review
  • Assignee set to Ivan Guan
  • Priority changed from Normal to High
  • Start date deleted (09/19/2018)
  • Source set to Community (dev)
  • Backport set to mimic,luminous
  • Component(FS) Client, MDS added

#3 Updated by Patrick Donnelly 2 months ago

  • Related to Bug #36507: client: connection failure during reconnect causes client to hang added

#4 Updated by Patrick Donnelly 2 months ago

#36507 is kinda related.

Also available in: Atom PDF