Bug #36079
closedceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs
0%
Description
Version: jewel(ceph-10.2.2)
MDS mode: active/hot-standby
Desscription:
As we know MDS will kill the session if client don't send renewcaps to mds within mds_session_autoclose(default 300s). If the client doesn't receive handle_mds_map message from monitor when the active and hot standby mds siwtch occurs it will hang all the time. I dumped the sessions on both client and mds side found the client record its session is open but mds side don't has it. After carefully study the client and mds log, i found the mds already killed the session but the client don't kown that beacause its network is not work at the time. Whereafter the active and hot standby mds switch happens, the client still didn't receive the handle_mds_map message due to its bad network.Thus the client think its session is ok,but it can't send and message to mds although the network is back to normal because its pipe has marked down.
client log:
2018-09-13 15:30:22.414875 7f5b40989700 21 client.87547 tick 2018-09-13 15:30:22.414923 7f5b40989700 20 client.87547 trim_cache size 0 max 16384 2018-09-13 15:30:23.104030 7f5b39577700 3 client.87547 ll_getattr 1.head 2018-09-13 15:30:23.104067 7f5b39577700 10 client.87547 _getattr mask pAsLsXsFs issued=0 2018-09-13 15:30:23.104082 7f5b39577700 15 Dentry inode.get on 0x7f5b542b7e00 1.head now 4 2018-09-13 15:30:23.104091 7f5b39577700 20 client.87547 choose_target_mds starting with req->inode 1.head(faked_ino=0 ref=4 ll_ref=273 cap_refs={} open={} mode=40755 size=0/0 mtime=2018-09-11 20:13:45.682305 caps=-(0=pAsLsXsFs) has_dir_layout 0x7f5b542b7e00) 2018-09-13 15:30:23.104123 7f5b39577700 20 client.87547 choose_target_mds 1.head(faked_ino=0 ref=4 ll_ref=273 cap_refs={} open={} mode=40755 size=0/0 mtime=2018-09-11 20:13:45.682305 caps=-(0=pAsLsXsFs) has_dir_layout 0x7f5b542b7e00) is_hash=0 hash=0 2018-09-13 15:30:23.104138 7f5b39577700 10 client.87547 choose_target_mds from caps on inode 1.head(faked_ino=0 ref=4 ll_ref=273 cap_refs={} open={} mode=40755 size=0/0 mtime=2018-09-11 20:13:45.682305 caps=-(0=pAsLsXsFs) has_dir_layout 0x7f5b542b7e00) 2018-09-13 15:30:23.104146 7f5b39577700 20 client.87547 mds is 0 2018-09-13 15:30:23.104153 7f5b39577700 10 client.87547 send_request rebuilding request 9 for mds.0mds_name: request op: 257 2018-09-13 15:30:23.104159 7f5b39577700 20 client.87547 encode_cap_releases enter (req: 0x7f5b5431b9c0, mds: 0) 2018-09-13 15:30:23.104163 7f5b39577700 25 client.87547 encode_cap_releases exit (req: 0x7f5b5431b9c0, mds 0 2018-09-13 15:30:23.104165 7f5b39577700 20 client.87547 send_request set sent_stamp to 2018-09-13 15:30:23.104164 2018-09-13 15:30:23.104169 7f5b39577700 10 client.87547 send_request client_request(unknown.0:9 getattr pAsLsXsFs #1 2018-09-13 15:30:23.104088) v3 to mds.0 2018-09-13 15:30:23.104184 7f5b39577700 1 -- 192.168.12.201:0/1506388177 --> 192.168.12.201:6803/14343 -- client_request(client.87547:9 getattr pAsLsXsFs #1 2018-09-13 15:30:23.104088) v3 -- ?+0 0x7f5b5431aec0 con 0x7f5b5428ea80 2018-09-13 15:30:23.104213 7f5b39577700 0 -- 192.168.12.201:0/1506388177 submit_message client_request(client.87547:9 getattr pAsLsXsFs #1 2018-09-13 15:30:23.104088) v3 remote, 192.168.12.201:6803/14343, f*ailed lossy con, dropping message 0x7f5b5431aec0*
Updated by Patrick Donnelly over 5 years ago
- Project changed from Ceph to CephFS
- Subject changed from ceph fuse hang because it miss reconnect phase when hot standby mds switch occurs to ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs
- Description updated (diff)
- Due date deleted (
09/20/2018) - Status changed from New to Fix Under Review
- Assignee set to Ivan Guan
- Priority changed from Normal to High
- Start date deleted (
09/19/2018) - Source set to Community (dev)
- Backport set to mimic,luminous
- Component(FS) Client, MDS added
Updated by Patrick Donnelly over 5 years ago
- Related to Bug #36507: client: connection failure during reconnect causes client to hang added
Updated by Patrick Donnelly over 5 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 5 years ago
- Copied to Backport #37828: mimic: ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs added
Updated by Nathan Cutler over 5 years ago
- Copied to Backport #37829: luminous: ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs added
Updated by Patrick Donnelly over 5 years ago
- Status changed from Pending Backport to Resolved
Updated by Patrick Donnelly about 5 years ago
- Related to Bug #39305: ceph-fuse: client hang because its bad session PipeConnection to mds added