Project

General

Profile

Actions

Bug #36079

closed

ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs

Added by Ivan Guan over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Version: jewel(ceph-10.2.2)
MDS mode: active/hot-standby
Desscription:
As we know MDS will kill the session if client don't send renewcaps to mds within mds_session_autoclose(default 300s). If the client doesn't receive handle_mds_map message from monitor when the active and hot standby mds siwtch occurs it will hang all the time. I dumped the sessions on both client and mds side found the client record its session is open but mds side don't has it. After carefully study the client and mds log, i found the mds already killed the session but the client don't kown that beacause its network is not work at the time. Whereafter the active and hot standby mds switch happens, the client still didn't receive the handle_mds_map message due to its bad network.Thus the client think its session is ok,but it can't send and message to mds although the network is back to normal because its pipe has marked down.

client log:

2018-09-13 15:30:22.414875 7f5b40989700 21 client.87547 tick
2018-09-13 15:30:22.414923 7f5b40989700 20 client.87547 trim_cache size 0 max 16384
2018-09-13 15:30:23.104030 7f5b39577700  3 client.87547 ll_getattr 1.head
2018-09-13 15:30:23.104067 7f5b39577700 10 client.87547 _getattr mask pAsLsXsFs issued=0
2018-09-13 15:30:23.104082 7f5b39577700 15 Dentry   inode.get on 0x7f5b542b7e00 1.head now 4
2018-09-13 15:30:23.104091 7f5b39577700 20 client.87547 choose_target_mds starting with req->inode 1.head(faked_ino=0 ref=4 ll_ref=273 cap_refs={} open={} mode=40755 size=0/0 mtime=2018-09-11 20:13:45.682305 caps=-(0=pAsLsXsFs) has_dir_layout 0x7f5b542b7e00)
2018-09-13 15:30:23.104123 7f5b39577700 20 client.87547 choose_target_mds 1.head(faked_ino=0 ref=4 ll_ref=273 cap_refs={} open={} mode=40755 size=0/0 mtime=2018-09-11 20:13:45.682305 caps=-(0=pAsLsXsFs) has_dir_layout 0x7f5b542b7e00) is_hash=0 hash=0
2018-09-13 15:30:23.104138 7f5b39577700 10 client.87547 choose_target_mds from caps on inode 1.head(faked_ino=0 ref=4 ll_ref=273 cap_refs={} open={} mode=40755 size=0/0 mtime=2018-09-11 20:13:45.682305 caps=-(0=pAsLsXsFs) has_dir_layout 0x7f5b542b7e00)
2018-09-13 15:30:23.104146 7f5b39577700 20 client.87547 mds is 0
2018-09-13 15:30:23.104153 7f5b39577700 10 client.87547 send_request rebuilding request 9 for mds.0mds_name:  request op: 257
2018-09-13 15:30:23.104159 7f5b39577700 20 client.87547 encode_cap_releases enter (req: 0x7f5b5431b9c0, mds: 0)
2018-09-13 15:30:23.104163 7f5b39577700 25 client.87547 encode_cap_releases exit (req: 0x7f5b5431b9c0, mds 0
2018-09-13 15:30:23.104165 7f5b39577700 20 client.87547 send_request set sent_stamp to 2018-09-13 15:30:23.104164
2018-09-13 15:30:23.104169 7f5b39577700 10 client.87547 send_request client_request(unknown.0:9 getattr pAsLsXsFs #1 2018-09-13 15:30:23.104088) v3 to mds.0
2018-09-13 15:30:23.104184 7f5b39577700  1 -- 192.168.12.201:0/1506388177 --> 192.168.12.201:6803/14343 -- client_request(client.87547:9 getattr pAsLsXsFs #1 2018-09-13 15:30:23.104088) v3 -- ?+0 0x7f5b5431aec0 con 0x7f5b5428ea80
2018-09-13 15:30:23.104213 7f5b39577700  0 -- 192.168.12.201:0/1506388177 submit_message client_request(client.87547:9 getattr pAsLsXsFs #1 2018-09-13 15:30:23.104088) v3 remote, 192.168.12.201:6803/14343, f*ailed lossy con, dropping message 0x7f5b5431aec0*


Related issues 4 (0 open4 closed)

Related to CephFS - Bug #36507: client: connection failure during reconnect causes client to hangDuplicatePatrick Donnelly

Actions
Related to CephFS - Bug #39305: ceph-fuse: client hang because its bad session PipeConnection to mdsResolvedIvan Guan

Actions
Copied to CephFS - Backport #37828: mimic: ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occursResolvedPrashant DActions
Copied to CephFS - Backport #37829: luminous: ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occursResolvedPrashant DActions
Actions #1

Updated by Patrick Donnelly over 5 years ago

  • Project changed from Ceph to CephFS
  • Subject changed from ceph fuse hang because it miss reconnect phase when hot standby mds switch occurs to ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs
  • Description updated (diff)
  • Due date deleted (09/20/2018)
  • Status changed from New to Fix Under Review
  • Assignee set to Ivan Guan
  • Priority changed from Normal to High
  • Start date deleted (09/19/2018)
  • Source set to Community (dev)
  • Backport set to mimic,luminous
  • Component(FS) Client, MDS added
Actions #2

Updated by Patrick Donnelly over 5 years ago

  • Related to Bug #36507: client: connection failure during reconnect causes client to hang added
Actions #3

Updated by Patrick Donnelly over 5 years ago

#36507 is kinda related.

Actions #4

Updated by Patrick Donnelly over 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #5

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #37828: mimic: ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs added
Actions #6

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #37829: luminous: ceph-fuse: hang because it miss reconnect phase when hot standby mds switch occurs added
Actions #7

Updated by Patrick Donnelly over 5 years ago

  • Pull request ID set to 24172
Actions #8

Updated by Patrick Donnelly about 5 years ago

  • Status changed from Pending Backport to Resolved
Actions #9

Updated by Patrick Donnelly about 5 years ago

  • Related to Bug #39305: ceph-fuse: client hang because its bad session PipeConnection to mds added
Actions

Also available in: Atom PDF