Project

General

Profile

Bug #18579

Fuse client has "opening" session to nonexistent MDS rank after MDS cluster shrink

Added by John Spray over 2 years ago. Updated 7 months ago.

Status:
Resolved
Priority:
High
Category:
-
Target version:
Start date:
01/18/2017
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client
Labels (FS):
multimds
Pull request ID:

Description

mds_sessions

    "sessions": [
        {
            "mds": 0,
            "addr": "194.12.182.59:6812\/3223369410",
            "seq": 11847,
            "cap_gen": 0,
            "cap_ttl": "2017-01-18 10:03:38.879099",
            "last_cap_renew_request": "2017-01-18 10:02:38.879099",
            "cap_renew_seq": 20,
            "num_caps": 4782,
            "state": "open" 
        },
        {
            "mds": 1,
            "addr": "194.12.182.59:6813\/197224361",
            "seq": 0,
            "cap_gen": 0,
            "cap_ttl": "0.000000",
            "last_cap_renew_request": "0.000000",
            "cap_renew_seq": 0,
            "num_caps": 0,
            "state": "opening" 
        }
    ],

mds_requests
{
    "request": {
        "tid": 5837,
        "op": "mkdir",
        "path": "#20000000274\/fssnap.d",
        "path2": "",
        "ino": "20000000274",
        "dentry": "fssnap.d",
        "hint_ino": "0",
        "sent_stamp": "2017-01-18 09:59:38.913936",
        "mds": -1,
        "resend_mds": -1,
        "send_to_auth": 0,
        "sent_on_mseq": 0,
        "retry_attempt": 0,
        "got_unsafe": 0,
        "uid": 1000,
        "gid": 1000,
        "oldest_client_tid": 5837,
        "mdsmap_epoch": 0,
        "flags": 0,
        "num_retry": 0,
        "num_fwd": 0,
        "num_releases": 0,
        "abort_rc": 0
    }
}


Related issues

Related to fs - Feature #10792: qa: enable thrasher for MDS cluster size (vary max_mds) Resolved 02/07/2015

History

#1 Updated by John Spray over 2 years ago

  • Related to Feature #10792: qa: enable thrasher for MDS cluster size (vary max_mds) added

#2 Updated by John Spray over 2 years ago

  • Description updated (diff)

Logs in /home/jspray/18579 (should be world readable) on teuthology

#3 Updated by Patrick Donnelly over 2 years ago

  • Assignee set to Patrick Donnelly

I'll take this one.

#4 Updated by Patrick Donnelly over 2 years ago

  • Status changed from New to In Progress

Sorry for the delay updating this.

The issue appears to be that the stopping MDS is still authoritative for the parent directory of the mkdir op so the remaining mds is forwarding the client's request:

2017-01-18 09:59:39.322119 7f4f0d9e8700 10 client.4118 send_request client_request(unknown.0:5837 mkdir #20000000274/fssnap.d 2017-01-18 09:59:38.913857 RETRY=1 caller_uid=1000, caller_gid=1000{}) v4 to mds.0
2017-01-18 09:59:39.322184 7f4f0d9e8700 20 client.4118 awaiting reply|forward|kick on 0x7f4f0d9e6b60
2017-01-18 09:59:39.325627 7f4f161f9700 10 client.4118 handle_client_request tid 5837 fwd 1 to mds.1, resending to 1
2017-01-18 09:59:39.325636 7f4f0d9e8700 10 client.4118 choose_target_mds resend_mds specified as mds.1
2017-01-18 09:59:39.325641 7f4f0d9e8700 20 client.4118 mds is 1
2017-01-18 09:59:39.325648 7f4f0d9e8700 10 client.4118 _open_mds_session mds.1
2017-01-18 09:59:39.325926 7f4f0d9e8700 10 client.4118 waiting for session to mds.1 to open
2017-01-18 09:59:39.329650 7f4f161f9700 10 client.4118 ms_handle_connect on 194.12.182.59:6813/197224361

mds.0 is forwarding the request:

2017-01-18 09:59:39.325094 7f8eee9ff700 10 MDSInternalContextBase::complete: 18C_MDS_TryFindInode
2017-01-18 09:59:39.325096 7f8eee9ff700  7 mds.0.server dispatch_client_request client_request(client.4118:5837 mkdir #20000000274/fssnap.d 2017-01-18 09:59:38.913857 RETRY=1 caller_uid=1000, caller_gid=1000{}) v4
2017-01-18 09:59:39.325105 7f8eee9ff700 10 mds.0.server rdlock_path_xlock_dentry request(client.4118:5837 cr=0x560033c2de40) #20000000274/fssnap.d
2017-01-18 09:59:39.325111 7f8eee9ff700 10 mds.0.server traverse_to_auth_dir dirpath #20000000274 dname fssnap.d
2017-01-18 09:59:39.325113 7f8eee9ff700  7 mds.0.cache traverse: opening base ino 20000000274 snap head
2017-01-18 09:59:39.325117 7f8eee9ff700 10 mds.0.cache path_traverse finish on snapid head
2017-01-18 09:59:39.325119 7f8eee9ff700  7 mds.0.server try_open_auth_dirfrag: not open, not inode auth, fw to mds.1

and mds.1 is just sitting on the session request because it is up:stopping:

2017-01-18 09:59:39.328549 7f2e25e4b700  1 -- 194.12.182.59:6813/197224361 >> - conn(0x556bf8600800 :6813 s=STATE_ACCEPTING pgs=0 cs=0 l=0)._process_connection sd=30 -
2017-01-18 09:59:39.329005 7f2e25e4b700 10 cephx: verify_authorizer decrypted service mds secret_id=2
2017-01-18 09:59:39.329109 7f2e25e4b700 10 cephx: verify_authorizer global_id=4118
2017-01-18 09:59:39.329151 7f2e25e4b700 10 cephx: verify_authorizer ok nonce 507ed7ab2eb141f2 reply_bl.length()=36
2017-01-18 09:59:39.329172 7f2e25e4b700 10 mds.client.admin  new session 0x556bf9fee000 for client.4118 194.12.182.59:0/1644907610 con 0x556bf8600800
2017-01-18 09:59:39.329198 7f2e25e4b700 10 mds.client.admin ms_verify_authorizer: parsing auth_cap_str='allow *'
2017-01-18 09:59:39.329388 7f2e25e4b700 10 In get_auth_session_handler for protocol 2
2017-01-18 09:59:39.329565 7f2e23eb1700 10 mds.d ms_handle_accept 194.12.182.59:0/1644907610 con 0x556bf8600800 session 0x556bf9fee000
2017-01-18 09:59:39.330187 7f2e25e4b700 10 _calc_signature seq 1 front_crc_ = 810075477 middle_crc = 0 data_crc = 0 sig = 5199377228323527851
2017-01-18 09:59:39.330298 7f2e23eb1700  1 -- 194.12.182.59:6813/197224361 <== client.4118 194.12.182.59:0/1644907610 1 ==== client_session(request_open) v2 ==== 307+0+0 (810075477 0 0) 0x556bf99e6640 con 0x556bf8600800
2017-01-18 09:59:39.330344 7f2e23eb1700  3 mds.1.server not active yet, waiting

I think the right approach to handle this is to have the stopping mds reject new client sessions. However, some of the client logic suggests this should be allowed. Any opinions?

#5 Updated by Patrick Donnelly over 2 years ago

After discussion during standup, a few things probably needs to happen here:

  • Upon receiving a new MDSMap, the client should close down any session which are opening with a stopping mds. It should also not initiate a session with an MDS it knows to be stopping.
  • A stopping mds should not "sit on" a session with a client until it becomes active. In this particular case, forwarding the request to an active or simply closing the session may be appropriate.
  • An MDS should not forward a request to an MDS it knows to be stopping. Instead, it could: (a) sit on the request until the authority changes for the directory or (b) tell the client to retry in the future.

#6 Updated by Zheng Yan over 2 years ago

  • Status changed from In Progress to Need Review

#7 Updated by John Spray over 2 years ago

  • Status changed from Need Review to Resolved

#8 Updated by Patrick Donnelly 7 months ago

  • Category deleted (90)
  • Labels (FS) multimds added

Also available in: Atom PDF