Project

General

Profile

Actions

Bug #56116

closed

mds: handle deferred client request core when mds reboot

Added by Mer Xuanyi almost 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

100%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS, cephfs.pyx
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When mds reboot, client will send `mds_requests` and `client_reconnect` to mds.

If mds does not receive the `client_reconnect` message within `mds_reconnect_timeout`, mds will kill client session and go to next phase (reconnect -> rejoin).

And mds will handle these received client requests when mds' state change is active.

But if MDCache is not ready, these messages will be pushed into mdcache->waiting_for_root queue.

Back to the client, the client will try to rebuild the session with mds even if mds already kill the old session (client still has unfinished mds_requests), so the client will send request_open to mds.

If mds handle this client session message before mdcache is ready, the new session will be added to mds' sessionmap.

Now if mdcache is ready, mds will get the crash because mds mistook the client request for a new session with an imported session

 1: (()+0xf100) [0x7f1033c30100]
 2: (Mutex::lock(bool)+0x9) [0x7f1035e33cf9]
 3: (MDSRank::get_session(boost::intrusive_ptr<Message const> const&)+0x92a) [0x7f103ee8f27a]
 4: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x504) [0x7f103ef0c9c4]
 5: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x7f103ef18162]
 6: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x7f103ee8be8c]
 7: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x7f103ee8e2fa]
 8: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x7f103ee8e942]
 9: (MDSContext::complete(int)+0x74) [0x7f103f0ff4b4]
2022-06-07T03:50:44.372+0800 7fffe6b2c700  5 mds.beacon.a set_want_state: up:replay -> up:reconnect
2022-06-07T03:50:46.524+0800 7fffee33b700  3 mds.0.server not active yet, waiting
2022-06-07T03:50:53.860+0800 7fffecb38700 10 mds.0.server kill_session 0x55555b4e2300
2022-06-07T03:50:53.860+0800 7fffecb38700  5 mds.beacon.a set_want_state: up:reconnect -> up:rejoin
2022-06-07T03:50:54.869+0800 7fffee33b700  5 mds.beacon.a set_want_state: up:rejoin -> up:active
2022-06-07T03:51:19.690+0800 7fffee33b700 10 MDSContext::complete: 18C_MDS_RetryMessage
2022-06-07T03:51:19.690+0800 7fffee33b700  5 mds.0.server waiting for root
2022-06-07T03:51:19.915+0800 7fffee33b700 10 mds.0.sessionmap add_session s=0x55555b57e000 name=client.4445
2022-06-07T03:51:38.962+0800 7fffe832f700 10 mds.0.cache populate_mydir done
2022-06-07T03:51:38.962+0800 7fffe9b32700 10 MDSContext::complete: 18C_MDS_RetryMessage
2022-06-07T03:51:38.962+0800 7fffe9b32700 10 mds.0.215 get_session replacing connection bootstrap session 0x55555b4e2300 with imported session 0x55555b57e000

Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #57110: pacific: mds: handle deferred client request core when mds rebootResolvedKonstantin ShalyginActions
Copied to CephFS - Backport #57111: quincy: mds: handle deferred client request core when mds rebootResolvedKonstantin ShalyginActions
Actions

Also available in: Atom PDF