Project

General

Profile

Actions

Bug #56116

closed

mds: handle deferred client request core when mds reboot

Added by Mer Xuanyi almost 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

100%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS, cephfs.pyx
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When mds reboot, client will send `mds_requests` and `client_reconnect` to mds.

If mds does not receive the `client_reconnect` message within `mds_reconnect_timeout`, mds will kill client session and go to next phase (reconnect -> rejoin).

And mds will handle these received client requests when mds' state change is active.

But if MDCache is not ready, these messages will be pushed into mdcache->waiting_for_root queue.

Back to the client, the client will try to rebuild the session with mds even if mds already kill the old session (client still has unfinished mds_requests), so the client will send request_open to mds.

If mds handle this client session message before mdcache is ready, the new session will be added to mds' sessionmap.

Now if mdcache is ready, mds will get the crash because mds mistook the client request for a new session with an imported session

 1: (()+0xf100) [0x7f1033c30100]
 2: (Mutex::lock(bool)+0x9) [0x7f1035e33cf9]
 3: (MDSRank::get_session(boost::intrusive_ptr<Message const> const&)+0x92a) [0x7f103ee8f27a]
 4: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x504) [0x7f103ef0c9c4]
 5: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x7f103ef18162]
 6: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x7f103ee8be8c]
 7: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x7f103ee8e2fa]
 8: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x7f103ee8e942]
 9: (MDSContext::complete(int)+0x74) [0x7f103f0ff4b4]
2022-06-07T03:50:44.372+0800 7fffe6b2c700  5 mds.beacon.a set_want_state: up:replay -> up:reconnect
2022-06-07T03:50:46.524+0800 7fffee33b700  3 mds.0.server not active yet, waiting
2022-06-07T03:50:53.860+0800 7fffecb38700 10 mds.0.server kill_session 0x55555b4e2300
2022-06-07T03:50:53.860+0800 7fffecb38700  5 mds.beacon.a set_want_state: up:reconnect -> up:rejoin
2022-06-07T03:50:54.869+0800 7fffee33b700  5 mds.beacon.a set_want_state: up:rejoin -> up:active
2022-06-07T03:51:19.690+0800 7fffee33b700 10 MDSContext::complete: 18C_MDS_RetryMessage
2022-06-07T03:51:19.690+0800 7fffee33b700  5 mds.0.server waiting for root
2022-06-07T03:51:19.915+0800 7fffee33b700 10 mds.0.sessionmap add_session s=0x55555b57e000 name=client.4445
2022-06-07T03:51:38.962+0800 7fffe832f700 10 mds.0.cache populate_mydir done
2022-06-07T03:51:38.962+0800 7fffe9b32700 10 MDSContext::complete: 18C_MDS_RetryMessage
2022-06-07T03:51:38.962+0800 7fffe9b32700 10 mds.0.215 get_session replacing connection bootstrap session 0x55555b4e2300 with imported session 0x55555b57e000

Related issues 2 (0 open2 closed)

Copied to CephFS - Backport #57110: pacific: mds: handle deferred client request core when mds rebootResolvedKonstantin ShalyginActions
Copied to CephFS - Backport #57111: quincy: mds: handle deferred client request core when mds rebootResolvedKonstantin ShalyginActions
Actions #1

Updated by Venky Shankar almost 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 46750
Actions #2

Updated by Venky Shankar almost 2 years ago

  • Backport set to quincy, pacific
Actions #3

Updated by Venky Shankar almost 2 years ago

Hi,

Do you have a specific reproducer for this (in the form of a workload)?

Cheers,
Venky

Actions #4

Updated by Mer Xuanyi almost 2 years ago

Venky Shankar wrote:

Hi,

Do you have a specific reproducer for this (in the form of a workload)?

Cheers,
Venky

Hi, you can recurrent this by these steps:

0. prepare mds and ceph-fuse start by gdb, set non-stop on
1. set a breakpoint at Server::handle_client_request(#b1) in mds
2. send a client_request from client (like a mkdir)
3. kill mds from gdb, don't process this client_request
4. disable #b1, set Client::early_kick_flushing_caps(#b2) in client, MDCache::populate_mydir(#b3) in mds
5. reboot mds
6. client will block in #b2 when mds' state is reconnect, wait until kill_session.
7. continue client
8. mds will block in #b3 in rejoin phase, disable #b3, set new breakpoint at MDSIOContextBase::complete(#b4)
9. continue mds
10.mds will block in #b4, don't do anything until mds add_session into sessionmap
11.disable #b4, continue mds
12.mds will be crashed after print "get_session replacing connection bootstrap session ..."

setting:
client_reconnect_stale: true
mds_session_blocklist_on_evict: false
mds_session_blacklist_on_timeout: false
mds_reconnect_timeout: 5

Actions #5

Updated by Venky Shankar over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #6

Updated by Backport Bot over 1 year ago

  • Copied to Backport #57110: pacific: mds: handle deferred client request core when mds reboot added
Actions #7

Updated by Backport Bot over 1 year ago

  • Copied to Backport #57111: quincy: mds: handle deferred client request core when mds reboot added
Actions #8

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #9

Updated by Konstantin Shalygin 3 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100
Actions

Also available in: Atom PDF