Project

General

Profile

Bug #56116

mds: handle deferred client request core when mds reboot

Added by Mer Xuanyi 6 months ago. Updated 4 months ago.

Status:
Pending Backport
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
quincy, pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS, cephfs.pyx
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When mds reboot, client will send `mds_requests` and `client_reconnect` to mds.

If mds does not receive the `client_reconnect` message within `mds_reconnect_timeout`, mds will kill client session and go to next phase (reconnect -> rejoin).

And mds will handle these received client requests when mds' state change is active.

But if MDCache is not ready, these messages will be pushed into mdcache->waiting_for_root queue.

Back to the client, the client will try to rebuild the session with mds even if mds already kill the old session (client still has unfinished mds_requests), so the client will send request_open to mds.

If mds handle this client session message before mdcache is ready, the new session will be added to mds' sessionmap.

Now if mdcache is ready, mds will get the crash because mds mistook the client request for a new session with an imported session

 1: (()+0xf100) [0x7f1033c30100]
 2: (Mutex::lock(bool)+0x9) [0x7f1035e33cf9]
 3: (MDSRank::get_session(boost::intrusive_ptr<Message const> const&)+0x92a) [0x7f103ee8f27a]
 4: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x504) [0x7f103ef0c9c4]
 5: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x7f103ef18162]
 6: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x7f103ee8be8c]
 7: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x7f103ee8e2fa]
 8: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x7f103ee8e942]
 9: (MDSContext::complete(int)+0x74) [0x7f103f0ff4b4]
2022-06-07T03:50:44.372+0800 7fffe6b2c700  5 mds.beacon.a set_want_state: up:replay -> up:reconnect
2022-06-07T03:50:46.524+0800 7fffee33b700  3 mds.0.server not active yet, waiting
2022-06-07T03:50:53.860+0800 7fffecb38700 10 mds.0.server kill_session 0x55555b4e2300
2022-06-07T03:50:53.860+0800 7fffecb38700  5 mds.beacon.a set_want_state: up:reconnect -> up:rejoin
2022-06-07T03:50:54.869+0800 7fffee33b700  5 mds.beacon.a set_want_state: up:rejoin -> up:active
2022-06-07T03:51:19.690+0800 7fffee33b700 10 MDSContext::complete: 18C_MDS_RetryMessage
2022-06-07T03:51:19.690+0800 7fffee33b700  5 mds.0.server waiting for root
2022-06-07T03:51:19.915+0800 7fffee33b700 10 mds.0.sessionmap add_session s=0x55555b57e000 name=client.4445
2022-06-07T03:51:38.962+0800 7fffe832f700 10 mds.0.cache populate_mydir done
2022-06-07T03:51:38.962+0800 7fffe9b32700 10 MDSContext::complete: 18C_MDS_RetryMessage
2022-06-07T03:51:38.962+0800 7fffe9b32700 10 mds.0.215 get_session replacing connection bootstrap session 0x55555b4e2300 with imported session 0x55555b57e000

Related issues

Copied to CephFS - Backport #57110: pacific: mds: handle deferred client request core when mds reboot New
Copied to CephFS - Backport #57111: quincy: mds: handle deferred client request core when mds reboot New

History

#1 Updated by Venky Shankar 6 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 46750

#2 Updated by Venky Shankar 6 months ago

  • Backport set to quincy, pacific

#3 Updated by Venky Shankar 6 months ago

Hi,

Do you have a specific reproducer for this (in the form of a workload)?

Cheers,
Venky

#4 Updated by Mer Xuanyi 6 months ago

Venky Shankar wrote:

Hi,

Do you have a specific reproducer for this (in the form of a workload)?

Cheers,
Venky

Hi, you can recurrent this by these steps:

0. prepare mds and ceph-fuse start by gdb, set non-stop on
1. set a breakpoint at Server::handle_client_request(#b1) in mds
2. send a client_request from client (like a mkdir)
3. kill mds from gdb, don't process this client_request
4. disable #b1, set Client::early_kick_flushing_caps(#b2) in client, MDCache::populate_mydir(#b3) in mds
5. reboot mds
6. client will block in #b2 when mds' state is reconnect, wait until kill_session.
7. continue client
8. mds will block in #b3 in rejoin phase, disable #b3, set new breakpoint at MDSIOContextBase::complete(#b4)
9. continue mds
10.mds will block in #b4, don't do anything until mds add_session into sessionmap
11.disable #b4, continue mds
12.mds will be crashed after print "get_session replacing connection bootstrap session ..."

setting:
client_reconnect_stale: true
mds_session_blocklist_on_evict: false
mds_session_blacklist_on_timeout: false
mds_reconnect_timeout: 5

#5 Updated by Venky Shankar 4 months ago

  • Status changed from Fix Under Review to Pending Backport

#6 Updated by Backport Bot 4 months ago

  • Copied to Backport #57110: pacific: mds: handle deferred client request core when mds reboot added

#7 Updated by Backport Bot 4 months ago

  • Copied to Backport #57111: quincy: mds: handle deferred client request core when mds reboot added

#8 Updated by Backport Bot 4 months ago

  • Tags set to backport_processed

Also available in: Atom PDF