Bug #56116: mds: handle deferred client request core when mds reboot - CephFS - Ceph

Actions

Copy link

Bug #56116

closed

mds: handle deferred client request core when mds reboot

Added by Mer Xuanyi almost 2 years ago. Updated 3 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Correctness/Safety

Target version:

Ceph - v18.0.0

% Done:

100%

Source:

Tags:

backport_processed

Backport:

quincy, pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v18.0.0

ceph-qa-suite:

Component(FS):

MDS, cephfs.pyx

Labels (FS):

Pull request ID:

46750

Crash signature (v1):

Crash signature (v2):

Description

When mds reboot, client will send `mds_requests` and `client_reconnect` to mds.

If mds does not receive the `client_reconnect` message within `mds_reconnect_timeout`, mds will kill client session and go to next phase (reconnect -> rejoin).

And mds will handle these received client requests when mds' state change is active.

But if MDCache is not ready, these messages will be pushed into mdcache->waiting_for_root queue.

Back to the client, the client will try to rebuild the session with mds even if mds already kill the old session (client still has unfinished mds_requests), so the client will send request_open to mds.

If mds handle this client session message before mdcache is ready, the new session will be added to mds' sessionmap.

Now if mdcache is ready, mds will get the crash because mds mistook the client request for a new session with an imported session

 1: (()+0xf100) [0x7f1033c30100]
 2: (Mutex::lock(bool)+0x9) [0x7f1035e33cf9]
 3: (MDSRank::get_session(boost::intrusive_ptr<Message const> const&)+0x92a) [0x7f103ee8f27a]
 4: (Server::handle_client_request(boost::intrusive_ptr<MClientRequest const> const&)+0x504) [0x7f103ef0c9c4]
 5: (Server::dispatch(boost::intrusive_ptr<Message const> const&)+0x122) [0x7f103ef18162]
 6: (MDSRank::handle_deferrable_message(boost::intrusive_ptr<Message const> const&)+0x6dc) [0x7f103ee8be8c]
 7: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7fa) [0x7f103ee8e2fa]
 8: (MDSRank::retry_dispatch(boost::intrusive_ptr<Message const> const&)+0x12) [0x7f103ee8e942]
 9: (MDSContext::complete(int)+0x74) [0x7f103f0ff4b4]

2022-06-07T03:50:44.372+0800 7fffe6b2c700  5 mds.beacon.a set_want_state: up:replay -> up:reconnect
2022-06-07T03:50:46.524+0800 7fffee33b700  3 mds.0.server not active yet, waiting
2022-06-07T03:50:53.860+0800 7fffecb38700 10 mds.0.server kill_session 0x55555b4e2300
2022-06-07T03:50:53.860+0800 7fffecb38700  5 mds.beacon.a set_want_state: up:reconnect -> up:rejoin
2022-06-07T03:50:54.869+0800 7fffee33b700  5 mds.beacon.a set_want_state: up:rejoin -> up:active
2022-06-07T03:51:19.690+0800 7fffee33b700 10 MDSContext::complete: 18C_MDS_RetryMessage
2022-06-07T03:51:19.690+0800 7fffee33b700  5 mds.0.server waiting for root
2022-06-07T03:51:19.915+0800 7fffee33b700 10 mds.0.sessionmap add_session s=0x55555b57e000 name=client.4445
2022-06-07T03:51:38.962+0800 7fffe832f700 10 mds.0.cache populate_mydir done
2022-06-07T03:51:38.962+0800 7fffe9b32700 10 MDSContext::complete: 18C_MDS_RetryMessage
2022-06-07T03:51:38.962+0800 7fffe9b32700 10 mds.0.215 get_session replacing connection bootstrap session 0x55555b4e2300 with imported session 0x55555b57e000

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Venky Shankar almost 2 years ago

Status changed from New to Fix Under Review
Pull request ID set to 46750

Actions

Copy link

Updated by Venky Shankar almost 2 years ago

Backport set to quincy, pacific

Actions

Copy link

Updated by Venky Shankar almost 2 years ago

Hi,

Do you have a specific reproducer for this (in the form of a workload)?

Cheers,
Venky

Actions

Copy link

Updated by Mer Xuanyi almost 2 years ago

Venky Shankar wrote:

Hi,

Do you have a specific reproducer for this (in the form of a workload)?

Cheers,
Venky

Hi, you can recurrent this by these steps:

0. prepare mds and ceph-fuse start by gdb, set non-stop on
1. set a breakpoint at Server::handle_client_request(#b1) in mds
2. send a client_request from client (like a mkdir)
3. kill mds from gdb, don't process this client_request
4. disable #b1, set Client::early_kick_flushing_caps(#b2) in client, MDCache::populate_mydir(#b3) in mds
5. reboot mds
6. client will block in #b2 when mds' state is reconnect, wait until kill_session.
7. continue client
8. mds will block in #b3 in rejoin phase, disable #b3, set new breakpoint at MDSIOContextBase::complete(#b4)
9. continue mds
10.mds will block in #b4, don't do anything until mds add_session into sessionmap
11.disable #b4, continue mds
12.mds will be crashed after print "get_session replacing connection bootstrap session ..."

setting:
client_reconnect_stale: true
mds_session_blocklist_on_evict: false
mds_session_blacklist_on_timeout: false
mds_reconnect_timeout: 5

Actions

Copy link