Project

General

Profile

Actions

Bug #61790

open

cephfs client to mds comms remain silent after reconnect

Added by Venky Shankar 11 months ago. Updated 11 months ago.

Status:
New
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

https://pulpito.ceph.com/vshankar-2023-06-20_10:07:44-fs-wip-vshankar-testing-20230620.052303-testing-default-smithi/7308920/

(fuse) client log
./remote/smithi196/log/ceph-client.1.126898.log.gz

MDS logs
./remote/smithi104/log/ceph-mds.a.log.gz
./remote/smithi104/log/ceph-mds.c.log.gz

mds.a was rank 0 and the client had an open session with it. Post failover (by mds.c), the client initiated a reconnect

2023-06-20T15:01:42.253+0000 7f5f0b7fe700 10 client.13603 send_reconnect to mds.0
2023-06-20T15:01:42.253+0000 7f5f0b7fe700 20 client.13603 trim_cache_for_reconnect mds.0
2023-06-20T15:01:42.253+0000 7f5f0b7fe700 20 client.13603 trim_cache_for_reconnect mds.0 trimmed 0 dentries
2023-06-20T15:01:42.253+0000 7f5f0b7fe700 10 client.13603 connect_mds_targets for mds.0
2023-06-20T15:01:42.253+0000 7f5f0b7fe700 10 client.13603  caps on 0x1.head pAsLsXsFs wants -
2023-06-20T15:01:42.253+0000 7f5f0b7fe700 10 client.13603     path #0x1
2023-06-20T15:01:42.253+0000 7f5f0b7fe700 10 client.13603  snaprealm snaprealm(0x1 nref=1 c=0 seq=1 parent=0x0 my_snaps=[] cached_snapc=1=[] last_modified=2023-06-20T15:01:22.975593+0000 change_attr=0)

which mds.c handled

2023-06-20T15:01:42.254+0000 7f31a81f6700 10 mds.c ms_handle_accept 192.168.0.1:0/2884131241 con 0x55ba9d698400 session 0x55ba9d650f00
2023-06-20T15:01:42.254+0000 7f31a81f6700 10 mds.c  session connection 0 -> 0x55ba9d698400
2023-06-20T15:01:42.254+0000 7f31ab1fc700  1 --2- [v2:172.21.15.104:6836/3326079188,v1:172.21.15.104:6837/3326079188] >> 192.168.0.1:0/2884131241 conn(0x55ba9d698400 0x55ba9d66c680 crc :-1 s=READY pgs=8 cs=0 l=0 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).ready entity=client.13603 client_cookie=ae23e7da360852d4 server_cookie=69552c3b8842effd in_seq=0 out_seq=0
2023-06-20T15:01:42.254+0000 7f31a81f6700  1 -- [v2:172.21.15.104:6836/3326079188,v1:172.21.15.104:6837/3326079188] <== client.13603 192.168.0.1:0/2884131241 1 ==== client_reconnect(1 caps 1 realms ) v5 ==== 0+0+101 (crc 0 0 0) 0x55ba9d593d40 con 0x55ba9d698400
2023-06-20T15:01:42.254+0000 7f31a81f6700  7 mds.0.server handle_client_reconnect client.13603
2023-06-20T15:01:42.254+0000 7f31a81f6700 20 mds.0.222 get_session have 0x55ba9d650f00 client.13603 192.168.0.1:0/2884131241 state open
2023-06-20T15:01:42.254+0000 7f31a81f6700 10 mds.0.server  reconnect_start 3290.305375s delay 0.002000
2023-06-20T15:01:42.254+0000 7f31a81f6700 10 add_session: mds.metrics: session=0x55ba9d650f00, client=client.13603 192.168.0.1:0/2884131241
2023-06-20T15:01:42.254+0000 7f31a81f6700 20 add_session: mds.metrics: metrics=[update_type=0, metrics={cap_hit_metric={hits=0, misses=0}, read_latency={latency=0.000000, avg_latency=0.000000, sq_sum=0, count=0}, write_latency={latency=0.000000, avg_latency=0.000000, sq_sum=0, count=0}, metadata_latency={latency=0.000000, avg_latency=0.000000, sq_sum=0, count=0}, dentry_lease={hits=0, misses=0}, opened_files_metric={opened_files=0, total_inodes=0}, pinned_icaps_metric={pinned_icaps=0, total_inodes=0}, opened_inodes_metric={opened_inodes=0, total_inodes=0}, read_io_sizes_metric={total_ops=0, total_size=0}, write_io_sizes_metric={total_ops=0, total_size=0}}]
2023-06-20T15:01:42.254+0000 7f31a81f6700 10 mds.0.222 send_message_client client.13603 192.168.0.1:0/2884131241 client_session(open) v5
2023-06-20T15:01:42.254+0000 7f31a81f6700  1 -- [v2:172.21.15.104:6836/3326079188,v1:172.21.15.104:6837/3326079188] --> 192.168.0.1:0/2884131241 -- client_session(open) v5 -- 0x55ba9c7b2a80 con 0x55ba9d698400
2023-06-20T15:01:42.254+0000 7f31a81f6700  0 log_channel(cluster) log [DBG] : reconnect by client.13603 192.168.0.1:0/2884131241 after 0.00199998
2023-06-20T15:01:42.254+0000 7f31a81f6700 15 mds.0.server open snaprealm (w inode) on [inode 0x1 [...2,head] / auth v2 snaprealm=0x55ba9d6946c0 f() n(v0 rc2023-06-20T15:01:36.801827+0000 1=0+1)/n(v0 rc2023-06-20T15:01:22.975593+0000 1=0+1) (iversion lock) caps={15828=pLsXsFs/-@0} | dirfrag=1 caps=1 dirty=1 0x55ba9d66c100]
2023-06-20T15:01:42.254+0000 7f31a81f6700 15 mds.0.server open cap realm 0x1 on [inode 0x1 [...2,head] / auth v2 snaprealm=0x55ba9d6946c0 f() n(v0 rc2023-06-20T15:01:36.801827+0000 1=0+1)/n(v0 rc2023-06-20T15:01:22.975593+0000 1=0+1) (iversion lock) caps={15828=pLsXsFs/-@0} | dirfrag=1 caps=1 dirty=1 0x55ba9d66c100]
2023-06-20T15:01:42.254+0000 7f31a81f6700  7 mds.0.server reconnect_gather_finish.  failed on 0 clients

mds.c enters up:rejoin and handles the client session (and transitions to up:active)

2023-06-20T15:01:43.265+0000 7f31a21ea700 10 MDSIOContextBase::complete: 26C_MDC_RejoinSessionsOpened
2023-06-20T15:01:43.265+0000 7f31a21ea700 10 MDSContext::complete: 26C_MDC_RejoinSessionsOpened
2023-06-20T15:01:43.265+0000 7f31a21ea700 10 mds.0.cache rejoin_open_sessions_finish
2023-06-20T15:01:43.265+0000 7f31a21ea700 10 mds.0.server finish_force_open_sessions on 2 clients, initial v 2
2023-06-20T15:01:43.265+0000 7f31a21ea700 10 mds.0.server force_open_sessions skipping already-open client.13603 192.168.0.1:0/2884131241
2023-06-20T15:01:43.265+0000 7f31a21ea700 20 mds.0.sessionmap mark_dirty s=0x55ba9d650f00 name=client.13603 v=2
2023-06-20T15:01:43.265+0000 7f31a21ea700 10 mds.0.server force_open_sessions skipping already-open client.15828 192.168.0.1:0/2117120829
2023-06-20T15:01:43.265+0000 7f31a21ea700 20 mds.0.sessionmap mark_dirty s=0x55ba9c81ed00 name=client.15828 v=3
2023-06-20T15:01:43.265+0000 7f31a21ea700 10 mds.0.server finish_force_open_sessions: final v 4
2023-06-20T15:01:43.265+0000 7f31a21ea700 10 mds.0.cache rejoin_gather_finish

Client:tick() sends request_renewcaps

2023-06-20T15:01:59.411+0000 7f5f09ffb700 20 client.13603 tick
2023-06-20T15:01:59.411+0000 7f5f09ffb700 10 client.13603 renew_caps()
2023-06-20T15:01:59.411+0000 7f5f09ffb700 15 client.13603 renew_caps requesting from mds.0
2023-06-20T15:01:59.411+0000 7f5f09ffb700 10 client.13603 renew_caps mds.0
2023-06-20T15:01:59.411+0000 7f5f09ffb700  1 -- 192.168.0.1:0/2884131241 --> [v2:172.21.15.104:6836/3326079188,v1:172.21.15.104:6837/3326079188] -- client_session(request_renewcaps seq 3) v5 -- 0x7f5efc0064b0 con 0x7f5f0400a370

But this message is not received in the MDS nor the metrics update message by the client. The MDS autocloses the session after timeout (300s). I did no dig much further. FWIW, this could be something related to the messenger layer. Requires investigation.

Actions #1

Updated by Venky Shankar 11 months ago

  • Priority changed from Normal to High
Actions #2

Updated by Venky Shankar 11 months ago

  • Assignee set to Venky Shankar
Actions

Also available in: Atom PDF