Project

General

Profile

Actions

Bug #23272

open

switch port down ,cephfs kernel client lost session, blocked not recover ok util port up

Added by Yong Wang about 6 years ago. Updated about 6 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
fs/ceph
Target version:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Crash signature (v1):
Crash signature (v2):

Description

=========================
switch port down ,cephfs kernel client lost session, blocked not recover ok util port up =========================

ceph version 12.2.2
3nodes
mds standy-active-standy

=========================
2018-03-07 14:39:29.349374 7fc850c1e700 1 mds.0.server reconnect gave up on client.114396 10.0.30.25:0/2251752544
2018-03-07 14:39:29.349504 7fc850c1e700 0 log_channel(cluster) log [WRN] : evicting unresponsive client ceph_node1: (114396), after waiting 45 seconds during MDS startup

ceph daemon mds.ceph_node1 session ls
[ {
"id": 114399,
"num_leases": 0,
"num_caps": 69,
"state": "open",
"replay_requests": 0,
"completed_requests": 0,
"reconnecting": false,
"inst": "client.114399 10.0.30.26:0/495832568",
"client_metadata": {
"entity_id": "",
"hostname": "ceph_node2",
"kernel_version": "3.10.0-514.el7.x86_64"
}
}, {
"id": 114380,
"num_leases": 0,
"num_caps": 69,
"state": "open",
"replay_requests": 0,
"completed_requests": 0,
"reconnecting": false,
"inst": "client.114380 10.0.30.27:0/1147346533",
"client_metadata": {
"entity_id": "",
"hostname": "ceph_node3",
"kernel_version": "3.10.0-514.el7.x86_64"
}
}
]

Mar 7 14:42:54 ceph_node1 kernel: ceph: mds0 reconnect start
Mar 7 14:44:29 ceph_node1 kernel: ceph: error -22 preparing reconnect for mds0
Mar 7 15:02:09 ceph_node1 kernel: ceph: mds0 reconnect start
Mar 7 15:02:09 ceph_node1 kernel: ceph: error -22 preparing reconnect for mds0

[446473.401003] ceph: error -22 preparing reconnect for mds0
[446473.402086] libceph: mon2 10.0.30.27:6789 session established
[447531.038944] libceph: mds0 10.0.30.26:6800 socket closed (con state OPEN)
[447533.144485] ceph: mds0 reconnect start
[447533.144557] ceph: error -22 preparing reconnect for mds0
[447542.796376] ceph: mds0 recovery completed

fs/ceph/mds_client.c:3031 fail_nopagelist

free -g
total used free shared buff/cache available
Mem: 62 17 44 0 0 44
Swap: 29 1 28

mount |grep 10.0
10.0.30.25:6789,10.0.30.26:6789,10.0.30.27:6789:/ on /infinityfs1 type ceph (rw,relatime,acl)

mds_client.c check_new_map

uname -r
3.10.0-514.el7.x86_64
echo 'file fs/ceph/mds_client.c line 3051 +p' >/sys/kernel/debug/dynamic_debug/control

stop mds , make mds active on another node, reconnect ok.

<4>[521401.347515] ceph: dropping dirty+flushing Fw state for ffff8806749f8340 1099511628786
<4>[521401.422872] ceph: dropping dirty+flushing Fw state for ffff8808462424a0 1099511628796
<4>[521401.492587] ceph: dropping dirty+flushing Fw state for ffff881051c288d0 1099511628799
<4>[521401.564168] ceph: dropping dirty+flushing Fw state for ffff880f5b348e60 1099511628797
<4>[521401.644903] ceph: dropping dirty+flushing Fw state for ffff88098c6408d0 1099511628798
<4>[521401.713670] ceph: dropping dirty+flushing Fw state for ffff880fab0fdc40 1099511628801
<4>[521401.782892] ceph: dropping dirty+flushing Fw state for ffff880fab0fe760 1099511628803
<4>[521401.852433] ceph: dropping dirty+flushing Fw state for ffff88000d6d93f0 1099511628800
<4>[521401.920445] ceph: dropping dirty+flushing Fw state for ffff880fab0fe1d0 1099511628802
<4>[521401.968322] ceph: dropping dirty+flushing Fw state for ffff880c7bf908d0 1099511628804
<7>[521410.563957] ceph: check_new_map new 72 old 71
<7>[521411.624240] ceph: check_new_map new 73 old 72
<4>[521411.625006] libceph: mds0 10.0.30.26:6800 socket closed (con state NEGOTIATING)

Actions #1

Updated by Yong Wang about 6 years ago

It seems like client not supoort connect retry feature.
If network service down a long time, how to we keep io normal on this node?

Actions #2

Updated by Zheng Yan about 6 years ago

Mar 7 14:42:54 ceph_node1 kernel: ceph: mds0 reconnect start
Mar 7 14:44:29 ceph_node1 kernel: ceph: error -22 preparing reconnect for mds0

no idea how does send_mds_reconnect() get error code -EINVAL?

Actions

Also available in: Atom PDF