Bug #55850: libceph socket closed - Linux kernel client - Ceph

Actions

Copy link

Bug #55850

open

libceph socket closed

Added by Grant Peltier almost 2 years ago. Updated over 1 year ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Xiubo Li

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

4 - irritation

Reviewed:

Affected Versions:

Ceph - v14.2.10

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

We are having this issue on all of our Fuse clients, it looks like Ceph is closing it's socket to our mds0 randomly. There is not much info in the Ceph logs about this, and I am unable to find a root cause.

ceph version 14.2.10-392-gb3a13b81cb (b3a13b81cb4dfddec1cd59e7bab1e3e9984c8dd8) nautilus (stable)

dmesg logs

[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect start
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect start
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect start
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect success
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect success
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect success
[Tue May 31 20:16:06 2022] ceph: mds0 recovery completed
[Tue May 31 20:16:06 2022] ceph: mds0 recovery completed
[Tue May 31 20:16:06 2022] ceph: mds0 recovery completed

Ceph config file

# DeepSea default configuration. Changes in this file will be overwritten on
# package update. Include custom configuration fragments in
# /srv/salt/ceph/configuration/files/ceph.conf.d/[global,osd,mon,mgr,mds,client].conf
[global]
fsid = b799274f-d309-4616-8320-a05dd147c602
mon_initial_members = ceph-node4, ceph-node1, ceph-node2, ceph-node3
mon_host = 10.50.1.250, 10.50.1.248, 10.50.1.249, 10.50.1.247
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = 10.50.1.0/24
cluster_network = 10.50.4.0/24

# enable old ceph health format in the json output. This fixes the
# ceph_exporter. This option will only stay until the prometheus plugin takes
# over
mon_health_preluminous_compat = true
mon health preluminous compat warning = false

rbd default features = 3

[mon]
mgr inital modules = dashboard

[mds]
mds_cache_memory_limit = 8589934592

Thank you!

Actions

Copy link

Updated by Venky Shankar almost 2 years ago

Project changed from CephFS to Linux kernel client
Assignee set to Jeff Layton

Actions

Copy link

Updated by Jeff Layton almost 2 years ago

This bug is confusing, as these messages are from the kernel ceph client, but you mention using FUSE clients. I'll assume for now that you're just mistaken about the kernel/fuse mount thing. In any case these messages:

[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)

...do not indicate that the client is closing the socket, but rather that the peer (mds0) closed the socket, or that there was some other spurious networking-level hiccup. These messages indicate that the MDS's socket nonce changed. Those are generally only generated when the daemon starts, so this probably means that something changed in the cluster:

[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address

You may want to see if the MDS is crashing or had some other issue.

Actions

Copy link

Updated by Grant Peltier almost 2 years ago

Sorry, yes I am using the kernel client here. We had dual Fuse/Kernel mounting setup so I got a little confused, but now the client's only have kernel mounting. We are currently investigating the MDS networking issues. There were no MDS crashes reported in the logs.

Actions

Copy link