Project

General

Profile

Actions

Bug #55850

open

libceph socket closed

Added by Grant Peltier almost 2 years ago. Updated over 1 year ago.

Status:
Need More Info
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
4 - irritation
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Crash signature (v1):
Crash signature (v2):

Description

We are having this issue on all of our Fuse clients, it looks like Ceph is closing it's socket to our mds0 randomly. There is not much info in the Ceph logs about this, and I am unable to find a root cause.

ceph version 14.2.10-392-gb3a13b81cb (b3a13b81cb4dfddec1cd59e7bab1e3e9984c8dd8) nautilus (stable)

dmesg logs

[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect start
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect start
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect start
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect success
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect success
[Tue May 31 20:15:45 2022] ceph: mds0 reconnect success
[Tue May 31 20:16:06 2022] ceph: mds0 recovery completed
[Tue May 31 20:16:06 2022] ceph: mds0 recovery completed
[Tue May 31 20:16:06 2022] ceph: mds0 recovery completed

Ceph config file

# DeepSea default configuration. Changes in this file will be overwritten on
# package update. Include custom configuration fragments in
# /srv/salt/ceph/configuration/files/ceph.conf.d/[global,osd,mon,mgr,mds,client].conf
[global]
fsid = b799274f-d309-4616-8320-a05dd147c602
mon_initial_members = ceph-node4, ceph-node1, ceph-node2, ceph-node3
mon_host = 10.50.1.250, 10.50.1.248, 10.50.1.249, 10.50.1.247
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = 10.50.1.0/24
cluster_network = 10.50.4.0/24

# enable old ceph health format in the json output. This fixes the
# ceph_exporter. This option will only stay until the prometheus plugin takes
# over
mon_health_preluminous_compat = true
mon health preluminous compat warning = false

rbd default features = 3

[mon]
mgr inital modules = dashboard

[mds]
mds_cache_memory_limit = 8589934592

Thank you!

Actions #1

Updated by Venky Shankar almost 2 years ago

  • Project changed from CephFS to Linux kernel client
  • Assignee set to Jeff Layton
Actions #2

Updated by Jeff Layton almost 2 years ago

This bug is confusing, as these messages are from the kernel ceph client, but you mention using FUSE clients. I'll assume for now that you're just mistaken about the kernel/fuse mount thing. In any case these messages:

[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)
[Tue May 31 20:15:37 2022] libceph: mds0 (1)10.50.1.248:6800 socket closed (con state OPEN)

...do not indicate that the client is closing the socket, but rather that the peer (mds0) closed the socket, or that there was some other spurious networking-level hiccup. These messages indicate that the MDS's socket nonce changed. Those are generally only generated when the daemon starts, so this probably means that something changed in the cluster:

[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address
[Tue May 31 20:15:38 2022] libceph: wrong peer, want (1)10.50.1.248:6800/-1186287667, got (1)10.50.1.248:6800/-1535805463
[Tue May 31 20:15:38 2022] libceph: mds0 (1)10.50.1.248:6800 wrong peer at address

You may want to see if the MDS is crashing or had some other issue.

Actions #3

Updated by Grant Peltier almost 2 years ago

Sorry, yes I am using the kernel client here. We had dual Fuse/Kernel mounting setup so I got a little confused, but now the client's only have kernel mounting. We are currently investigating the MDS networking issues. There were no MDS crashes reported in the logs.

Actions #4

Updated by Jeff Layton almost 2 years ago

Are there any intervening firewalls, etc between the clients and the MDSs?

Actions #5

Updated by Jeff Layton almost 2 years ago

  • Assignee changed from Jeff Layton to Xiubo Li
Actions #6

Updated by Xiubo Li over 1 year ago

  • Status changed from New to Need More Info

Jeff Layton wrote:

Are there any intervening firewalls, etc between the clients and the MDSs?

@Grant,

Any reply for this ?

And could you upload the mds side logs if possible ?

Thanks!

Actions

Also available in: Atom PDF