Project

General

Profile

Actions

Bug #49584

open

Ceph OSD, MDS, MGR daemon does not _only_ bind to specified address when configured to do so and results in degraded cluster state

Added by Stefan Kooman about 3 years ago. Updated almost 3 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Documentation (https://docs.ceph.com/en/octopus/rados/configuration/network-config-ref/#ceph-daemons) states the following:

The MGR, OSD, and MDS daemons will bind to any available address and do not require any special configuration. However, it is possible to specify a specific IP address for them to bind to with the public addr (and/or, in the case of OSD daemons, the cluster addr) configuration option. For example,

[osd.0]
public addr = {host-public-ip-address}
cluster addr = {host-cluster-ip-address}

However, this appears not to be completely true. When ms_bind_ipv4=true and ms_bind_ipv6=true, then the daemon will bind to the specified address (I tested with a public addr), but also to the '0.0.0.0' IPv4 address when no specific IPv4 address is configured.

ceph fs dump:

--- snip ---
Standby daemons:

[mds.stefanmds2{-1:76745848} state up:standby seq 1 addr [v2:[2001:7b8:642:0:1337::88]:6800/27999541,v1:[2001:7b8:642:0:1337::88]:6801/27999541,v2:0.0.0.0:6802/27999541,v1:0.0.0.0:6803/27999541]]
--- snap ---

(mgmt)root@stefanmds1:~$ ceph daemon mds.stefanmds1 config get public_addr {
"public_addr": "-"
}

root@stefanmds1:~$ ceph config set mds.stefanmds1 public_addr 2001:7b8:642:0:1337::87

(mgmt)root@stefanmds1:~$ systemctl restart ceph-mds.target

(mgmt)root@stefanmds1:~$ ceph daemon mds.stefanmds1 config get public_addr {
"public_addr": "v2:[2001:7b8:642:0:1337::87]:0/0"
}

ceph fs dump

--- snip ---
[mds.stefanmds1{-1:80340454} state up:standby seq 2 addr [v2:[2001:7b8:642:0:1337::87]:6800/2355734993,v1:[2001:7b8:642:0:1337::87]:6801/2355734993,v2:0.0.0.0:6800/2355734993,v1:0.0.0.0:6801/2355734993]]
--- snap ---

However, shortly after I configured that for mds1 it got laggy and replaced by mds2. Only after I reverted those changes would things go back to normal. After this I set a public addr for both mds1 and mds2 and restarted those daemons. That results in a degraded filesystem: mds: cephfs:1/1 {0=stefanmds1=up:reconnect} 1 up:standby and a MDS using loads of CPU and logging (only) this at high speed:
2021-03-03 16:21:09.957 7fe863606700 1 mds.stefanmds1 parse_caps: cannot decode auth caps buffer of length 0

Actions #1

Updated by Stefan Kooman about 3 years ago

After removing the specific public_addr and restarting the MDSes the situation returns to normal and the cluster recovers:

mds: cephfs:1 {0=stefanmds1=up:active} 1 up:standby

If ms_bind_ipv4=false is set, than above issues do not occur when IPv6 public_addrs are set.

Actions #2

Updated by Sage Weil almost 3 years ago

  • Project changed from Ceph to RADOS
Actions

Also available in: Atom PDF