Bug #49584: Ceph OSD, MDS, MGR daemon does not _only_ bind to specified address when configured to do so and results in degraded cluster state - RADOS - Ceph

Actions

Copy link

Bug #49584

open

Ceph OSD, MDS, MGR daemon does not _only_ bind to specified address when configured to do so and results in degraded cluster state

Added by Stefan Kooman about 3 years ago. Updated almost 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Documentation (https://docs.ceph.com/en/octopus/rados/configuration/network-config-ref/#ceph-daemons) states the following:

The MGR, OSD, and MDS daemons will bind to any available address and do not require any special configuration. However, it is possible to specify a specific IP address for them to bind to with the public addr (and/or, in the case of OSD daemons, the cluster addr) configuration option. For example,

[osd.0]
public addr = {host-public-ip-address}
cluster addr = {host-cluster-ip-address}

However, this appears not to be completely true. When ms_bind_ipv4=true and ms_bind_ipv6=true, then the daemon will bind to the specified address (I tested with a public addr), but also to the '0.0.0.0' IPv4 address when no specific IPv4 address is configured.

ceph fs dump:

--- snip ---
Standby daemons:

[mds.stefanmds2{-1:76745848} state up:standby seq 1 addr [v2:[2001:7b8:642:0:1337::88]:6800/27999541,v1:[2001:7b8:642:0:1337::88]:6801/27999541,v2:0.0.0.0:6802/27999541,v1:0.0.0.0:6803/27999541]]
--- snap ---

(mgmt)root@stefanmds1:~$ ceph daemon mds.stefanmds1 config get public_addr {
"public_addr": "-"
}

root@stefanmds1:~$ ceph config set mds.stefanmds1 public_addr 2001:7b8:642:0:1337::87

(mgmt)root@stefanmds1:~$ systemctl restart ceph-mds.target

(mgmt)root@stefanmds1:~$ ceph daemon mds.stefanmds1 config get public_addr {
"public_addr": "v2:[2001:7b8:642:0:1337::87]:0/0"
}

ceph fs dump

--- snip ---
[mds.stefanmds1{-1:80340454} state up:standby seq 2 addr [v2:[2001:7b8:642:0:1337::87]:6800/2355734993,v1:[2001:7b8:642:0:1337::87]:6801/2355734993,v2:0.0.0.0:6800/2355734993,v1:0.0.0.0:6801/2355734993]]
--- snap ---

However, shortly after I configured that for mds1 it got laggy and replaced by mds2. Only after I reverted those changes would things go back to normal. After this I set a public addr for both mds1 and mds2 and restarted those daemons. That results in a degraded filesystem: mds: cephfs:1/1 {0=stefanmds1=up:reconnect} 1 up:standby and a MDS using loads of CPU and logging (only) this at high speed:
2021-03-03 16:21:09.957 7fe863606700 1 mds.stefanmds1 parse_caps: cannot decode auth caps buffer of length 0

Actions

Copy link

Updated by Stefan Kooman about 3 years ago

After removing the specific public_addr and restarting the MDSes the situation returns to normal and the cluster recovers:

mds: cephfs:1 {0=stefanmds1=up:active} 1 up:standby

If ms_bind_ipv4=false is set, than above issues do not occur when IPv6 public_addrs are set.

Actions

Copy link

Updated by Sage Weil almost 3 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #49584

Ceph OSD, MDS, MGR daemon does not _only_ bind to specified address when configured to do so and results in degraded cluster state

Updated by Stefan Kooman about 3 years ago

Updated by Sage Weil almost 3 years ago