Bug #52761: OSDs announcing incorrect front_addr after upgrade to 16.2.6 - RADOS - Ceph

Actions

Copy link

Bug #52761

open

OSDs announcing incorrect front_addr after upgrade to 16.2.6

Added by Javier Cacheiro over 2 years ago. Updated over 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.2.6

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Ceph cluster configured with a public and cluster network:

ceph config dump|grep network

global advanced cluster_network 10.114.0.0/16 *
mon advanced public_network 10.113.0.0/16 *

Upgraded from 16.2.4 to 16.2.6 and all nodes rebooted after the upgrade.

Investigating an issue with clients not being able to connect I found that the problem is that clients are directed to the cluster_network address for some OSDs.

Looking at the osd metadata I see in most OSDs the front addresses are correctly configured through the 10.113 public network, like this one:

osd.0
"back_addr": "[v2:10.114.29.10:6813/2947358317,v1:10.114.29.10:6819/2947358317]",
"front_addr": "[v2:10.113.29.10:6801/2947358317,v1:10.113.29.10:6807/2947358317]",
"hb_back_addr": "[v2:10.114.29.10:6837/2947358317,v1:10.114.29.10:6843/2947358317]",
"hb_front_addr": "[v2:10.113.29.10:6825/2947358317,v1:10.113.29.10:6832/2947358317]",

But then, there are also many osds where the configuration is incorrect, but this could happen in different ways.

For example in some OSDs the error is just in the front_addr, but the hb_front_addr is fine:

osd.26
"back_addr": "[v2:10.114.29.5:6866/4155549673,v1:10.114.29.5:6867/4155549673]",
"front_addr": "[v2:10.114.29.5:6864/4155549673,v1:10.114.29.5:6865/4155549673]",
"hb_back_addr": "[v2:10.114.29.5:6870/4155549673,v1:10.114.29.5:6871/4155549673]",
"hb_front_addr": "[v2:10.113.29.5:6868/4155549673,v1:10.113.29.5:6869/4155549673]",

In others it is the hb_front_addr:

osd.34
"back_addr": "[v2:10.114.29.6:6802/3934363792,v1:10.114.29.6:6803/3934363792]",
"front_addr": "[v2:10.113.29.6:6800/3934363792,v1:10.113.29.6:6801/3934363792]",
"hb_back_addr": "[v2:10.114.29.6:6806/3934363792,v1:10.114.29.6:6807/3934363792]",
"hb_front_addr": "[v2:10.114.29.6:6804/3934363792,v1:10.114.29.6:6805/3934363792]",

And in others both are wrong:

osd.32
"back_addr": "[v2:10.114.29.10:6814/2403531529,v1:10.114.29.10:6820/2403531529]",
"front_addr": "[v2:10.114.29.10:6802/2403531529,v1:10.114.29.10:6808/2403531529]",
"hb_back_addr": "[v2:10.114.29.10:6836/2403531529,v1:10.114.29.10:6841/2403531529]",
"hb_front_addr": "[v2:10.114.29.10:6826/2403531529,v1:10.114.29.10:6830/2403531529]",

This happens just for the front_addr assignation, the back_addr in all OSDs is in the cluster network (10.114).

In the same node there can be OSDs that have the right configuration and OSDs that are announcing wrong front addresses.

Actions

Copy link

Updated by Javier Cacheiro over 2 years ago

Just as statistics, there are now:

- 51 cases where there is an error in the front_addr or hb_front_addr configuration.
- 333 cases where it is correct

Actions

Copy link

Updated by Javier Cacheiro over 2 years ago

Restarting the daemons seems to get the correct configuration but it is unclear why this did not happen when they were all rebooted after the upgrade.

Actions

Copy link

Updated by Javier Cacheiro over 2 years ago

In some cases it requires several daemon restarts until it gets to the right configuration.

I don't know if the wrong config could be something that could be happening randomly with a lower probability.

Actions

Copy link

Updated by Javier Cacheiro over 2 years ago

Upgraded from v16.2.6 to v16.2.6-20210927 to apply the remoto bug fix.

After the upgrade (no reboot of the nodes but the daemons were restarted by the upgrade) still 6 osds announcing incorrect front_addr and hb_front_addr (in this case all osds announce both front_addr and hb_front_addr in the cluster_network).

Actions

Copy link

Updated by Javier Cacheiro over 2 years ago

I have kept restarting the incorrectly configured osds daemons until they got the right front_addr. In some cases it required several restarts.

Now all osds are announcing the correct front_addr and hb_front_addr.

Actions

Copy link

Updated by Greg Farnum over 2 years ago

Project changed from Ceph to RADOS

Actions

Copy link

Updated by Neha Ojha over 2 years ago

The docs suggest setting public_network in the global section, not just for the mons https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#public-network, can you give this a try and see if it helps?

Actions

Copy link

Updated by Javier Cacheiro over 2 years ago

Yes, I tried that, but it does not change the behavior:

ceph config set global public_network 10.113.0.0/16

and then I run the daemon reconfig.

Same behavior.

As a further comment, the config with the setting only for the mon comes directly from cephadm when I bootstraped the cluster with:

cephadm bootstrap --mon-ip 10.113.26.1 --cluster-network 10.114.0.0/16

It run with no issue until the upgrade.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #52761

OSDs announcing incorrect front_addr after upgrade to 16.2.6

Updated by Javier Cacheiro over 2 years ago

Updated by Javier Cacheiro over 2 years ago

Updated by Javier Cacheiro over 2 years ago

Updated by Javier Cacheiro over 2 years ago

Updated by Javier Cacheiro over 2 years ago

Updated by Greg Farnum over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Javier Cacheiro over 2 years ago