Bug #52761
openOSDs announcing incorrect front_addr after upgrade to 16.2.6
0%
Description
Ceph cluster configured with a public and cluster network:
ceph config dump|grep network
global advanced cluster_network 10.114.0.0/16 *
mon advanced public_network 10.113.0.0/16 *
Upgraded from 16.2.4 to 16.2.6 and all nodes rebooted after the upgrade.
Investigating an issue with clients not being able to connect I found that the problem is that clients are directed to the cluster_network address for some OSDs.
Looking at the osd metadata I see in most OSDs the front addresses are correctly configured through the 10.113 public network, like this one:
osd.0"back_addr": "[v2:10.114.29.10:6813/2947358317,v1:10.114.29.10:6819/2947358317]",
"front_addr": "[v2:10.113.29.10:6801/2947358317,v1:10.113.29.10:6807/2947358317]",
"hb_back_addr": "[v2:10.114.29.10:6837/2947358317,v1:10.114.29.10:6843/2947358317]",
"hb_front_addr": "[v2:10.113.29.10:6825/2947358317,v1:10.113.29.10:6832/2947358317]",
But then, there are also many osds where the configuration is incorrect, but this could happen in different ways.
For example in some OSDs the error is just in the front_addr, but the hb_front_addr is fine:
osd.26"back_addr": "[v2:10.114.29.5:6866/4155549673,v1:10.114.29.5:6867/4155549673]",
"front_addr": "[v2:10.114.29.5:6864/4155549673,v1:10.114.29.5:6865/4155549673]",
"hb_back_addr": "[v2:10.114.29.5:6870/4155549673,v1:10.114.29.5:6871/4155549673]",
"hb_front_addr": "[v2:10.113.29.5:6868/4155549673,v1:10.113.29.5:6869/4155549673]",
In others it is the hb_front_addr:
osd.34"back_addr": "[v2:10.114.29.6:6802/3934363792,v1:10.114.29.6:6803/3934363792]",
"front_addr": "[v2:10.113.29.6:6800/3934363792,v1:10.113.29.6:6801/3934363792]",
"hb_back_addr": "[v2:10.114.29.6:6806/3934363792,v1:10.114.29.6:6807/3934363792]",
"hb_front_addr": "[v2:10.114.29.6:6804/3934363792,v1:10.114.29.6:6805/3934363792]",
And in others both are wrong:
osd.32"back_addr": "[v2:10.114.29.10:6814/2403531529,v1:10.114.29.10:6820/2403531529]",
"front_addr": "[v2:10.114.29.10:6802/2403531529,v1:10.114.29.10:6808/2403531529]",
"hb_back_addr": "[v2:10.114.29.10:6836/2403531529,v1:10.114.29.10:6841/2403531529]",
"hb_front_addr": "[v2:10.114.29.10:6826/2403531529,v1:10.114.29.10:6830/2403531529]",
This happens just for the front_addr assignation, the back_addr in all OSDs is in the cluster network (10.114).
In the same node there can be OSDs that have the right configuration and OSDs that are announcing wrong front addresses.
Updated by Javier Cacheiro over 2 years ago
Just as statistics, there are now:
- 51 cases where there is an error in the front_addr or hb_front_addr configuration.
- 333 cases where it is correct
Updated by Javier Cacheiro over 2 years ago
Restarting the daemons seems to get the correct configuration but it is unclear why this did not happen when they were all rebooted after the upgrade.
Updated by Javier Cacheiro over 2 years ago
In some cases it requires several daemon restarts until it gets to the right configuration.
I don't know if the wrong config could be something that could be happening randomly with a lower probability.
Updated by Javier Cacheiro over 2 years ago
Upgraded from v16.2.6 to v16.2.6-20210927 to apply the remoto bug fix.
After the upgrade (no reboot of the nodes but the daemons were restarted by the upgrade) still 6 osds announcing incorrect front_addr and hb_front_addr (in this case all osds announce both front_addr and hb_front_addr in the cluster_network).
Updated by Javier Cacheiro over 2 years ago
I have kept restarting the incorrectly configured osds daemons until they got the right front_addr. In some cases it required several restarts.
Now all osds are announcing the correct front_addr and hb_front_addr.
Updated by Neha Ojha over 2 years ago
The docs suggest setting public_network in the global section, not just for the mons https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#public-network, can you give this a try and see if it helps?
Updated by Javier Cacheiro over 2 years ago
Yes, I tried that, but it does not change the behavior:
ceph config set global public_network 10.113.0.0/16
and then I run the daemon reconfig.
Same behavior.
As a further comment, the config with the setting only for the mon comes directly from cephadm when I bootstraped the cluster with:
cephadm bootstrap --mon-ip 10.113.26.1 --cluster-network 10.114.0.0/16
It run with no issue until the upgrade.