Bug #52761
openOSDs announcing incorrect front_addr after upgrade to 16.2.6
0%
Description
Ceph cluster configured with a public and cluster network:
ceph config dump|grep network
global advanced cluster_network 10.114.0.0/16 *
mon advanced public_network 10.113.0.0/16 *
Upgraded from 16.2.4 to 16.2.6 and all nodes rebooted after the upgrade.
Investigating an issue with clients not being able to connect I found that the problem is that clients are directed to the cluster_network address for some OSDs.
Looking at the osd metadata I see in most OSDs the front addresses are correctly configured through the 10.113 public network, like this one:
osd.0"back_addr": "[v2:10.114.29.10:6813/2947358317,v1:10.114.29.10:6819/2947358317]",
"front_addr": "[v2:10.113.29.10:6801/2947358317,v1:10.113.29.10:6807/2947358317]",
"hb_back_addr": "[v2:10.114.29.10:6837/2947358317,v1:10.114.29.10:6843/2947358317]",
"hb_front_addr": "[v2:10.113.29.10:6825/2947358317,v1:10.113.29.10:6832/2947358317]",
But then, there are also many osds where the configuration is incorrect, but this could happen in different ways.
For example in some OSDs the error is just in the front_addr, but the hb_front_addr is fine:
osd.26"back_addr": "[v2:10.114.29.5:6866/4155549673,v1:10.114.29.5:6867/4155549673]",
"front_addr": "[v2:10.114.29.5:6864/4155549673,v1:10.114.29.5:6865/4155549673]",
"hb_back_addr": "[v2:10.114.29.5:6870/4155549673,v1:10.114.29.5:6871/4155549673]",
"hb_front_addr": "[v2:10.113.29.5:6868/4155549673,v1:10.113.29.5:6869/4155549673]",
In others it is the hb_front_addr:
osd.34"back_addr": "[v2:10.114.29.6:6802/3934363792,v1:10.114.29.6:6803/3934363792]",
"front_addr": "[v2:10.113.29.6:6800/3934363792,v1:10.113.29.6:6801/3934363792]",
"hb_back_addr": "[v2:10.114.29.6:6806/3934363792,v1:10.114.29.6:6807/3934363792]",
"hb_front_addr": "[v2:10.114.29.6:6804/3934363792,v1:10.114.29.6:6805/3934363792]",
And in others both are wrong:
osd.32"back_addr": "[v2:10.114.29.10:6814/2403531529,v1:10.114.29.10:6820/2403531529]",
"front_addr": "[v2:10.114.29.10:6802/2403531529,v1:10.114.29.10:6808/2403531529]",
"hb_back_addr": "[v2:10.114.29.10:6836/2403531529,v1:10.114.29.10:6841/2403531529]",
"hb_front_addr": "[v2:10.114.29.10:6826/2403531529,v1:10.114.29.10:6830/2403531529]",
This happens just for the front_addr assignation, the back_addr in all OSDs is in the cluster network (10.114).
In the same node there can be OSDs that have the right configuration and OSDs that are announcing wrong front addresses.