Bug #46845
closedNewly orchestrated OSD fails with 'unable to find any IPv4 address in networks '2001:db8:11d::/120' with ms_bind_ipv6=true
0%
Description
I just started deploying 60 OSDs to my new 15.2.4 OCtopus IPv6 cephadm cluster. I applied the spec for the OSDs and the orchestrator started creating OSDs. Unfortunately all 60 OSDs crashed at startup with the following message: 'unable to find any IPv4 address in networks '2001:db8:11d::/120'
ms_bind_ipv6 is set to true.
-- The job identifier is 14258. Aug 06 09:21:01 node3.example.net bash[64671]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-22 Aug 06 09:21:01 node3.example.net bash[64671]: Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-24e819f4-9089-48ae-b817-014a29addf23/osd-data-0ccc10ee-018d-43e8-8350-6ea1dd67102e --path /var/lib/ceph/osd/ceph-22 --no-mon-config Aug 06 09:21:01 node3.example.net bash[64671]: Running command: /usr/bin/ln -snf /dev/ceph-24e819f4-9089-48ae-b817-014a29addf23/osd-data-0ccc10ee-018d-43e8-8350-6ea1dd67102e /var/lib/ceph/osd/ceph-22/block Aug 06 09:21:01 node3.example.net bash[64671]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-22/block Aug 06 09:21:01 node3.example.net bash[64671]: Running command: /usr/bin/chown -R ceph:ceph /dev/mapper/ceph--24e819f4--9089--48ae--b817--014a29addf23-osd--data--0ccc10ee--018d--43e8--8350--6ea1dd67102e Aug 06 09:21:01 node3.example.net bash[64671]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-22 Aug 06 09:21:01 node3.example.net bash[64671]: --> ceph-volume lvm activate successful for osd ID: 22 Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.465+0000 7fee3e813f40 0 set uid:gid to 167:167 (ceph:ceph) Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.465+0000 7fee3e813f40 0 ceph version 15.2.4 (7447c15c6ff58d7fce91843b705a268a1917325c) octopus (stable), process ceph-osd, pid 1 Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.465+0000 7fee3e813f40 0 pidfile_write: ignore empty --pid-file Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.469+0000 7fee3e813f40 1 bdev create path /var/lib/ceph/osd/ceph-22/block type kernel Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.469+0000 7fee3e813f40 1 bdev(0x562f2f600000 /var/lib/ceph/osd/ceph-22/block) open path /var/lib/ceph/osd/ceph-22/block Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.469+0000 7fee3e813f40 1 bdev(0x562f2f600000 /var/lib/ceph/osd/ceph-22/block) open size 1000203091968 (0xe8e0c00000, 932 GiB) block_size 4096 (4 KiB) non-rotational discard supported Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.469+0000 7fee3e813f40 1 bluestore(/var/lib/ceph/osd/ceph-22) _set_cache_sizes cache_size 3221225472 meta 0.4 kv 0.4 data 0.2 Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.469+0000 7fee3e813f40 1 bdev create path /var/lib/ceph/osd/ceph-22/block type kernel Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.469+0000 7fee3e813f40 1 bdev(0x562f2f600700 /var/lib/ceph/osd/ceph-22/block) open path /var/lib/ceph/osd/ceph-22/block Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.469+0000 7fee3e813f40 1 bdev(0x562f2f600700 /var/lib/ceph/osd/ceph-22/block) open size 1000203091968 (0xe8e0c00000, 932 GiB) block_size 4096 (4 KiB) non-rotational discard supported Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.469+0000 7fee3e813f40 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-22/block size 932 GiB Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.469+0000 7fee3e813f40 1 bdev(0x562f2f600700 /var/lib/ceph/osd/ceph-22/block) close Aug 06 09:21:01 node3.example.net bash[64907]: debug 2020-08-06T07:21:01.773+0000 7fee3e813f40 1 bdev(0x562f2f600000 /var/lib/ceph/osd/ceph-22/block) close Aug 06 09:21:02 node3.example.net bash[64907]: debug 2020-08-06T07:21:02.037+0000 7fee3e813f40 1 objectstore numa_node 0 Aug 06 09:21:02 node3.example.net bash[64907]: debug 2020-08-06T07:21:02.037+0000 7fee3e813f40 0 starting osd.22 osd_data /var/lib/ceph/osd/ceph-22 /var/lib/ceph/osd/ceph-22/journal Aug 06 09:21:02 node3.example.net bash[64907]: debug 2020-08-06T07:21:02.037+0000 7fee3e813f40 -1 unable to find any IPv4 address in networks '2001:db8:11d::/120' interfaces '' Aug 06 09:21:02 node3.example.net bash[64907]: debug 2020-08-06T07:21:02.037+0000 7fee3e813f40 -1 unable to find any IPv4 address in networks '2001:db8:11d::/120' interfaces '' Aug 06 09:21:02 node3.example.net bash[64907]: debug 2020-08-06T07:21:02.037+0000 7fee3e813f40 -1 Failed to pick public address. Aug 06 09:21:02 node3.example.net systemd[1]: ceph-d77f7c4a-d656-11ea-95cb-531234b0f844@osd.22.service: Main process exited, code=exited, status=1/FAILURE
I double checked to see if ms_bind_ipv6 was set to True, this is the case.
While searching for ms_bind I noticed ms_bind_ipv4 is a thing that exists and it was also set to true (default). When I configure this to be false, the OSDs can boot up. Switching ms_bind_ipv4 back to the default (true), the OSDs can not start.
ms_bind_ipv4 set to false (for OSD only):
Aug 06 09:55:22 node3.example.net bash[66959]: debug 2020-08-06T07:55:22.013+0000 7f54b86daf40 0 starting osd.22 osd_data /var/lib/ceph/osd/ceph-22 /var/lib/ceph/osd/ceph-22/journal Aug 06 09:55:22 node3.example.net bash[66959]: debug 2020-08-06T07:55:22.033+0000 7f54b86daf40 0 load: jerasure load: lrc load: isa Aug 06 09:55:22 node3.example.net bash[66959]: debug 2020-08-06T07:55:22.033+0000 7f54b86daf40 1 bdev create path /var/lib/ceph/osd/ceph-22/block type kernel Aug 06 09:55:22 node3.example.net bash[66959]: debug 2020-08-06T07:55:22.033+0000 7f54b86daf40 1 bdev(0x55c143d6a000 /var/lib/ceph/osd/ceph-22/block) open path /var/lib/ceph/osd/ceph-22/block ...snip... Aug 06 09:55:25 node3.example.net bash[66959]: debug 2020-08-06T07:55:25.628+0000 7f54a21c7700 1 osd.22 88 state: booting -> active
ms_bind_ipv4 back to the default value (true) and then it fails to start again:
Aug 06 10:10:43 node3.example.net bash[70455]: debug 2020-08-06T08:10:43.617+0000 7f78b53d3f40 0 starting osd.22 osd_data /var/lib/ceph/osd/ceph-22 /var/lib/ceph/osd/ceph-22/journal Aug 06 10:10:43 node3.example.net bash[70455]: debug 2020-08-06T08:10:43.617+0000 7f78b53d3f40 -1 unable to find any IPv4 address in networks '2001:db8:11d::/120' interfaces ''
To be sure this was the only thing in the way, i tried it 2 more times. I can confirm that with the ms_bind_ipv4 set to false, my OSDs can boot. With ms_bind_ipv4 set to default (true), my OSDs fail to boot.
If you need any more information i'd be happy to supply you with it.
Updated by Matthew Oliver over 3 years ago
I think this is duplicate of https://tracker.ceph.com/issues/39711
The workaround was to disable `ms_bind_ipv4`, as it's enabled by default. And seeing `ms_bind_ipv6` doesn't disable it.
Your tracker bug has more details then the other though, which is nice, I'm no expert in this area of the codebase.. yet but I'll use it to attempt to track down the issue so we don't need a workaround, because clearly that isn't IPv4 so maybe a passing failure. I'll play with this tomorrow (it's late here in Oz).
Updated by Matthew Oliver over 3 years ago
I've managed to recreate the issue in a vstart env. It happens when I use ipv6 but set the `public network` to an ipv6 network. Now I can debug!
Hopefully have a solution/PR soon :)
Updated by Daniël Vos over 3 years ago
Matthew Oliver wrote:
I've managed to recreate the issue in a vstart env. It happens when I use ipv6 but set the `public network` to an ipv6 network. Now I can debug!
Hopefully have a solution/PR soon :)
That's great! My `public network` and `cluster network` both have their own /120. I've `cephadm bootstrap`ped my cluster with a ceph.conf that contained 3 settings, the public/cluster network and `ms bind ipv6` = true.
Updated by Matthew Oliver over 3 years ago
- Status changed from New to In Progress
- Assignee set to Matthew Oliver
Cool, I've tracked down what's happening. Will push the first version of a patch up on Monday. I think if we get an IP from the network then we shouldn't stop the OSD from starting. If there isn't an network for all address famlilies then we should warn and continue.
So it'll be a PR containing I bit of code change and documentation to make sure how to do single stack (ipv4 or ipv6) and dual stack would need to be configured. That's the plan anyway.
Right now, back to my weekend :)
Updated by Neha Ojha over 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 36536
Updated by Kefu Chai over 3 years ago
- Has duplicate Bug #39711: "unable to find any IPv4 address in networks <ipv6-network>" after upgrade to nautilus on osd and mds added
Updated by Kefu Chai over 3 years ago
- Status changed from Fix Under Review to Resolved
Updated by Daniel Pivonka over 2 years ago
- Related to Bug #52867: pick_address.cc prints: unable to find any IPv4 address in networks 'fd00:fd00:fd00:3000::/64' interfaces added