Bug #65186
openOSDs unreachable in upgrade test
0%
Description
/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7616011/remote/smithi087/log/a8e8c570-e819-11ee-95cd-87774f69a715
2024-03-22T07:19:18.215315+0000 mon.a (mon.0) 10 : cluster 0 Standby manager daemon x restarted
2024-03-22T07:19:18.215450+0000 mon.a (mon.0) 11 : cluster 0 Standby manager daemon x started
2024-03-22T07:19:18.215315+0000 mon.a (mon.0) 10 : cluster 0 Standby manager daemon x restarted
2024-03-22T07:19:18.215450+0000 mon.a (mon.0) 11 : cluster 0 Standby manager daemon x started
2024-03-22T07:19:18.277027+0000 mon.a (mon.0) 12 : cluster 0 mgrmap e33: y(active, since 63s), standbys: x
2024-03-22T07:19:18.414028+0000 mon.a (mon.0) 13 : cluster 1 Active manager daemon y restarted
2024-03-22T07:19:18.414630+0000 mon.a (mon.0) 14 : cluster 4 Health check failed: 8 osds(s) are not reachable (OSD_UNREACHABLE)
2024-03-22T07:19:18.414953+0000 mon.a (mon.0) 15 : cluster 1 Activating manager daemon y
2024-03-22T07:19:18.427127+0000 mon.a (mon.0) 16 : cluster 0 osdmap e81: 8 total, 8 up, 8 in
2024-03-22T07:19:18.277027+0000 mon.a (mon.0) 12 : cluster 0 mgrmap e33: y(active, since 63s), standbys: x
2024-03-22T07:19:18.427673+0000 mon.a (mon.0) 17 : cluster 0 mgrmap e34: y(active, starting, since 0.0129348s), standbys: x
2024-03-22T07:19:18.414028+0000 mon.a (mon.0) 13 : cluster 1 Active manager daemon y restarted
2024-03-22T07:19:18.433869+0000 osd.4 (osd.4) 3 : cluster 3 failed to encode map e81 with expected crc
2024-03-22T07:19:18.435418+0000 osd.2 (osd.2) 3 : cluster 3 failed to encode map e81 with expected crc
2024-03-22T07:19:18.414630+0000 mon.a (mon.0) 14 : cluster 4 Health check failed: 8 osds(s) are not reachable (OSD_UNREACHABLE)
2024-03-22T07:19:18.443967+0000 osd.4 (osd.4) 4 : cluster 3 failed to encode map e81 with expected crc
Likely connected to https://tracker.ceph.com/issues/63389.
Updated by Laura Flores about 1 month ago
- Related to Bug #63389: Failed to encode map X with expected CRC added
Updated by Laura Flores about 1 month ago
Possibly a dupe of the related tracker (crc encoding issues)
Updated by Laura Flores about 1 month ago
/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7615991
Updated by Radoslaw Zarzynski about 1 month ago
- Assignee set to Ronen Friedman
2024-03-22T07:19:18.414630+0000 mon.a (mon.0) 14 : cluster 4 Health check failed: 8 osds(s) are not reachable (OSD_UNREACHABLE)
Hmm, unreachable OSDs. This in turn could be potentially explained with overload caused by exchange of full maps.
We can rerun with squid backport of the CRC fix inside (https://github.com/ceph/ceph/pull/56553).
Updated by Radoslaw Zarzynski about 1 month ago
- Assignee deleted (
Ronen Friedman)
Updated by Laura Flores 29 days ago
/a/teuthology-2024-03-29_02:08:11-upgrade-squid-distro-default-smithi/7629092
Updated by Laura Flores 29 days ago
/a/teuthology-2024-03-29_02:08:11-upgrade-squid-distro-default-smithi/7629109
Updated by Laura Flores 26 days ago
Laura Flores wrote:
/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7616011/remote/smithi087/log/a8e8c570-e819-11ee-95cd-87774f69a715
[...]Likely connected to https://tracker.ceph.com/issues/63389.
Some messages that say "osd.X's public address is not in subnet".
/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7616011/remote/smithi087/log/a8e8c570-e819-11ee-95cd-87774f69a715/ceph-mon.a.log.gz
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : Health detail: HEALTH_ERR 8 osds(s) are not reachable
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 70 at 2024-03-22T07:20:00.000198+0000) v1 -- 0x55a9b6f1aa80 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : [ERR] OSD_UNREACHABLE: 8 osds(s) are not reachable
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 71 at 2024-03-22T07:20:00.000264+0000) v1 -- 0x55a9b5c65180 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : osd.0's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 72 at 2024-03-22T07:20:00.000316+0000) v1 -- 0x55a9b6eaaa80 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : osd.1's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 73 at 2024-03-22T07:20:00.000355+0000) v1 -- 0x55a9b6eaafc0 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : osd.2's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8ab0c700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] <== mon.0 v2:172.21.15.87:3300/0 0 ==== log(1 entries from seq 70 at 2024-03-22T07:20:00.000198+0000) v1 ==== 0+0+0 (unknown 0 0 0) 0x55a9b6f1aa80 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 74 at 2024-03-22T07:20:00.000397+0000) v1 -- 0x55a9b6eaa540 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : osd.3's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 75 at 2024-03-22T07:20:00.000465+0000) v1 -- 0x55a9b71b81c0 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : osd.4's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 76 at 2024-03-22T07:20:00.000506+0000) v1 -- 0x55a9b6eaac40 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : osd.5's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 77 at 2024-03-22T07:20:00.000537+0000) v1 -- 0x55a9b6eab180 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : osd.6's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 78 at 2024-03-22T07:20:00.000571+0000) v1 -- 0x55a9b6eaa8c0 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : osd.7's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700 1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 79 at 2024-03-22T07:20:00.000631+0000) v1 -- 0x55a9b71b8a80 con 0x55a9b5c90c00
Updated by Nitzan Mordechai 25 days ago
mon.a logs shows:
2024-03-22T07:20:00.553+0000 7f1de02eb700 10 mon.b@1(peon).log v747 update_from_paxos latest full 714 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000198+0000 mon.a (mon.0) 70 : cluster [ERR] Health detail: HEALTH_ERR 8 osds(s) are not reachable 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000264+0000 mon.a (mon.0) 71 : cluster [ERR] [ERR] OSD_UNREACHABLE: 8 osds(s) are not reachable 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000316+0000 mon.a (mon.0) 72 : cluster [ERR] osd.0's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000355+0000 mon.a (mon.0) 73 : cluster [ERR] osd.1's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000397+0000 mon.a (mon.0) 74 : cluster [ERR] osd.2's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000465+0000 mon.a (mon.0) 75 : cluster [ERR] osd.3's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000506+0000 mon.a (mon.0) 76 : cluster [ERR] osd.4's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000537+0000 mon.a (mon.0) 77 : cluster [ERR] osd.5's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000571+0000 mon.a (mon.0) 78 : cluster [ERR] osd.6's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet 2024-03-22T07:20:00.553+0000 7f1de02eb700 7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000631+0000 mon.a (mon.0) 79 : cluster [ERR] osd.7's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
osd boot message (osd.0):
2024-03-22T07:23:10.432+0000 7f95a3527700 1 -- [v2:172.21.15.87:6802/395849778,v1:172.21.15.87:6803/395849778] --> [v2:172.21.15.87:3301/0,v1:172.21.15.87:6790/0] -- osd_boot(osd.0 booted 0 features 4540701547738038271 v81) v7 -- 0x56246e746380 con 0x56246d00d400
ip: 172.21.15.87 is actually in subnet 2, the function that checking that:
bool is_addr_in_subnet( CephContext *cct, const std::string &networks, const std::string &addr) { const auto nets = get_str_list(networks); ceph_assert(!nets.empty()); const auto &net = nets.front(); struct ifaddrs ifa; unsigned ipv = CEPH_PICK_ADDRESS_IPV4; struct sockaddr_in public_addr; ifa.ifa_next = nullptr; ifa.ifa_addr = (struct sockaddr*)&public_addr; public_addr.sin_family = AF_INET; inet_pton(AF_INET, addr.c_str(), &public_addr.sin_addr); return matches_with_net(cct, ifa, net, ipv); }
looks like we are checking net for a list of nets and only the first one, we will need to loop over all the nets before we decide if the address is in the subnet or not.
Updated by Nitzan Mordechai 25 days ago
- Status changed from New to Fix Under Review
- Pull request ID set to 56640
Updated by Radoslaw Zarzynski 24 days ago
- Priority changed from High to Normal
Lowering the priority as it's not a real regression – it's a problem with the recently introduced warning.
Updated by Laura Flores 19 days ago
- Related to deleted (Bug #63389: Failed to encode map X with expected CRC)
Updated by Matan Breizman 10 days ago
/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659312
/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659457