Project

General

Profile

Actions

Bug #65186

open

OSDs unreachable in upgrade test

Added by Laura Flores about 1 month ago. Updated 5 days ago.

Status:
Fix Under Review
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7616011/remote/smithi087/log/a8e8c570-e819-11ee-95cd-87774f69a715

2024-03-22T07:19:18.215315+0000 mon.a (mon.0) 10 : cluster 0 Standby manager daemon x restarted
2024-03-22T07:19:18.215450+0000 mon.a (mon.0) 11 : cluster 0 Standby manager daemon x started
2024-03-22T07:19:18.215315+0000 mon.a (mon.0) 10 : cluster 0 Standby manager daemon x restarted
2024-03-22T07:19:18.215450+0000 mon.a (mon.0) 11 : cluster 0 Standby manager daemon x started
2024-03-22T07:19:18.277027+0000 mon.a (mon.0) 12 : cluster 0 mgrmap e33: y(active, since 63s), standbys: x
2024-03-22T07:19:18.414028+0000 mon.a (mon.0) 13 : cluster 1 Active manager daemon y restarted
2024-03-22T07:19:18.414630+0000 mon.a (mon.0) 14 : cluster 4 Health check failed: 8 osds(s) are not reachable (OSD_UNREACHABLE)
2024-03-22T07:19:18.414953+0000 mon.a (mon.0) 15 : cluster 1 Activating manager daemon y
2024-03-22T07:19:18.427127+0000 mon.a (mon.0) 16 : cluster 0 osdmap e81: 8 total, 8 up, 8 in
2024-03-22T07:19:18.277027+0000 mon.a (mon.0) 12 : cluster 0 mgrmap e33: y(active, since 63s), standbys: x
2024-03-22T07:19:18.427673+0000 mon.a (mon.0) 17 : cluster 0 mgrmap e34: y(active, starting, since 0.0129348s), standbys: x
2024-03-22T07:19:18.414028+0000 mon.a (mon.0) 13 : cluster 1 Active manager daemon y restarted
2024-03-22T07:19:18.433869+0000 osd.4 (osd.4) 3 : cluster 3 failed to encode map e81 with expected crc
2024-03-22T07:19:18.435418+0000 osd.2 (osd.2) 3 : cluster 3 failed to encode map e81 with expected crc
2024-03-22T07:19:18.414630+0000 mon.a (mon.0) 14 : cluster 4 Health check failed: 8 osds(s) are not reachable (OSD_UNREACHABLE)
2024-03-22T07:19:18.443967+0000 osd.4 (osd.4) 4 : cluster 3 failed to encode map e81 with expected crc

Likely connected to https://tracker.ceph.com/issues/63389.

Actions #1

Updated by Laura Flores about 1 month ago

  • Related to Bug #63389: Failed to encode map X with expected CRC added
Actions #2

Updated by Laura Flores about 1 month ago

Possibly a dupe of the related tracker (crc encoding issues)

Actions #3

Updated by Laura Flores about 1 month ago

/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7615991

Actions #4

Updated by Radoslaw Zarzynski about 1 month ago

  • Assignee set to Ronen Friedman
2024-03-22T07:19:18.414630+0000 mon.a (mon.0) 14 : cluster 4 Health check failed: 8 osds(s) are not reachable (OSD_UNREACHABLE)

Hmm, unreachable OSDs. This in turn could be potentially explained with overload caused by exchange of full maps.
We can rerun with squid backport of the CRC fix inside (https://github.com/ceph/ceph/pull/56553).

Actions #5

Updated by Radoslaw Zarzynski about 1 month ago

  • Assignee deleted (Ronen Friedman)
Actions #6

Updated by Laura Flores 29 days ago

/a/teuthology-2024-03-29_02:08:11-upgrade-squid-distro-default-smithi/7629092

Actions #7

Updated by Laura Flores 29 days ago

/a/teuthology-2024-03-29_02:08:11-upgrade-squid-distro-default-smithi/7629109

Actions #8

Updated by Laura Flores 26 days ago

Laura Flores wrote:

/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7616011/remote/smithi087/log/a8e8c570-e819-11ee-95cd-87774f69a715
[...]

Likely connected to https://tracker.ceph.com/issues/63389.

Some messages that say "osd.X's public address is not in subnet".
/a/teuthology-2024-03-22_02:08:13-upgrade-squid-distro-default-smithi/7616011/remote/smithi087/log/a8e8c570-e819-11ee-95cd-87774f69a715/ceph-mon.a.log.gz

2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : Health detail: HEALTH_ERR 8 osds(s) are not reachable
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 70 at 2024-03-22T07:20:00.000198+0000) v1 -- 0x55a9b6f1aa80 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] : [ERR] OSD_UNREACHABLE: 8 osds(s) are not reachable
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 71 at 2024-03-22T07:20:00.000264+0000) v1 -- 0x55a9b5c65180 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] :     osd.0's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 72 at 2024-03-22T07:20:00.000316+0000) v1 -- 0x55a9b6eaaa80 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] :     osd.1's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 73 at 2024-03-22T07:20:00.000355+0000) v1 -- 0x55a9b6eaafc0 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] :     osd.2's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8ab0c700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] <== mon.0 v2:172.21.15.87:3300/0 0 ==== log(1 entries from seq 70 at 2024-03-22T07:20:00.000198+0000) v1 ==== 0+0+0 (unknown 0 0 0) 0x55a9b6f1aa80 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 74 at 2024-03-22T07:20:00.000397+0000) v1 -- 0x55a9b6eaa540 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] :     osd.3's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 75 at 2024-03-22T07:20:00.000465+0000) v1 -- 0x55a9b71b81c0 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] :     osd.4's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 76 at 2024-03-22T07:20:00.000506+0000) v1 -- 0x55a9b6eaac40 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] :     osd.5's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 77 at 2024-03-22T07:20:00.000537+0000) v1 -- 0x55a9b6eab180 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] :     osd.6's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 78 at 2024-03-22T07:20:00.000571+0000) v1 -- 0x55a9b6eaa8c0 con 0x55a9b5c90c00
2024-03-22T07:19:59.997+0000 7fbb8d311700 -1 log_channel(cluster) log [ERR] :     osd.7's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:19:59.997+0000 7fbb8d311700  1 -- [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] --> [v2:172.21.15.87:3300/0,v1:172.21.15.87:6789/0] -- log(1 entries from seq 79 at 2024-03-22T07:20:00.000631+0000) v1 -- 0x55a9b71b8a80 con 0x55a9b5c90c00

Actions #9

Updated by Neha Ojha 26 days ago

  • Assignee set to Nitzan Mordechai
  • Priority changed from Normal to High
Actions #10

Updated by Nitzan Mordechai 25 days ago

mon.a logs shows:

2024-03-22T07:20:00.553+0000 7f1de02eb700 10 mon.b@1(peon).log v747 update_from_paxos latest full 714
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000198+0000 mon.a (mon.0) 70 : cluster [ERR] Health detail: HEALTH_ERR 8 osds(s) are not reachable
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000264+0000 mon.a (mon.0) 71 : cluster [ERR] [ERR] OSD_UNREACHABLE: 8 osds(s) are not reachable
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000316+0000 mon.a (mon.0) 72 : cluster [ERR]     osd.0's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000355+0000 mon.a (mon.0) 73 : cluster [ERR]     osd.1's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000397+0000 mon.a (mon.0) 74 : cluster [ERR]     osd.2's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000465+0000 mon.a (mon.0) 75 : cluster [ERR]     osd.3's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000506+0000 mon.a (mon.0) 76 : cluster [ERR]     osd.4's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000537+0000 mon.a (mon.0) 77 : cluster [ERR]     osd.5's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000571+0000 mon.a (mon.0) 78 : cluster [ERR]     osd.6's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet
2024-03-22T07:20:00.553+0000 7f1de02eb700  7 mon.b@1(peon).log v747 update_from_paxos applying incremental log 747 2024-03-22T07:20:00.000631+0000 mon.a (mon.0) 79 : cluster [ERR]     osd.7's public address is not in '172.21.15.254/32,172.21.0.0/20,172.21.0.1/32,172.21.0.2/32' subnet

osd boot message (osd.0):

2024-03-22T07:23:10.432+0000 7f95a3527700  1 -- [v2:172.21.15.87:6802/395849778,v1:172.21.15.87:6803/395849778] --> [v2:172.21.15.87:3301/0,v1:172.21.15.87:6790/0] -- osd_boot(osd.0 booted 0 features 4540701547738038271 v81) v7 -- 0x56246e746380 con 0x56246d00d400

ip: 172.21.15.87 is actually in subnet 2, the function that checking that:

bool is_addr_in_subnet(
  CephContext *cct,
  const std::string &networks,
  const std::string &addr)
{
  const auto nets = get_str_list(networks);
  ceph_assert(!nets.empty());
  const auto &net = nets.front();
  struct ifaddrs ifa;
  unsigned ipv = CEPH_PICK_ADDRESS_IPV4;
  struct sockaddr_in public_addr;

  ifa.ifa_next = nullptr;
  ifa.ifa_addr = (struct sockaddr*)&public_addr;
  public_addr.sin_family = AF_INET;
  inet_pton(AF_INET, addr.c_str(), &public_addr.sin_addr);

  return matches_with_net(cct, ifa, net, ipv);
}

looks like we are checking net for a list of nets and only the first one, we will need to loop over all the nets before we decide if the address is in the subnet or not.

Actions #11

Updated by Nitzan Mordechai 25 days ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 56640
Actions #12

Updated by Radoslaw Zarzynski 24 days ago

  • Priority changed from High to Normal

Lowering the priority as it's not a real regression – it's a problem with the recently introduced warning.

Actions #13

Updated by Laura Flores 19 days ago

  • Related to deleted (Bug #63389: Failed to encode map X with expected CRC)
Actions #14

Updated by Matan Breizman 10 days ago

/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659312
/a/yuriw-2024-04-16_23:25:35-rados-wip-yuriw-testing-20240416.150233-distro-default-smithi/7659457

Actions #15

Updated by Radoslaw Zarzynski 5 days ago

In QA. Pinged.

Actions

Also available in: Atom PDF