Project

General

Profile

Bug #48033

mon: after unrelated crash: handle_auth_request failed to assign global_id; probing, no known network/clock problem; then just start working in 6 hours.

Added by Peter Gervai over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94) nautilus (stable)

I have tried to extensively search for this and found way too much hits of seemingly non-helpful problems, so I pour some logs here in hope it may help. However feel free to close, merge, or whatever you please, since I cannot tell you what the problem was and why have it fixed itself.

So the server got stuck on something (very possibly not related to ceph) and the load went up and the fencing hit the red button and rebooted the server. Then it started. From then on there were some curious effects, all related to this one sertver and possibly connected.

0) clocks are in sync (ntp), network, mtu, whatever seems to be in good working order, largest rtt between members about 4.5 ms. All the other 4 cluster members are happy, one of them on the same rack switch as the problem one.

1) lot of ceph-osd3196: 2020-10-28 00:20:40.607 7f7b55179700 -1 osd.28 35153 heartbeat_check: no reply from 10.0.0.102:6837 osd.14 ever on either front or back, first ping sent 2020-10-28 00:20:15.141014 (oldest deadline 2020-10-28 00:20:35.141014)
mentioning basically all OSD on the server, complainint about other ODSs. Despite of that they all worked fine, except for no more than a dozen stalled probes, all created on startup.

2) mon stuck probing, out of quorum. I have tried restarting, or looking at various stats and logs but nothing was seen on default level. So I have first tried mon_debug 20, then debug-ms 1 (and seen nothing, but here you are; ip's are normalised to 10.10.10.x). The log starts after the reboot and ends when it got quorum 6 hours later.
[log attached]

To make it searchable, here's a few lines:

2020-10-28 00:38:20.817 7fc3d040a700 1 mon.bean@3(probing) e7 handle_auth_request failed to assign global_id
2020-10-28 00:38:21.449 7fc3d040a700 1 mon.bean@3(probing) e7 handle_auth_request failed to assign global_id
2020-10-28 00:38:22.637 7fc3d040a700 1 mon.bean@3(probing) e7 handle_auth_request failed to assign global_id
...
2020-10-28 01:51:31.672 7f3e3fc35700 1 -- [v2:10.10.10.112:3300/0,v1:10.10.10.112:6789/0] >> v1:10.10.10.111:0/1918451541 conn(0x55f037d32900 legacy=0x55f037c9d800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 31
2020-10-28 01:51:31.672 7f3e3fc35700 1 -- [v2:10.10.10.112:3300/0,v1:10.10.10.112:6789/0] >> v1:10.10.10.111:0/1918451541 conn(0x55f037d32900 legacy=0x55f037c9d800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-10-28 01:51:31.672 7f3e3fc35700 1 --1- [v2:10.10.10.112:3300/0,v1:10.10.10.112:6789/0] >> v1:10.10.10.111:0/1918451541 conn(0x55f037d32900 0x55f037c9d800 :6789 s=OPENED pgs=1 cs=1 l=1).handle_message read tag failed
2020-10-28 01:51:31.672 7f3e3fc35700 1 --1- [v2:10.10.10.112:3300/0,v1:10.10.10.112:6789/0] >> v1:10.10.10.111:0/1918451541 conn(0x55f037d32900 0x55f037c9d800 :6789 s=OPENED pgs=1 cs=1 l=1).fault on lossy channel, failing

ceph-mon.log.gz - ceph-mon.log (285 KB) Peter Gervai, 10/28/2020 09:48 AM

History

#1 Updated by Neha Ojha over 3 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (Monitor)

#2 Updated by Neha Ojha over 3 years ago

  • Status changed from New to Need More Info
  • Affected Versions v14.2.11 added

https://tracker.ceph.com/issues/47654#note-7 and https://tracker.ceph.com/issues/47654#note-8 may help to understand what those logs mean. I don't think they are the source of the problem. Might be useful to capture logs from all the monitors with debugging enabled (debug mon=20, debug ms=1, debug paxos=20), to see what's preventing this mon from joining quorum.

#3 Updated by Peter Gervai over 3 years ago

It seems that this can be closed: there is something deeply cursed on one of the machines in that cluster and seemingly some really bad network mojo is in action outside ceph. Probably it was indeed a lossy channel, though we really cannot pinpoint it as of yet. But my guess is Ceph can't do anything about it. Sorry for the noise,

#4 Updated by Neha Ojha over 3 years ago

  • Status changed from Need More Info to Closed

Also available in: Atom PDF