Project

General

Profile

Actions

Bug #48033

closed

mon: after unrelated crash: handle_auth_request failed to assign global_id; probing, no known network/clock problem; then just start working in 6 hours.

Added by Peter Gervai over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

ceph version 14.2.11 (21626754f4563baadc6ba5d50b9cbc48a5730a94) nautilus (stable)

I have tried to extensively search for this and found way too much hits of seemingly non-helpful problems, so I pour some logs here in hope it may help. However feel free to close, merge, or whatever you please, since I cannot tell you what the problem was and why have it fixed itself.

So the server got stuck on something (very possibly not related to ceph) and the load went up and the fencing hit the red button and rebooted the server. Then it started. From then on there were some curious effects, all related to this one sertver and possibly connected.

0) clocks are in sync (ntp), network, mtu, whatever seems to be in good working order, largest rtt between members about 4.5 ms. All the other 4 cluster members are happy, one of them on the same rack switch as the problem one.

1) lot of ceph-osd3196: 2020-10-28 00:20:40.607 7f7b55179700 -1 osd.28 35153 heartbeat_check: no reply from 10.0.0.102:6837 osd.14 ever on either front or back, first ping sent 2020-10-28 00:20:15.141014 (oldest deadline 2020-10-28 00:20:35.141014)
mentioning basically all OSD on the server, complainint about other ODSs. Despite of that they all worked fine, except for no more than a dozen stalled probes, all created on startup.

2) mon stuck probing, out of quorum. I have tried restarting, or looking at various stats and logs but nothing was seen on default level. So I have first tried mon_debug 20, then debug-ms 1 (and seen nothing, but here you are; ip's are normalised to 10.10.10.x). The log starts after the reboot and ends when it got quorum 6 hours later.
[log attached]

To make it searchable, here's a few lines:

2020-10-28 00:38:20.817 7fc3d040a700 1 mon.bean@3(probing) e7 handle_auth_request failed to assign global_id
2020-10-28 00:38:21.449 7fc3d040a700 1 mon.bean@3(probing) e7 handle_auth_request failed to assign global_id
2020-10-28 00:38:22.637 7fc3d040a700 1 mon.bean@3(probing) e7 handle_auth_request failed to assign global_id
...
2020-10-28 01:51:31.672 7f3e3fc35700 1 -- [v2:10.10.10.112:3300/0,v1:10.10.10.112:6789/0] >> v1:10.10.10.111:0/1918451541 conn(0x55f037d32900 legacy=0x55f037c9d800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 31
2020-10-28 01:51:31.672 7f3e3fc35700 1 -- [v2:10.10.10.112:3300/0,v1:10.10.10.112:6789/0] >> v1:10.10.10.111:0/1918451541 conn(0x55f037d32900 legacy=0x55f037c9d800 unknown :6789 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2020-10-28 01:51:31.672 7f3e3fc35700 1 --1- [v2:10.10.10.112:3300/0,v1:10.10.10.112:6789/0] >> v1:10.10.10.111:0/1918451541 conn(0x55f037d32900 0x55f037c9d800 :6789 s=OPENED pgs=1 cs=1 l=1).handle_message read tag failed
2020-10-28 01:51:31.672 7f3e3fc35700 1 --1- [v2:10.10.10.112:3300/0,v1:10.10.10.112:6789/0] >> v1:10.10.10.111:0/1918451541 conn(0x55f037d32900 0x55f037c9d800 :6789 s=OPENED pgs=1 cs=1 l=1).fault on lossy channel, failing


Files

ceph-mon.log.gz (285 KB) ceph-mon.log.gz ceph-mon.log Peter Gervai, 10/28/2020 09:48 AM
Actions

Also available in: Atom PDF