Project

General

Profile

Bug #42641

Starting MGR fails: handle_connect_reply_2 connect got BADAUTHORIZER

Added by Thomas Schneider about 3 years ago. Updated about 3 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
other
Tags:
MGR, BADAUTHORIZER
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After stopping all MGR services (on 4 nodes) I get this error when I try to start the MGR again on any single node:
10.97.206.94:0/2146864016 >> v1:10.97.206.93:6822/2734472 conn(0x55d9e2e17180 0x55d9e3465000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER

Unfortunately this occurs regularly and I don't understand what's the root cause for it.

History

#1 Updated by Brad Hubbard about 3 years ago

  • Project changed from Ceph to mgr

#2 Updated by Sage Weil about 3 years ago

  • Status changed from New to Need More Info

Hi, are you see still this problem? Can you reproduce it on a recent release?

#3 Updated by Thomas Schneider about 3 years ago

Hi,
I have installed this version of different packages from Sage's deb repo:
ceph-base/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-common/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-fuse/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-mds/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-mgr-dashboard/stable,now 14.2.4-1-gd592e56-1bionic all [installed]
ceph-mgr-diskprediction-cloud/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr-diskprediction-local/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr-rook/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr-ssh/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-mon/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-osd/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
libcephfs1/oldstable,now 10.2.11-2 amd64 [installed]
libcephfs2/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
python-ceph-argparse/stable,now 14.2.4-1-gd592e56-1bionic all [installed]
python-cephfs/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]

Is there a more recent version available?

In order to stabilize the cluster I have executed several measure:
1. setting options: noout nobackfill norecover norebalance nodown
2. stopping all OSDs
3. stopping all MGRs and MONs
4. setting in ceph.conf: cephx_require_signatures = false cephx_cluster_require_signatures = false cephx_sign_messages = false
5. starting all OSDs
6. starting all MGRs and MONs

Hereby the cluster recovered to a state with some slow requests and a few stuck requests, but not with the error in MGR log.
Then I unset the options noout nobackfill norecover norebalance nodown again and delete the settings for cephx in ceph.conf.

Unfortunately the cluster is still not fully recovered, but the error message in MGR log is not recorded anymore.

Also available in: Atom PDF