Bug #42641: Starting MGR fails: handle_connect_reply_2 connect got BADAUTHORIZER - mgr - Ceph

Actions

Copy link

Bug #42641

open

Starting MGR fails: handle_connect_reply_2 connect got BADAUTHORIZER

Added by Thomas Schneider over 4 years ago. Updated over 4 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v14.2.2

% Done:

Source:

other

Tags:

MGR, BADAUTHORIZER

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

After stopping all MGR services (on 4 nodes) I get this error when I try to start the MGR again on any single node:
10.97.206.94:0/2146864016 >> v1:10.97.206.93:6822/2734472 conn(0x55d9e2e17180 0x55d9e3465000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER

Unfortunately this occurs regularly and I don't understand what's the root cause for it.

Actions

Copy link

Updated by Brad Hubbard over 4 years ago

Project changed from Ceph to mgr

Actions

Copy link

Updated by Sage Weil over 4 years ago

Status changed from New to Need More Info

Hi, are you see still this problem? Can you reproduce it on a recent release?

Actions

Copy link

Updated by Thomas Schneider over 4 years ago

Hi,
I have installed this version of different packages from Sage's deb repo:
ceph-base/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-common/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-fuse/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-mds/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-mgr-dashboard/stable,now 14.2.4-1-gd592e56-1bionic all [installed]
ceph-mgr-diskprediction-cloud/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr-diskprediction-local/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr-rook/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr-ssh/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-mon/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-osd/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
libcephfs1/oldstable,now 10.2.11-2 amd64 [installed]
libcephfs2/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
python-ceph-argparse/stable,now 14.2.4-1-gd592e56-1bionic all [installed]
python-cephfs/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]

Is there a more recent version available?

In order to stabilize the cluster I have executed several measure:
1. setting options: noout nobackfill norecover norebalance nodown
2. stopping all OSDs
3. stopping all MGRs and MONs
4. setting in ceph.conf: cephx_require_signatures = false cephx_cluster_require_signatures = false cephx_sign_messages = false
5. starting all OSDs
6. starting all MGRs and MONs

Hereby the cluster recovered to a state with some slow requests and a few stuck requests, but not with the error in MGR log.
Then I unset the options noout nobackfill norecover norebalance nodown again and delete the settings for cephx in ceph.conf.

Unfortunately the cluster is still not fully recovered, but the error message in MGR log is not recorded anymore.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #42641

Starting MGR fails: handle_connect_reply_2 connect got BADAUTHORIZER

Updated by Brad Hubbard over 4 years ago

Updated by Sage Weil over 4 years ago

Updated by Thomas Schneider over 4 years ago