Bug #42641
openStarting MGR fails: handle_connect_reply_2 connect got BADAUTHORIZER
0%
Description
After stopping all MGR services (on 4 nodes) I get this error when I try to start the MGR again on any single node:
10.97.206.94:0/2146864016 >> v1:10.97.206.93:6822/2734472 conn(0x55d9e2e17180 0x55d9e3465000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER
Unfortunately this occurs regularly and I don't understand what's the root cause for it.
Updated by Sage Weil over 4 years ago
- Status changed from New to Need More Info
Hi, are you see still this problem? Can you reproduce it on a recent release?
Updated by Thomas Schneider over 4 years ago
Hi,
I have installed this version of different packages from Sage's deb repo:
ceph-base/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-common/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-fuse/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-mds/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-mgr-dashboard/stable,now 14.2.4-1-gd592e56-1bionic all [installed]
ceph-mgr-diskprediction-cloud/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr-diskprediction-local/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr-rook/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr-ssh/stable,now 14.2.4-1-gd592e56-1bionic all [installed,automatic]
ceph-mgr/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-mon/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph-osd/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
ceph/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
libcephfs1/oldstable,now 10.2.11-2 amd64 [installed]
libcephfs2/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
python-ceph-argparse/stable,now 14.2.4-1-gd592e56-1bionic all [installed]
python-cephfs/stable,now 14.2.4-1-gd592e56-1bionic amd64 [installed]
Is there a more recent version available?
In order to stabilize the cluster I have executed several measure:
1. setting options: noout nobackfill norecover norebalance nodown
2. stopping all OSDs
3. stopping all MGRs and MONs
4. setting in ceph.conf: cephx_require_signatures = false cephx_cluster_require_signatures = false cephx_sign_messages = false
5. starting all OSDs
6. starting all MGRs and MONs
Hereby the cluster recovered to a state with some slow requests and a few stuck requests, but not with the error in MGR log.
Then I unset the options noout nobackfill norecover norebalance nodown again and delete the settings for cephx in ceph.conf.
Unfortunately the cluster is still not fully recovered, but the error message in MGR log is not recorded anymore.