Bug #52760
openMonitor unable to rejoin the cluster
0%
Description
Our cluster has three monitors.
After a restart one of our monitors failed to join the cluster with:
Sep 24 07:52:47 mon2.example.com ceph-mon1348665: 2021-09-24T07:52:47.973+0200 7fc9f766a700 -1 mon.mon2@-1(???) e14 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
Looking at the sessions from the cluster leader mon0, it still has a session to the monitor that is failing to join:
$ for mon in $(ceph -s --format=json|jq '.quorum_names[]' -r);do ceph tell mon.$mon sessions|jq '.[]|select(.con_type == "mon")';done
{
"name": "mon.0",
"entity_name": "",
"addrs": {
"addrvec": [
{
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "none",
"addr": "(unrecognized address family 0)",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
}
{
"name": "mon.2",
"entity_name": "",
"addrs": {
"addrvec": [
{
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
}
{
"name": "mon.1",
"entity_name": "",
"addrs": {
"addrvec": [
{
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
}
{
"name": "mon.0",
"entity_name": "",
"addrs": {
"addrvec": [
{
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
}
{
"name": "mon.1",
"entity_name": "",
"addrs": {
"addrvec": [
{
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "none",
"addr": "(unrecognized address family 0)",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
}
{
"name": "mon.2",
"entity_name": "",
"addrs": {
"addrvec": [
{
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
{
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
}
ceph -s shows:
health: HEALTH_WARN
1/3 mons down, quorum mon0,mon1
services:
mon: 3 daemons, quorum mon0,mon1(age 5d), out of quorum: mon2
All monitors run ceph version 15.2.12 (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable)
The workaround is to restart the other 2 monitors that have quorum.
Updated by Neha Ojha over 2 years ago
- Status changed from New to Need More Info
Can you share mon logs from all the monitors with debug_mon=20 and debug_ms=1?
Updated by Ruben Kerkhof over 2 years ago
Neha Ojha wrote:
Can you share mon logs from all the monitors with debug_mon=20 and debug_ms=1?
I will once this happens again. We've seen this at some other customers too.