Project

General

Profile

Actions

Bug #52760

open

Monitor unable to rejoin the cluster

Added by Ruben Kerkhof over 2 years ago. Updated over 2 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Our cluster has three monitors.

After a restart one of our monitors failed to join the cluster with:
Sep 24 07:52:47 mon2.example.com ceph-mon1348665: 2021-09-24T07:52:47.973+0200 7fc9f766a700 -1 mon.mon2@-1(???) e14 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied

Looking at the sessions from the cluster leader mon0, it still has a session to the monitor that is failing to join:

$ for mon in $(ceph -s --format=json|jq '.quorum_names[]' -r);do ceph tell mon.$mon sessions|jq '.[]|select(.con_type == "mon")';done {
"name": "mon.0",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "none",
"addr": "(unrecognized address family 0)",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.2",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.1",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.0",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.1",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "none",
"addr": "(unrecognized address family 0)",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.2",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
}

ceph -s shows:
health: HEALTH_WARN
1/3 mons down, quorum mon0,mon1
services:
mon: 3 daemons, quorum mon0,mon1(age 5d), out of quorum: mon2

All monitors run ceph version 15.2.12 (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable)
The workaround is to restart the other 2 monitors that have quorum.

Actions #1

Updated by Greg Farnum over 2 years ago

  • Project changed from Ceph to RADOS
Actions #2

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Need More Info

Can you share mon logs from all the monitors with debug_mon=20 and debug_ms=1?

Actions #3

Updated by Ruben Kerkhof over 2 years ago

Neha Ojha wrote:

Can you share mon logs from all the monitors with debug_mon=20 and debug_ms=1?

I will once this happens again. We've seen this at some other customers too.

Actions

Also available in: Atom PDF