Bug #52760: Monitor unable to rejoin the cluster - RADOS - Ceph

Actions

Copy link

Bug #52760

open

Monitor unable to rejoin the cluster

Added by Ruben Kerkhof over 2 years ago. Updated over 2 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Our cluster has three monitors.

After a restart one of our monitors failed to join the cluster with:
Sep 24 07:52:47 mon2.example.com ceph-mon^1348665: 2021-09-24T07:52:47.973+0200 7fc9f766a700 -1 mon.mon2@-1(???) e14 handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied

Looking at the sessions from the cluster leader mon0, it still has a session to the monitor that is failing to join:

$ for mon in $(ceph -s --format=json|jq '.quorum_names[]' -r);do ceph tell mon.$mon sessions|jq '.[]|select(.con_type == "mon")';done {
"name": "mon.0",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "none",
"addr": "(unrecognized address family 0)",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.2",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.1",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.0",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.1",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "none",
"addr": "(unrecognized address family 0)",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
} {
"name": "mon.2",
"entity_name": "",
"addrs": {
"addrvec": [ {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
}, {
"type": "v1",
"addr": "[ip-add]:6789",
"nonce": 0
}
]
},
"socket_addr": {
"type": "v2",
"addr": "[ip-add]:3300",
"nonce": 0
},
"con_type": "mon",
"con_features": 4540138292840890400,
"con_features_hex": "3f01cfb8ffedffff",
"con_features_release": "luminous",
"open": true,
"caps": {
"text": "allow *"
},
"authenticated": true,
"global_id": 0,
"global_id_status": "none",
"osd_epoch": 0,
"remote_host": ""
}

ceph -s shows:
health: HEALTH_WARN
1/3 mons down, quorum mon0,mon1
services:
mon: 3 daemons, quorum mon0,mon1(age 5d), out of quorum: mon2

All monitors run ceph version 15.2.12 (ce065eabfa5ce81323b009786bdf5bb03127cbe1) octopus (stable)
The workaround is to restart the other 2 monitors that have quorum.