Bug #53330
ceph client request connection with an old invalid key.
0%
Description
We have a production ceph cluster with 3 mons and 516 osds.
Ceph version: 14.2.8
CPU: Intel(R) Xeon(R) Gold 5218
MEM: 187 GB
NIC: 10 G
Node mon-2 down for some reason at 2021-11-17 04:02:00, the rest works well util ceph-mon daemon on mon-2 restart at 2021-11-17 09:05:03。
mon-3 calling new election due to lease_timeout at 09:05:16
mon-1 become leader at 2021-11-17 09:07:16
Then many osds were marked down by mon due to heartbeat timeout with other osds.
After restarting all OSDs, the cluster returns to health.
We can see a lot of verify_authorizer failed record in osd's log of on different node
[root@*****-ceph-1 ceph]# zcat /var/log/ceph/ceph-osd.*.log-20211118.gz | grep "verify_authorizer could not get service secret for service osd secret_id=9082" | wc -l
20057946
[root@*****-ceph-10 ~]# zcat /var/log/ceph/ceph-osd.*.log-20211118.gz | grep "verify_authorizer could not get service secret for service osd secret_id=9082" | wc -l
64982126
The secret_id add 1 ever hour, according to the secret_id of the current time,secret_id 9082 seems the valid id before mon-2 down at 2021-11-17 04:02:00.
Similar problem was reported six years ago