Project

General

Profile

Actions

Bug #50775

open

mds and osd unable to obtain rotating service keys

Added by wenge song almost 3 years ago. Updated over 2 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

30%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

version-15.2.0

error message:

2021-05-04T05:51:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:54.718+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:54.718+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:54.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:54:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:54:24.719+0800 7f105b2737c0 -1 mds.c ERROR: failed to refresh rotating keys, maximum retry time reached.
2021-05-04T05:54:24.719+0800 7f105b2737c0 1 mds.c suicide!

The same problem occurs with OSD on the same node as MDS:

2021-05-03T10:41:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.025+0800 7f7f8e790ec0 -1 osd.25 11233 init wait_auth_rotating timed out

bug repeat contiditon:

1、change auth_service_ticket_ttl 120s To speed up the secret update

2、restarting mon repeatedly on one node,restarting mds repeatedly on another node,Neither node is the primary mon node

3、When primary mon happens to be updating the secret to other mons and one mon just restart at same time,mon need
reselect leader,the following occurs:

2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later

bug occured reason:

1、when trigger_propose failed,the primary mon secret version is different from the other mons,primary mon have new secret version but other mon just have old secret version

2、when mds or osd need update secret from other mon not leader,Get the rotate secret of an older version cause bug happend

3、when mon leader update secret next time,trigger_propose succeed,other mons update new secret version.mds and osd update new secret version too. cluster returns to normal.

I will submit patch for this bug


Files

ceph-mon.a.log.tar.gz (708 KB) ceph-mon.a.log.tar.gz wenge song, 05/24/2021 11:15 AM
ceph-mds.b.log.tar.gz (481 KB) ceph-mds.b.log.tar.gz wenge song, 05/24/2021 11:17 AM
ceph-mon.b.log.tar.gz (461 KB) ceph-mon.b.log.tar.gz wenge song, 05/24/2021 11:53 AM
bugshell (376 Bytes) bugshell wenge song, 05/24/2021 11:53 AM
ceph-mds.c.log.tar.gz (488 KB) ceph-mds.c.log.tar.gz wenge song, 05/26/2021 06:46 AM
ceph-mds.b.log.tar.gz (638 KB) ceph-mds.b.log.tar.gz wenge song, 05/26/2021 06:46 AM
ceph-mon.b.log.tar.gz (758 KB) ceph-mon.b.log.tar.gz wenge song, 05/26/2021 06:46 AM
ceph-mon.a.log.tar.gz (713 KB) ceph-mon.a.log.tar.gz wenge song, 05/26/2021 06:46 AM
ceph-mon.c.log.tar.gz (878 KB) ceph-mon.c.log.tar.gz wenge song, 05/26/2021 06:46 AM
1622700668(1).jpg (106 KB) 1622700668(1).jpg wenge song, 06/03/2021 06:12 AM
ceph-mds.b.tar.gz (38.7 KB) ceph-mds.b.tar.gz wenge song, 06/04/2021 06:42 AM
ceph-mds.c.tar.gz (34 KB) ceph-mds.c.tar.gz wenge song, 06/04/2021 06:42 AM
ceph-mon.a.tar.gz (660 KB) ceph-mon.a.tar.gz this is mon leader wenge song, 06/04/2021 06:42 AM
ceph-mon.b.tar.gz (320 KB) ceph-mon.b.tar.gz wenge song, 06/04/2021 06:42 AM
ceph-mon.c.tar.gz (553 KB) ceph-mon.c.tar.gz wenge song, 06/04/2021 06:42 AM

Related issues 1 (0 open1 closed)

Related to CephFS - Bug #50390: mds: monclient: wait_auth_rotating timed out after 30ResolvedIlya Dryomov

Actions
Actions

Also available in: Atom PDF