Project

General

Profile

Bug #50390

mds: monclient: wait_auth_rotating timed out after 30

Added by Patrick Donnelly about 1 month ago. Updated 4 days ago.

Status:
Fix Under Review
Priority:
High
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Symptom:

2021-04-15T15:54:59.586 INFO:teuthology.orchestra.run.smithi001.stdout:2021-04-15T15:41:41.849793+0000 mon.a (mon.0) 3465 : cluster [WRN] Replacing daemon mds.b as rank 0 with standby daemon mds.a

From: /ceph/teuthology-archive/pdonnell-2021-04-15_01:35:57-fs-wip-pdonnell-testing-20210414.230315-distro-basic-smithi/6047521/teuthology.log

MDS was restarted between tests. The MDS's prior incarnation was up:standby in the MDSMap but it got stuck after restart on:

2021-04-15T15:41:41.824+0000 7fd1952a9780  0 monclient: wait_auth_rotating timed out after 30
2021-04-15T15:41:41.824+0000 7fd1952a9780 -1 mds.b unable to obtain rotating service keys; retrying

From: /ceph/teuthology-archive/pdonnell-2021-04-15_01:35:57-fs-wip-pdonnell-testing-20210414.230315-distro-basic-smithi/6047521/remote/smithi110/log/ceph-mds.b.log.gz

So it never got to send its up:boot message to the mons at startup, which would replace its old instance (in up:standby). This causes the warning because the mons had assigned that old instance to be up:creating but never heard back from it.


Related issues

Related to CephFS - Bug #50775: mds and osd unable to obtain rotating service keys New
Duplicated by CephFS - Bug #50755: mds restart but unable to obtain rotating service keys Duplicate

History

#1 Updated by Patrick Donnelly about 1 month ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 40880

#2 Updated by zizhou sun 6 days ago

I have same question, target version ceph-v15.2.0
2021-05-04T05:49:24.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:49:24.717+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:49:54.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:49:54.717+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:50:24.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:50:24.717+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:50:54.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:50:54.718+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:51:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:51:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying

#3 Updated by wenge song 6 days ago

I had the same problem, repeat to restart mds.

#4 Updated by Patrick Donnelly 5 days ago

  • Duplicated by Bug #50755: mds restart but unable to obtain rotating service keys added

#5 Updated by wenge song 5 days ago

pull request id 40880 fixed this? I just see the parameters "rotating keys bootstrap timeout: 15" added. It doesn't seem to solve my problem。

#6 Updated by wenge song 5 days ago

It seem to that paxos not active cause mon leader not push to other mon

2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: _rotate_secret adding mgr id 416 AQC/fptg1hEUJxAAJs6bVTyxhl6CI10+v3BnkA== expires 2021-05-12T16:07:43.655624+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 cephx keyserver: _check_rotating_secrets added 4
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: _dump_rotating_secrets
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mon id 414 key AQCvcJtgzxXfERAAm6Kr7gVRpcb9Xhwa+peZyw== expires 2021-05-12T15:07:43.299826+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mon id 415 key AQA1dJtg5IbaDRAAP0KHFIjqHYpazz9hywIfLQ== expires 2021-05-12T15:37:43.299826+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mon id 416 key AQC/fptgh4sSJxAA2pE+jPzIMQBDShOYQRDasg== expires 2021-05-12T16:07:43.655519+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mds id 414 key AQCvcJtgDQfgERAAVer26WywnBZUDKjqH3yrgg== expires 2021-05-12T15:07:43.299892+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mds id 415 key AQA1dJtgD3LbDRAA5Pfjneis7vdnwMwrtR8iXw== expires 2021-05-12T15:37:43.299892+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mds id 416 key AQC/fptgOrwTJxAA53nDKfPTfdlNJSo9Kge3rA== expires 2021-05-12T16:07:43.655603+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service osd id 414 key AQCvcJtgUa/fERAARtv3UkRO2S+LgsS6+0K/6Q== expires 2021-05-12T15:07:43.299870+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service osd id 415 key AQA1dJtg0hrbDRAAr7QF0d/7OC4X2hPB1ju3lw== expires 2021-05-12T15:37:43.299870+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service osd id 416 key AQC/fptgrDwTJxAAP+DedLl1yoLTaj0N0f0jgw== expires 2021-05-12T16:07:43.655570+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mgr id 414 key AQCvcJtgrFngERAAukIVWLe+rFbb8hrsmPD8IA== expires 2021-05-12T15:07:43.299913+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mgr id 415 key AQA1dJtgcsLbDRAABwvvqvIRu4ThgzdU/Au4WQ== expires 2021-05-12T15:37:43.299913+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mgr id 416 key AQC/fptg1hEUJxAAJs6bVTyxhl6CI10+v3BnkA== expires 2021-05-12T16:07:43.655624+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service auth id 45 key AQD1bZtgYx+0NRAAoUd1BcBBhbV3ZcbzJrb8Kg== expires 2021-05-12T15:06:05.900988+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service auth id 46 key AQD9dJtg+yAKIhAAJqlYAQ5pc+PA61Kb/aSr2A== expires 2021-05-12T15:41:05.900988+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service auth id 47 key AQBefptgIvozGhAAz85naguy99UtIiRbVIAIGQ== expires 2021-05-12T16:16:06.439605+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later

#7 Updated by wenge song 4 days ago

error message:

2021-05-04T05:51:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:54.718+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:54.718+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:54.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:54:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:54:24.719+0800 7f105b2737c0 -1 mds.c ERROR: failed to refresh rotating keys, maximum retry time reached.
2021-05-04T05:54:24.719+0800 7f105b2737c0 1 mds.c suicide!

The same problem occurs with OSD on the same node as MDS:

2021-05-03T10:41:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.025+0800 7f7f8e790ec0 -1 osd.25 11233 init wait_auth_rotating timed out

bug repeat contiditon:

1、change auth_service_ticket_ttl 120s To speed up the secret update

2、restarting mon repeatedly on one node,restarting mds repeatedly on another node,Neither node is the primary mon node

3、When primary mon happens to be updating the secret to other mons and one mon just restart at same time,mon need
reselect leader,the following occurs:

2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later

bug occured reason:

1、when trigger_propose failed,the primary mon secret version is different from the other mons,primary mon have new secret version but other mon just have old secret version

2、when mds or osd need update secret from other mon not leader,Get the rotate secret of an older version cause bug happend

3、when mon leader update secret next time,trigger_propose succeed,other mons update new secret version.mds and osd update new secret version too. cluster returns to normal.

#8 Updated by Patrick Donnelly 4 days ago

  • Related to Bug #50775: mds and osd unable to obtain rotating service keys added

Also available in: Atom PDF