Project

General

Profile

Actions

Bug #50390

closed

mds: monclient: wait_auth_rotating timed out after 30

Added by Patrick Donnelly about 3 years ago. Updated almost 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
nautilus,octopus,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Symptom:

2021-04-15T15:54:59.586 INFO:teuthology.orchestra.run.smithi001.stdout:2021-04-15T15:41:41.849793+0000 mon.a (mon.0) 3465 : cluster [WRN] Replacing daemon mds.b as rank 0 with standby daemon mds.a

From: /ceph/teuthology-archive/pdonnell-2021-04-15_01:35:57-fs-wip-pdonnell-testing-20210414.230315-distro-basic-smithi/6047521/teuthology.log

MDS was restarted between tests. The MDS's prior incarnation was up:standby in the MDSMap but it got stuck after restart on:

2021-04-15T15:41:41.824+0000 7fd1952a9780  0 monclient: wait_auth_rotating timed out after 30
2021-04-15T15:41:41.824+0000 7fd1952a9780 -1 mds.b unable to obtain rotating service keys; retrying

From: /ceph/teuthology-archive/pdonnell-2021-04-15_01:35:57-fs-wip-pdonnell-testing-20210414.230315-distro-basic-smithi/6047521/remote/smithi110/log/ceph-mds.b.log.gz

So it never got to send its up:boot message to the mons at startup, which would replace its old instance (in up:standby). This causes the warning because the mons had assigned that old instance to be up:creating but never heard back from it.


Related issues 5 (1 open4 closed)

Related to RADOS - Bug #50775: mds and osd unable to obtain rotating service keysFix Under Review

Actions
Has duplicate CephFS - Bug #50755: mds restart but unable to obtain rotating service keysDuplicate

Actions
Copied to CephFS - Backport #50897: nautilus: mds: monclient: wait_auth_rotating timed out after 30ResolvedIlya DryomovActions
Copied to CephFS - Backport #50898: octopus: mds: monclient: wait_auth_rotating timed out after 30ResolvedIlya DryomovActions
Copied to CephFS - Backport #50899: pacific: mds: monclient: wait_auth_rotating timed out after 30ResolvedIlya DryomovActions
Actions #1

Updated by Patrick Donnelly about 3 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 40880
Actions #2

Updated by zizhou sun almost 3 years ago

I have same question, target version ceph-v15.2.0
2021-05-04T05:49:24.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:49:24.717+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:49:54.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:49:54.717+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:50:24.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:50:24.717+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:50:54.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:50:54.718+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:51:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:51:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying

Actions #3

Updated by wenge song almost 3 years ago

I had the same problem, repeat to restart mds.

Actions #4

Updated by Patrick Donnelly almost 3 years ago

  • Has duplicate Bug #50755: mds restart but unable to obtain rotating service keys added
Actions #5

Updated by wenge song almost 3 years ago

pull request id 40880 fixed this? I just see the parameters "rotating keys bootstrap timeout: 15" added. It doesn't seem to solve my problem。

Actions #6

Updated by wenge song almost 3 years ago

It seem to that paxos not active cause mon leader not push to other mon

2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: _rotate_secret adding mgr id 416 AQC/fptg1hEUJxAAJs6bVTyxhl6CI10+v3BnkA== expires 2021-05-12T16:07:43.655624+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 cephx keyserver: _check_rotating_secrets added 4
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: _dump_rotating_secrets
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mon id 414 key AQCvcJtgzxXfERAAm6Kr7gVRpcb9Xhwa+peZyw== expires 2021-05-12T15:07:43.299826+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mon id 415 key AQA1dJtg5IbaDRAAP0KHFIjqHYpazz9hywIfLQ== expires 2021-05-12T15:37:43.299826+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mon id 416 key AQC/fptgh4sSJxAA2pE+jPzIMQBDShOYQRDasg== expires 2021-05-12T16:07:43.655519+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mds id 414 key AQCvcJtgDQfgERAAVer26WywnBZUDKjqH3yrgg== expires 2021-05-12T15:07:43.299892+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mds id 415 key AQA1dJtgD3LbDRAA5Pfjneis7vdnwMwrtR8iXw== expires 2021-05-12T15:37:43.299892+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mds id 416 key AQC/fptgOrwTJxAA53nDKfPTfdlNJSo9Kge3rA== expires 2021-05-12T16:07:43.655603+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service osd id 414 key AQCvcJtgUa/fERAARtv3UkRO2S+LgsS6+0K/6Q== expires 2021-05-12T15:07:43.299870+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service osd id 415 key AQA1dJtg0hrbDRAAr7QF0d/7OC4X2hPB1ju3lw== expires 2021-05-12T15:37:43.299870+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service osd id 416 key AQC/fptgrDwTJxAAP+DedLl1yoLTaj0N0f0jgw== expires 2021-05-12T16:07:43.655570+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mgr id 414 key AQCvcJtgrFngERAAukIVWLe+rFbb8hrsmPD8IA== expires 2021-05-12T15:07:43.299913+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mgr id 415 key AQA1dJtgcsLbDRAABwvvqvIRu4ThgzdU/Au4WQ== expires 2021-05-12T15:37:43.299913+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mgr id 416 key AQC/fptg1hEUJxAAJs6bVTyxhl6CI10+v3BnkA== expires 2021-05-12T16:07:43.655624+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service auth id 45 key AQD1bZtgYx+0NRAAoUd1BcBBhbV3ZcbzJrb8Kg== expires 2021-05-12T15:06:05.900988+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service auth id 46 key AQD9dJtg+yAKIhAAJqlYAQ5pc+PA61Kb/aSr2A== expires 2021-05-12T15:41:05.900988+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service auth id 47 key AQBefptgIvozGhAAz85naguy99UtIiRbVIAIGQ== expires 2021-05-12T16:16:06.439605+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later

Actions #7

Updated by wenge song almost 3 years ago

error message:

2021-05-04T05:51:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:54.718+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:54.718+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:54.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:54:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:54:24.719+0800 7f105b2737c0 -1 mds.c ERROR: failed to refresh rotating keys, maximum retry time reached.
2021-05-04T05:54:24.719+0800 7f105b2737c0 1 mds.c suicide!

The same problem occurs with OSD on the same node as MDS:

2021-05-03T10:41:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.025+0800 7f7f8e790ec0 -1 osd.25 11233 init wait_auth_rotating timed out

bug repeat contiditon:

1、change auth_service_ticket_ttl 120s To speed up the secret update

2、restarting mon repeatedly on one node,restarting mds repeatedly on another node,Neither node is the primary mon node

3、When primary mon happens to be updating the secret to other mons and one mon just restart at same time,mon need
reselect leader,the following occurs:

2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later

bug occured reason:

1、when trigger_propose failed,the primary mon secret version is different from the other mons,primary mon have new secret version but other mon just have old secret version

2、when mds or osd need update secret from other mon not leader,Get the rotate secret of an older version cause bug happend

3、when mon leader update secret next time,trigger_propose succeed,other mons update new secret version.mds and osd update new secret version too. cluster returns to normal.

Actions #8

Updated by Patrick Donnelly almost 3 years ago

  • Related to Bug #50775: mds and osd unable to obtain rotating service keys added
Actions #9

Updated by Ilya Dryomov almost 3 years ago

  • Assignee changed from Patrick Donnelly to Ilya Dryomov
Actions #10

Updated by Ilya Dryomov almost 3 years ago

  • Backport changed from pacific to nautilus,octopus,pacific
  • Pull request ID changed from 40880 to 41368
Actions #11

Updated by Kefu Chai almost 3 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #12

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #50897: nautilus: mds: monclient: wait_auth_rotating timed out after 30 added
Actions #13

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #50898: octopus: mds: monclient: wait_auth_rotating timed out after 30 added
Actions #14

Updated by Backport Bot almost 3 years ago

  • Copied to Backport #50899: pacific: mds: monclient: wait_auth_rotating timed out after 30 added
Actions #15

Updated by Loïc Dachary almost 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF