Bug #50390
closedmds: monclient: wait_auth_rotating timed out after 30
0%
Description
Symptom:
2021-04-15T15:54:59.586 INFO:teuthology.orchestra.run.smithi001.stdout:2021-04-15T15:41:41.849793+0000 mon.a (mon.0) 3465 : cluster [WRN] Replacing daemon mds.b as rank 0 with standby daemon mds.a
From: /ceph/teuthology-archive/pdonnell-2021-04-15_01:35:57-fs-wip-pdonnell-testing-20210414.230315-distro-basic-smithi/6047521/teuthology.log
MDS was restarted between tests. The MDS's prior incarnation was up:standby in the MDSMap but it got stuck after restart on:
2021-04-15T15:41:41.824+0000 7fd1952a9780 0 monclient: wait_auth_rotating timed out after 30 2021-04-15T15:41:41.824+0000 7fd1952a9780 -1 mds.b unable to obtain rotating service keys; retrying
From: /ceph/teuthology-archive/pdonnell-2021-04-15_01:35:57-fs-wip-pdonnell-testing-20210414.230315-distro-basic-smithi/6047521/remote/smithi110/log/ceph-mds.b.log.gz
So it never got to send its up:boot message to the mons at startup, which would replace its old instance (in up:standby). This causes the warning because the mons had assigned that old instance to be up:creating but never heard back from it.
Updated by Patrick Donnelly about 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 40880
Updated by zizhou sun almost 3 years ago
I have same question, target version ceph-v15.2.0
2021-05-04T05:49:24.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:49:24.717+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:49:54.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:49:54.717+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:50:24.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:50:24.717+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:50:54.717+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:50:54.718+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:51:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:51:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
Updated by wenge song almost 3 years ago
I had the same problem, repeat to restart mds.
Updated by Patrick Donnelly almost 3 years ago
- Has duplicate Bug #50755: mds restart but unable to obtain rotating service keys added
Updated by wenge song almost 3 years ago
pull request id 40880 fixed this? I just see the parameters "rotating keys bootstrap timeout: 15" added. It doesn't seem to solve my problem。
Updated by wenge song almost 3 years ago
It seem to that paxos not active cause mon leader not push to other mon
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: _rotate_secret adding mgr id 416 AQC/fptg1hEUJxAAJs6bVTyxhl6CI10+v3BnkA== expires 2021-05-12T16:07:43.655624+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 cephx keyserver: _check_rotating_secrets added 4
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: _dump_rotating_secrets
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mon id 414 key AQCvcJtgzxXfERAAm6Kr7gVRpcb9Xhwa+peZyw== expires 2021-05-12T15:07:43.299826+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mon id 415 key AQA1dJtg5IbaDRAAP0KHFIjqHYpazz9hywIfLQ== expires 2021-05-12T15:37:43.299826+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mon id 416 key AQC/fptgh4sSJxAA2pE+jPzIMQBDShOYQRDasg== expires 2021-05-12T16:07:43.655519+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mds id 414 key AQCvcJtgDQfgERAAVer26WywnBZUDKjqH3yrgg== expires 2021-05-12T15:07:43.299892+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mds id 415 key AQA1dJtgD3LbDRAA5Pfjneis7vdnwMwrtR8iXw== expires 2021-05-12T15:37:43.299892+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mds id 416 key AQC/fptgOrwTJxAA53nDKfPTfdlNJSo9Kge3rA== expires 2021-05-12T16:07:43.655603+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service osd id 414 key AQCvcJtgUa/fERAARtv3UkRO2S+LgsS6+0K/6Q== expires 2021-05-12T15:07:43.299870+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service osd id 415 key AQA1dJtg0hrbDRAAr7QF0d/7OC4X2hPB1ju3lw== expires 2021-05-12T15:37:43.299870+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service osd id 416 key AQC/fptgrDwTJxAAP+DedLl1yoLTaj0N0f0jgw== expires 2021-05-12T16:07:43.655570+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mgr id 414 key AQCvcJtgrFngERAAukIVWLe+rFbb8hrsmPD8IA== expires 2021-05-12T15:07:43.299913+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mgr id 415 key AQA1dJtgcsLbDRAABwvvqvIRu4ThgzdU/Au4WQ== expires 2021-05-12T15:37:43.299913+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service mgr id 416 key AQC/fptg1hEUJxAAJs6bVTyxhl6CI10+v3BnkA== expires 2021-05-12T16:07:43.655624+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service auth id 45 key AQD1bZtgYx+0NRAAoUd1BcBBhbV3ZcbzJrb8Kg== expires 2021-05-12T15:06:05.900988+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service auth id 46 key AQD9dJtg+yAKIhAAJqlYAQ5pc+PA61Kb/aSr2A== expires 2021-05-12T15:41:05.900988+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 30 cephx keyserver: service auth id 47 key AQBefptgIvozGhAAz85naguy99UtIiRbVIAIGQ== expires 2021-05-12T16:16:06.439605+0800
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later
Updated by wenge song almost 3 years ago
error message:
2021-05-04T05:51:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:54.718+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:54.718+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:54.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:54:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:54:24.719+0800 7f105b2737c0 -1 mds.c ERROR: failed to refresh rotating keys, maximum retry time reached.
2021-05-04T05:54:24.719+0800 7f105b2737c0 1 mds.c suicide!
The same problem occurs with OSD on the same node as MDS:
2021-05-03T10:41:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.025+0800 7f7f8e790ec0 -1 osd.25 11233 init wait_auth_rotating timed out
bug repeat contiditon:
1、change auth_service_ticket_ttl 120s To speed up the secret update
2、restarting mon repeatedly on one node,restarting mds repeatedly on another node,Neither node is the primary mon node
3、When primary mon happens to be updating the secret to other mons and one mon just restart at same time,mon need
reselect leader,the following occurs:
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later
bug occured reason:
1、when trigger_propose failed,the primary mon secret version is different from the other mons,primary mon have new secret version but other mon just have old secret version
2、when mds or osd need update secret from other mon not leader,Get the rotate secret of an older version cause bug happend
3、when mon leader update secret next time,trigger_propose succeed,other mons update new secret version.mds and osd update new secret version too. cluster returns to normal.
Updated by Patrick Donnelly almost 3 years ago
- Related to Bug #50775: mds and osd unable to obtain rotating service keys added
Updated by Ilya Dryomov almost 3 years ago
- Assignee changed from Patrick Donnelly to Ilya Dryomov
Updated by Ilya Dryomov almost 3 years ago
- Backport changed from pacific to nautilus,octopus,pacific
- Pull request ID changed from 40880 to 41368
Updated by Kefu Chai almost 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot almost 3 years ago
- Copied to Backport #50897: nautilus: mds: monclient: wait_auth_rotating timed out after 30 added
Updated by Backport Bot almost 3 years ago
- Copied to Backport #50898: octopus: mds: monclient: wait_auth_rotating timed out after 30 added
Updated by Backport Bot almost 3 years ago
- Copied to Backport #50899: pacific: mds: monclient: wait_auth_rotating timed out after 30 added
Updated by Loïc Dachary almost 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".