Project

General

Profile

Actions

Bug #50775

open

mds and osd unable to obtain rotating service keys

Added by wenge song almost 3 years ago. Updated over 2 years ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

30%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

version-15.2.0

error message:

2021-05-04T05:51:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:54.718+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:54.718+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:54.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:54:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:54:24.719+0800 7f105b2737c0 -1 mds.c ERROR: failed to refresh rotating keys, maximum retry time reached.
2021-05-04T05:54:24.719+0800 7f105b2737c0 1 mds.c suicide!

The same problem occurs with OSD on the same node as MDS:

2021-05-03T10:41:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.025+0800 7f7f8e790ec0 -1 osd.25 11233 init wait_auth_rotating timed out

bug repeat contiditon:

1、change auth_service_ticket_ttl 120s To speed up the secret update

2、restarting mon repeatedly on one node,restarting mds repeatedly on another node,Neither node is the primary mon node

3、When primary mon happens to be updating the secret to other mons and one mon just restart at same time,mon need
reselect leader,the following occurs:

2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later

bug occured reason:

1、when trigger_propose failed,the primary mon secret version is different from the other mons,primary mon have new secret version but other mon just have old secret version

2、when mds or osd need update secret from other mon not leader,Get the rotate secret of an older version cause bug happend

3、when mon leader update secret next time,trigger_propose succeed,other mons update new secret version.mds and osd update new secret version too. cluster returns to normal.

I will submit patch for this bug


Files

ceph-mon.a.log.tar.gz (708 KB) ceph-mon.a.log.tar.gz wenge song, 05/24/2021 11:15 AM
ceph-mds.b.log.tar.gz (481 KB) ceph-mds.b.log.tar.gz wenge song, 05/24/2021 11:17 AM
ceph-mon.b.log.tar.gz (461 KB) ceph-mon.b.log.tar.gz wenge song, 05/24/2021 11:53 AM
bugshell (376 Bytes) bugshell wenge song, 05/24/2021 11:53 AM
ceph-mds.c.log.tar.gz (488 KB) ceph-mds.c.log.tar.gz wenge song, 05/26/2021 06:46 AM
ceph-mds.b.log.tar.gz (638 KB) ceph-mds.b.log.tar.gz wenge song, 05/26/2021 06:46 AM
ceph-mon.b.log.tar.gz (758 KB) ceph-mon.b.log.tar.gz wenge song, 05/26/2021 06:46 AM
ceph-mon.a.log.tar.gz (713 KB) ceph-mon.a.log.tar.gz wenge song, 05/26/2021 06:46 AM
ceph-mon.c.log.tar.gz (878 KB) ceph-mon.c.log.tar.gz wenge song, 05/26/2021 06:46 AM
1622700668(1).jpg (106 KB) 1622700668(1).jpg wenge song, 06/03/2021 06:12 AM
ceph-mds.b.tar.gz (38.7 KB) ceph-mds.b.tar.gz wenge song, 06/04/2021 06:42 AM
ceph-mds.c.tar.gz (34 KB) ceph-mds.c.tar.gz wenge song, 06/04/2021 06:42 AM
ceph-mon.a.tar.gz (660 KB) ceph-mon.a.tar.gz this is mon leader wenge song, 06/04/2021 06:42 AM
ceph-mon.b.tar.gz (320 KB) ceph-mon.b.tar.gz wenge song, 06/04/2021 06:42 AM
ceph-mon.c.tar.gz (553 KB) ceph-mon.c.tar.gz wenge song, 06/04/2021 06:42 AM

Related issues 1 (0 open1 closed)

Related to CephFS - Bug #50390: mds: monclient: wait_auth_rotating timed out after 30ResolvedIlya Dryomov

Actions
Actions #1

Updated by wenge song almost 3 years ago

I will reproduce, fix and verify this bug. Then will send code review of bugfix.

Actions #2

Updated by Patrick Donnelly almost 3 years ago

  • Related to Bug #50390: mds: monclient: wait_auth_rotating timed out after 30 added
Actions #3

Updated by Ilya Dryomov almost 3 years ago

Hi Song,

Could you please confirm the ceph version with the output of "ceph-mds --version"?

Actions #4

Updated by wenge song almost 3 years ago

Ilya Dryomov wrote:

Hi Song,

Could you please confirm the ceph version with the output of "ceph-mds --version"?

hi,llya
Sorry, Because we've made changes,I can't get the version the way you say it is。Is the following method possible?

[root@ceph1 ~]# rpm -qa | grep ceph-mds
ceph-mds-15.2.0-1.0.2.3.0_2104292130.el8.x86_64

Actions #5

Updated by Ilya Dryomov almost 3 years ago

So you are running the original octopus release with custom patches added in?

Actions #6

Updated by wenge song almost 3 years ago

Ilya Dryomov wrote:

So you are running the original octopus release with custom patches added in?

just a little change, I think the bug has nothing to do with it. As I described, I think the mon leader sync rotate secret keys failed, but I just saw "trigger_propose not active, will propose later",I don't know Whether there will be a propose again and When to propose again.

Actions #7

Updated by Ilya Dryomov almost 3 years ago

But it is the original octopus release with no substantial backports, correct?

In particular, can you confirm that you don't have any of

https://tracker.ceph.com/issues/50452
https://github.com/ceph/ceph/compare/cbc7f727842f93130fcfbbad64de0959ee4a02d3...f3a4166379b12d4a7bba667fe761e5b660552db1

changes included in your build?

Actions #8

Updated by wenge song almost 3 years ago

Ilya Dryomov wrote:

But it is the original octopus release with no substantial backports, correct?

In particular, can you confirm that you don't have any of

https://tracker.ceph.com/issues/50452
https://github.com/ceph/ceph/compare/cbc7f727842f93130fcfbbad64de0959ee4a02d3...f3a4166379b12d4a7bba667fe761e5b660552db1

changes included in your build?

Indeed, there is no change above in our build,I saw these changes, do you think bug is fixed by changes above? or is there a definite patch that can solve this problem?

Actions #9

Updated by Ilya Dryomov almost 3 years ago

No, it's the other way around -- there is some suspicion that https://tracker.ceph.com/issues/50390 is caused by these changes because it appeared right after they got merged.

The fact that you don't have them in your build means that either your bug is different from https://tracker.ceph.com/issues/50390 or that these changes have nothing to do with failures to obtain rotating service keys that you and Patrick are seeing.

Actions #10

Updated by Patrick Donnelly almost 3 years ago

  • Project changed from CephFS to Ceph
Actions #11

Updated by Ilya Dryomov almost 3 years ago

I posted https://github.com/ceph/ceph/pull/41368, please take a look. It's probably not going to solve an edge case of restarting multiple daemons repeatedly, but it's definitely a step towards that.

Actions #12

Updated by wenge song almost 3 years ago

Ilya Dryomov wrote:

I posted https://github.com/ceph/ceph/pull/41368, please take a look. It's probably not going to solve an edge case of restarting multiple daemons repeatedly, but it's definitely a step towards that.

hi,llya
I've seen this commit,It really can't solve this bug,Thank you all the same

Actions #13

Updated by wenge song almost 3 years ago

wenge song wrote:

Ilya Dryomov wrote:

I posted https://github.com/ceph/ceph/pull/41368, please take a look. It's probably not going to solve an edge case of restarting multiple daemons repeatedly, but it's definitely a step towards that.

hi,llya
I've seen this commit,It really can't solve this bug,Thank you all the same

hi llya
I posted https://github.com/ceph/ceph/pull/41438, please review. I change the code and can't repeat this bug.

Actions #14

Updated by Kefu Chai almost 3 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 41438
Actions #15

Updated by Ilya Dryomov almost 3 years ago

Can you share the full monitor logs? Specifically, I'm interested in the log where the following excerpt came from

2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later

but also all peon monitor logs as well.

Actions #16

Updated by Ilya Dryomov almost 3 years ago

Just to be clear, are you saying that if the proposal with the new keys doesn't get sent because trigger_propose() returns false, the peon monitors don't get new keys until the next renewal (i.e. 120 seconds later in your case)?

Because ordinarily trigger_propose() would simply delay the proposal, not discard it entirely. Are you seeing the proposal with the new keys being discarded?

Actions #17

Updated by wenge song almost 3 years ago

Ilya Dryomov wrote:

Can you share the full monitor logs? Specifically, I'm interested in the log where the following excerpt came from

[...]

but also all peon monitor logs as well.

thanks for your review and reply,I need reproduce this bug again for more logs.

Actions #18

Updated by wenge song almost 3 years ago

Ilya Dryomov wrote:

Just to be clear, are you saying that if the proposal with the new keys doesn't get sent because trigger_propose() returns false, the peon monitors don't get new keys until the next renewal (i.e. 120 seconds later in your case)?

Because ordinarily trigger_propose() would simply delay the proposal, not discard it entirely. Are you seeing the proposal with the new keys being discarded?

You are right,I didn't seen the proposal with the new keys being discarded,but when the new keys rotating and trigger_propose return false. mds and osd unable get new keys from peon moniter,I think mon leader did not push new secret to other peon mon in time or other reason cause push it failed,so I modified code to try pushing new secret for more times. Maybe I got it wrong,So need more in-depth discussion.

Actions #19

Updated by wenge song almost 3 years ago

mon leader log

Actions #20

Updated by wenge song almost 3 years ago

mds.b unable to obtain rotating service keys,this is mds.b log

Actions #21

Updated by wenge song almost 3 years ago

wenge song wrote:

mds.b unable to obtain rotating service keys,this is mds.b log

2021-05-24T18:48:22.934+0800 7f4853f19700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-05-24T18:47:52.934877+0800)

Updated by wenge song almost 3 years ago

bugshell is my test case,mon.b is a peon monitor

Actions #23

Updated by Ilya Dryomov almost 3 years ago

The logs that you provided are weird. Some log messages that should be there are not there. For example, I don't see a single update_from_paxos log message, but they should be there given that mon.b was restarted multiple times and other log messages at "debug mon = 10" level are there.

And some log messages seem to have had their log level adjusted. For example, _rotate_secret and _check_rotating_secrets log messages are not 0 upstream.

Please remove all modifications to the logging infrastructure and reproduce with "debug mon = 30", "debug monc = 30", "debug auth = 30" and "debug ms = 1" on all daemons. Don't selectively enable only certain log messages -- I'd like to see all of them.

Actions #24

Updated by Ilya Dryomov almost 3 years ago

Out of curiosity, how many iterations of bugshell does it take to reproduce? I might try it on the weekend, but it would be a lot easier if you provide a set of verbose logs of course.

Actions #25

Updated by wenge song almost 3 years ago

Ilya Dryomov wrote:

Out of curiosity, how many iterations of bugshell does it take to reproduce? I might try it on the weekend, but it would be a lot easier if you provide a set of verbose logs of course.

It's not a fixed iterations of bugshel,I reproduced it in seventy-six iterations of bugshel last time。Maybe you can try it.As a reminder, you need run the bugshell on two ceph nodes. I will reproduce it again to provide more detailed logs in my spare time.

Updated by wenge song almost 3 years ago

I reproduced with "debug mon = 30", "debug monc = 30", "debug auth = 30" and "debug ms = 1" on all daemons,mon.a is leader all the way. mds.b and mds.c unable get new secret. If need more logs or reproduce again,just tell me.

Actions #27

Updated by Ilya Dryomov almost 3 years ago

These logs are still weird. Now there is plenty of update_from_paxos log messages but virtually no paxosservice log messages. This can't be how it actually ran...

What modifications did you apply to the logging infrastructure? Are you somehow enabling logging on a per file basis?

Please remove all modifications to the logging infrastructure so that all log messages which correspond to "debug mon = 30", "debug monc = 30", "debug auth = 30" and "debug ms = 1" make it to the log file.

Actions #28

Updated by wenge song almost 3 years ago

Ilya Dryomov wrote:

These logs are still weird. Now there is plenty of update_from_paxos log messages but virtually no paxosservice log messages. This can't be how it actually ran...

What modifications did you apply to the logging infrastructure? Are you somehow enabling logging on a per file basis?

Please remove all modifications to the logging infrastructure so that all log messages which correspond to "debug mon = 30", "debug monc = 30", "debug auth = 30" and "debug ms = 1" make it to the log file.

hi,
If we want to see the log messages associated with paxosservice,Maybe we need add “debug_paxos = 30”, Should I do this?

Actions #29

Updated by Ilya Dryomov almost 3 years ago

Yes, "debug paxos = 30" would definitely help! Sorry, I missed it because the previous set of logs that you shared had them (perhaps not at 30 but still) so I just assumed they would stay enabled.

Basically the ask is a set of logs that is as verbose as possible so that anyone who looks at it doesn't end up scratching their head wondering whether a particular log line is missing because it didn't get executed or because it simply wasn't enabled.

Updated by wenge song almost 3 years ago

Ilya Dryomov wrote:

Yes, "debug paxos = 30" would definitely help! Sorry, I missed it because the previous set of logs that you shared had them (perhaps not at 30 but still) so I just assumed they would stay enabled.

Basically the ask is a set of logs that is as verbose as possible so that anyone who looks at it doesn't end up scratching their head wondering whether a particular log line is missing because it didn't get executed or because it simply wasn't enabled.

please check if these logs are what you want to see.

Actions #31

Updated by Neha Ojha almost 3 years ago

  • Project changed from Ceph to RADOS
Actions #32

Updated by Jerry Pu over 2 years ago

We ran into this issue, too. Our environment is a multi-host cluster (v15.2.9). Sometimes, we can observe that "unable to obtain rotating service keys" errors happened on mds and osds because these daemons seemed to actually keep getting older version of rotating keys which already expired. They were unable to get correct version of rotating keys. Until next key renewal (within an hour), they finally got new and correct version of rotating keys and were able to boot successfully.

Actions

Also available in: Atom PDF