Bug #50775
openmds and osd unable to obtain rotating service keys
30%
Description
version-15.2.0
error message:
2021-05-04T05:51:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:52:54.718+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:52:54.718+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:24.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:53:54.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:53:54.719+0800 7f105b2737c0 -1 mds.c unable to obtain rotating service keys; retrying
2021-05-04T05:54:24.719+0800 7f105b2737c0 0 monclient: wait_auth_rotating timed out after 30
2021-05-04T05:54:24.719+0800 7f105b2737c0 -1 mds.c ERROR: failed to refresh rotating keys, maximum retry time reached.
2021-05-04T05:54:24.719+0800 7f105b2737c0 1 mds.c suicide!
The same problem occurs with OSD on the same node as MDS:
2021-05-03T10:41:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:15.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 0 monclient: wait_auth_rotating timed out after 30
2021-05-03T10:42:45.024+0800 7f7f8e790ec0 -1 osd.25 11233 unable to obtain rotating service keys; retrying
2021-05-03T10:42:45.025+0800 7f7f8e790ec0 -1 osd.25 11233 init wait_auth_rotating timed out
bug repeat contiditon:
1、change auth_service_ticket_ttl 120s To speed up the secret update
2、restarting mon repeatedly on one node,restarting mds repeatedly on another node,Neither node is the primary mon node
3、When primary mon happens to be updating the secret to other mons and one mon just restart at same time,mon need
reselect leader,the following occurs:
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0
2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690
2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later
bug occured reason:
1、when trigger_propose failed,the primary mon secret version is different from the other mons,primary mon have new secret version but other mon just have old secret version
2、when mds or osd need update secret from other mon not leader,Get the rotate secret of an older version cause bug happend
3、when mon leader update secret next time,trigger_propose succeed,other mons update new secret version.mds and osd update new secret version too. cluster returns to normal.
I will submit patch for this bug
Files
Updated by wenge song almost 3 years ago
I will reproduce, fix and verify this bug. Then will send code review of bugfix.
Updated by Patrick Donnelly almost 3 years ago
- Related to Bug #50390: mds: monclient: wait_auth_rotating timed out after 30 added
Updated by Ilya Dryomov almost 3 years ago
Hi Song,
Could you please confirm the ceph version with the output of "ceph-mds --version"?
Updated by wenge song almost 3 years ago
Ilya Dryomov wrote:
Hi Song,
Could you please confirm the ceph version with the output of "ceph-mds --version"?
hi,llya
Sorry, Because we've made changes,I can't get the version the way you say it is。Is the following method possible?
[root@ceph1 ~]# rpm -qa | grep ceph-mds
ceph-mds-15.2.0-1.0.2.3.0_2104292130.el8.x86_64
Updated by Ilya Dryomov almost 3 years ago
So you are running the original octopus release with custom patches added in?
Updated by wenge song almost 3 years ago
Ilya Dryomov wrote:
So you are running the original octopus release with custom patches added in?
just a little change, I think the bug has nothing to do with it. As I described, I think the mon leader sync rotate secret keys failed, but I just saw "trigger_propose not active, will propose later",I don't know Whether there will be a propose again and When to propose again.
Updated by Ilya Dryomov almost 3 years ago
But it is the original octopus release with no substantial backports, correct?
In particular, can you confirm that you don't have any of
https://tracker.ceph.com/issues/50452
https://github.com/ceph/ceph/compare/cbc7f727842f93130fcfbbad64de0959ee4a02d3...f3a4166379b12d4a7bba667fe761e5b660552db1
changes included in your build?
Updated by wenge song almost 3 years ago
Ilya Dryomov wrote:
But it is the original octopus release with no substantial backports, correct?
In particular, can you confirm that you don't have any of
https://tracker.ceph.com/issues/50452
https://github.com/ceph/ceph/compare/cbc7f727842f93130fcfbbad64de0959ee4a02d3...f3a4166379b12d4a7bba667fe761e5b660552db1changes included in your build?
Indeed, there is no change above in our build,I saw these changes, do you think bug is fixed by changes above? or is there a definite patch that can solve this problem?
Updated by Ilya Dryomov almost 3 years ago
No, it's the other way around -- there is some suspicion that https://tracker.ceph.com/issues/50390 is caused by these changes because it appeared right after they got merged.
The fact that you don't have them in your build means that either your bug is different from https://tracker.ceph.com/issues/50390 or that these changes have nothing to do with failures to obtain rotating service keys that you and Patrick are seeing.
Updated by Patrick Donnelly almost 3 years ago
- Project changed from CephFS to Ceph
Updated by Ilya Dryomov almost 3 years ago
I posted https://github.com/ceph/ceph/pull/41368, please take a look. It's probably not going to solve an edge case of restarting multiple daemons repeatedly, but it's definitely a step towards that.
Updated by wenge song almost 3 years ago
Ilya Dryomov wrote:
I posted https://github.com/ceph/ceph/pull/41368, please take a look. It's probably not going to solve an edge case of restarting multiple daemons repeatedly, but it's definitely a step towards that.
hi,llya
I've seen this commit,It really can't solve this bug,Thank you all the same
Updated by wenge song almost 3 years ago
wenge song wrote:
Ilya Dryomov wrote:
I posted https://github.com/ceph/ceph/pull/41368, please take a look. It's probably not going to solve an edge case of restarting multiple daemons repeatedly, but it's definitely a step towards that.
hi,llya
I've seen this commit,It really can't solve this bug,Thank you all the same
hi llya
I posted https://github.com/ceph/ceph/pull/41438, please review. I change the code and can't repeat this bug.
Updated by Kefu Chai almost 3 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 41438
Updated by Ilya Dryomov almost 3 years ago
Can you share the full monitor logs? Specifically, I'm interested in the log where the following excerpt came from
2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 check_rotate updated rotating 2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).paxosservice(auth 5508..5687) propose_pending 2021-05-12T15:07:43.654+0800 7f5f761f9700 10 mon.a@0(leader).auth v5687 encode_pending v 5688 2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader) e3 log_health updated 0 previous 0 2021-05-12T15:07:43.669+0800 7f5f761f9700 5 mon.a@0(leader).paxos(paxos updating c 409133..409838) queue_pending_finisher 0x557067a8a690 2021-05-12T15:07:43.669+0800 7f5f761f9700 10 mon.a@0(leader).paxos(paxos updating c 409133..409838) trigger_propose not active, will propose later
but also all peon monitor logs as well.
Updated by Ilya Dryomov almost 3 years ago
Just to be clear, are you saying that if the proposal with the new keys doesn't get sent because trigger_propose() returns false, the peon monitors don't get new keys until the next renewal (i.e. 120 seconds later in your case)?
Because ordinarily trigger_propose() would simply delay the proposal, not discard it entirely. Are you seeing the proposal with the new keys being discarded?
Updated by wenge song almost 3 years ago
Ilya Dryomov wrote:
Can you share the full monitor logs? Specifically, I'm interested in the log where the following excerpt came from
[...]
but also all peon monitor logs as well.
thanks for your review and reply,I need reproduce this bug again for more logs.
Updated by wenge song almost 3 years ago
Ilya Dryomov wrote:
Just to be clear, are you saying that if the proposal with the new keys doesn't get sent because trigger_propose() returns false, the peon monitors don't get new keys until the next renewal (i.e. 120 seconds later in your case)?
Because ordinarily trigger_propose() would simply delay the proposal, not discard it entirely. Are you seeing the proposal with the new keys being discarded?
You are right,I didn't seen the proposal with the new keys being discarded,but when the new keys rotating and trigger_propose return false. mds and osd unable get new keys from peon moniter,I think mon leader did not push new secret to other peon mon in time or other reason cause push it failed,so I modified code to try pushing new secret for more times. Maybe I got it wrong,So need more in-depth discussion.
Updated by wenge song almost 3 years ago
- File ceph-mon.a.log.tar.gz ceph-mon.a.log.tar.gz added
mon leader log
Updated by wenge song almost 3 years ago
- File ceph-mds.b.log.tar.gz ceph-mds.b.log.tar.gz added
mds.b unable to obtain rotating service keys,this is mds.b log
Updated by wenge song almost 3 years ago
wenge song wrote:
mds.b unable to obtain rotating service keys,this is mds.b log
2021-05-24T18:48:22.934+0800 7f4853f19700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2021-05-24T18:47:52.934877+0800)
Updated by wenge song almost 3 years ago
- File ceph-mon.b.log.tar.gz ceph-mon.b.log.tar.gz added
- File bugshell bugshell added
bugshell is my test case,mon.b is a peon monitor
Updated by Ilya Dryomov almost 3 years ago
The logs that you provided are weird. Some log messages that should be there are not there. For example, I don't see a single update_from_paxos log message, but they should be there given that mon.b was restarted multiple times and other log messages at "debug mon = 10" level are there.
And some log messages seem to have had their log level adjusted. For example, _rotate_secret and _check_rotating_secrets log messages are not 0 upstream.
Please remove all modifications to the logging infrastructure and reproduce with "debug mon = 30", "debug monc = 30", "debug auth = 30" and "debug ms = 1" on all daemons. Don't selectively enable only certain log messages -- I'd like to see all of them.
Updated by Ilya Dryomov almost 3 years ago
Out of curiosity, how many iterations of bugshell does it take to reproduce? I might try it on the weekend, but it would be a lot easier if you provide a set of verbose logs of course.
Updated by wenge song almost 3 years ago
Ilya Dryomov wrote:
Out of curiosity, how many iterations of bugshell does it take to reproduce? I might try it on the weekend, but it would be a lot easier if you provide a set of verbose logs of course.
It's not a fixed iterations of bugshel,I reproduced it in seventy-six iterations of bugshel last time。Maybe you can try it.As a reminder, you need run the bugshell on two ceph nodes. I will reproduce it again to provide more detailed logs in my spare time.
Updated by wenge song almost 3 years ago
- File ceph-mds.b.log.tar.gz ceph-mds.b.log.tar.gz added
- File ceph-mds.c.log.tar.gz ceph-mds.c.log.tar.gz added
- File ceph-mon.a.log.tar.gz ceph-mon.a.log.tar.gz added
- File ceph-mon.b.log.tar.gz ceph-mon.b.log.tar.gz added
- File ceph-mon.c.log.tar.gz ceph-mon.c.log.tar.gz added
I reproduced with "debug mon = 30", "debug monc = 30", "debug auth = 30" and "debug ms = 1" on all daemons,mon.a is leader all the way. mds.b and mds.c unable get new secret. If need more logs or reproduce again,just tell me.
Updated by Ilya Dryomov almost 3 years ago
These logs are still weird. Now there is plenty of update_from_paxos log messages but virtually no paxosservice log messages. This can't be how it actually ran...
What modifications did you apply to the logging infrastructure? Are you somehow enabling logging on a per file basis?
Please remove all modifications to the logging infrastructure so that all log messages which correspond to "debug mon = 30", "debug monc = 30", "debug auth = 30" and "debug ms = 1" make it to the log file.
Updated by wenge song almost 3 years ago
- File 1622700668(1).jpg 1622700668(1).jpg added
Ilya Dryomov wrote:
These logs are still weird. Now there is plenty of update_from_paxos log messages but virtually no paxosservice log messages. This can't be how it actually ran...
What modifications did you apply to the logging infrastructure? Are you somehow enabling logging on a per file basis?
Please remove all modifications to the logging infrastructure so that all log messages which correspond to "debug mon = 30", "debug monc = 30", "debug auth = 30" and "debug ms = 1" make it to the log file.
hi,
If we want to see the log messages associated with paxosservice,Maybe we need add “debug_paxos = 30”, Should I do this?
Updated by Ilya Dryomov almost 3 years ago
Yes, "debug paxos = 30" would definitely help! Sorry, I missed it because the previous set of logs that you shared had them (perhaps not at 30 but still) so I just assumed they would stay enabled.
Basically the ask is a set of logs that is as verbose as possible so that anyone who looks at it doesn't end up scratching their head wondering whether a particular log line is missing because it didn't get executed or because it simply wasn't enabled.
Updated by wenge song almost 3 years ago
- File ceph-mds.b.tar.gz ceph-mds.b.tar.gz added
- File ceph-mds.c.tar.gz ceph-mds.c.tar.gz added
- File ceph-mon.a.tar.gz ceph-mon.a.tar.gz added
- File ceph-mon.b.tar.gz ceph-mon.b.tar.gz added
- File ceph-mon.c.tar.gz ceph-mon.c.tar.gz added
Ilya Dryomov wrote:
Yes, "debug paxos = 30" would definitely help! Sorry, I missed it because the previous set of logs that you shared had them (perhaps not at 30 but still) so I just assumed they would stay enabled.
Basically the ask is a set of logs that is as verbose as possible so that anyone who looks at it doesn't end up scratching their head wondering whether a particular log line is missing because it didn't get executed or because it simply wasn't enabled.
please check if these logs are what you want to see.
Updated by Jerry Pu over 2 years ago
We ran into this issue, too. Our environment is a multi-host cluster (v15.2.9). Sometimes, we can observe that "unable to obtain rotating service keys" errors happened on mds and osds because these daemons seemed to actually keep getting older version of rotating keys which already expired. They were unable to get correct version of rotating keys. Until next key renewal (within an hour), they finally got new and correct version of rotating keys and were able to boot successfully.