Bug #17170
mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync
100%
Description
1.cluster info:
a> have 3 servers
b> have a mon each server
2.Constructing scenarios(only one times)
I changed the time of server, when I found wrong time
3.osd error(some osds, not all)
I restart osd after the modification time found the following error osd:
2016-08-26 10:13:14.556691 7fbac89de700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2016-08-26 09:13:14.556670) 2016-08-26 10:13:14.557430 7fbad09ee700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2016-08-26 09:13:14.557419) 2016-08-26 10:13:15.766918 7fbad8c2d700 10 journal commit_start max_applied_seq 55383725, open_ops 0 2016-08-26 10:13:15.766942 7fbad8c2d700 10 journal commit_start blocked, all open_ops have completed 2016-08-26 10:13:15.766945 7fbad8c2d700 10 journal commit_start nothing to do 2016-08-26 10:13:15.766949 7fbad8c2d700 10 journal commit_start 2016-08-26 10:13:20.767043 7fbad8c2d700 10 journal commit_start max_applied_seq 55383725, open_ops 0 2016-08-26 10:13:20.767085 7fbad8c2d700 10 journal commit_start blocked, all open_ops have completed 2016-08-26 10:13:20.767087 7fbad8c2d700 10 journal commit_start nothing to do 2016-08-26 10:13:20.767091 7fbad8c2d700 10 journal commit_start 2016-08-26 10:13:21.553471 7fbae466b880 0 monclient: wait_auth_rotating timed out after 30 2016-08-26 10:13:21.553499 7fbae466b880 -1 osd.2 9877 unable to obtain rotating service keys; retrying 2016-08-26 10:13:21.559833 7fbad8c2d700 10 journal commit_start max_applied_seq 55383725, open_ops 0 2016-08-26 10:13:21.559853 7fbad8c2d700 10 journal commit_start blocked, all open_ops have completed 2016-08-26 10:13:21.559865 7fbad8c2d700 10 journal commit_start nothing to do 2016-08-26 10:13:21.559869 7fbad8c2d700 10 journal commit_start 2016-08-26 10:13:21.560085 7fbae466b880 10 journal journal_stop 2016-08-26 10:13:21.560142 7fbae466b880 1 journal close /var/lib/ceph/osd/ceph-2/journal 2016-08-26 10:13:21.560156 7fbad842c700 20 journal write_thread_entry woke up 2016-08-26 10:13:21.560165 7fbad842c700 10 journal write_thread_entry finish 2016-08-26 10:13:21.560223 7fbae466b880 15 journal do_write writing 16200609792~0 + header 2016-08-26 10:13:21.560367 7fbae466b880 20 journal do_write latency 0.000139 2016-08-26 10:13:21.560372 7fbae466b880 20 journal do_write queueing finishers through seq 0 2016-08-26 10:13:21.560376 7fbae466b880 20 journal write_header_sync finish 2016-08-26 10:13:21.560391 7fbad7c2b700 10 journal write_finish_thread_entry exit 2016-08-26 10:13:21.624241 7fbae466b880 -1 ^[[0;31m ** ERROR: osd init failed: (110) Connection timed out^[[0m
Related issues
History
#1 Updated by Sage Weil over 6 years ago
- Status changed from New to Can't reproduce
#2 Updated by xie xingguo over 5 years ago
- Status changed from Can't reproduce to New
We hit this issue again in Luminous.
#3 Updated by tangwenjun tang over 5 years ago
- File osd.log.txt View added
when we reboot one host, some osd take a long time to start.
and one osd succeed to start finally after several times restart by itself.
it cost about 25 minutes in the whole times
[root@ceph181 tecs]# cat /var/log/ceph/ceph-osd.14.log | grep "osd.14 178 init wait_auth_rotating timed out"
2018-03-23 14:54:26.747948 7fc09cb6ad00 -1 osd.14 178 init wait_auth_rotating timed out
2018-03-23 15:00:36.104752 7fbcbf0ced00 -1 osd.14 178 init wait_auth_rotating timed out
2018-03-23 15:07:31.182964 7fc9369b8d00 -1 osd.14 178 init wait_auth_rotating timed out
2018-03-23 15:13:25.098623 7f0ecd4b9d00 -1 osd.14 178 init wait_auth_rotating timed out
2018-03-23 15:19:20.558378 7fed8f21ad00 -1 osd.14 178 init wait_auth_rotating timed out
#4 Updated by Marcin Gibula over 5 years ago
I've seen this on our cluster (luminous, bluestore based), but was unable to reproduce it...
Restarting primary mon did help however.
#5 Updated by Xiaoxi Chen over 5 years ago
- Affected Versions v10.2.7 added
- Affected Versions deleted (
v0.94.7)
We hit this today on Jewel release (10.2.7), all OSDs connected to one of the monitor in the quorum having this issue.
Restarting that monitor walk around it.
The behavior was exactly the same as what described in main thread, the expired time of the rotating key is exactly 1 hour in the past, like2016-08-26 10:13:14.556691 7fbac89de700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2016-08-26 09:13:14.556670)
#6 Updated by Xiaoxi Chen over 5 years ago
- Affected Versions v0.94.7 added
- Affected Versions deleted (
v10.2.7)
#7 Updated by Xiaoxi Chen over 5 years ago
- Affected Versions v10.2.7 added
- Affected Versions deleted (
v0.94.7)
#8 Updated by Peter Gervai over 5 years ago
I have a bit different effect on v12.2.5, but may be related:
I have similar logs:
Jun 22 14:03:03 bowie ceph-mgr[3641]: 2018-06-22 14:03:03.502623 7f45bc15f700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-06-22 13:03:03.502598)
The effect is that I have added a bunch of OSDs and crashed a node, and after it came up all OSDs started just fine, but everything related to mgr+osd is stuck forever (like 'ceph osd status' or half of 'ceph osd *' commands). But everything looks normal apart from that.
Similar as above: restarting the leader mgr fixes the problem and puts back the system to normal.
#9 Updated by Yuval Lifshitz over 3 years ago
issue still seen in pacific dev version:
ceph version 16.0.0-2831-gdc63be5 (dc63be574b5eff5848fb48b7a21253689fcbefb2) pacific (dev)
2020-07-14T08:24:15.754+0000 7fa899dbaf40 0 monclient: wait_auth_rotating timed out after 30 2020-07-14T08:24:15.754+0000 7fa899dbaf40 -1 osd.7 118 unable to obtain rotating service keys; retrying 2020-07-14T08:24:15.754+0000 7fa899dbaf40 -1 osd.7 118 init wait_auth_rotating timed out
restart of osd, mgw or the entire machine didn't help.
could be related to clock skew, but setting NTP server didn't fix the issue:
# ceph status [12/95] cluster: id: 32572cfe-e63c-4b39-a073-dda65252d291 health: HEALTH_WARN clock skew detected on mon.e25-h35-740xd 3 osds down 1 host (7 osds) down Reduced data availability: 193 pgs inactive Degraded data redundancy: 61190/91785 objects degraded (66.667%), 96 pgs degraded, 193 pgs undersized 9453 slow ops, oldest one blocked for 1717 sec, mon.e25-h35-740xd has slow ops services: [1/95] mon: 2 daemons, quorum e25-h33-740xd,e25-h35-740xd (age 28m) mgr: e25-h33-740xd(active, since 28m) osd: 14 osds: 7 up (since 28m), 10 in (since 10h) data: pools: 7 pools, 193 pgs objects: 30.59k objects, 29 GiB usage: 5.9 TiB used, 11 TiB / 17 TiB avail pgs: 100.000% pgs not active 61190/91785 objects degraded (66.667%) 97 undersized+peered 96 undersized+degraded+peered
also:
timedatectl Local time: Tue 2020-07-14 09:35:32 UTC Universal time: Tue 2020-07-14 09:35:32 UTC RTC time: Tue 2020-07-14 09:35:32 Time zone: UTC (UTC, +0000) System clock synchronized: no NTP service: active RTC in local TZ: no
#10 Updated by Yuval Lifshitz over 3 years ago
issue fixed after setting correct NTP server on the machines.
followed the instructions here: https://access.redhat.com/solutions/3073261
#11 Updated by liqun zhang over 1 year ago
- File rotating_service_keys.png View added
this issue occurs in 15.2.13. osd log as rotating_service_keys.png.
#12 Updated by Greg Farnum about 1 year ago
- Project changed from Ceph to RADOS
- Subject changed from mon/monclient: unable to obtain rotating service keys when osd init to mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync
- Category set to Security
- Status changed from New to Closed
This report can technically have other causes, but it's just always because the OSDs are too far out of clock sync with the monitors, and can't find a cephx key that's in the right time bounds as a result. We should update the error message to suggest that fix.
#13 Updated by Greg Farnum about 1 year ago
- Status changed from Closed to New
#14 Updated by Greg Farnum about 1 year ago
- Status changed from New to Fix Under Review
- Assignee set to Greg Farnum
- Pull request ID set to 48318
#15 Updated by Yuri Weinstein 12 months ago
#16 Updated by Radoslaw Zarzynski 11 months ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to pacific,quincy
#17 Updated by Backport Bot 11 months ago
- Copied to Backport #58333: pacific: mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync added
#18 Updated by Backport Bot 11 months ago
- Copied to Backport #58334: quincy: mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync added
#19 Updated by Backport Bot 11 months ago
- Tags set to backport_processed
#20 Updated by Konstantin Shalygin 1 day ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100