Project

General

Profile

Bug #17170

mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync

Added by huanwen ren over 7 years ago. Updated 4 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Security
Target version:
-
% Done:

100%

Source:
other
Tags:
backport_processed
Backport:
pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

1.cluster info:
a> have 3 servers
b> have a mon each server

2.Constructing scenarios(only one times)
I changed the time of server, when I found wrong time

3.osd error(some osds, not all)
I restart osd after the modification time found the following error osd:

2016-08-26 10:13:14.556691 7fbac89de700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2016-08-26 09:13:14.556670)
2016-08-26 10:13:14.557430 7fbad09ee700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2016-08-26 09:13:14.557419)
2016-08-26 10:13:15.766918 7fbad8c2d700 10 journal commit_start max_applied_seq 55383725, open_ops 0
2016-08-26 10:13:15.766942 7fbad8c2d700 10 journal commit_start blocked, all open_ops have completed
2016-08-26 10:13:15.766945 7fbad8c2d700 10 journal commit_start nothing to do
2016-08-26 10:13:15.766949 7fbad8c2d700 10 journal commit_start
2016-08-26 10:13:20.767043 7fbad8c2d700 10 journal commit_start max_applied_seq 55383725, open_ops 0
2016-08-26 10:13:20.767085 7fbad8c2d700 10 journal commit_start blocked, all open_ops have completed
2016-08-26 10:13:20.767087 7fbad8c2d700 10 journal commit_start nothing to do
2016-08-26 10:13:20.767091 7fbad8c2d700 10 journal commit_start
2016-08-26 10:13:21.553471 7fbae466b880  0 monclient: wait_auth_rotating timed out after 30
2016-08-26 10:13:21.553499 7fbae466b880 -1 osd.2 9877 unable to obtain rotating service keys; retrying
2016-08-26 10:13:21.559833 7fbad8c2d700 10 journal commit_start max_applied_seq 55383725, open_ops 0
2016-08-26 10:13:21.559853 7fbad8c2d700 10 journal commit_start blocked, all open_ops have completed
2016-08-26 10:13:21.559865 7fbad8c2d700 10 journal commit_start nothing to do
2016-08-26 10:13:21.559869 7fbad8c2d700 10 journal commit_start
2016-08-26 10:13:21.560085 7fbae466b880 10 journal journal_stop
2016-08-26 10:13:21.560142 7fbae466b880  1 journal close /var/lib/ceph/osd/ceph-2/journal
2016-08-26 10:13:21.560156 7fbad842c700 20 journal write_thread_entry woke up
2016-08-26 10:13:21.560165 7fbad842c700 10 journal write_thread_entry finish
2016-08-26 10:13:21.560223 7fbae466b880 15 journal do_write writing 16200609792~0 + header
2016-08-26 10:13:21.560367 7fbae466b880 20 journal do_write latency 0.000139
2016-08-26 10:13:21.560372 7fbae466b880 20 journal do_write queueing finishers through seq 0
2016-08-26 10:13:21.560376 7fbae466b880 20 journal write_header_sync finish
2016-08-26 10:13:21.560391 7fbad7c2b700 10 journal write_finish_thread_entry exit
2016-08-26 10:13:21.624241 7fbae466b880 -1 ^[[0;31m ** ERROR: osd init failed: (110) Connection timed out^[[0m

ceph-osd.2_back.log View (41.9 KB) huanwen ren, 08/30/2016 09:18 AM

osd.log.txt View (329 KB) tangwenjun tang, 03/29/2018 01:19 AM

rotating_service_keys.png View (141 KB) liqun zhang, 03/25/2022 02:35 AM


Related issues

Copied to RADOS - Backport #58333: pacific: mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync Rejected
Copied to RADOS - Backport #58334: quincy: mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync Resolved

History

#1 Updated by Sage Weil almost 7 years ago

  • Status changed from New to Can't reproduce

#2 Updated by xie xingguo about 6 years ago

  • Status changed from Can't reproduce to New

We hit this issue again in Luminous.

#3 Updated by tangwenjun tang about 6 years ago

when we reboot one host, some osd take a long time to start.
and one osd succeed to start finally after several times restart by itself.
it cost about 25 minutes in the whole times

[root@ceph181 tecs]# cat /var/log/ceph/ceph-osd.14.log | grep "osd.14 178 init wait_auth_rotating timed out"
2018-03-23 14:54:26.747948 7fc09cb6ad00 -1 osd.14 178 init wait_auth_rotating timed out
2018-03-23 15:00:36.104752 7fbcbf0ced00 -1 osd.14 178 init wait_auth_rotating timed out
2018-03-23 15:07:31.182964 7fc9369b8d00 -1 osd.14 178 init wait_auth_rotating timed out
2018-03-23 15:13:25.098623 7f0ecd4b9d00 -1 osd.14 178 init wait_auth_rotating timed out
2018-03-23 15:19:20.558378 7fed8f21ad00 -1 osd.14 178 init wait_auth_rotating timed out

#4 Updated by Marcin Gibula about 6 years ago

I've seen this on our cluster (luminous, bluestore based), but was unable to reproduce it...
Restarting primary mon did help however.

#5 Updated by Xiaoxi Chen almost 6 years ago

  • Affected Versions v10.2.7 added
  • Affected Versions deleted (v0.94.7)

We hit this today on Jewel release (10.2.7), all OSDs connected to one of the monitor in the quorum having this issue.
Restarting that monitor walk around it.

The behavior was exactly the same as what described in main thread, the expired time of the rotating key is exactly 1 hour in the past, like
2016-08-26 10:13:14.556691 7fbac89de700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2016-08-26 09:13:14.556670)

#6 Updated by Xiaoxi Chen almost 6 years ago

  • Affected Versions v0.94.7 added
  • Affected Versions deleted (v10.2.7)

#7 Updated by Xiaoxi Chen almost 6 years ago

  • Affected Versions v10.2.7 added
  • Affected Versions deleted (v0.94.7)

#8 Updated by Peter Gervai almost 6 years ago

I have a bit different effect on v12.2.5, but may be related:

I have similar logs:

Jun 22 14:03:03 bowie ceph-mgr[3641]: 2018-06-22 14:03:03.502623 7f45bc15f700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-06-22 13:03:03.502598)

The effect is that I have added a bunch of OSDs and crashed a node, and after it came up all OSDs started just fine, but everything related to mgr+osd is stuck forever (like 'ceph osd status' or half of 'ceph osd *' commands). But everything looks normal apart from that.

Similar as above: restarting the leader mgr fixes the problem and puts back the system to normal.

#9 Updated by Yuval Lifshitz over 3 years ago

issue still seen in pacific dev version:

ceph version 16.0.0-2831-gdc63be5 (dc63be574b5eff5848fb48b7a21253689fcbefb2) pacific (dev)
2020-07-14T08:24:15.754+0000 7fa899dbaf40  0 monclient: wait_auth_rotating timed out after 30
2020-07-14T08:24:15.754+0000 7fa899dbaf40 -1 osd.7 118 unable to obtain rotating service keys; retrying
2020-07-14T08:24:15.754+0000 7fa899dbaf40 -1 osd.7 118 init wait_auth_rotating timed out

restart of osd, mgw or the entire machine didn't help.

could be related to clock skew, but setting NTP server didn't fix the issue:

# ceph status                                                                                                                                                                                                    [12/95]
  cluster:                               
    id:     32572cfe-e63c-4b39-a073-dda65252d291     
    health: HEALTH_WARN
            clock skew detected on mon.e25-h35-740xd
            3 osds down      
            1 host (7 osds) down   
            Reduced data availability: 193 pgs inactive
            Degraded data redundancy: 61190/91785 objects degraded (66.667%), 96 pgs degraded, 193 pgs undersized
            9453 slow ops, oldest one blocked for 1717 sec, mon.e25-h35-740xd has slow ops
 services:                                                                                                                                                                                                                             [1/95]
    mon: 2 daemons, quorum e25-h33-740xd,e25-h35-740xd (age 28m)
    mgr: e25-h33-740xd(active, since 28m)
    osd: 14 osds: 7 up (since 28m), 10 in (since 10h)

  data:
    pools:   7 pools, 193 pgs
    objects: 30.59k objects, 29 GiB
    usage:   5.9 TiB used, 11 TiB / 17 TiB avail
    pgs:     100.000% pgs not active
             61190/91785 objects degraded (66.667%)
             97 undersized+peered
             96 undersized+degraded+peered

also:

timedatectl
               Local time: Tue 2020-07-14 09:35:32 UTC
           Universal time: Tue 2020-07-14 09:35:32 UTC
                 RTC time: Tue 2020-07-14 09:35:32
                Time zone: UTC (UTC, +0000)
System clock synchronized: no
              NTP service: active
          RTC in local TZ: no

#10 Updated by Yuval Lifshitz over 3 years ago

issue fixed after setting correct NTP server on the machines.
followed the instructions here: https://access.redhat.com/solutions/3073261

#11 Updated by liqun zhang about 2 years ago

this issue occurs in 15.2.13. osd log as rotating_service_keys.png.

#12 Updated by Greg Farnum over 1 year ago

  • Project changed from Ceph to RADOS
  • Subject changed from mon/monclient: unable to obtain rotating service keys when osd init to mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync
  • Category set to Security
  • Status changed from New to Closed

This report can technically have other causes, but it's just always because the OSDs are too far out of clock sync with the monitors, and can't find a cephx key that's in the right time bounds as a result. We should update the error message to suggest that fix.

#13 Updated by Greg Farnum over 1 year ago

  • Status changed from Closed to New

#14 Updated by Greg Farnum over 1 year ago

  • Status changed from New to Fix Under Review
  • Assignee set to Greg Farnum
  • Pull request ID set to 48318

#16 Updated by Radoslaw Zarzynski over 1 year ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport set to pacific,quincy

#17 Updated by Backport Bot over 1 year ago

  • Copied to Backport #58333: pacific: mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync added

#18 Updated by Backport Bot over 1 year ago

  • Copied to Backport #58334: quincy: mon/monclient: update "unable to obtain rotating service keys when osd init" to suggest clock sync added

#19 Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed

#20 Updated by Konstantin Shalygin 4 months ago

  • Status changed from Pending Backport to Resolved
  • % Done changed from 0 to 100

Also available in: Atom PDF