Project

General

Profile

Bug #23460

mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early

Added by Wido den Hollander over 1 year ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
ceph-mgr
Target version:
-
Start date:
03/26/2018
Due date:
% Done:

0%

Source:
Tags:
cephx,mgr,monclient
Backport:
luminous, mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

On Luminous v12.2.2 and v12.2.4 clusters running either CentOS or Ubuntu I've seen many Manager going offline with these messages in their logs:

2018-03-26 05:32:50.976826 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:32:50.976818)
2018-03-26 05:33:00.977091 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:33:00.977083)
2018-03-26 05:33:10.977244 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:33:10.977238)
2018-03-26 05:33:20.977398 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:33:20.977392)
2018-03-26 05:33:30.977584 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:33:30.977577)
Mar 23 09:18:22 mon01 ceph-mgr[2324150]: 2018-03-23 09:18:22.451311 7fb9e8ac7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-23 8:18:22.451287)

The first thing to check is if the time is correct on these systems, but it is, they are all in sync thanks to NTP.

No clock drift warnings or such, the Mgr just goes down with these messages.

gdb.txt View (69 KB) Wido den Hollander, 04/05/2018 06:27 AM


Related issues

Related to mgr - Feature #23574: Add a HeartbeatMap to ceph-mgr (die on deadlocks) New 04/06/2018
Duplicated by mgr - Bug #36266: mgr: deadlock in ClusterState Duplicate 09/30/2018
Copied to mgr - Backport #38318: luminous: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early Resolved
Copied to mgr - Backport #38319: mimic: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early Resolved

History

#1 Updated by John Spray over 1 year ago

This one's still a mystery to me :-/

Only initial thought is that we're in daylight-savings territory right now, could be a differing TZ between mon and mgr nodes?

#2 Updated by Wido den Hollander over 1 year ago

I don't think the TZ is related. In all the cases I've seen the Mon and Mgr were running on the same node.

I saw it happen again on this system, but it also happened before the daylight saving changes. Seen many cases in the last few months where this occured.

#3 Updated by Wido den Hollander over 1 year ago

I found another mgr in a cluster which has been down now for 4 days.

[root@srv-zmb03-05 (GN3) ceph]# systemctl status ceph-mgr@srv-zmb03-05
* ceph-mgr@srv-zmb03-05.service - Ceph cluster manager daemon
   Loaded: loaded (/usr/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2018-03-19 07:21:03 CET; 1 weeks 0 days ago
 Main PID: 1468563 (ceph-mgr)
   CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@srv-zmb03-05.service
           `-1468563 /usr/bin/ceph-mgr -f --cluster ceph --id srv-zmb03-05 --setuser ceph --setgroup ceph

Mar 22 00:26:07 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:07.421833 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:07.421826)
Mar 22 00:26:17 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:17.421996 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:17.421990)
Mar 22 00:26:27 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:27.422161 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:27.422154)
Mar 22 00:26:37 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:37.422325 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:37.422314)
Mar 22 00:26:47 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:47.422530 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:47.422521)
Mar 22 03:19:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 03:19:01.978135 7f6c04c56700 -1 received  signal: Hangup from  PID: 2219601 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw  UID: 0
Mar 23 03:47:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-23 03:47:01.450114 7f6c04c56700 -1 received  signal: Hangup from  PID: 2412024 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw  UID: 0
Mar 24 03:44:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-24 03:44:01.851657 7f6c04c56700 -1 received  signal: Hangup from  PID: 2600400 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw  UID: 0
Mar 25 03:09:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-25 03:09:01.902803 7f6c04c56700 -1 received  signal: Hangup from  PID: 2776777 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw  UID: 0
Mar 26 03:11:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-26 03:11:01.589827 7f6c04c56700 -1 received  signal: Hangup from  PID: 2965956 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw  UID: 0
Hint: Some lines were ellipsized, use -l to show in full.
[root@srv-zmb03-05 (GN3) ceph]#

When trying to talk to the mgr over the admin socket it fails:

[root@srv-zmb03-05 (GN3) ceph]# timeout 10 ceph daemon mgr.srv-zmb03-05 status
[root@srv-zmb03-05 (GN3) ceph]#

After 10 seconds I got no response from the mgr daemon.

This is a 12.2.2 manager daemon, but I saw the same on a 12.2.4 manager.

The daemon is still running:

[root@srv-zmb03-05 (GN3) ceph]# ps aux|grep mgr
ceph     1468563 38.2  2.5 1834408 839804 ?      Ssl  Mar19 4018:20 /usr/bin/ceph-mgr -f --cluster ceph --id srv-zmb03-05 --setuser ceph --setgroup ceph
root     3063753  0.0  0.0   9048   664 pts/0    S+   15:35   0:00 grep --color=auto mgr
[root@srv-zmb03-05 (GN3) ceph]# 

It's however not logging anything nor responding. Restarting it takes a long time:

[root@srv-zmb03-05 (GN3) ceph]# time systemctl restart ceph-mgr.target

real    1m30.302s
user    0m0.001s
sys     0m0.006s
[root@srv-zmb03-05 (GN3) ceph]#

So the manager stops to function after they keys seem to expire.

But couldn't it be that the Manager hangs, it's "internal clock" somewhere get's stuck and the keys then expire because they don't rotate?

I have one manager running with "debug_mgr = 20" now and I'll leave it with that for some days.

#4 Updated by Wido den Hollander over 1 year ago

I just found another mgr in this cluster which is "dead", but is eating 300% CPU on 3 cores:

top - 15:51:19 up 14 days,  2:52,  1 user,  load average: 2.81, 2.52, 2.44
Tasks: 151 total,   2 running, 149 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  6.4 us,  6.2 sy,  0.0 ni, 87.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  4.9 us,  5.4 sy,  0.0 ni, 89.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  2.6 us,  2.1 sy,  0.0 ni, 95.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 58.8 us, 41.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 61.0 us, 39.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 10.9 us, 11.9 sy,  0.0 ni, 77.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32906716 total,   242976 free,  5663384 used, 27000356 buff/cache
KiB Swap:  1046524 total,  1046524 free,        0 used. 26284872 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                     
1441330 ceph      20   0 1645996 743944  14124 S 247.3  2.3   3588:40 ceph-mgr                                                                                                                                    
    863 ceph      20   0 7406156 6.126g 2.018g S   0.8 19.5   1825:13 ceph-mon 

The cluster in this case has 2200 OSDs, but I've also seen this happen on systems with 250 OSDs.

#5 Updated by Nathan Cutler over 1 year ago

  • Priority changed from Normal to High

Raising priority since this affects a stable release (Luminous)

#6 Updated by Wido den Hollander over 1 year ago

So I set the logging of the active manager to:

- debug_mgr: 20
- debug_auth: 5

After running for over 24 hours I noticed that the mgr was still running, but not being reported as active nor standby.

Through the admin socket the Manager doesn't respond anymore.

Looking through the logs I found this:

2018-03-27 18:16:41.433416 7f9b83bf7700  4 mgr.server handle_report from 0x55f16eb22000 osd,265
2018-03-27 18:16:41.433424 7f9b83bf7700 20 mgr.server handle_report updating existing DaemonState for osd,265
2018-03-27 18:16:41.433426 7f9b83bf7700 20 mgr update loading 0 new types, 0 old types, had 137 types, got 734 bytes of data
2018-03-27 18:16:41.433528 7f9b83bf7700  4 mgr.server handle_report from 0x55f16dd30800 osd,154
2018-03-27 18:16:41.433533 7f9b83bf7700 20 mgr.server handle_report updating existing DaemonState for osd,154
2018-03-27 18:16:41.433535 7f9b83bf7700 20 mgr update loading 0 new types, 0 old types, had 137 types, got 734 bytes of data
2018-03-27 18:16:41.435972 7f9b96411700 10 mgr tick tick
2018-03-27 18:16:41.435977 7f9b96411700  1 mgr send_beacon active
2018-03-27 18:16:41.436355 7f9b96411700 10 mgr send_beacon sending beacon as gid 248981849
2018-03-27 18:16:41.437240 7f9b96411700 10 mgr tick 
2018-03-27 18:16:41.437242 7f9b96411700 10 mgr update_delta_stats  v14827
2018-03-27 18:16:41.457254 7f9b83bf7700  4 mgr.server handle_report from 0x55f16d97d800 osd,647
2018-03-27 18:16:41.457266 7f9b83bf7700 20 mgr.server handle_report updating existing DaemonState for osd,647
2018-03-27 18:16:41.457279 7f9b83bf7700 20 mgr update loading 0 new types, 0 old types, had 137 types, got 734 bytes of data
2018-03-27 18:16:41.457512 7f9b823f4700 20 mgr[influx] Sending data to Influx host: dashboard.XXXXXX.nl
2018-03-27 18:16:41.486638 7f9b9c623700 10 mgr.server ms_verify_authorizer  session 0x55f16fbf9260 client.openstack has caps allow r 'allow r'
2018-03-27 18:16:41.490921 7f9b9b621700 10 mgr.server ms_verify_authorizer  session 0x55f170af5930 client.openstack has caps allow r 'allow r'
2018-03-27 18:16:41.545419 7f9b9b621700 10 mgr.server ms_verify_authorizer  session 0x55f170af4750 client.openstack has caps allow r 'allow r'
2018-03-27 18:16:41.550949 7f9b9be22700 10 mgr.server ms_verify_authorizer  session 0x55f170af6ff0 client.openstack has caps allow r 'allow r'

After this the logs keep just repeating this line and then there is silence for 2 hours:

2018-03-27 18:16:47.698073 7f9b9c623700 10 mgr.server ms_verify_authorizer  session 0x55f16c0a3110 client.openstack has caps allow r 'allow r'
..
..
2018-03-27 18:17:15.682785 7f9b9c623700 10 mgr.server ms_verify_authorizer  session 0x55f1647708a0 client.openstack has caps allow r 'allow r'
2018-03-27 18:17:15.708895 7f9b9be22700 10 mgr.server ms_verify_authorizer  session 0x55f17247f400 client.openstack has caps allow r 'allow r'
2018-03-27 18:17:15.725502 7f9b9be22700 10 mgr.server ms_verify_authorizer  session 0x55f17247dd40 client.openstack has caps allow r 'allow r'
2018-03-27 18:17:15.766729 7f9b9c623700 10 mgr.server ms_verify_authorizer  session 0x55f164771cf0 client.openstack has caps allow r 'allow r'

About 90 minutes later this is what I see:

2018-03-27 18:17:15.766729 7f9b9c623700 10 mgr.server ms_verify_authorizer  session 0x55f164771cf0 client.openstack has caps allow r 'allow r'
2018-03-27 20:03:58.146343 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 19:03:58.146335)
2018-03-27 20:04:08.146520 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 19:04:08.146514)
2018-03-27 20:04:18.146667 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 19:04:18.146661)

This goes on for a lot of time and then total silence until the logrotate comes along. No new lines in the logfile which follows:

2018-03-27 21:51:58.244816 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 20:51:58.244810)
2018-03-27 21:52:08.244962 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 20:52:08.244957)
2018-03-28 03:49:01.198054 7f9b95c10700 -1 received  signal: Hangup from  PID: 3449235 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0

Any other loglevels we might want to enable here? Or anything else I can test?

#7 Updated by wei jin over 1 year ago

I ran into the same issue from osd, which leads to slow request. (Luminous 12.2.4)
The primary osd's ops were blocked due to waiting for subop reply, and we found following logs from subop osd:

2018-04-04 12:16:31.196846 7f0db7c5e700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-04-04 11:16:31.196844)
2018-04-04 12:16:41.196980 7f0db7c5e700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-04-04 11:16:41.196976)
......

and then

2018-04-04 12:37:35.114034 7f0ddec8a700  0 auth: could not find secret_id=374
2018-04-04 12:37:35.114038 7f0ddec8a700  0 cephx: verify_authorizer could not get service secret for service osd secret_id=374
2018-04-04 12:37:35.114040 7f0ddec8a700  0 -- 10.20.63.207:6820/364615 >> 10.20.64.90:6840/399128406 conn(0x5566e865b000 :6820 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_c
onnect_msg: got bad authorizer
......

#8 Updated by wei jin over 1 year ago

And I also found one cpu core, which was allocated to this osd by cgroup, had 100% us time.
And the osd is still alive, seems ops related to it are blocked forever.

#9 Updated by John Spray over 1 year ago

Wido: so I guess the monclient messages are more of a symptom of something else getting badly stuck -- I suspect that if you can find a system in this state and attach gdb+"thread apply all bt" it might give us the culprit.

#10 Updated by Wido den Hollander over 1 year ago

I've attached to a hanging mgr process (eating 40% CPU) and the output of GDB is attached.

Things to know about this system:

- Ubuntu 16.04
- Python 2.7.12
- Ceph 12.2.4
- Running with IPv6

I've also seen this happen on a daily basis on a CentOS 7 system running 12.2.4 and also using IPv6 for its communication.

Don't know if this helps, but this is what I could find so far.

#11 Updated by John Spray over 1 year ago

  • Related to Feature #23574: Add a HeartbeatMap to ceph-mgr (die on deadlocks) added

#12 Updated by John Spray over 1 year ago

Thanks -- looks like we have a deadlock bug in ceph-mgr itself.

This thread has taken the (shared) Objecter lock and is waiting on the (exclusive) ClusterState lock:

Thread 32 (Thread 0x7fe51ec96700 (LWP 1614293)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fe53fe7edbd in __GI___pthread_mutex_lock (mutex=mutex@entry=0x561e37cef698) at ../nptl/pthread_mutex_lock.c:80
#2  0x0000561e2c7a9659 in Mutex::Lock (this=this@entry=0x561e37cef670, no_lockdep=no_lockdep@entry=false) at /build/ceph-12.2.4/src/common/Mutex.cc:107
#3  0x0000561e2c65311a in Mutex::Locker::Locker (m=..., this=<synthetic pointer>) at /build/ceph-12.2.4/src/common/Mutex.h:115
#4  ClusterState::with_pgmap<ActivePyModules::get_python(const string&)::<lambda(const OSDMap&)>::<lambda(const PGMap&)> > (cb=<optimized out>, this=0x561e37cef498)
    at /build/ceph-12.2.4/src/mgr/ClusterState.h:104
#5  ActivePyModules::<lambda(const OSDMap&)>::operator() (osd_map=..., __closure=<optimized out>) at /build/ceph-12.2.4/src/mgr/ActivePyModules.cc:290
#6  Objecter::with_osdmap<ActivePyModules::get_python(const string&)::<lambda(const OSDMap&)> > (cb=<optimized out>, this=<optimized out>) at /build/ceph-12.2.4/src/osdc/Objecter.h:2045
#7  ClusterState::with_osdmap<ActivePyModules::get_python(const string&)::<lambda(const OSDMap&)> > (this=<optimized out>) at /build/ceph-12.2.4/src/mgr/ClusterState.h:129
#8  ActivePyModules::get_python (this=0x561e37e93600, what="df") at /build/ceph-12.2.4/src/mgr/ActivePyModules.cc:291
#9  0x0000561e2c673988 in ceph_state_get (self=0x7fe5275df550, args=<optimized out>) at /build/ceph-12.2.4/src/mgr/BaseMgrModule.cc:325

This thread has taken the (exclusive) ClusterState lock and is waiting on the (shared) Objecter lock:

Thread 17 (Thread 0x7fe535776700 (LWP 1601474)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000561e2c639aa6 in boost::condition_variable::wait (m=..., this=0x7ffddbe293d8) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/pthread/condition_variable.hpp:81
#2  boost::shared_mutex::lock_shared (this=0x7ffddbe293a8) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/pthread/shared_mutex.hpp:190
#3  boost::shared_lock<boost::shared_mutex>::lock (this=this@entry=0x7fe535774300) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/lock_types.hpp:645
#4  0x0000561e2c633c01 in boost::shared_lock<boost::shared_mutex>::shared_lock (m_=..., this=0x7fe535774300) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/lock_types.hpp:520
#5  Objecter::with_osdmap<DaemonServer::send_report()::<lambda(const PGMap&)>::<lambda(const OSDMap&)> > (cb=<optimized out>, this=0x7ffddbe292c8) at /build/ceph-12.2.4/src/osdc/Objecter.h:2044
#6  ClusterState::with_osdmap<DaemonServer::send_report()::<lambda(const PGMap&)>::<lambda(const OSDMap&)> > (this=<optimized out>) at /build/ceph-12.2.4/src/mgr/ClusterState.h:129
#7  DaemonServer::<lambda(const PGMap&)>::operator() (pg_map=..., __closure=<optimized out>) at /build/ceph-12.2.4/src/mgr/DaemonServer.cc:1467
#8  ClusterState::with_pgmap<DaemonServer::send_report()::<lambda(const PGMap&)> > (cb=<optimized out>, this=0x561e37cef498) at /build/ceph-12.2.4/src/mgr/ClusterState.h:105
#9  DaemonServer::send_report (this=this@entry=0x561e37cf0228) at /build/ceph-12.2.4/src/mgr/DaemonServer.cc:1468

...and this thread is trying to (exclusive) acquire Objecter::lock:

Thread 11 (Thread 0x7fe538983700 (LWP 1601468)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000561e2c6c9565 in boost::condition_variable::wait (m=..., this=0x7ffddbe29430) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/pthread/condition_variable.hpp:81
#2  boost::shared_mutex::lock (this=0x7ffddbe293a8) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/pthread/shared_mutex.hpp:293
#3  0x0000561e2c6bf864 in ceph::shunique_lock<boost::shared_mutex>::lock (this=0x7fe538980f70) at /build/ceph-12.2.4/src/common/shunique_lock.h:157
#4  ceph::shunique_lock<boost::shared_mutex>::shunique_lock (m=..., this=0x7fe538980f70) at /build/ceph-12.2.4/src/common/shunique_lock.h:65
#5  Objecter::handle_osd_map (this=this@entry=0x7ffddbe292c8, m=m@entry=0x561e3b33c500) at /build/ceph-12.2.4/src/osdc/Objecter.cc:1136

I don't know for sure that the shared lock implementation here is prioritising writers over reads, but if it is then thread 11 would be preventing thread 17 acquiring the Objecter lock, thus leading to a deadlock even though threads 32 and 11 are both only trying to get shared access to the Objecter lock.

This was not caught by our lockdep testing because Objecter.h was at some point switched to using vanilla C++11 locking classes.

#13 Updated by John Spray over 1 year ago

  • Backport set to luminous

#14 Updated by John Spray about 1 year ago

  • Subject changed from mgr: _check_auth_rotating possible clock skew, rotating keys expired way too early to mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early

#15 Updated by Dan van der Ster about 1 year ago

Seems we're hitting this here too... standby mgr's are deadlocking. We only noticed by chance -- ceph doesn't HEALTH_WARN if there are no standby mgr's?

Here is a deadlock during startup when getting config for the dashboard:

2018-07-17 10:07:56.766351 7f3fb97a1680  0 set uid:gid to 167:167 (ceph:ceph)
2018-07-17 10:07:56.766376 7f3fb97a1680  0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (
stable), process (unknown), pid 3437919
2018-07-17 10:07:56.767621 7f3fb97a1680  0 pidfile_write: ignore empty --pid-file
2018-07-17 10:07:56.794538 7f3fb97a1680  1 mgr send_beacon standby
2018-07-17 10:07:56.795409 7f3fb97a1680  4 mgr init Complete.
2018-07-17 10:07:56.795419 7f3fb08f7700  4 mgr ms_dispatch standby mgrmap(e 259) v1
2018-07-17 10:07:56.795431 7f3fb08f7700  4 mgr handle_mgr_map received map epoch 259
2018-07-17 10:07:56.795435 7f3fb08f7700  4 mgr handle_mgr_map active in map: 0 active is 166088634
2018-07-17 10:07:56.805970 7f3fb08f7700  1 mgr init Loading python module 'dashboard'
2018-07-17 10:07:56.915631 7f3fb08f7700  4 mgr load Standby mode available in module 'dashboard'
2018-07-17 10:07:56.915646 7f3fb08f7700  1 mgr init Loading python module 'prometheus'
2018-07-17 10:07:56.997055 7f3fb08f7700  4 mgr load Standby mode available in module 'prometheus'
2018-07-17 10:07:56.997066 7f3fb08f7700  1 mgr init Loading python module 'restful'
2018-07-17 10:07:57.156095 7f3fb08f7700  4 mgr load Standby mode not provided by module 'restful'
2018-07-17 10:07:57.156119 7f3fb08f7700  1 mgr init Loading python module 'status'
2018-07-17 10:07:57.191849 7f3fb08f7700  4 mgr load Standby mode not provided by module 'status'
2018-07-17 10:07:57.191866 7f3fb08f7700  4 mgr standby_start Starting modules in standby mode
2018-07-17 10:07:57.191888 7f3fb08f7700  4 mgr standby_start starting module dashboard
2018-07-17 10:07:57.192025 7f3fb08f7700  1 mgr load Constructed class from module: dashboard
2018-07-17 10:07:57.192034 7f3fb08f7700  4 mgr start_one Starting thread for dashboard
2018-07-17 10:07:57.192063 7f3fb08f7700  4 mgr standby_start starting module prometheus
2018-07-17 10:07:57.192088 7f3f936d6700  4 mgr entry Entering thread for dashboard
2018-07-17 10:07:57.192178 7f3f936d6700  4 mgr get_config get_configkey: mgr/dashboard/p05517715y01595/server_addr

... stuck there forever. Eventually they start logging _check_auth_rotating possible clock skew.

We disabled the dashboard module and they start now.

#16 Updated by Jared Baker 11 months ago

Running Luminous 12.2.7-1xenial and am encountering this issue with ceph-mgr as well. I have 3 controllers running ceph-mgr and the standbys will sometimes disappear.

Sep 4 13:33:54 localhost ceph-mgr1196670: 2018-09-04 13:33:54.178490 7f70e2a06700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-09-04 12:33:54.178485)
Sep 4 13:34:04 localhost ceph-mgr1196670: 2018-09-04 13:34:04.178751 7f70e2a06700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-09-04 12:34:04.178744)
Sep 4 13:34:14 localhost ceph-mgr1196670: 2018-09-04 13:34:14.179076 7f70e2a06700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-09-04 12:34:14.179061)
Sep 4 13:34:24 localhost ceph-mgr1196670: 2018-09-04 13:34:24.179409 7f70e2a06700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-09-04 12:34:24.179387)

Some amount of restarting the ceph-mgr services across the controllers will allow me to return to a normal state of 1 active, 2 standby daemons.

Thank you

#17 Updated by John Spray 10 months ago

  • Duplicated by Bug #36266: mgr: deadlock in ClusterState added

#18 Updated by Bryan Stillwell 9 months ago

We've also run into this on 12.2.8 using 3 mgrs:

2018-10-24 13:39:09.487997 7f46f9395700  4 mgr standby_start starting module dashboard
2018-10-24 13:39:09.488087 7f46f9395700 20 mgr Gil Switched to new thread state 0x562bd7c9f0c0
2018-10-24 13:39:09.488091 7f46dcaac700 10 mgr entry listing keys
2018-10-24 13:39:09.488279 7f46f9395700  1 mgr load Constructed class from module: dashboard
2018-10-24 13:39:09.488287 7f46f9395700 20 mgr ~Gil Destroying new thread state 0x562bd7c9f0c0
2018-10-24 13:39:09.488294 7f46f9395700  4 mgr start_one Starting thread for dashboard
2018-10-24 13:39:09.488341 7f46f9395700  4 mgr standby_start starting module prometheus
2018-10-24 13:39:09.488350 7f46f9395700 20 mgr Gil Switched to new thread state 0x562bd7c9f0c0
2018-10-24 13:39:09.488378 7f46dc2ab700  4 mgr entry Entering thread for dashboard
2018-10-24 13:39:09.488497 7f46dc2ab700 20 mgr Gil Switched to new thread state 0x562bd7c9f220
2018-10-24 13:39:09.488565 7f46dc2ab700  4 mgr get_config get_configkey: mgr/dashboard/a2plcephmon002/server_addr

#19 Updated by Maxime Guyot 8 months ago

I ran into this specific issue as well. Is there a way to detect the deadlocked MGR with the Prometheus module for instance?

#20 Updated by Roberto Valverde 8 months ago

Maxime Guyot wrote:

I ran into this specific issue as well. Is there a way to detect the deadlocked MGR with the Prometheus module for instance?

Hi Maxime,

Yes, with Prometheus the mgr target will appear as down while in this state.

Cheers,
Roberto

#21 Updated by Hector Martin 7 months ago

I think this is fixed by this commit:

https://github.com/ceph/ceph/commit/70e56269b1a459d6b0e7b23bc3037602963ea5c2

And subsequent commits like this one might fix similar issues:

https://github.com/ceph/ceph/commit/61aa7e2e0230bff49e757c32932d1db4bcad5f67

Any chance to get at least the first one backported for v12.2.11?

#22 Updated by Hector Martin 5 months ago

Pinging this again; other people are hitting this issue and #36266 in Luminous and probably Mimic (recent conversation in #ceph). Any chance we can get those commits backported to those branches?

#23 Updated by Lenz Grimmer 5 months ago

  • Status changed from New to Pending Backport
  • Backport changed from luminous to luminous, mimic

Hector Martin wrote:

Pinging this again; other people are hitting this issue and #36266 in Luminous and probably Mimic (recent conversation in #ceph). Any chance we can get those commits backported to those branches?

Thanks for the hint. The related PR is https://github.com/ceph/ceph/pull/23482 - setting the "Pending Backport" flag.

#24 Updated by Nathan Cutler 5 months ago

  • Pull request ID set to 23482

#25 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #38318: luminous: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early added

#26 Updated by Nathan Cutler 5 months ago

  • Copied to Backport #38319: mimic: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early added

#27 Updated by Nathan Cutler 4 months ago

  • Status changed from Pending Backport to Resolved

#28 Updated by Stephen Bird 3 months ago

I'm seeing these deadlocks in Nautilus. Did these commits make it into 14.2?

#29 Updated by Nathan Cutler 3 months ago

Stephen Bird wrote:

I'm seeing these deadlocks in Nautilus. Did these commits make it into 14.2?

Yes, but please double-check for yourself.

For any commit, the github interface will tell you which versions it is in. For example: https://github.com/ceph/ceph/commit/70e56269b1a459d6b0e7b23bc3037602963ea5c2 shows this commit first entered the codestream in the v14.0.1 release.

#30 Updated by Andreas de Pretis 2 months ago

We are experiencing this with 13.2.5 on Ubuntu 16.04.

Also available in: Atom PDF