Bug #23460
closedmgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early
0%
Description
On Luminous v12.2.2 and v12.2.4 clusters running either CentOS or Ubuntu I've seen many Manager going offline with these messages in their logs:
2018-03-26 05:32:50.976826 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:32:50.976818) 2018-03-26 05:33:00.977091 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:33:00.977083) 2018-03-26 05:33:10.977244 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:33:10.977238) 2018-03-26 05:33:20.977398 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:33:20.977392) 2018-03-26 05:33:30.977584 7f596859d700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-26 04:33:30.977577)
Mar 23 09:18:22 mon01 ceph-mgr[2324150]: 2018-03-23 09:18:22.451311 7fb9e8ac7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-23 8:18:22.451287)
The first thing to check is if the time is correct on these systems, but it is, they are all in sync thanks to NTP.
No clock drift warnings or such, the Mgr just goes down with these messages.
Files
Updated by John Spray about 6 years ago
This one's still a mystery to me :-/
Only initial thought is that we're in daylight-savings territory right now, could be a differing TZ between mon and mgr nodes?
Updated by Wido den Hollander about 6 years ago
I don't think the TZ is related. In all the cases I've seen the Mon and Mgr were running on the same node.
I saw it happen again on this system, but it also happened before the daylight saving changes. Seen many cases in the last few months where this occured.
Updated by Wido den Hollander about 6 years ago
I found another mgr in a cluster which has been down now for 4 days.
[root@srv-zmb03-05 (GN3) ceph]# systemctl status ceph-mgr@srv-zmb03-05 * ceph-mgr@srv-zmb03-05.service - Ceph cluster manager daemon Loaded: loaded (/usr/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: disabled) Active: active (running) since Mon 2018-03-19 07:21:03 CET; 1 weeks 0 days ago Main PID: 1468563 (ceph-mgr) CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@srv-zmb03-05.service `-1468563 /usr/bin/ceph-mgr -f --cluster ceph --id srv-zmb03-05 --setuser ceph --setgroup ceph Mar 22 00:26:07 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:07.421833 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:07.421826) Mar 22 00:26:17 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:17.421996 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:17.421990) Mar 22 00:26:27 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:27.422161 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:27.422154) Mar 22 00:26:37 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:37.422325 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:37.422314) Mar 22 00:26:47 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 00:26:47.422530 7f6c0745b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early ...6:47.422521) Mar 22 03:19:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-22 03:19:01.978135 7f6c04c56700 -1 received signal: Hangup from PID: 2219601 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw UID: 0 Mar 23 03:47:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-23 03:47:01.450114 7f6c04c56700 -1 received signal: Hangup from PID: 2412024 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw UID: 0 Mar 24 03:44:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-24 03:44:01.851657 7f6c04c56700 -1 received signal: Hangup from PID: 2600400 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw UID: 0 Mar 25 03:09:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-25 03:09:01.902803 7f6c04c56700 -1 received signal: Hangup from PID: 2776777 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw UID: 0 Mar 26 03:11:01 srv-zmb03-05.zm1.XXX.nl ceph-mgr[1468563]: 2018-03-26 03:11:01.589827 7f6c04c56700 -1 received signal: Hangup from PID: 2965956 task name: killall -q -1 ceph-mon ceph-mgr ce...osgw UID: 0 Hint: Some lines were ellipsized, use -l to show in full. [root@srv-zmb03-05 (GN3) ceph]#
When trying to talk to the mgr over the admin socket it fails:
[root@srv-zmb03-05 (GN3) ceph]# timeout 10 ceph daemon mgr.srv-zmb03-05 status [root@srv-zmb03-05 (GN3) ceph]#
After 10 seconds I got no response from the mgr daemon.
This is a 12.2.2 manager daemon, but I saw the same on a 12.2.4 manager.
The daemon is still running:
[root@srv-zmb03-05 (GN3) ceph]# ps aux|grep mgr ceph 1468563 38.2 2.5 1834408 839804 ? Ssl Mar19 4018:20 /usr/bin/ceph-mgr -f --cluster ceph --id srv-zmb03-05 --setuser ceph --setgroup ceph root 3063753 0.0 0.0 9048 664 pts/0 S+ 15:35 0:00 grep --color=auto mgr [root@srv-zmb03-05 (GN3) ceph]#
It's however not logging anything nor responding. Restarting it takes a long time:
[root@srv-zmb03-05 (GN3) ceph]# time systemctl restart ceph-mgr.target real 1m30.302s user 0m0.001s sys 0m0.006s [root@srv-zmb03-05 (GN3) ceph]#
So the manager stops to function after they keys seem to expire.
But couldn't it be that the Manager hangs, it's "internal clock" somewhere get's stuck and the keys then expire because they don't rotate?
I have one manager running with "debug_mgr = 20" now and I'll leave it with that for some days.
Updated by Wido den Hollander about 6 years ago
I just found another mgr in this cluster which is "dead", but is eating 300% CPU on 3 cores:
top - 15:51:19 up 14 days, 2:52, 1 user, load average: 2.81, 2.52, 2.44 Tasks: 151 total, 2 running, 149 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 6.4 us, 6.2 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 4.9 us, 5.4 sy, 0.0 ni, 89.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 2.6 us, 2.1 sy, 0.0 ni, 95.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 58.8 us, 41.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 61.0 us, 39.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 10.9 us, 11.9 sy, 0.0 ni, 77.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 32906716 total, 242976 free, 5663384 used, 27000356 buff/cache KiB Swap: 1046524 total, 1046524 free, 0 used. 26284872 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1441330 ceph 20 0 1645996 743944 14124 S 247.3 2.3 3588:40 ceph-mgr 863 ceph 20 0 7406156 6.126g 2.018g S 0.8 19.5 1825:13 ceph-mon
The cluster in this case has 2200 OSDs, but I've also seen this happen on systems with 250 OSDs.
Updated by Nathan Cutler about 6 years ago
- Priority changed from Normal to High
Raising priority since this affects a stable release (Luminous)
Updated by Wido den Hollander about 6 years ago
So I set the logging of the active manager to:
- debug_mgr: 20
- debug_auth: 5
After running for over 24 hours I noticed that the mgr was still running, but not being reported as active nor standby.
Through the admin socket the Manager doesn't respond anymore.
Looking through the logs I found this:
2018-03-27 18:16:41.433416 7f9b83bf7700 4 mgr.server handle_report from 0x55f16eb22000 osd,265 2018-03-27 18:16:41.433424 7f9b83bf7700 20 mgr.server handle_report updating existing DaemonState for osd,265 2018-03-27 18:16:41.433426 7f9b83bf7700 20 mgr update loading 0 new types, 0 old types, had 137 types, got 734 bytes of data 2018-03-27 18:16:41.433528 7f9b83bf7700 4 mgr.server handle_report from 0x55f16dd30800 osd,154 2018-03-27 18:16:41.433533 7f9b83bf7700 20 mgr.server handle_report updating existing DaemonState for osd,154 2018-03-27 18:16:41.433535 7f9b83bf7700 20 mgr update loading 0 new types, 0 old types, had 137 types, got 734 bytes of data 2018-03-27 18:16:41.435972 7f9b96411700 10 mgr tick tick 2018-03-27 18:16:41.435977 7f9b96411700 1 mgr send_beacon active 2018-03-27 18:16:41.436355 7f9b96411700 10 mgr send_beacon sending beacon as gid 248981849 2018-03-27 18:16:41.437240 7f9b96411700 10 mgr tick 2018-03-27 18:16:41.437242 7f9b96411700 10 mgr update_delta_stats v14827 2018-03-27 18:16:41.457254 7f9b83bf7700 4 mgr.server handle_report from 0x55f16d97d800 osd,647 2018-03-27 18:16:41.457266 7f9b83bf7700 20 mgr.server handle_report updating existing DaemonState for osd,647 2018-03-27 18:16:41.457279 7f9b83bf7700 20 mgr update loading 0 new types, 0 old types, had 137 types, got 734 bytes of data 2018-03-27 18:16:41.457512 7f9b823f4700 20 mgr[influx] Sending data to Influx host: dashboard.XXXXXX.nl 2018-03-27 18:16:41.486638 7f9b9c623700 10 mgr.server ms_verify_authorizer session 0x55f16fbf9260 client.openstack has caps allow r 'allow r' 2018-03-27 18:16:41.490921 7f9b9b621700 10 mgr.server ms_verify_authorizer session 0x55f170af5930 client.openstack has caps allow r 'allow r' 2018-03-27 18:16:41.545419 7f9b9b621700 10 mgr.server ms_verify_authorizer session 0x55f170af4750 client.openstack has caps allow r 'allow r' 2018-03-27 18:16:41.550949 7f9b9be22700 10 mgr.server ms_verify_authorizer session 0x55f170af6ff0 client.openstack has caps allow r 'allow r'
After this the logs keep just repeating this line and then there is silence for 2 hours:
2018-03-27 18:16:47.698073 7f9b9c623700 10 mgr.server ms_verify_authorizer session 0x55f16c0a3110 client.openstack has caps allow r 'allow r' .. .. 2018-03-27 18:17:15.682785 7f9b9c623700 10 mgr.server ms_verify_authorizer session 0x55f1647708a0 client.openstack has caps allow r 'allow r' 2018-03-27 18:17:15.708895 7f9b9be22700 10 mgr.server ms_verify_authorizer session 0x55f17247f400 client.openstack has caps allow r 'allow r' 2018-03-27 18:17:15.725502 7f9b9be22700 10 mgr.server ms_verify_authorizer session 0x55f17247dd40 client.openstack has caps allow r 'allow r' 2018-03-27 18:17:15.766729 7f9b9c623700 10 mgr.server ms_verify_authorizer session 0x55f164771cf0 client.openstack has caps allow r 'allow r'
About 90 minutes later this is what I see:
2018-03-27 18:17:15.766729 7f9b9c623700 10 mgr.server ms_verify_authorizer session 0x55f164771cf0 client.openstack has caps allow r 'allow r' 2018-03-27 20:03:58.146343 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 19:03:58.146335) 2018-03-27 20:04:08.146520 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 19:04:08.146514) 2018-03-27 20:04:18.146667 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 19:04:18.146661)
This goes on for a lot of time and then total silence until the logrotate comes along. No new lines in the logfile which follows:
2018-03-27 21:51:58.244816 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 20:51:58.244810) 2018-03-27 21:52:08.244962 7f9b98415700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-03-27 20:52:08.244957) 2018-03-28 03:49:01.198054 7f9b95c10700 -1 received signal: Hangup from PID: 3449235 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw UID: 0
Any other loglevels we might want to enable here? Or anything else I can test?
Updated by wei jin about 6 years ago
I ran into the same issue from osd, which leads to slow request. (Luminous 12.2.4)
The primary osd's ops were blocked due to waiting for subop reply, and we found following logs from subop osd:
2018-04-04 12:16:31.196846 7f0db7c5e700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-04-04 11:16:31.196844) 2018-04-04 12:16:41.196980 7f0db7c5e700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-04-04 11:16:41.196976) ......
and then
2018-04-04 12:37:35.114034 7f0ddec8a700 0 auth: could not find secret_id=374 2018-04-04 12:37:35.114038 7f0ddec8a700 0 cephx: verify_authorizer could not get service secret for service osd secret_id=374 2018-04-04 12:37:35.114040 7f0ddec8a700 0 -- 10.20.63.207:6820/364615 >> 10.20.64.90:6840/399128406 conn(0x5566e865b000 :6820 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_c onnect_msg: got bad authorizer ......
Updated by wei jin about 6 years ago
And I also found one cpu core, which was allocated to this osd by cgroup, had 100% us time.
And the osd is still alive, seems ops related to it are blocked forever.
Updated by John Spray about 6 years ago
Wido: so I guess the monclient messages are more of a symptom of something else getting badly stuck -- I suspect that if you can find a system in this state and attach gdb+"thread apply all bt" it might give us the culprit.
Updated by Wido den Hollander about 6 years ago
I've attached to a hanging mgr process (eating 40% CPU) and the output of GDB is attached.
Things to know about this system:
- Ubuntu 16.04
- Python 2.7.12
- Ceph 12.2.4
- Running with IPv6
I've also seen this happen on a daily basis on a CentOS 7 system running 12.2.4 and also using IPv6 for its communication.
Don't know if this helps, but this is what I could find so far.
Updated by John Spray about 6 years ago
- Related to Feature #23574: Add a HeartbeatMap to ceph-mgr (die on deadlocks) added
Updated by John Spray about 6 years ago
Thanks -- looks like we have a deadlock bug in ceph-mgr itself.
This thread has taken the (shared) Objecter lock and is waiting on the (exclusive) ClusterState lock:
Thread 32 (Thread 0x7fe51ec96700 (LWP 1614293)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007fe53fe7edbd in __GI___pthread_mutex_lock (mutex=mutex@entry=0x561e37cef698) at ../nptl/pthread_mutex_lock.c:80 #2 0x0000561e2c7a9659 in Mutex::Lock (this=this@entry=0x561e37cef670, no_lockdep=no_lockdep@entry=false) at /build/ceph-12.2.4/src/common/Mutex.cc:107 #3 0x0000561e2c65311a in Mutex::Locker::Locker (m=..., this=<synthetic pointer>) at /build/ceph-12.2.4/src/common/Mutex.h:115 #4 ClusterState::with_pgmap<ActivePyModules::get_python(const string&)::<lambda(const OSDMap&)>::<lambda(const PGMap&)> > (cb=<optimized out>, this=0x561e37cef498) at /build/ceph-12.2.4/src/mgr/ClusterState.h:104 #5 ActivePyModules::<lambda(const OSDMap&)>::operator() (osd_map=..., __closure=<optimized out>) at /build/ceph-12.2.4/src/mgr/ActivePyModules.cc:290 #6 Objecter::with_osdmap<ActivePyModules::get_python(const string&)::<lambda(const OSDMap&)> > (cb=<optimized out>, this=<optimized out>) at /build/ceph-12.2.4/src/osdc/Objecter.h:2045 #7 ClusterState::with_osdmap<ActivePyModules::get_python(const string&)::<lambda(const OSDMap&)> > (this=<optimized out>) at /build/ceph-12.2.4/src/mgr/ClusterState.h:129 #8 ActivePyModules::get_python (this=0x561e37e93600, what="df") at /build/ceph-12.2.4/src/mgr/ActivePyModules.cc:291 #9 0x0000561e2c673988 in ceph_state_get (self=0x7fe5275df550, args=<optimized out>) at /build/ceph-12.2.4/src/mgr/BaseMgrModule.cc:325
This thread has taken the (exclusive) ClusterState lock and is waiting on the (shared) Objecter lock:
Thread 17 (Thread 0x7fe535776700 (LWP 1601474)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x0000561e2c639aa6 in boost::condition_variable::wait (m=..., this=0x7ffddbe293d8) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/pthread/condition_variable.hpp:81 #2 boost::shared_mutex::lock_shared (this=0x7ffddbe293a8) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/pthread/shared_mutex.hpp:190 #3 boost::shared_lock<boost::shared_mutex>::lock (this=this@entry=0x7fe535774300) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/lock_types.hpp:645 #4 0x0000561e2c633c01 in boost::shared_lock<boost::shared_mutex>::shared_lock (m_=..., this=0x7fe535774300) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/lock_types.hpp:520 #5 Objecter::with_osdmap<DaemonServer::send_report()::<lambda(const PGMap&)>::<lambda(const OSDMap&)> > (cb=<optimized out>, this=0x7ffddbe292c8) at /build/ceph-12.2.4/src/osdc/Objecter.h:2044 #6 ClusterState::with_osdmap<DaemonServer::send_report()::<lambda(const PGMap&)>::<lambda(const OSDMap&)> > (this=<optimized out>) at /build/ceph-12.2.4/src/mgr/ClusterState.h:129 #7 DaemonServer::<lambda(const PGMap&)>::operator() (pg_map=..., __closure=<optimized out>) at /build/ceph-12.2.4/src/mgr/DaemonServer.cc:1467 #8 ClusterState::with_pgmap<DaemonServer::send_report()::<lambda(const PGMap&)> > (cb=<optimized out>, this=0x561e37cef498) at /build/ceph-12.2.4/src/mgr/ClusterState.h:105 #9 DaemonServer::send_report (this=this@entry=0x561e37cf0228) at /build/ceph-12.2.4/src/mgr/DaemonServer.cc:1468
...and this thread is trying to (exclusive) acquire Objecter::lock:
Thread 11 (Thread 0x7fe538983700 (LWP 1601468)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x0000561e2c6c9565 in boost::condition_variable::wait (m=..., this=0x7ffddbe29430) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/pthread/condition_variable.hpp:81 #2 boost::shared_mutex::lock (this=0x7ffddbe293a8) at /build/ceph-12.2.4/obj-x86_64-linux-gnu/boost/include/boost/thread/pthread/shared_mutex.hpp:293 #3 0x0000561e2c6bf864 in ceph::shunique_lock<boost::shared_mutex>::lock (this=0x7fe538980f70) at /build/ceph-12.2.4/src/common/shunique_lock.h:157 #4 ceph::shunique_lock<boost::shared_mutex>::shunique_lock (m=..., this=0x7fe538980f70) at /build/ceph-12.2.4/src/common/shunique_lock.h:65 #5 Objecter::handle_osd_map (this=this@entry=0x7ffddbe292c8, m=m@entry=0x561e3b33c500) at /build/ceph-12.2.4/src/osdc/Objecter.cc:1136
I don't know for sure that the shared lock implementation here is prioritising writers over reads, but if it is then thread 11 would be preventing thread 17 acquiring the Objecter lock, thus leading to a deadlock even though threads 32 and 11 are both only trying to get shared access to the Objecter lock.
This was not caught by our lockdep testing because Objecter.h was at some point switched to using vanilla C++11 locking classes.
Updated by John Spray almost 6 years ago
- Subject changed from mgr: _check_auth_rotating possible clock skew, rotating keys expired way too early to mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early
Updated by Dan van der Ster almost 6 years ago
Seems we're hitting this here too... standby mgr's are deadlocking. We only noticed by chance -- ceph doesn't HEALTH_WARN if there are no standby mgr's?
Here is a deadlock during startup when getting config for the dashboard:
2018-07-17 10:07:56.766351 7f3fb97a1680 0 set uid:gid to 167:167 (ceph:ceph) 2018-07-17 10:07:56.766376 7f3fb97a1680 0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous ( stable), process (unknown), pid 3437919 2018-07-17 10:07:56.767621 7f3fb97a1680 0 pidfile_write: ignore empty --pid-file 2018-07-17 10:07:56.794538 7f3fb97a1680 1 mgr send_beacon standby 2018-07-17 10:07:56.795409 7f3fb97a1680 4 mgr init Complete. 2018-07-17 10:07:56.795419 7f3fb08f7700 4 mgr ms_dispatch standby mgrmap(e 259) v1 2018-07-17 10:07:56.795431 7f3fb08f7700 4 mgr handle_mgr_map received map epoch 259 2018-07-17 10:07:56.795435 7f3fb08f7700 4 mgr handle_mgr_map active in map: 0 active is 166088634 2018-07-17 10:07:56.805970 7f3fb08f7700 1 mgr init Loading python module 'dashboard' 2018-07-17 10:07:56.915631 7f3fb08f7700 4 mgr load Standby mode available in module 'dashboard' 2018-07-17 10:07:56.915646 7f3fb08f7700 1 mgr init Loading python module 'prometheus' 2018-07-17 10:07:56.997055 7f3fb08f7700 4 mgr load Standby mode available in module 'prometheus' 2018-07-17 10:07:56.997066 7f3fb08f7700 1 mgr init Loading python module 'restful' 2018-07-17 10:07:57.156095 7f3fb08f7700 4 mgr load Standby mode not provided by module 'restful' 2018-07-17 10:07:57.156119 7f3fb08f7700 1 mgr init Loading python module 'status' 2018-07-17 10:07:57.191849 7f3fb08f7700 4 mgr load Standby mode not provided by module 'status' 2018-07-17 10:07:57.191866 7f3fb08f7700 4 mgr standby_start Starting modules in standby mode 2018-07-17 10:07:57.191888 7f3fb08f7700 4 mgr standby_start starting module dashboard 2018-07-17 10:07:57.192025 7f3fb08f7700 1 mgr load Constructed class from module: dashboard 2018-07-17 10:07:57.192034 7f3fb08f7700 4 mgr start_one Starting thread for dashboard 2018-07-17 10:07:57.192063 7f3fb08f7700 4 mgr standby_start starting module prometheus 2018-07-17 10:07:57.192088 7f3f936d6700 4 mgr entry Entering thread for dashboard 2018-07-17 10:07:57.192178 7f3f936d6700 4 mgr get_config get_configkey: mgr/dashboard/p05517715y01595/server_addr
... stuck there forever. Eventually they start logging _check_auth_rotating possible clock skew.
We disabled the dashboard module and they start now.
Updated by Jared Baker over 5 years ago
Running Luminous 12.2.7-1xenial and am encountering this issue with ceph-mgr as well. I have 3 controllers running ceph-mgr and the standbys will sometimes disappear.
Sep 4 13:33:54 localhost ceph-mgr1196670: 2018-09-04 13:33:54.178490 7f70e2a06700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-09-04 12:33:54.178485)
Sep 4 13:34:04 localhost ceph-mgr1196670: 2018-09-04 13:34:04.178751 7f70e2a06700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-09-04 12:34:04.178744)
Sep 4 13:34:14 localhost ceph-mgr1196670: 2018-09-04 13:34:14.179076 7f70e2a06700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-09-04 12:34:14.179061)
Sep 4 13:34:24 localhost ceph-mgr1196670: 2018-09-04 13:34:24.179409 7f70e2a06700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2018-09-04 12:34:24.179387)
Some amount of restarting the ceph-mgr services across the controllers will allow me to return to a normal state of 1 active, 2 standby daemons.
Thank you
Updated by John Spray over 5 years ago
- Has duplicate Bug #36266: mgr: deadlock in ClusterState added
Updated by Bryan Stillwell over 5 years ago
We've also run into this on 12.2.8 using 3 mgrs:
2018-10-24 13:39:09.487997 7f46f9395700 4 mgr standby_start starting module dashboard 2018-10-24 13:39:09.488087 7f46f9395700 20 mgr Gil Switched to new thread state 0x562bd7c9f0c0 2018-10-24 13:39:09.488091 7f46dcaac700 10 mgr entry listing keys 2018-10-24 13:39:09.488279 7f46f9395700 1 mgr load Constructed class from module: dashboard 2018-10-24 13:39:09.488287 7f46f9395700 20 mgr ~Gil Destroying new thread state 0x562bd7c9f0c0 2018-10-24 13:39:09.488294 7f46f9395700 4 mgr start_one Starting thread for dashboard 2018-10-24 13:39:09.488341 7f46f9395700 4 mgr standby_start starting module prometheus 2018-10-24 13:39:09.488350 7f46f9395700 20 mgr Gil Switched to new thread state 0x562bd7c9f0c0 2018-10-24 13:39:09.488378 7f46dc2ab700 4 mgr entry Entering thread for dashboard 2018-10-24 13:39:09.488497 7f46dc2ab700 20 mgr Gil Switched to new thread state 0x562bd7c9f220 2018-10-24 13:39:09.488565 7f46dc2ab700 4 mgr get_config get_configkey: mgr/dashboard/a2plcephmon002/server_addr
Updated by Maxime Guyot over 5 years ago
I ran into this specific issue as well. Is there a way to detect the deadlocked MGR with the Prometheus module for instance?
Updated by Roberto Valverde over 5 years ago
Maxime Guyot wrote:
I ran into this specific issue as well. Is there a way to detect the deadlocked MGR with the Prometheus module for instance?
Hi Maxime,
Yes, with Prometheus the mgr target will appear as down while in this state.
Cheers,
Roberto
Updated by Hector Martin over 5 years ago
I think this is fixed by this commit:
https://github.com/ceph/ceph/commit/70e56269b1a459d6b0e7b23bc3037602963ea5c2
And subsequent commits like this one might fix similar issues:
https://github.com/ceph/ceph/commit/61aa7e2e0230bff49e757c32932d1db4bcad5f67
Any chance to get at least the first one backported for v12.2.11?
Updated by Hector Martin about 5 years ago
Pinging this again; other people are hitting this issue and #36266 in Luminous and probably Mimic (recent conversation in #ceph). Any chance we can get those commits backported to those branches?
Updated by Lenz Grimmer about 5 years ago
- Status changed from New to Pending Backport
- Backport changed from luminous to luminous, mimic
Hector Martin wrote:
Pinging this again; other people are hitting this issue and #36266 in Luminous and probably Mimic (recent conversation in #ceph). Any chance we can get those commits backported to those branches?
Thanks for the hint. The related PR is https://github.com/ceph/ceph/pull/23482 - setting the "Pending Backport" flag.
Updated by Nathan Cutler about 5 years ago
- Copied to Backport #38318: luminous: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early added
Updated by Nathan Cutler about 5 years ago
- Copied to Backport #38319: mimic: mgr deadlock: _check_auth_rotating possible clock skew, rotating keys expired way too early added
Updated by Nathan Cutler about 5 years ago
- Status changed from Pending Backport to Resolved
Updated by Stephen Bird about 5 years ago
I'm seeing these deadlocks in Nautilus. Did these commits make it into 14.2?
Updated by Nathan Cutler about 5 years ago
Stephen Bird wrote:
I'm seeing these deadlocks in Nautilus. Did these commits make it into 14.2?
Yes, but please double-check for yourself.
For any commit, the github interface will tell you which versions it is in. For example: https://github.com/ceph/ceph/commit/70e56269b1a459d6b0e7b23bc3037602963ea5c2 shows this commit first entered the codestream in the v14.0.1 release.
Updated by Andreas de Pretis almost 5 years ago
We are experiencing this with 13.2.5 on Ubuntu 16.04.
Updated by Daniel Poelzleithner over 4 years ago
We had this issue yesterday with ceph version 14.2.1 (9257126ffb439de1652793b3e29f4c0b97a47b47) nautilus (stable) on PVE 5.0.15-1.
It seems that the mon started to desync for some reason (clockscrew maybe, we ran ntpd on all servers though). Some OSDs could not connect to the mon anymore, others stayed up.
I was able to resolve the issue by stopping all mon and mgr daemons, then starting one mon, then the rest. After bringing up the mgr the cluster got healthy again.
Updated by hoan nv about 4 years ago
I had this issue with ceph version 14.2.7.
After disable prometheus, mgr don't have this log.
Updated by Stefan Kooman over 3 years ago
We had this issue with Ceph 14.2.12. We will disable prometheus and see if this helps.