Project

General

Profile

Bug #39264

Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors

Added by Stephen Bird about 2 years ago. Updated 29 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
prometheus module
Target version:
% Done:

0%

Source:
Tags:
Backport:
octopus, nautilus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Greetings!

I have a bug not unlike BUG #23460, where our ceph-mgrs die and show clock/rotating key errors:

2019-04-11 03:37:01.544 7fc96a2c0700 -1 received  signal: Hangup from <unknown> (PID: 89976) UID: 0
2019-04-11 03:37:01.555 7fc96a2c0700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 89977) UID: 0
2019-04-11 03:37:07.113 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:07.114821)
2019-04-11 03:37:17.113 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:17.115020)
2019-04-11 03:37:27.114 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:27.115245)
2019-04-11 03:37:37.114 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:37.115425)
2019-04-11 03:37:47.114 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:47.115640)
2019-04-11 03:37:57.114 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:57.115867)
2019-04-11 03:38:07.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:07.116108)
2019-04-11 03:38:17.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:17.116286)
2019-04-11 03:38:27.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:27.116458)
2019-04-11 03:38:37.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:37.116650)
2019-04-11 03:38:47.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:47.116906)
2019-04-11 03:38:57.116 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:57.117134)
2019-04-11 03:39:07.116 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:39:07.117344)
2019-04-11 03:39:17.116 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:39:17.117606)

We have 3 mgr's that usually run, but since 14.2, our mgrs have died off to this one by one.

The time is synced on our mgr nodes and the original bug fix has been committed to 14.2.


Related issues

Related to mgr - Bug #43364: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive Resolved
Related to mgr - Bug #43317: high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5 Duplicate
Related to RADOS - Bug #43185: ceph -s not showing client activity Resolved
Related to mgr - Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8 New
Copied to mgr - Backport #48713: nautilus: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors Resolved
Copied to mgr - Backport #48714: octopus: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors Resolved

History

#1 Updated by Stephen Bird about 2 years ago

So a quick update, after turning off the devicehealth module, the mgr's seem to be maintaining an active/standby status.

#2 Updated by Soenke Schippmann almost 2 years ago

This bug persists in 14.2.1 and is somewhat annoying. Especially as there seems to be no health warn when there is no standby mgr available.

#3 Updated by Janek Bevendorff over 1 year ago

I am seeing this quite regularly in Nautiluas 14.2.4. We have five MGRs and every now and then a few of them hang themselves with these messages. After a restart, everything's fine again. Most of the time, I have to restart all MGRs, though, even the ones which are not affected, because those tend to hang themselves shortly after restarting the dead ones.

We have the prometheus module enabled, which is being polled every few seconds. With over 1.2k OSDs, this seems to incur quite a bit of load on the MGRs and the metrics response times are about 3 seconds. Perhaps there is a connection.

#4 Updated by hoan nv about 1 year ago

This bug persists in 14.2.7.

My cluster has 5 mgrs, after some hours, mgr has log

2020-02-10 09:08:10.546 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:10.547340)
2020-02-10 09:08:20.546 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:20.547607)
2020-02-10 09:08:30.546 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:30.547909)
2020-02-10 09:08:40.547 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:40.548206)
2020-02-10 09:08:50.547 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:50.548487)
2020-02-10 09:09:00.547 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:09:00.548842)
2020-02-10 09:09:10.548 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:09:10.549157)
2020-02-10 09:09:20.548 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:09:20.549462)

I will try disable prometheus module.

#5 Updated by hoan nv about 1 year ago

After disable prometheus module, mgr don't have this log.

#6 Updated by Lenz Grimmer about 1 year ago

  • Category changed from ceph-mgr to prometheus module
  • Severity changed from 3 - minor to 2 - major

#7 Updated by Janek Bevendorff about 1 year ago

I want to dig this issue up again. With the new 14.2.8 update, the problem has become so bad I had to disable the prometheus module altogether. My MGRs kept hanging up every few hours. It takes only a day or two for all five MGRs to be gone. Reproducibly.

Since nobody knows a solution on the mailing list and I am not seeing a huge amount of fuzz about this problem except for some complaints about the general performance of the prometheus module, I would assume that the effect scales with cluster size. We have 1248 OSDs, five MONs and five MGRs. The /metrics endpoint takes quite a while to collect all the metrics and after a few hours, the MGRs are starting to die and stay dead until I restart them.

Since disabling the prometheus module, I haven't seen any MGR stability issues (so far).

#8 Updated by Jan Fajerski about 1 year ago

A couple of questions:

- Is there anything else in the logs?
- Does anyone maybe have debug logs for this?
- How does resource usage look on the affected mgr nodes?

#10 Updated by Janek Bevendorff about 1 year ago

No, but on the mailing list people were reporting that the problem went away once the free space dropped below a few PiB. We have 9 PiB available and our MGRs are not only slow, but shut down entirely every few hours.

#11 Updated by Jan Fajerski about 1 year ago

Janek Bevendorff wrote:

No, but on the mailing list people were reporting that the problem went away once the free space dropped below a few PiB. We have 9 PiB available and our MGRs are not only slow, but shut down entirely every few hours.

Any chance you could collect some debug logs from those mgr instances?
Do they shut down or are they being killed? Are there maybe coredumps in case they crash?

#12 Updated by Janek Bevendorff about 1 year ago

There is nothing else in the logs and I disabled the Prometheus module in our production cluster, so I don't have the problem right now. But they don't actually die, they just stop responding. The daemon is still alive, but doesn't appear in ceph status and requests to /metrics or the dashboard time out.

#13 Updated by Lenz Grimmer 12 months ago

  • Related to Bug #43364: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

#14 Updated by Lenz Grimmer 12 months ago

  • Related to Bug #43317: high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5 added

#15 Updated by Lenz Grimmer 12 months ago

  • Related to Bug #43185: ceph -s not showing client activity added

#16 Updated by Lenz Grimmer 12 months ago

  • Related to Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8 added

#17 Updated by Sebastian Luna-Valero 7 months ago

This problem hit me yesterday in a Ceph cluster running version 15.2.5.

I got "no active mgr" with "ceph -s" but all mgr containers were reported as up by docker.

I solved the issue by restarting all Ceph containers but this is not ideal for a production system.

#18 Updated by Sebastian Luna-Valero 7 months ago

I have this problem again today.

Although all docker containers are shown as "up", the following commands hang on the MONs:
  • cephadm shell -- ceph orch ps
  • cephadm shell -- ceph orch ls

The mgr nodes are mostly idle so resource usage shouldn't be a problem.

Could you please let me know how to create the debug logs for this?

Best regards,
Sebastian

#19 Updated by Sebastian Luna-Valero 7 months ago

Here I paste what I hope are relevant log messages.

ceph-mon-1:

2020-10-01T17:02:45.522+0000 7f9897ce4700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2020-10-01T17:02:45.522+0000 7f9897ce4700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***
2020-10-01T17:02:52.247+0000 7f0186d0e200  0 set uid:gid to 167:167 (ceph:ceph)
2020-10-01T17:02:52.247+0000 7f0186d0e200  0 ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable), process ceph-mgr, pid 1
2020-10-01T17:02:52.248+0000 7f0186d0e200  0 pidfile_write: ignore empty --pid-file
2020-10-01T17:02:52.285+0000 7f0186d0e200  1 mgr[py] Loading python module 'alerts'
2020-10-01T17:02:52.356+0000 7f0186d0e200  1 mgr[py] Loading python module 'balancer'
2020-10-01T17:02:52.411+0000 7f0186d0e200  1 mgr[py] Loading python module 'cephadm'
2020-10-01T17:02:52.603+0000 7f0186d0e200  1 mgr[py] Loading python module 'crash'
2020-10-01T17:02:52.658+0000 7f0186d0e200  1 mgr[py] Loading python module 'dashboard'
2020-10-01T17:02:53.112+0000 7f0186d0e200  1 mgr[py] Loading python module 'devicehealth'
2020-10-01T17:02:53.159+0000 7f0186d0e200  1 mgr[py] Loading python module 'diskprediction_local'
2020-10-01T17:02:53.321+0000 7f0186d0e200  1 mgr[py] Loading python module 'influx'
2020-10-01T17:02:53.424+0000 7f0186d0e200  1 mgr[py] Loading python module 'insights'
2020-10-01T17:02:53.471+0000 7f0186d0e200  1 mgr[py] Loading python module 'iostat'
2020-10-01T17:02:53.516+0000 7f0186d0e200  1 mgr[py] Loading python module 'k8sevents'
2020-10-01T17:02:53.888+0000 7f0186d0e200  1 mgr[py] Loading python module 'localpool'
2020-10-01T17:02:53.933+0000 7f0186d0e200  1 mgr[py] Loading python module 'orchestrator'
2020-10-01T17:02:54.112+0000 7f0186d0e200  1 mgr[py] Loading python module 'osd_support'
2020-10-01T17:02:54.158+0000 7f0186d0e200  1 mgr[py] Loading python module 'pg_autoscaler'
2020-10-01T17:02:54.230+0000 7f0186d0e200  1 mgr[py] Loading python module 'progress'
2020-10-01T17:02:54.297+0000 7f0186d0e200  1 mgr[py] Loading python module 'prometheus'
2020-10-01T17:02:54.658+0000 7f0186d0e200  1 mgr[py] Loading python module 'rbd_support'
2020-10-01T17:02:54.740+0000 7f0186d0e200  1 mgr[py] Loading python module 'restful'
2020-10-01T17:02:54.962+0000 7f0186d0e200  1 mgr[py] Loading python module 'rook'
2020-10-01T17:02:55.466+0000 7f0186d0e200  1 mgr[py] Loading python module 'selftest'
2020-10-01T17:02:55.514+0000 7f0186d0e200  1 mgr[py] Loading python module 'status'
2020-10-01T17:02:55.567+0000 7f0186d0e200  1 mgr[py] Loading python module 'telegraf'
2020-10-01T17:02:55.615+0000 7f0186d0e200  1 mgr[py] Loading python module 'telemetry'
2020-10-01T17:02:55.739+0000 7f0186d0e200  1 mgr[py] Loading python module 'test_orchestrator'
2020-10-01T17:02:56.006+0000 7f0186d0e200  1 mgr[py] Loading python module 'volumes'
2020-10-01T17:02:56.161+0000 7f0186d0e200  1 mgr[py] Loading python module 'zabbix'
2020-10-01T17:02:56.235+0000 7f0173f8b700  0 ms_deliver_dispatch: unhandled message 0x564063524420 mon_map magic: 0 v1 from mon.2 v2:172.16.5.117:3300/0
2020-10-01T17:02:56.751+0000 7f0173f8b700  1 mgr handle_mgr_map Activating!
2020-10-01T17:02:56.753+0000 7f0173f8b700  1 mgr handle_mgr_map I am now activating
2020-10-01T17:02:56.840+0000 7f010486b700  0 [balancer DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.840+0000 7f010486b700  1 mgr load Constructed class from module: balancer
2020-10-01T17:02:56.841+0000 7f010486b700  0 [cephadm DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.988+0000 7f010486b700  1 mgr load Constructed class from module: cephadm
2020-10-01T17:02:56.988+0000 7f010486b700  0 [crash DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.988+0000 7f010486b700  1 mgr load Constructed class from module: crash
2020-10-01T17:02:56.992+0000 7f0110073700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
2020-10-01T17:02:56.993+0000 7f010486b700  0 [dashboard DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.993+0000 7f010486b700  1 mgr load Constructed class from module: dashboard
2020-10-01T17:02:56.993+0000 7f010486b700  0 [devicehealth DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.993+0000 7f010486b700  1 mgr load Constructed class from module: devicehealth
2020-10-01T17:02:56.997+0000 7f010486b700  0 [diskprediction_local DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.997+0000 7f010486b700  1 mgr load Constructed class from module: diskprediction_local
2020-10-01T17:02:56.998+0000 7f010486b700  0 [iostat DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.998+0000 7f010486b700  1 mgr load Constructed class from module: iostat
2020-10-01T17:02:56.998+0000 7f010486b700  0 [orchestrator DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.998+0000 7f010486b700  1 mgr load Constructed class from module: orchestrator
2020-10-01T17:02:57.003+0000 7f010486b700  0 [pg_autoscaler DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:57.003+0000 7f010486b700  1 mgr load Constructed class from module: pg_autoscaler
2020-10-01T17:02:57.021+0000 7f010486b700  0 [progress DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:57.021+0000 7f010486b700  1 mgr load Constructed class from module: progress
2020-10-01T17:02:57.021+0000 7f010486b700  0 [prometheus DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:57.027+0000 7f010486b700  1 mgr load Constructed class from module: prometheus
2020-10-01T17:02:57.063+0000 7f010486b700  0 [rbd_support DEBUG root] setting log level based on debug_mgr: WARNING (1/5)

ceph-mon-2:

2020-10-01T17:02:40.694+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:40.696200+0000)
2020-10-01T17:02:41.695+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:41.696611+0000)
2020-10-01T17:02:42.695+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:42.696897+0000)
2020-10-01T17:02:43.695+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:43.697231+0000)
2020-10-01T17:02:44.696+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:44.697444+0000)
2020-10-01T17:02:45.521+0000 7fad1df98700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2020-10-01T17:02:45.521+0000 7fad1df98700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***
2020-10-01T17:02:52.282+0000 7f1a12b52200  0 set uid:gid to 167:167 (ceph:ceph)
2020-10-01T17:02:52.282+0000 7f1a12b52200  0 ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable), process ceph-mgr, pid 1
2020-10-01T17:02:52.283+0000 7f1a12b52200  0 pidfile_write: ignore empty --pid-file
2020-10-01T17:02:52.320+0000 7f1a12b52200  1 mgr[py] Loading python module 'alerts'
2020-10-01T17:02:52.392+0000 7f1a12b52200  1 mgr[py] Loading python module 'balancer'
2020-10-01T17:02:52.446+0000 7f1a12b52200  1 mgr[py] Loading python module 'cephadm'
2020-10-01T17:02:52.639+0000 7f1a12b52200  1 mgr[py] Loading python module 'crash'
2020-10-01T17:02:52.697+0000 7f1a12b52200  1 mgr[py] Loading python module 'dashboard'
2020-10-01T17:02:53.157+0000 7f1a12b52200  1 mgr[py] Loading python module 'devicehealth'
2020-10-01T17:02:53.206+0000 7f1a12b52200  1 mgr[py] Loading python module 'diskprediction_local'
2020-10-01T17:02:53.333+0000 7f1a12b52200  1 mgr[py] Loading python module 'influx'
2020-10-01T17:02:53.432+0000 7f1a12b52200  1 mgr[py] Loading python module 'insights'
2020-10-01T17:02:53.478+0000 7f1a12b52200  1 mgr[py] Loading python module 'iostat'
2020-10-01T17:02:53.521+0000 7f1a12b52200  1 mgr[py] Loading python module 'k8sevents'
2020-10-01T17:02:53.904+0000 7f1a12b52200  1 mgr[py] Loading python module 'localpool'
2020-10-01T17:02:53.950+0000 7f1a12b52200  1 mgr[py] Loading python module 'orchestrator'
2020-10-01T17:02:54.129+0000 7f1a12b52200  1 mgr[py] Loading python module 'osd_support'
2020-10-01T17:02:54.176+0000 7f1a12b52200  1 mgr[py] Loading python module 'pg_autoscaler'
2020-10-01T17:02:54.257+0000 7f1a12b52200  1 mgr[py] Loading python module 'progress'
2020-10-01T17:02:54.330+0000 7f1a12b52200  1 mgr[py] Loading python module 'prometheus'
2020-10-01T17:02:54.706+0000 7f1a12b52200  1 mgr[py] Loading python module 'rbd_support'
2020-10-01T17:02:54.788+0000 7f1a12b52200  1 mgr[py] Loading python module 'restful'
2020-10-01T17:02:55.010+0000 7f1a12b52200  1 mgr[py] Loading python module 'rook'
2020-10-01T17:02:55.517+0000 7f1a12b52200  1 mgr[py] Loading python module 'selftest'
2020-10-01T17:02:55.564+0000 7f1a12b52200  1 mgr[py] Loading python module 'status'
2020-10-01T17:02:55.619+0000 7f1a12b52200  1 mgr[py] Loading python module 'telegraf'
2020-10-01T17:02:55.668+0000 7f1a12b52200  1 mgr[py] Loading python module 'telemetry'
2020-10-01T17:02:55.792+0000 7f1a12b52200  1 mgr[py] Loading python module 'test_orchestrator'
2020-10-01T17:02:56.059+0000 7f1a12b52200  1 mgr[py] Loading python module 'volumes'
2020-10-01T17:02:56.210+0000 7f1a12b52200  1 mgr[py] Loading python module 'zabbix'
2020-10-01T17:02:56.273+0000 7f19ffdcf700  0 ms_deliver_dispatch: unhandled message 0x5650bc356420 mon_map magic: 0 v1 from mon.1 v2:172.16.5.116:3300/0
2020-10-01T17:02:56.755+0000 7f1a005d0700  0 [dashboard DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.755+0000 7f1a005d0700  1 mgr load Constructed class from module: dashboard
2020-10-01T17:02:56.757+0000 7f1a005d0700  0 [prometheus DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.757+0000 7f1a005d0700  1 mgr load Constructed class from module: prometheus
2020-10-01T17:32:57.791+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:47:57.796+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T18:02:57.802+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T18:17:57.807+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T18:32:57.813+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178

ceph-mon-3:

2020-10-01T17:02:40.720+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:40.721579+0000)
2020-10-01T17:02:41.720+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:41.721844+0000)
2020-10-01T17:02:42.721+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:42.722116+0000)
2020-10-01T17:02:43.721+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:43.722378+0000)
2020-10-01T17:02:44.721+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:44.722612+0000)
2020-10-01T17:02:45.585+0000 7f61028fb700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2020-10-01T17:02:45.588+0000 7f61028fb700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***
2020-10-01T17:02:52.476+0000 7f15487e2200  0 set uid:gid to 167:167 (ceph:ceph)
2020-10-01T17:02:52.476+0000 7f15487e2200  0 ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable), process ceph-mgr, pid 1
2020-10-01T17:02:52.477+0000 7f15487e2200  0 pidfile_write: ignore empty --pid-file
2020-10-01T17:02:52.519+0000 7f15487e2200  1 mgr[py] Loading python module 'alerts'
2020-10-01T17:02:52.621+0000 7f15487e2200  1 mgr[py] Loading python module 'balancer'
2020-10-01T17:02:52.694+0000 7f15487e2200  1 mgr[py] Loading python module 'cephadm'
2020-10-01T17:02:53.039+0000 7f15487e2200  1 mgr[py] Loading python module 'crash'
2020-10-01T17:02:53.114+0000 7f15487e2200  1 mgr[py] Loading python module 'dashboard'
2020-10-01T17:02:54.001+0000 7f15487e2200  1 mgr[py] Loading python module 'devicehealth'
2020-10-01T17:02:54.081+0000 7f15487e2200  1 mgr[py] Loading python module 'diskprediction_local'
2020-10-01T17:02:54.401+0000 7f15487e2200  1 mgr[py] Loading python module 'influx'
2020-10-01T17:02:54.515+0000 7f15487e2200  1 mgr[py] Loading python module 'insights'
2020-10-01T17:02:54.589+0000 7f15487e2200  1 mgr[py] Loading python module 'iostat'
2020-10-01T17:02:54.646+0000 7f15487e2200  1 mgr[py] Loading python module 'k8sevents'
2020-10-01T17:02:55.251+0000 7f15487e2200  1 mgr[py] Loading python module 'localpool'
2020-10-01T17:02:55.306+0000 7f15487e2200  1 mgr[py] Loading python module 'orchestrator'
2020-10-01T17:02:55.518+0000 7f15487e2200  1 mgr[py] Loading python module 'osd_support'
2020-10-01T17:02:55.571+0000 7f15487e2200  1 mgr[py] Loading python module 'pg_autoscaler'
2020-10-01T17:02:55.660+0000 7f15487e2200  1 mgr[py] Loading python module 'progress'
2020-10-01T17:02:55.744+0000 7f15487e2200  1 mgr[py] Loading python module 'prometheus'
2020-10-01T17:02:56.181+0000 7f15487e2200  1 mgr[py] Loading python module 'rbd_support'
2020-10-01T17:02:56.280+0000 7f15487e2200  1 mgr[py] Loading python module 'restful'
2020-10-01T17:02:56.589+0000 7f15487e2200  1 mgr[py] Loading python module 'rook'
2020-10-01T17:02:57.168+0000 7f15487e2200  1 mgr[py] Loading python module 'selftest'
2020-10-01T17:02:57.223+0000 7f15487e2200  1 mgr[py] Loading python module 'status'
2020-10-01T17:02:57.286+0000 7f15487e2200  1 mgr[py] Loading python module 'telegraf'
2020-10-01T17:02:57.343+0000 7f15487e2200  1 mgr[py] Loading python module 'telemetry'
2020-10-01T17:02:57.488+0000 7f15487e2200  1 mgr[py] Loading python module 'test_orchestrator'
2020-10-01T17:02:57.803+0000 7f15487e2200  1 mgr[py] Loading python module 'volumes'
2020-10-01T17:02:58.000+0000 7f15487e2200  1 mgr[py] Loading python module 'zabbix'
2020-10-01T17:02:58.074+0000 7f1535a5f700  0 ms_deliver_dispatch: unhandled message 0x56286fa76420 mon_map magic: 0 v1 from mon.2 v2:172.16.5.117:3300/0
2020-10-01T17:02:58.080+0000 7f1536260700  0 [dashboard DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:58.080+0000 7f1536260700  1 mgr load Constructed class from module: dashboard
2020-10-01T17:02:58.081+0000 7f1536260700  0 [prometheus DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:58.081+0000 7f1536260700  1 mgr load Constructed class from module: prometheus
2020-10-01T17:02:58.085+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:02:59.082+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:18:00.087+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:33:00.092+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:48:00.097+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178

#20 Updated by Robert Groenewald 6 months ago

We've just encountered the same issue on 15.2.5

#21 Updated by Sebastian Luna-Valero 6 months ago

FYI: I upgraded to version 15.2.6 and I see this problem as well.

#22 Updated by David Orman 5 months ago

We can confirm that we see this in 15.2.7, as well. Our test cluster is made up of 21 nodes, 5 mgrs & mons, all nodes with 24 OSDs each (for a total of 504). We've deployed using cephadm and are running the daemons as containers. We started seeing mgr lockups about 12 hours after the last restart, and we have the same log messages presented above. Restarting the mgrs fixes this. Unfortunately, running without metrics isn't a great solution in a production cluster. Let us know what additional information we can provide to assist in correcting this bug; we're happy to oblige.

#23 Updated by Martin Verges 5 months ago

We encountered this issue last night on a 750 OSD cluster. Restarting MON/MGR/OSD didn't help. We found 25 OSDs related to the issue that we set OUT and then DOWN. After that and restarting all other OSDs the issue did go away.

#24 Updated by Jeremi A 5 months ago

I just deployed a brand new cluster with 15.2.8 - got no "active" MGR, it constantly restarts every few seconds.

https://tracker.ceph.com/issues/48689

#25 Updated by David Orman 5 months ago

We've got a PR in to fix this, we validated it resolves the issue in our larger clusters. We could use some help getting this moved forward since it seems to impact a number of users:

https://github.com/ceph/ceph/pull/38677

#26 Updated by Kefu Chai 5 months ago

  • Status changed from New to Fix Under Review
  • Backport set to octopus, nautilus
  • Pull request ID set to 38677

#27 Updated by Jeremi A 5 months ago

David Orman wrote:

We've got a PR in to fix this, we validated it resolves the issue in our larger clusters. We could use some help getting this moved forward since it seems to impact a number of users:

https://github.com/ceph/ceph/pull/38677

Great, currently sitting with a 4.1 PiB cluster "dead in the water" because of this.

#28 Updated by Kefu Chai 4 months ago

  • Status changed from Fix Under Review to Pending Backport

#29 Updated by Backport Bot 4 months ago

  • Copied to Backport #48713: nautilus: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors added

#30 Updated by Backport Bot 4 months ago

  • Copied to Backport #48714: octopus: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors added

#31 Updated by Loïc Dachary 29 days ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF