Bug #39264: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors - mgr - Ceph

Actions

Copy link

Bug #39264

closed

Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors

Added by Stephen Bird about 5 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

prometheus module

Target version:

Ceph - v14.2.0

% Done:

Source:

Tags:

Backport:

octopus, nautilus

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

38677

Crash signature (v1):

Crash signature (v2):

Description

Greetings!

I have a bug not unlike BUG #23460, where our ceph-mgrs die and show clock/rotating key errors:

2019-04-11 03:37:01.544 7fc96a2c0700 -1 received  signal: Hangup from &lt;unknown&gt; (PID: 89976) UID: 0
2019-04-11 03:37:01.555 7fc96a2c0700 -1 received  signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw  (PID: 89977) UID: 0
2019-04-11 03:37:07.113 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:07.114821)
2019-04-11 03:37:17.113 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:17.115020)
2019-04-11 03:37:27.114 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:27.115245)
2019-04-11 03:37:37.114 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:37.115425)
2019-04-11 03:37:47.114 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:47.115640)
2019-04-11 03:37:57.114 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:37:57.115867)
2019-04-11 03:38:07.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:07.116108)
2019-04-11 03:38:17.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:17.116286)
2019-04-11 03:38:27.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:27.116458)
2019-04-11 03:38:37.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:37.116650)
2019-04-11 03:38:47.115 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:47.116906)
2019-04-11 03:38:57.116 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:38:57.117134)
2019-04-11 03:39:07.116 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:39:07.117344)
2019-04-11 03:39:17.116 7fc9682bc700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2019-04-11 02:39:17.117606)

We have 3 mgr's that usually run, but since 14.2, our mgrs have died off to this one by one.

The time is synced on our mgr nodes and the original bug fix has been committed to 14.2.

Related issues 6 (1 open — 5 closed)

Related to mgr - Bug #43364: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive

Resolved

Actions

Related to mgr - Bug #43317: high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5

Duplicate

Actions

Related to RADOS - Bug #43185: ceph -s not showing client activity

Resolved

Actions

Related to mgr - Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8

New

Actions

Copied to mgr - Backport #48713: nautilus: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors

Resolved

Actions

Copied to mgr - Backport #48714: octopus: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors

Resolved

Laura Paduano

Actions

Copy link

Updated by Stephen Bird about 5 years ago

So a quick update, after turning off the devicehealth module, the mgr's seem to be maintaining an active/standby status.

Actions

Copy link

Updated by Soenke Schippmann almost 5 years ago

This bug persists in 14.2.1 and is somewhat annoying. Especially as there seems to be no health warn when there is no standby mgr available.

Actions

Copy link

Updated by Janek Bevendorff over 4 years ago

I am seeing this quite regularly in Nautiluas 14.2.4. We have five MGRs and every now and then a few of them hang themselves with these messages. After a restart, everything's fine again. Most of the time, I have to restart all MGRs, though, even the ones which are not affected, because those tend to hang themselves shortly after restarting the dead ones.

We have the prometheus module enabled, which is being polled every few seconds. With over 1.2k OSDs, this seems to incur quite a bit of load on the MGRs and the metrics response times are about 3 seconds. Perhaps there is a connection.

Actions

Copy link

Updated by hoan nv about 4 years ago

This bug persists in 14.2.7.

My cluster has 5 mgrs, after some hours, mgr has log

2020-02-10 09:08:10.546 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:10.547340)
2020-02-10 09:08:20.546 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:20.547607)
2020-02-10 09:08:30.546 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:30.547909)
2020-02-10 09:08:40.547 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:40.548206)
2020-02-10 09:08:50.547 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:08:50.548487)
2020-02-10 09:09:00.547 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:09:00.548842)
2020-02-10 09:09:10.548 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:09:10.549157)
2020-02-10 09:09:20.548 7f802d778700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-02-10 08:09:20.549462)

I will try disable prometheus module.

Actions

Copy link

Updated by hoan nv about 4 years ago

After disable prometheus module, mgr don't have this log.

Actions

Copy link

Updated by Lenz Grimmer about 4 years ago

Category changed from ceph-mgr to prometheus module
Severity changed from 3 - minor to 2 - major

Actions

Copy link

Updated by Janek Bevendorff about 4 years ago

I want to dig this issue up again. With the new 14.2.8 update, the problem has become so bad I had to disable the prometheus module altogether. My MGRs kept hanging up every few hours. It takes only a day or two for all five MGRs to be gone. Reproducibly.

Since nobody knows a solution on the mailing list and I am not seeing a huge amount of fuzz about this problem except for some complaints about the general performance of the prometheus module, I would assume that the effect scales with cluster size. We have 1248 OSDs, five MONs and five MGRs. The /metrics endpoint takes quite a while to collect all the metrics and after a few hours, the MGRs are starting to die and stay dead until I restart them.

Since disabling the prometheus module, I haven't seen any MGR stability issues (so far).

Actions

Copy link

Updated by Jan Fajerski about 4 years ago

A couple of questions:

- Is there anything else in the logs?
- Does anyone maybe have debug logs for this?
- How does resource usage look on the affected mgr nodes?

Actions

Copy link

Updated by Jan Fajerski about 4 years ago

There is also a number of similar looking issues open that might be related:
https://tracker.ceph.com/issues/43364
https://tracker.ceph.com/issues/43317
https://tracker.ceph.com/issues/43185
https://tracker.ceph.com/issues/44251
https://tracker.ceph.com/issues/44495

Actions

Copy link

#10

Updated by Janek Bevendorff about 4 years ago

No, but on the mailing list people were reporting that the problem went away once the free space dropped below a few PiB. We have 9 PiB available and our MGRs are not only slow, but shut down entirely every few hours.

Actions

Copy link

#11

Updated by Jan Fajerski about 4 years ago

Janek Bevendorff wrote:

No, but on the mailing list people were reporting that the problem went away once the free space dropped below a few PiB. We have 9 PiB available and our MGRs are not only slow, but shut down entirely every few hours.

Any chance you could collect some debug logs from those mgr instances?
Do they shut down or are they being killed? Are there maybe coredumps in case they crash?

Actions

Copy link

#12

Updated by Janek Bevendorff about 4 years ago

There is nothing else in the logs and I disabled the Prometheus module in our production cluster, so I don't have the problem right now. But they don't actually die, they just stop responding. The daemon is still alive, but doesn't appear in ceph status and requests to /metrics or the dashboard time out.

Actions

Copy link

#13

Updated by Lenz Grimmer almost 4 years ago

Related to Bug #43364: ceph-mgr's finisher queue can grow indefinitely, making python modules/commands unresponsive added

Actions

Copy link

#14

Updated by Lenz Grimmer almost 4 years ago

Related to Bug #43317: high CPU usage, ceph-mgr very slow or unresponsive following upgrade from Nautilus v14.2.4 to v14.2.5 added

Actions

Copy link

#15

Updated by Lenz Grimmer almost 4 years ago

Related to Bug #43185: ceph -s not showing client activity added

Actions

Copy link

#16

Updated by Lenz Grimmer almost 4 years ago

Related to Bug #45439: High CPU utilization for large clusters in ceph-mgr in 14.2.8 added

Actions

Copy link

#17

Updated by Sebastian Luna-Valero over 3 years ago

This problem hit me yesterday in a Ceph cluster running version 15.2.5.

I got "no active mgr" with "ceph -s" but all mgr containers were reported as up by docker.

I solved the issue by restarting all Ceph containers but this is not ideal for a production system.

Actions

Copy link

#18

Updated by Sebastian Luna-Valero over 3 years ago

I have this problem again today.

Although all docker containers are shown as "up", the following commands hang on the MONs:

cephadm shell -- ceph orch ps
cephadm shell -- ceph orch ls

The mgr nodes are mostly idle so resource usage shouldn't be a problem.

Could you please let me know how to create the debug logs for this?

Best regards,
Sebastian

Actions

Copy link

#19

Updated by Sebastian Luna-Valero over 3 years ago

Here I paste what I hope are relevant log messages.

ceph-mon-1:

2020-10-01T17:02:45.522+0000 7f9897ce4700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2020-10-01T17:02:45.522+0000 7f9897ce4700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***
2020-10-01T17:02:52.247+0000 7f0186d0e200  0 set uid:gid to 167:167 (ceph:ceph)
2020-10-01T17:02:52.247+0000 7f0186d0e200  0 ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable), process ceph-mgr, pid 1
2020-10-01T17:02:52.248+0000 7f0186d0e200  0 pidfile_write: ignore empty --pid-file
2020-10-01T17:02:52.285+0000 7f0186d0e200  1 mgr[py] Loading python module 'alerts'
2020-10-01T17:02:52.356+0000 7f0186d0e200  1 mgr[py] Loading python module 'balancer'
2020-10-01T17:02:52.411+0000 7f0186d0e200  1 mgr[py] Loading python module 'cephadm'
2020-10-01T17:02:52.603+0000 7f0186d0e200  1 mgr[py] Loading python module 'crash'
2020-10-01T17:02:52.658+0000 7f0186d0e200  1 mgr[py] Loading python module 'dashboard'
2020-10-01T17:02:53.112+0000 7f0186d0e200  1 mgr[py] Loading python module 'devicehealth'
2020-10-01T17:02:53.159+0000 7f0186d0e200  1 mgr[py] Loading python module 'diskprediction_local'
2020-10-01T17:02:53.321+0000 7f0186d0e200  1 mgr[py] Loading python module 'influx'
2020-10-01T17:02:53.424+0000 7f0186d0e200  1 mgr[py] Loading python module 'insights'
2020-10-01T17:02:53.471+0000 7f0186d0e200  1 mgr[py] Loading python module 'iostat'
2020-10-01T17:02:53.516+0000 7f0186d0e200  1 mgr[py] Loading python module 'k8sevents'
2020-10-01T17:02:53.888+0000 7f0186d0e200  1 mgr[py] Loading python module 'localpool'
2020-10-01T17:02:53.933+0000 7f0186d0e200  1 mgr[py] Loading python module 'orchestrator'
2020-10-01T17:02:54.112+0000 7f0186d0e200  1 mgr[py] Loading python module 'osd_support'
2020-10-01T17:02:54.158+0000 7f0186d0e200  1 mgr[py] Loading python module 'pg_autoscaler'
2020-10-01T17:02:54.230+0000 7f0186d0e200  1 mgr[py] Loading python module 'progress'
2020-10-01T17:02:54.297+0000 7f0186d0e200  1 mgr[py] Loading python module 'prometheus'
2020-10-01T17:02:54.658+0000 7f0186d0e200  1 mgr[py] Loading python module 'rbd_support'
2020-10-01T17:02:54.740+0000 7f0186d0e200  1 mgr[py] Loading python module 'restful'
2020-10-01T17:02:54.962+0000 7f0186d0e200  1 mgr[py] Loading python module 'rook'
2020-10-01T17:02:55.466+0000 7f0186d0e200  1 mgr[py] Loading python module 'selftest'
2020-10-01T17:02:55.514+0000 7f0186d0e200  1 mgr[py] Loading python module 'status'
2020-10-01T17:02:55.567+0000 7f0186d0e200  1 mgr[py] Loading python module 'telegraf'
2020-10-01T17:02:55.615+0000 7f0186d0e200  1 mgr[py] Loading python module 'telemetry'
2020-10-01T17:02:55.739+0000 7f0186d0e200  1 mgr[py] Loading python module 'test_orchestrator'
2020-10-01T17:02:56.006+0000 7f0186d0e200  1 mgr[py] Loading python module 'volumes'
2020-10-01T17:02:56.161+0000 7f0186d0e200  1 mgr[py] Loading python module 'zabbix'
2020-10-01T17:02:56.235+0000 7f0173f8b700  0 ms_deliver_dispatch: unhandled message 0x564063524420 mon_map magic: 0 v1 from mon.2 v2:172.16.5.117:3300/0
2020-10-01T17:02:56.751+0000 7f0173f8b700  1 mgr handle_mgr_map Activating!
2020-10-01T17:02:56.753+0000 7f0173f8b700  1 mgr handle_mgr_map I am now activating
2020-10-01T17:02:56.840+0000 7f010486b700  0 [balancer DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.840+0000 7f010486b700  1 mgr load Constructed class from module: balancer
2020-10-01T17:02:56.841+0000 7f010486b700  0 [cephadm DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.988+0000 7f010486b700  1 mgr load Constructed class from module: cephadm
2020-10-01T17:02:56.988+0000 7f010486b700  0 [crash DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.988+0000 7f010486b700  1 mgr load Constructed class from module: crash
2020-10-01T17:02:56.992+0000 7f0110073700  1 mgr.server send_report Not sending PG status to monitor yet, waiting for OSDs
2020-10-01T17:02:56.993+0000 7f010486b700  0 [dashboard DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.993+0000 7f010486b700  1 mgr load Constructed class from module: dashboard
2020-10-01T17:02:56.993+0000 7f010486b700  0 [devicehealth DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.993+0000 7f010486b700  1 mgr load Constructed class from module: devicehealth
2020-10-01T17:02:56.997+0000 7f010486b700  0 [diskprediction_local DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.997+0000 7f010486b700  1 mgr load Constructed class from module: diskprediction_local
2020-10-01T17:02:56.998+0000 7f010486b700  0 [iostat DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.998+0000 7f010486b700  1 mgr load Constructed class from module: iostat
2020-10-01T17:02:56.998+0000 7f010486b700  0 [orchestrator DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.998+0000 7f010486b700  1 mgr load Constructed class from module: orchestrator
2020-10-01T17:02:57.003+0000 7f010486b700  0 [pg_autoscaler DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:57.003+0000 7f010486b700  1 mgr load Constructed class from module: pg_autoscaler
2020-10-01T17:02:57.021+0000 7f010486b700  0 [progress DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:57.021+0000 7f010486b700  1 mgr load Constructed class from module: progress
2020-10-01T17:02:57.021+0000 7f010486b700  0 [prometheus DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:57.027+0000 7f010486b700  1 mgr load Constructed class from module: prometheus
2020-10-01T17:02:57.063+0000 7f010486b700  0 [rbd_support DEBUG root] setting log level based on debug_mgr: WARNING (1/5)

ceph-mon-2:

2020-10-01T17:02:40.694+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:40.696200+0000)
2020-10-01T17:02:41.695+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:41.696611+0000)
2020-10-01T17:02:42.695+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:42.696897+0000)
2020-10-01T17:02:43.695+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:43.697231+0000)
2020-10-01T17:02:44.696+0000 7fad1bf94700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:44.697444+0000)
2020-10-01T17:02:45.521+0000 7fad1df98700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2020-10-01T17:02:45.521+0000 7fad1df98700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***
2020-10-01T17:02:52.282+0000 7f1a12b52200  0 set uid:gid to 167:167 (ceph:ceph)
2020-10-01T17:02:52.282+0000 7f1a12b52200  0 ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable), process ceph-mgr, pid 1
2020-10-01T17:02:52.283+0000 7f1a12b52200  0 pidfile_write: ignore empty --pid-file
2020-10-01T17:02:52.320+0000 7f1a12b52200  1 mgr[py] Loading python module 'alerts'
2020-10-01T17:02:52.392+0000 7f1a12b52200  1 mgr[py] Loading python module 'balancer'
2020-10-01T17:02:52.446+0000 7f1a12b52200  1 mgr[py] Loading python module 'cephadm'
2020-10-01T17:02:52.639+0000 7f1a12b52200  1 mgr[py] Loading python module 'crash'
2020-10-01T17:02:52.697+0000 7f1a12b52200  1 mgr[py] Loading python module 'dashboard'
2020-10-01T17:02:53.157+0000 7f1a12b52200  1 mgr[py] Loading python module 'devicehealth'
2020-10-01T17:02:53.206+0000 7f1a12b52200  1 mgr[py] Loading python module 'diskprediction_local'
2020-10-01T17:02:53.333+0000 7f1a12b52200  1 mgr[py] Loading python module 'influx'
2020-10-01T17:02:53.432+0000 7f1a12b52200  1 mgr[py] Loading python module 'insights'
2020-10-01T17:02:53.478+0000 7f1a12b52200  1 mgr[py] Loading python module 'iostat'
2020-10-01T17:02:53.521+0000 7f1a12b52200  1 mgr[py] Loading python module 'k8sevents'
2020-10-01T17:02:53.904+0000 7f1a12b52200  1 mgr[py] Loading python module 'localpool'
2020-10-01T17:02:53.950+0000 7f1a12b52200  1 mgr[py] Loading python module 'orchestrator'
2020-10-01T17:02:54.129+0000 7f1a12b52200  1 mgr[py] Loading python module 'osd_support'
2020-10-01T17:02:54.176+0000 7f1a12b52200  1 mgr[py] Loading python module 'pg_autoscaler'
2020-10-01T17:02:54.257+0000 7f1a12b52200  1 mgr[py] Loading python module 'progress'
2020-10-01T17:02:54.330+0000 7f1a12b52200  1 mgr[py] Loading python module 'prometheus'
2020-10-01T17:02:54.706+0000 7f1a12b52200  1 mgr[py] Loading python module 'rbd_support'
2020-10-01T17:02:54.788+0000 7f1a12b52200  1 mgr[py] Loading python module 'restful'
2020-10-01T17:02:55.010+0000 7f1a12b52200  1 mgr[py] Loading python module 'rook'
2020-10-01T17:02:55.517+0000 7f1a12b52200  1 mgr[py] Loading python module 'selftest'
2020-10-01T17:02:55.564+0000 7f1a12b52200  1 mgr[py] Loading python module 'status'
2020-10-01T17:02:55.619+0000 7f1a12b52200  1 mgr[py] Loading python module 'telegraf'
2020-10-01T17:02:55.668+0000 7f1a12b52200  1 mgr[py] Loading python module 'telemetry'
2020-10-01T17:02:55.792+0000 7f1a12b52200  1 mgr[py] Loading python module 'test_orchestrator'
2020-10-01T17:02:56.059+0000 7f1a12b52200  1 mgr[py] Loading python module 'volumes'
2020-10-01T17:02:56.210+0000 7f1a12b52200  1 mgr[py] Loading python module 'zabbix'
2020-10-01T17:02:56.273+0000 7f19ffdcf700  0 ms_deliver_dispatch: unhandled message 0x5650bc356420 mon_map magic: 0 v1 from mon.1 v2:172.16.5.116:3300/0
2020-10-01T17:02:56.755+0000 7f1a005d0700  0 [dashboard DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.755+0000 7f1a005d0700  1 mgr load Constructed class from module: dashboard
2020-10-01T17:02:56.757+0000 7f1a005d0700  0 [prometheus DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:56.757+0000 7f1a005d0700  1 mgr load Constructed class from module: prometheus
2020-10-01T17:32:57.791+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:47:57.796+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T18:02:57.802+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T18:17:57.807+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T18:32:57.813+0000 7f19ffdcf700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178

ceph-mon-3:

2020-10-01T17:02:40.720+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:40.721579+0000)
2020-10-01T17:02:41.720+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:41.721844+0000)
2020-10-01T17:02:42.721+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:42.722116+0000)
2020-10-01T17:02:43.721+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:43.722378+0000)
2020-10-01T17:02:44.721+0000 7f61008f7700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2020-10-01T16:02:44.722612+0000)
2020-10-01T17:02:45.585+0000 7f61028fb700 -1 received  signal: Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2020-10-01T17:02:45.588+0000 7f61028fb700 -1 mgr handle_mgr_signal  *** Got signal Terminated ***
2020-10-01T17:02:52.476+0000 7f15487e2200  0 set uid:gid to 167:167 (ceph:ceph)
2020-10-01T17:02:52.476+0000 7f15487e2200  0 ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable), process ceph-mgr, pid 1
2020-10-01T17:02:52.477+0000 7f15487e2200  0 pidfile_write: ignore empty --pid-file
2020-10-01T17:02:52.519+0000 7f15487e2200  1 mgr[py] Loading python module 'alerts'
2020-10-01T17:02:52.621+0000 7f15487e2200  1 mgr[py] Loading python module 'balancer'
2020-10-01T17:02:52.694+0000 7f15487e2200  1 mgr[py] Loading python module 'cephadm'
2020-10-01T17:02:53.039+0000 7f15487e2200  1 mgr[py] Loading python module 'crash'
2020-10-01T17:02:53.114+0000 7f15487e2200  1 mgr[py] Loading python module 'dashboard'
2020-10-01T17:02:54.001+0000 7f15487e2200  1 mgr[py] Loading python module 'devicehealth'
2020-10-01T17:02:54.081+0000 7f15487e2200  1 mgr[py] Loading python module 'diskprediction_local'
2020-10-01T17:02:54.401+0000 7f15487e2200  1 mgr[py] Loading python module 'influx'
2020-10-01T17:02:54.515+0000 7f15487e2200  1 mgr[py] Loading python module 'insights'
2020-10-01T17:02:54.589+0000 7f15487e2200  1 mgr[py] Loading python module 'iostat'
2020-10-01T17:02:54.646+0000 7f15487e2200  1 mgr[py] Loading python module 'k8sevents'
2020-10-01T17:02:55.251+0000 7f15487e2200  1 mgr[py] Loading python module 'localpool'
2020-10-01T17:02:55.306+0000 7f15487e2200  1 mgr[py] Loading python module 'orchestrator'
2020-10-01T17:02:55.518+0000 7f15487e2200  1 mgr[py] Loading python module 'osd_support'
2020-10-01T17:02:55.571+0000 7f15487e2200  1 mgr[py] Loading python module 'pg_autoscaler'
2020-10-01T17:02:55.660+0000 7f15487e2200  1 mgr[py] Loading python module 'progress'
2020-10-01T17:02:55.744+0000 7f15487e2200  1 mgr[py] Loading python module 'prometheus'
2020-10-01T17:02:56.181+0000 7f15487e2200  1 mgr[py] Loading python module 'rbd_support'
2020-10-01T17:02:56.280+0000 7f15487e2200  1 mgr[py] Loading python module 'restful'
2020-10-01T17:02:56.589+0000 7f15487e2200  1 mgr[py] Loading python module 'rook'
2020-10-01T17:02:57.168+0000 7f15487e2200  1 mgr[py] Loading python module 'selftest'
2020-10-01T17:02:57.223+0000 7f15487e2200  1 mgr[py] Loading python module 'status'
2020-10-01T17:02:57.286+0000 7f15487e2200  1 mgr[py] Loading python module 'telegraf'
2020-10-01T17:02:57.343+0000 7f15487e2200  1 mgr[py] Loading python module 'telemetry'
2020-10-01T17:02:57.488+0000 7f15487e2200  1 mgr[py] Loading python module 'test_orchestrator'
2020-10-01T17:02:57.803+0000 7f15487e2200  1 mgr[py] Loading python module 'volumes'
2020-10-01T17:02:58.000+0000 7f15487e2200  1 mgr[py] Loading python module 'zabbix'
2020-10-01T17:02:58.074+0000 7f1535a5f700  0 ms_deliver_dispatch: unhandled message 0x56286fa76420 mon_map magic: 0 v1 from mon.2 v2:172.16.5.117:3300/0
2020-10-01T17:02:58.080+0000 7f1536260700  0 [dashboard DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:58.080+0000 7f1536260700  1 mgr load Constructed class from module: dashboard
2020-10-01T17:02:58.081+0000 7f1536260700  0 [prometheus DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
2020-10-01T17:02:58.081+0000 7f1536260700  1 mgr load Constructed class from module: prometheus
2020-10-01T17:02:58.085+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:02:59.082+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:18:00.087+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:33:00.092+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178
2020-10-01T17:48:00.097+0000 7f1535a5f700  0 client.0 ms_handle_reset on v2:172.16.5.115:6802/2370333178

Actions

Copy link

#20

Updated by Robert Groenewald over 3 years ago

We've just encountered the same issue on 15.2.5

Actions

Copy link

#21

Updated by Sebastian Luna-Valero over 3 years ago

FYI: I upgraded to version 15.2.6 and I see this problem as well.

Actions

Copy link

#22

Updated by David Orman over 3 years ago

We can confirm that we see this in 15.2.7, as well. Our test cluster is made up of 21 nodes, 5 mgrs & mons, all nodes with 24 OSDs each (for a total of 504). We've deployed using cephadm and are running the daemons as containers. We started seeing mgr lockups about 12 hours after the last restart, and we have the same log messages presented above. Restarting the mgrs fixes this. Unfortunately, running without metrics isn't a great solution in a production cluster. Let us know what additional information we can provide to assist in correcting this bug; we're happy to oblige.

Actions

Copy link

#23

Updated by Martin Verges over 3 years ago

We encountered this issue last night on a 750 OSD cluster. Restarting MON/MGR/OSD didn't help. We found 25 OSDs related to the issue that we set OUT and then DOWN. After that and restarting all other OSDs the issue did go away.

Actions

Copy link

#24

Updated by Jeremi A over 3 years ago

I just deployed a brand new cluster with 15.2.8 - got no "active" MGR, it constantly restarts every few seconds.

https://tracker.ceph.com/issues/48689

Actions

Copy link

#25

Updated by David Orman over 3 years ago

We've got a PR in to fix this, we validated it resolves the issue in our larger clusters. We could use some help getting this moved forward since it seems to impact a number of users:

https://github.com/ceph/ceph/pull/38677

Actions

Copy link

#26

Updated by Kefu Chai over 3 years ago

Status changed from New to Fix Under Review
Backport set to octopus, nautilus
Pull request ID set to 38677

Actions

Copy link

#27

Updated by Jeremi A over 3 years ago

David Orman wrote:

We've got a PR in to fix this, we validated it resolves the issue in our larger clusters. We could use some help getting this moved forward since it seems to impact a number of users:

https://github.com/ceph/ceph/pull/38677

Great, currently sitting with a 4.1 PiB cluster "dead in the water" because of this.

Actions

Copy link

#28

Updated by Kefu Chai over 3 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#29

Updated by Backport Bot over 3 years ago

Copied to Backport #48713: nautilus: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors added

Actions

Copy link

#30

Updated by Backport Bot over 3 years ago

Copied to Backport #48714: octopus: Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors added

Actions

Copy link

#31

Updated by Loïc Dachary about 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » mgr

Custom queries

Bug #39264

Ceph-mgr Hangup and _check_auth_rotating possible clock skew, rotating keys expired way too early Errors

Updated by Stephen Bird about 5 years ago

Updated by Soenke Schippmann almost 5 years ago

Updated by Janek Bevendorff over 4 years ago

Updated by hoan nv about 4 years ago

Updated by hoan nv about 4 years ago

Updated by Lenz Grimmer about 4 years ago

Updated by Janek Bevendorff about 4 years ago

Updated by Jan Fajerski about 4 years ago

Updated by Jan Fajerski about 4 years ago

Updated by Janek Bevendorff about 4 years ago

Updated by Jan Fajerski about 4 years ago

Updated by Janek Bevendorff about 4 years ago

Updated by Lenz Grimmer almost 4 years ago

Updated by Lenz Grimmer almost 4 years ago

Updated by Lenz Grimmer almost 4 years ago

Updated by Lenz Grimmer almost 4 years ago

Updated by Sebastian Luna-Valero over 3 years ago

Updated by Sebastian Luna-Valero over 3 years ago

Updated by Sebastian Luna-Valero over 3 years ago

Updated by Robert Groenewald over 3 years ago

Updated by Sebastian Luna-Valero over 3 years ago

Updated by David Orman over 3 years ago

Updated by Martin Verges over 3 years ago

Updated by Jeremi A over 3 years ago

Updated by David Orman over 3 years ago

Updated by Kefu Chai over 3 years ago

Updated by Jeremi A over 3 years ago

Updated by Kefu Chai over 3 years ago

Updated by Backport Bot over 3 years ago

Updated by Backport Bot over 3 years ago

Updated by Loïc Dachary about 3 years ago