Actions
Bug #3569
closedMonitor & OSD failures when an OSD clock is wrong
Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Hi, I know that ceph has time synced servers has a requirements, but I think a sane failure mode like a message in the logs instead of incontrollably growing memory usage would be a good idea. I had the NTP process die on me tonight on an OSD (for unknown reason so far ...) and the clock went 3000s out of sync and the OSD memory just kept growing, and also the master mon memory. (which has the nice effect of having the master mon being OOM killed, then one of the backup takes the master role and grows as well and gets killed and so on and so forth until there is no quorum anymore).
It happenned very reliably at each attempt to restart the OSD and stopped right when I fixed the clock. Just take a working cluster, take an osd out, let it rebalance, set the clock of one of the OSD 50 min too fast, and restart the OSD. I had it occur twice with the same clock sync problems. (once in a test cluster with just 2 osd IIRC and once in the prod cluster). I don't get it anymore because I patched the underlying problem that was causing the clock to jump forward 50 min. If you can't reproduce it locally, I can try to reproduce it again on the test cluster tomorrow. My best guess was that somehow the messages had a timestamp and it refused to process message too much in the future and maybe just queued them while waiting (but 50 min worth of message is a lot of memory). But that's really a wild guess :p
It's not queuing up messages until their timestamp, but it might be trying to get new cephx keys, is my best guess?
Actions