Bug #3569: Monitor & OSD failures when an OSD clock is wrong - Ceph - Ceph

Actions

Copy link

Bug #3569

closed

Monitor & OSD failures when an OSD clock is wrong

Added by Greg Farnum over 11 years ago. Updated about 10 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

I know that ceph has time synced servers has a requirements, but I
think a sane failure mode like a message in the logs instead of
incontrollably growing memory usage would be a good idea.

I had the NTP process die on me tonight on an OSD (for unknown reason
so far ...) and the clock went 3000s out of sync and the OSD memory
just kept growing, and also the master mon memory.
(which has the nice effect of having the master mon being OOM killed,
then one of the backup takes the master role and grows as well and
gets killed and so on and so forth until there is no quorum anymore).

It happenned very reliably at each attempt to restart the OSD and
stopped right when I fixed the clock.
Just take a working cluster, take an osd out, let it rebalance, set
the clock of one of the OSD 50 min too fast, and restart the OSD.

I had it occur twice with the same clock sync problems. (once in a
test cluster with just 2 osd IIRC and once in the prod cluster).

I don't get it anymore because I patched the underlying problem that
was causing the clock to jump forward 50 min.

If you can't reproduce it locally, I can try to reproduce it again on
the test cluster tomorrow.

My best guess was that somehow the messages had a timestamp and it
refused to process message too much in the future and maybe just
queued them while waiting (but 50 min worth of message is a lot of
memory). But that's really a wild guess :p

It's not queuing up messages until their timestamp, but it might be trying to get new cephx keys, is my best guess?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #3569

Monitor & OSD failures when an OSD clock is wrong

Updated by Joao Eduardo Luis over 11 years ago

Updated by Joao Eduardo Luis over 11 years ago

Updated by Dzianis Huznou over 10 years ago

Updated by Samuel Just about 10 years ago