Project

General

Profile

Bug #12065

Memory leak and high CPU usage on OSD when system clock is in the future

Added by Paul Emmerich over 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
-
Start date:
06/17/2015
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Hi,

we noticed that all OSDs on one of our storage server were leaking memory (up to 20 GB per OSD and day) and had an unusually high CPU usage (about 5 to 10 times that of other servers with the same hardware).
However, the OSDs seemed to work just fine. The log files contained no suspicious entries and the I/O performance was not affected.
Profiling with perf showed that the additional CPU time was spent mostly with memory allocation.

A quick investigation showed that the system was missconfigured: the clock was about 8 hours in the future.
Fixing the system clock also fixed the memory leaks and CPU usage.

This behavior was also reproducible on other systems.
My quick tests suggest that this happens if the clock is more than about 1 or 2 hours in the future when the OSD starts. Setting the clock to the correct time reduces the CPU usage instantly without restarting the OSDs.

The obvious fix is clearly to not missconfigure your systems ;)
Still, I think the OSDs shouldn't leak memory in this scenario. Maybe there should even be a warning for excessive clock skew on OSDs?

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)

Paul


Related issues

Copied to Ceph - Backport #15336: hammer: Memory leak and high CPU usage on OSD when system clock is in the future Resolved

Associated revisions

Revision 918c12c2 (diff)
Added by Alexey Sheplyakov almost 3 years ago

monclient: avoid key renew storm on clock skew

Refreshing rotating keys too often is a symptom of a clock skew, try to
detect it and don't cause extra problems:

  • MonClient::_check_auth_rotating:
    - detect and report premature keys expiration due to a time skew
    - rate limit refreshing the keys to avoid excessive RAM and CPU usage
    (both by OSD in question and monitors which have to process a lot
    of auth messages)
  • MonClient::wait_auth_rotating: wait for valid (not expired) keys
  • OSD::init(): bail out after 10 attempts to obtain the rotating keys

Fixes: #12065

Signed-off-by: Alexey Sheplyakov <>

History

#1 Updated by Jason Dillaman over 3 years ago

  • Project changed from rbd to Ceph

#2 Updated by Samuel Just over 3 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Alexey Sheplyakov almost 3 years ago

The OSD keeps trying to renew the rotating keys which fails due to the clock skew:

2016-03-15 17:57:19.576609 7fc63f84a700 10 cephx client: handle_response ret = 0
2016-03-15 17:57:19.576611 7fc63f84a700 10 cephx client: get_rotating_key
2016-03-15 17:57:19.576634 7fc63f84a700 10 auth: dump_rotating:
2016-03-15 17:57:19.576643 7fc63f84a700 10 auth: id 212 AQDu6OdW7nFQKRAAhcBJi+rg7x6MbVVZnBrXxA== expires 2016-03-15 13:50:22.693130
2016-03-15 17:57:19.576650 7fc63f84a700 10 auth: id 213 AQAPBehWdkA5JxAAXl+TvWF9C4Pafhl1qDe3lg== expires 2016-03-15 14:50:23.658054
2016-03-15 17:57:19.576653 7fc63f84a700 10 auth: id 214 AQAfE+hW25ywNhAA7QtJun0uYvU6eScVrfrKVQ== expires 2016-03-15 15:50:23.917532
2016-03-15 17:57:19.576658 7fc63f84a700 10 cephx: validate_tickets want 37 have 37 need 0
2016-03-15 17:57:19.576659 7fc63f84a700 20 cephx client: need_tickets: want=37 need=0 have=37
2016-03-15 17:57:19.576661 7fc63f84a700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2016-03-15 17:56:49.576660)
2016-03-15 17:57:19.576663 7fc63f84a700 10 cephx client: build_rotating_request
2016-03-15 17:57:19.576664 7fc63f84a700 10 monclient: _send_mon_message to mon.saceph-mon at 10.253.0.20:6789/0
2016-03-15 17:57:19.576855 7fc63f84a700 10 cephx client: handle_response ret = 0
2016-03-15 17:57:19.576856 7fc63f84a700 10 cephx client: get_rotating_key
2016-03-15 17:57:19.576878 7fc63f84a700 10 auth: dump_rotating:
2016-03-15 17:57:19.576886 7fc63f84a700 10 auth: id 212 AQDu6OdW7nFQKRAAhcBJi+rg7x6MbVVZnBrXxA== expires 2016-03-15 13:50:22.693130
2016-03-15 17:57:19.576893 7fc63f84a700 10 auth: id 213 AQAPBehWdkA5JxAAXl+TvWF9C4Pafhl1qDe3lg== expires 2016-03-15 14:50:23.658054
2016-03-15 17:57:19.576896 7fc63f84a700 10 auth: id 214 AQAfE+hW25ywNhAA7QtJun0uYvU6eScVrfrKVQ== expires 2016-03-15 15:50:23.917532
2016-03-15 17:57:19.576901 7fc63f84a700 10 cephx: validate_tickets want 37 have 37 need 0
2016-03-15 17:57:19.576902 7fc63f84a700 20 cephx client: need_tickets: want=37 need=0 have=37
2016-03-15 17:57:19.576903 7fc63f84a700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2016-03-15 17:56:49.576903)
2016-03-15 17:57:19.576905 7fc63f84a700 10 cephx client: build_rotating_request
2016-03-15 17:57:19.576906 7fc63f84a700 10 monclient: _send_mon_message to mon.saceph-mon at 10.253.0.20:6789/0
2016-03-15 17:57:19.577067 7fc63f84a700 10 cephx client: handle_response ret = 0
2016-03-15 17:57:19.577068 7fc63f84a700 10 cephx client: get_rotating_key
2016-03-15 17:57:19.577088 7fc63f84a700 10 auth: dump_rotating:
2016-03-15 17:57:19.577095 7fc63f84a700 10 auth: id 212 AQDu6OdW7nFQKRAAhcBJi+rg7x6MbVVZnBrXxA== expires 2016-03-15 13:50:22.693130
2016-03-15 17:57:19.577101 7fc63f84a700 10 auth: id 213 AQAPBehWdkA5JxAAXl+TvWF9C4Pafhl1qDe3lg== expires 2016-03-15 14:50:23.658054
2016-03-15 17:57:19.577104 7fc63f84a700 10 auth: id 214 AQAfE+hW25ywNhAA7QtJun0uYvU6eScVrfrKVQ== expires 2016-03-15 15:50:23.917532
2016-03-15 17:57:19.577108 7fc63f84a700 10 cephx: validate_tickets want 37 have 37 need 0
2016-03-15 17:57:19.577109 7fc63f84a700 20 cephx client: need_tickets: want=37 need=0 have=37
2016-03-15 17:57:19.577110 7fc63f84a700 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2016-03-15 17:56:49.577110)
2016-03-15 17:57:19.577112 7fc63f84a700 10 cephx client: build_rotating_request
2016-03-15 17:57:19.577113 7fc63f84a700 10 monclient: _send_mon_message to mon.saceph-mon at 10.253.0.20:6789/0

This involves a copious allocation/reclaiming of the MAuth messages (which burns a lot of CPU cycles).

#4 Updated by Alexey Sheplyakov almost 3 years ago

  • Assignee set to Alexey Sheplyakov

#5 Updated by Alexey Sheplyakov almost 3 years ago

The patch https://github.com/ceph/ceph/pull/8258 solves the problem for me

#6 Updated by Sage Weil almost 3 years ago

  • Status changed from New to Testing

#7 Updated by Sage Weil almost 3 years ago

  • Status changed from Testing to Pending Backport
  • Backport set to hammer

#8 Updated by Loic Dachary almost 3 years ago

  • Copied to Backport #15336: hammer: Memory leak and high CPU usage on OSD when system clock is in the future added

#9 Updated by Loic Dachary over 2 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF