Project

General

Profile

Bug #5700

very high memory usage after update

Added by Corin Langosch over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With bobtail a few month ago my osds used around 500mb after restart but grew over time, due to memory leaks.

When I upgraded from bobtail to cuttlefish 61.3 osds started to consume around 1 gb right after restart. The didn't grow much over time, so much less memory leaks it seems. But still, a lot of very memory...

Today I upgraded to latest cuttlefish 61.5 and now my osds consume 1.6 - 2.5 gb right after restart. This is really bad as it totally changes how much osds I can run on a single hardware and so I also wasn't able to restart all osds as my machines wouldn't have enough memory.

I really wonder why the osds need that much memory and as the documentation says osds should take around 500 mb,o I clearly think this is a big bug/ regression.

Please let me know how I can help to debug so this issue can be solved asap.

corin.tar.gz (2.43 MB) Corin Langosch, 09/03/2013 08:54 AM

History

#1 Updated by Corin Langosch over 10 years ago

Just a small update: I hoped the memory usage would go down after some hours, but it stays high:

ceph version 0.61.5 (8ee10dc4bb73bdd918873f29c70eedc3c7ef1979)

root     10283  2.1  0.2 761104 134364 ?       Sl   Jul20  17:24 /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf
root     14330  9.0  3.0 3286468 2012940 ?     Ssl  Jul20  71:40 /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf
root     25295  8.1  2.8 2999896 1894104 ?     Ssl  Jul20  61:47 /usr/bin/ceph-osd -i 4 --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf
root     25406  8.7  2.9 3217812 1973528 ?     Ssl  Jul20  65:45 /usr/bin/ceph-osd -i 5 --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf

on another host

root      9783  6.6  6.5 2943520 2146796 ?     Ssl  Jul20  52:37 /usr/bin/ceph-osd -i 12 --pid-file /var/run/ceph/osd.12.pid -c /etc/ceph/ceph.conf
root     11037  7.0  4.2 2649968 1403744 ?     Ssl  Jul20  55:14 /usr/bin/ceph-osd -i 13 --pid-file /var/run/ceph/osd.13.pid -c /etc/ceph/ceph.conf

#2 Updated by Mark Nelson over 10 years ago

Hi,

Could you tell me a couple of things about your cluster?

How many PGs total across all of your pool?

How much replication?

When you start the cluster up and there is high memory usage, is the cluster scrubbing? (ceph -w should tell you).

Are these packages (what OS?) or did you compile yourself?

Is tcmalloc enabled?

#3 Updated by Corin Langosch over 10 years ago

Hi Mark,

Here's the output of ceph osd dump:

epoch 3388
fsid 4ac0e21b-6ea2-4ac7-8114-122bd9ba55d6
created 2013-02-17 12:50:11.549322
modified 2013-07-20 20:29:06.110250
flags 

pool 5 'ssd' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 29 owner 0
pool 6 'hdd' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 30 owner 0

max_osd 15
osd.0 up   in  weight 1 up_from 3323 up_thru 3339 down_at 3321 last_clean_interval [3319,3320) 10.0.0.4:6804/22388 10.0.0.4:6805/22388 10.0.0.4:6806/22388 exists,up c33d9eb5-3466-4078-9da8-553768fa98fe
osd.1 up   in  weight 1 up_from 3298 up_thru 3385 down_at 3296 last_clean_interval [3010,3295) 10.0.0.4:6802/22412 10.0.0.4:6809/22412 10.0.0.4:6810/22412 exists,up 6812e61b-d3c7-44b1-ab74-c13e4805be00
osd.2 up   in  weight 1 up_from 3353 up_thru 3384 down_at 3351 last_clean_interval [3302,3352) 10.0.0.4:6800/23716 10.0.0.4:6807/23716 10.0.0.4:6808/23716 exists,up a3747638-0ee8-4839-a7c8-f6936b27b2cb
osd.3 up   in  weight 1 up_from 3327 up_thru 3339 down_at 3325 last_clean_interval [3282,3324) 10.0.0.5:6805/14328 10.0.0.5:6806/14328 10.0.0.5:6807/14328 exists,up 63a1e842-0103-48fd-91fd-8cc6b8a35859
osd.4 up   in  weight 1 up_from 3381 up_thru 3381 down_at 3376 last_clean_interval [3286,3375) 10.0.0.5:6800/25293 10.0.0.5:6801/25293 10.0.0.5:6802/25293 exists,up b92c0c38-4595-4b0e-a98a-78de1874100d
osd.5 up   in  weight 1 up_from 3383 up_thru 3383 down_at 3378 last_clean_interval [3290,3377) 10.0.0.5:6803/25404 10.0.0.5:6804/25404 10.0.0.5:6808/25404 exists,up c9150d50-dfbc-4d55-9951-0caf18629444
osd.6 up   in  weight 1 up_from 3331 up_thru 3339 down_at 3329 last_clean_interval [3270,3328) 10.0.0.6:6800/18318 10.0.0.6:6802/18318 10.0.0.6:6803/18318 exists,up 15e5b405-adbf-46aa-b698-7bc4c69778d3
osd.7 up   in  weight 1 up_from 3368 up_thru 3383 down_at 3358 last_clean_interval [3278,3357) 10.0.0.6:6801/28326 10.0.0.6:6804/28326 10.0.0.6:6805/28326 exists,up 6a9014cf-19b5-48a8-a468-bfd279c1c7b5
osd.8 up   in  weight 1 up_from 3366 up_thru 3383 down_at 3360 last_clean_interval [3274,3359) 10.0.0.6:6806/28434 10.0.0.6:6807/28434 10.0.0.6:6808/28434 exists,up 5a4cea68-5fc8-4320-ba58-5a276dc95511
osd.9 up   in  weight 1 up_from 3335 up_thru 3339 down_at 3333 last_clean_interval [3262,3332) 10.0.0.7:6803/14828 10.0.0.7:6804/14828 10.0.0.7:6805/14828 exists,up a36db462-0862-4064-8200-8ff91dc7316e
osd.10 up   in  weight 1 up_from 3349 up_thru 3383 down_at 3347 last_clean_interval [3307,3346) 10.0.0.7:6800/15947 10.0.0.7:6801/15947 10.0.0.7:6802/15947 exists,up 21915a11-30e1-4612-a35d-d2405bb63617
osd.11 down out weight 0 up_from 1062 up_thru 1367 down_at 1372 last_clean_interval [379,1056) 10.0.0.7:6803/23663 10.0.0.7:6805/23663 10.0.0.7:6806/23663 autoout,exists 576e642d-ed76-43db-9815-14bdb438e533
osd.12 up   in  weight 1 up_from 3339 up_thru 3339 down_at 3337 last_clean_interval [3314,3336) 10.0.0.8:6800/9781 10.0.0.8:6802/9781 10.0.0.8:6806/9781 exists,up fbb71a0c-5d97-4a23-8f74-7e1a130bb60d
osd.13 up   in  weight 1 up_from 3343 up_thru 3383 down_at 3341 last_clean_interval [3258,3340) 10.0.0.8:6801/11035 10.0.0.8:6803/11035 10.0.0.8:6804/11035 exists,up c0acd06a-4bff-4b92-afcb-f10b8b0268c7

Every few minutes some osd is scrubbing but I don't think it has any impact on the memory usage. I restarted almost all osds and they all have quite high memory usage (1.3 gb to 2.3gb).

Here's the output of ps aux | grep ceph:

10283  2.1  0.2 728996 137620 ?       Sl   Jul20  61:27  /usr/bin/ceph-mon -i a --pid-file /var/run/ceph/mon.a.pid -c /etc/ceph/ceph.conf
12929  1.6  0.2 729636 161432 ?       Sl   Jul20  46:34  /usr/bin/ceph-mon -i b --pid-file /var/run/ceph/mon.b.pid -c /etc/ceph/ceph.conf
 8747  1.6  0.2 633504 160416 ?       Sl   Jul20  46:50  /usr/bin/ceph-mon -i c --pid-file /var/run/ceph/mon.c.pid -c /etc/ceph/ceph.conf
22390 34.8 19.2 3166452 1573844 ?     Ssl  Jul20 982:04  /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf
22414 14.1 16.4 2737916 1344548 ?     Ssl  Jun17 7138:16 /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf
23718 15.4 18.6 3608532 1521792 ?     Ssl  Jun17 7766:29 /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf
14330  9.0  2.3 2982828 1518052 ?     Ssl  Jul20 255:49  /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf
25295  8.3  1.8 2692020 1242940 ?     Ssl  Jul20 232:51  /usr/bin/ceph-osd -i 4 --pid-file /var/run/ceph/osd.4.pid -c /etc/ceph/ceph.conf
25406  8.4  2.3 3017540 1526996 ?     Ssl  Jul20 235:07  /usr/bin/ceph-osd -i 5 --pid-file /var/run/ceph/osd.5.pid -c /etc/ceph/ceph.conf
18320  6.0  2.1 2976612 1423440 ?     Ssl  Jul20 170:53  /usr/bin/ceph-osd -i 6 --pid-file /var/run/ceph/osd.6.pid -c /etc/ceph/ceph.conf
28328  7.4  1.9 2766676 1263956 ?     Ssl  Jul20 206:47  /usr/bin/ceph-osd -i 7 --pid-file /var/run/ceph/osd.7.pid -c /etc/ceph/ceph.conf
28436  7.1  2.0 2699084 1329996 ?     Ssl  Jul20 198:36  /usr/bin/ceph-osd -i 8 --pid-file /var/run/ceph/osd.8.pid -c /etc/ceph/ceph.conf
14830  7.6  3.0 3618872 1998092 ?     Ssl  Jul20 215:25  /usr/bin/ceph-osd -i 9 --pid-file /var/run/ceph/osd.9.pid -c /etc/ceph/ceph.conf
15950  6.8  1.9 2671816 1265128 ?     Ssl  Jul20 192:37  /usr/bin/ceph-osd -i 10 --pid-file /var/run/ceph/osd.10.pid -c /etc/ceph/ceph.conf
 9783  6.3  6.7 3517092 2221328 ?     Ssl  Jul20 177:16  /usr/bin/ceph-osd -i 12 --pid-file /var/run/ceph/osd.12.pid -c /etc/ceph/ceph.conf
11037  6.8  3.8 2623248 1280464 ?     Ssl  Jul20 192:51  /usr/bin/ceph-osd -i 13 --pid-file /var/run/ceph/osd.13.pid -c /etc/ceph/ceph.conf

As you can see I restarted everything, except osd 1 and osd 2.

System is Ubuntu 12.10, the packages are from the official ceph repository:

ii  ceph                                    0.61.5-1quantal              amd64        distributed storage and file system
ii  ceph-common                             0.61.5-1quantal              amd64        common utilities to mount and interact with a ceph storage cluster
ii  ceph-fs-common                          0.61.5-1quantal              amd64        common utilities to mount and interact with a ceph file system
ii  ceph-fuse                               0.61.5-1quantal              amd64        FUSE-based client for the Ceph distributed file system
ii  ceph-mds                                0.61.5-1quantal              amd64        metadata server for the ceph distributed file system
ii  libcephfs1                              0.61.5-1quantal              amd64        Ceph distributed file system client library

Not sure if tcmalloc is enabled, I didn't specify anything special. How can I check?

Finally here's my ceph.conf (same on all nodes):

[global]
  auth cluster required = cephx
  auth service required = cephx
  auth client required = cephx
  cephx require signatures = true
  public network = 10.0.0.0/24
  cluster network = 10.0.0.0/24

[client]
  rbd cache = true
  rbd cache size = 33554432
  rbd cache max dirty = 25165824
  rbd cache target dirty = 16777216
  rbd cache max dirty age = 3

[osd]
  osd journal size = 1000
  osd journal dio = true
  osd journal aio = true
  osd journal = /xfs-drive1/$cluster-osd-$id.journal
  osd op threads = 8
  filestore op threads = 16
  filestore max sync interval = 10
  filestore min sync interval = 3

[mon.a]
  host = n103
  mon addr = 10.0.0.5:6789

[mon.b]
  host = n104
  mon addr = 10.0.0.6:6789

[mon.c]
  host = n105
  mon addr = 10.0.0.7:6789

[osd.0]
  host = n102

[osd.1]
  host = n102

[osd.2]
  host = n102

[osd.3]
  host = n103

[osd.4]
  host = n103

[osd.5]
  host = n103

[osd.6]
  host = n104

[osd.7]
  host = n104

[osd.8]
  host = n104

[osd.9]
  host = n105

[osd.10]
  host = n105

[osd.11]
  host = n105

[osd.12]
  host = n106

[osd.13]
  host = n106

[osd.14]
  host = n106

#4 Updated by Mark Nelson over 10 years ago

Hi Corin,

tcmalloc should be enabled if you are using our packages. Would you mind generating a core dump from one of the high memory OSD processes, and package it up with the ceph-osd binary?

You can get the core file by running gcore:

gcore [-o filename] pid

I've emailed you with instructions on where to send the file once you've got a core dump.

Thanks!
Mark

#5 Updated by Corin Langosch over 10 years ago

Hi Mark,

I just uploaded the archive. It's called corin.tar.gz.

While taking the core dump (which took only a couple of seconds) the osd memory usage grew from around 1.9 GB to 2.8 GB. Now, around 30 minutes later, it's still almost 2.8 GB but it seems it's slowly decreasing. Also all other osdss went down a few hundret MB each during the last few days, but they are still all consuming 1.2GB+.

Let me know if you need anything else.

Corin

#6 Updated by Sage Weil over 10 years ago

  • Assignee set to David Zafman

#7 Updated by Ian Colle over 10 years ago

  • Priority changed from Urgent to High

#8 Updated by Sage Weil over 10 years ago

  • Status changed from New to Can't reproduce

don't see anything strange from the core. i suspect this is just lots of pgs...

#9 Updated by Corin Langosch over 10 years ago

Hi Sage,

to be honest I'm a little disappointed by your answer. 8192 isn't a lot of pgs? The docs say 50-100 pgs per osds per pool. So 2 pools with 4096 pgs each just allows for 40 - 80 osds - which isn't really that much?

Why did the memory consumption change that much from bobtail to cuttlefish? I didn't change the pools in any way.

There's also a big discrepancy with the docs, as they state an osd would consume 200 - 500 mb (http://ceph.com/docs/next/install/hardware-recommendations/). I know David is still waiting for a detailed debug log from me, which I'll provide within a short while. But if ceph's memory requirements are really that high, fixing the docs to allow for proper resource planing is really a must.

Corin

#10 Updated by Sage Weil over 10 years ago

Corin Langosch wrote:

Hi Sage,

to be honest I'm a little disappointed by your answer. 8192 isn't a lot of pgs? The docs say 50-100 pgs per osds per pool. So 2 pools with 4096 pgs each just allows for 40 - 80 osds - which isn't really that much?

It's the pgs per osd that matters. But yeah, I'm not happy with my answer either, but I don't have much else to go on. My suspicion is that we will see the memory consumed in the usual places with the per-PG in-memory state. Having the massif results will let us confirm that. Maybe we have more hash_map's in use than before (those can quickly eat RAM) and we just didn't notice.

Why did the memory consumption change that much from bobtail to cuttlefish? I didn't change the pools in any way.

There's also a big discrepancy with the docs, as they state an osd would consume 200 - 500 mb (http://ceph.com/docs/next/install/hardware-recommendations/). I know David is still waiting for a detailed debug log from me, which I'll provide within a short while. But if ceph's memory requirements are really that high, fixing the docs to allow for proper resource planing is really a must.

We should probably update that to estimate in terms of PGs per OSD to be a bit more accurate.

Either way, if you can provide the massif output that will help tremendously.

Thanks, Corin!

#11 Updated by Corin Langosch over 10 years ago

It's the pgs per osd that matters. But yeah, I'm not happy with my answer either, but I don't have much else to go on. My suspicion is that we will see the memory consumed in the usual places with the per-PG in-memory state. Having the massif results will let us confirm that. Maybe we have more hash_map's in use than before (those can quickly eat RAM) and we just didn't notice.

Then it'd really be great if one could start with a small number of pgs but grow as more osds are added. Afaik ceph has some support for it, but it's not stable yet? Does expaning cause a lot of data shifting?

To check how the number of pgs affects the osd memory consumption I just added a new pool with "rados mkpool test 4096 4096". Now ceph -w shows 12288 pgs, but I couldn't notice any change of the osds' memory usage. Should this have affected the memory usage or do I need to place some objects into that pool first?

Also osd memory usage differs quite strong from osd to osd. They were all started at the same time, but after some weeks one osd for example uses 1.5 gb ram, while another 3 gb ram. There we no recoveries during the last 2-3 weeks, so memory usage should be almost equal?

I'll to the debugging now.. :)

Corin

#12 Updated by Corin Langosch over 10 years ago

Here we go:

  • restart with valgrind **

valgrind --tool=massif /usr/bin/ceph-osd -i 12 --pid-file /var/run/ceph/osd.12.pid -c /etc/ceph/ceph.conf -f

startup (till cluster reports the osd up/in again):
- took around 20 minutes
- always taking 100% cpu
- grew to around 3.5 gb

peering/ recovery
- took around 10 minutes, then aborted
- aborted because osd was reported as down by other osds (too slow)

  • restart without valgrind **

service ceph start osd

startup (till cluster reports the osd up/in again):
- took only a few seconds
- always taking 100% cpu
- grew to around 2.8 gb

peering/ recovery
- took only a few seconds
- grew to around 3.0 gb
- finished successfully

I uploaded my ceph-osd and the massif output to cephdrop. The filename is corin.tar.gz.

md5sum /usr/bin/ceph-osd
f3e762a608b2bedaea1b9baf4066cedf /usr/bin/ceph-osd

md5sum massif.out.14535
6030beeaf2232a8f7e71502583fac6c5 massif.out.14535

md5sum corin.tar.gz
83e1aba34fb700ab9f4a4dbcaf395a47 corin.tar.gz

To me it looks like a startup problem, as the process grows that much before even joing the cluster?

#13 Updated by Corin Langosch over 10 years ago

Looks like ceph is reading the whole log file (1GB) in memory and not freeing it again?

#14 Updated by Corin Langosch over 10 years ago

Looks like ceph is reading the whole log file (1GB) in memory and not freeing it again?

#15 Updated by Corin Langosch over 10 years ago

I jus thought recreating the journal would help, but I didn't help at all.

kill osd.12
/usr/bin/ceph-osd -i 12 --flush-journal
/usr/bin/ceph-osd -i 12 --mkjournal
restart osd.12

Memory usage is as high as it was with the old journal :-(

#16 Updated by Sage Weil over 10 years ago

Corin Langosch wrote:

I jus thought recreating the journal would help, but I didn't help at all.

kill osd.12
/usr/bin/ceph-osd -i 12 --flush-journal
/usr/bin/ceph-osd -i 12 --mkjournal
restart osd.12

Memory usage is as high as it was with the old journal :-(

It's a different log than the journal.

Try reducing these by a factor of 10:

osd_min_pg_log_entries = 3000
osd_max_pg_log_entries = 10000

so, 300 and 1000. note that it will not be able to trim until after it peers, and it is controleld by the primary, so all osds will need to restart with that setting. once it does trim, the heap isn't always freed back to the OS, but after a second restart (of a single osd) i think you will see it's memory utilization goes down.

The memory utilization is basically num_pgs * num_log_entries, and the log entries only appear after you've done lots of write activity. i think this is why you don't see usage go up when you create a new pool.

#17 Updated by Corin Langosch over 10 years ago

Do these settings affect data safety in any way? The cluster is an important production one, so I cannot really play around with it. Probably we could combine it with the upgrade of the cluster to latest dumpling, when the next point release is out (if it contains all fixes for the reported slowdows since cuttlefish)?

From the docs osd_min_pg_log_entries is 1000 by default (not 3000?). I couldn't find osd_max_pg_log_entries in the docs. How can I check the values my cluster is currently using?

But still, I really wonder why ceph needs to read that much information on startup? Can't this be greatly reduced? Why isn't the memory freed as soon as it's not needed anymore - shouldn't this be fixed as well? If we had an osd binary with these fixes I'd be happy to give it a try on a single osd.

#18 Updated by Corin Langosch over 10 years ago

BTW, would be nice if this issue would be re-opened. I cannot do this.. :(

#19 Updated by Sage Weil over 10 years ago

  • Status changed from Can't reproduce to 7
  • Assignee changed from David Zafman to Sage Weil

Corin Langosch wrote:

Do these settings affect data safety in any way? The cluster is an important production one, so I cannot really play around with it. Probably we could combine it with the upgrade of the cluster to latest dumpling, when the next point release is out (if it contains all fixes for the reported slowdows since cuttlefish)?

From the docs osd_min_pg_log_entries is 1000 by default (not 3000?). I couldn't find osd_max_pg_log_entries in the docs. How can I check the values my cluster is currently using?

But still, I really wonder why ceph needs to read that much information on startup? Can't this be greatly reduced? Why isn't the memory freed as soon as it's not needed anymore - shouldn't this be fixed as well? If we had an osd binary with these fixes I'd be happy to give it a try on a single osd.

They don't affect data safety. They do need to be high enough to cover a reasonable window of activity as the log is used to prevent resent ops (i.e., after a client is temporarily disconnected or the data mapping changes) from being reapplied. Longer logs also expand the window during which an OSD can be down and come back up and rejoin without doing a full backfill/sync on its data.

Can you verify that lowering these values reduces your memory consumption?

#20 Updated by Sage Weil over 10 years ago

Also, massif should have generated a report file that indicates which callers are allocating all of the memory. Can you attach that? THanks!

#21 Updated by Corin Langosch over 10 years ago

For testing I'd like to wait for for the next dumpling release (hopefully with fixed http://tracker.ceph.com/issues/6040). I'll then restart the whole cluster with the new settings (osd_min_pg_log_entries = 100 and osd_max_pg_log_entries = 1000). Is this ok for you?

There was no report file, only one with extension massif. But I'll double check after upgrading and testing again.

#22 Updated by Sage Weil over 10 years ago

Corin Langosch wrote:

For testing I'd like to wait for for the next dumpling release (hopefully with fixed http://tracker.ceph.com/issues/6040). I'll then restart the whole cluster with the new settings (osd_min_pg_log_entries = 100 and osd_max_pg_log_entries = 1000). Is this ok for you?

Sounds good. Should be out today or tomorrow.

There was no report file, only one with extension massif. But I'll double check after upgrading and testing again.

That's the one! Can you attach?

thanks-

#23 Updated by Corin Langosch over 10 years ago

Great! There are no other show stoppers and upgrade should be smooth, right? :)

I uploaded my binary and the massif file to cephdrop, see http://tracker.ceph.com/issues/5700#note-12 But I still have it here, and so attached it :)

#24 Updated by Sage Weil over 10 years ago

Ah, sorry, I missed that.

And yeah, the massif output confirms that ~80% of the heap is consumed by the pg logs. Reducing those values will help considerably, as will keeping the pg count fixed as your cluster expands over time and PGs spread out over a larger number of OSDs.

FWIW we increased that value from 1000 -> 3000 just before cuttlefish.

#25 Updated by Corin Langosch over 10 years ago

Ah ok, that increase might explain the increased memory usage, We'll know for sure in a few days :)

But anyway, is it really necessary to load all the logs into memory and not free them again? Sorry for bugging you with that, but it just doesn't feel right to me. Especially as from your comment I assume the logs are only needed during recovery or startup (to see what changed), but not during normal operation?

#26 Updated by Corin Langosch over 10 years ago

I just upgraded to dumpling: adjusted ceph.conf, restarted all mons, restarted all osds

After the restart, the osds still consume a lot of memory:

root 19135 0.0 0.0 8112 924 pts/0 S+ 14:30 0:00 grep --color=auto ceph
root 27851 12.8 8.0 3363176 2638292 ? Ssl 13:48 5:17 /usr/bin/ceph-osd -i 12 --pid-file /var/run/ceph/osd.12.pid -c /etc/ceph/ceph.conf
root 30685 18.1 5.7 2711064 1907044 ? Ssl 13:51 6:58 /usr/bin/ceph-osd -i 13 --pid-file /var/run/ceph/osd.13.pid -c /etc/ceph/ceph.conf

I'll let them run 1-2 days to see what happens.

This is my current osd config:

[osd]
osd journal size = 1000
osd journal dio = true
osd journal aio = true
osd journal = /xfs-drive1/$cluster-osd-$id.journal
osd op threads = 8
osd min pg log entries = 1000
osd max pg log entries = 3000
filestore op threads = 16
filestore max sync interval = 10
filestore min sync interval = 3

Please not that I didn't change to min = 100 and max = 500 as I read that I could cause problems (http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/002770.html). Is it really safe to go to min = 100 and max = 500 (it's a kvm production cluster)? Can i change those values without restarting all daemons again?

#27 Updated by Corin Langosch over 10 years ago

So after 5 days of having latest dumpling running, memory usage is still quite high:

root     17320  0.0  0.0   8112   924 pts/0    S+   21:58   0:00 grep --color=auto ceph
root     27851  7.8  5.1 4020792 1695804 ?     Ssl  Sep12 489:57 /usr/bin/ceph-osd -i 12 --pid-file /var/run/ceph/osd.12.pid -c /etc/ceph/ceph.conf
root     30685 13.5  4.1 2723340 1351056 ?     Ssl  Sep12 848:44 /usr/bin/ceph-osd -i 13 --pid-file /var/run/ceph/osd.13.pid -c /etc/ceph/ceph.conf
root     22984 44.3 19.6 3287164 1604404 ?     Ssl  Sep12 2761:22 /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf
root     25798  0.0  0.0   8112   936 pts/1    S+   22:00   0:00 grep --color=auto ceph
root     28596 33.0 13.5 3094128 1108760 ?     Ssl  Sep12 2057:44 /usr/bin/ceph-osd -i 2 --pid-file /var/run/ceph/osd.2.pid -c /etc/ceph/ceph.conf
root     32160 28.8 12.7 2845976 1041860 ?     Ssl  Sep12 1796:40 /usr/bin/ceph-osd -i 1 --pid-file /var/run/ceph/osd.1.pid -c /etc/ceph/ceph.conf

Is there any chance that ceph can be made much less memory hungry?

#28 Updated by Sage Weil over 10 years ago

  • Status changed from 7 to Resolved

after reviewing this again, there are 2 things:

1- the default # of pg log entires increased from bobtail to cuttlefish
2- you have a lot of pgs given your number of osds. this will eventually get better as you expand your cluster over time.

all indications are that there are no leaks or other regressions. closing this bug!

#29 Updated by Corin Langosch over 10 years ago

Allow me this last question: why has all this log information be kept in memory all the time?

#30 Updated by Greg Farnum over 10 years ago

I created #6570 for that, Corin. There are tradeoffs involved and some of them are probably worth making, but it's not a quick fix. :)

Also available in: Atom PDF