Bug #698
closedcosd memory usage with large number of pools
0%
Description
I reported this on the mailing list a week ago but never filed it here. Still present in 0.24.1.
I've got a 3 node test cluster (3 mons, 3 osds) with about 24,000,000
very small objects across 2400 pools (written directly with librados,
this isn't a ceph filesystem).
The cosd processes have steadily grown in ram size and have finally
exhausted ram and are getting killed by the oom killer (the nodes have
6gig RAM and no swap).
When I start them back up they just very quickly increase in ram size
again and get killed.
I'm running ceph 0.24 (and 0.24.1) on 64bit Ubuntu Lucid servers. In case it's
useful, I've just written these objects serially, no reading, no
rewrites, updates or snapshots.
I haven't touched the pg_nums on this cluster that I recall (it's been
up a couple of weeks but has nearly exclusively been used for writing
this test data).
tcmalloc heap profiling report attached (last profile before cosd was killed).
cosd debug output from around the time of the final ram exhaustion:
2011-01-05 00:17:58.532524 mon e1: 3 mons at {0=10.135.211.78:6789/0,1=10.61.136.222:6789/0,2=10.202.105.222:6789/0} 2011-01-05 00:22:53.325264 osd e10659: 3 osds: 3 up, 3 in 2011-01-05 00:22:53.383272 pg v151295: 20936 pgs: 1 creating, 2 peering, 10352 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 476 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 12489924/49420044 degraded (25.273%) 2011-01-05 00:22:53.422433 log 2011-01-05 00:22:53.325027 mon0 10.135.211.78:6789/0 4 : [INF] osd0 10.135.211.78:6801/31836 boot 2011-01-05 00:24:47.301186 pg v151296: 20936 pgs: 1 creating, 2 peering, 10352 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 476 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 12489924/49420044 degraded (25.273%) <cosd crashes here> 2011-01-05 00:25:52.422340 log 2011-01-05 00:25:52.189259 mon0 10.135.211.78:6789/0 5 : [INF] osd0 10.135.211.78:6801/31836 failed (by osd2 10.61.136.222:6800/915) 2011-01-05 00:25:57.265635 log 2011-01-05 00:25:57.121870 mon0 10.135.211.78:6789/0 6 : [INF] osd0 10.135.211.78:6801/31836 failed (by osd2 10.61.136.222:6800/915) 2011-01-05 00:26:02.341805 osd e10660: 3 osds: 2 up, 3 in 2011-01-05 00:26:02.362526 log 2011-01-05 00:26:02.127627 mon0 10.135.211.78:6789/0 7 : [INF] osd0 10.135.211.78:6801/31836 failed (by osd2 10.61.136.222:6800/915) 2011-01-05 00:26:02.470942 pg v151297: 20936 pgs: 1 creating, 2 peering, 10352 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 476 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 12489924/49420044 degraded (25.273%) 2011-01-05 00:26:12.578266 pg v151298: 20936 pgs: 1 creating, 2 peering, 3393 crashed+peering, 3052 active+clean+degraded, 7053 degraded+peering, 7435 crashed+degraded+peering; 24130 MB data, 266 GB used, 332 GB / 630 GB avail; 20728862/49420044 degraded (41.944%)
Files