Bug #22977
closedHigh CPU load caused by operations on onode_map
0%
Description
I'm investigating performance on a cluster that shows an unusually high CPU load.
Setup are Bluestore OSDs running mostly a 5/3 erasure coding pool.
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), centos 7, kernel 4.13.7
Each OSD is handling around 60 write ops and 40 read ops per second, all small operations from kernel RBD clients.
All OSDs are running on SSDs that look fine (low queue sizes/wait times, ~20% io load).
CPU load per OSD process was around 90-130% with 2.2 GHz CPU cores.
No background operations like scrubbing or recovery were ongoing.
I've run perf to see what the OSD is doing. I've initially expected that the CPU load was mainly due to the EC setup but no: it was caused by random operations on the onode_map in Bluestore.
More than 50% of the CPU time is spent inside std::_Hashtable chasing pointers in _M_find_node. Looks like a textbook hash collision problem in that unordered_map to me if I interpreted that correctly.
Restarting the OSD reduces the CPU usage by 50%, but it's slowly increasing again
I've uploaded the perf traces before and after restarting the OSD here:
https://static.croit.io/share/bluestore-onode-cpu/perf-before-restart.data
https://static.croit.io/share/bluestore-onode-cpu/perf-after-restart.data
(~4MB each, external links because I can only upload 1MB here.)
I waited for 5 minutes after the recovery completed after the restart before taking the second trace.
I unfortunately wasn't able to reproduce this on my test cluster; I probably had the wrong workload here to test it. But I've seen it on two production clusters.
Files
Updated by Paul Emmerich about 6 years ago
- File perf-dump.json perf-dump.json added
Here's a "perf dump" from an OSD suffering from this.
Potentially relevant onode data that looks similar on all OSDs.
"bluestore_onodes": 1933610,
"bluestore_onode_hits": 3083639062,
"bluestore_onode_misses": 540465668,
"bluestore_onode_shard_hits": 1679445540,
"bluestore_onode_shard_misses": 38863659,
"bluestore_onode_reshard": 16236977,
1.9 million onodes seems like too much. it's a 960GB SSD.
Updated by Igor Fedotov about 6 years ago
Hi Paul.
Could you please share your Ceph settings?
Also please collect mempool statistics on a saturated OSD using 'ceph daemon osd.NNN dump_mempools' command.
Updated by Paul Emmerich about 6 years ago
- File osd-4-mempool.json osd-4-mempool.json added
The cluster is running with default settings except for osd_max_backfills.
Sorry, I completely forgot about the mempool dump, see attachment.
Unfortunately, there's a recovery going on right now and I didn't save the output yesterday before the recovery started. So maybe not 100% accurate.
But I recall similar values from looking at it yesterday.
Process RSS is 5.5G, mempool claims 3.1G (so roughly within the usually quoted "off by up to +50%")
bluestore_onodes in perf dump was at 2289426 on this OSD when I took this mempool dump.
rados df looks like this:
POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR
xxxx 7132 846 0 2538 0 0 17 6863185 8183M 21946 1794k
ecpool 8250G 2593966 489191 20751728 6 0 335624 7034626398 176T 5295910555 72387G
xxxx 221G 56606 0 283030 0 0 436 42333060 3751G 477924 303G
xxxx 0 0 0 0 0 0 0 0 0 0 0
xxxx 0 0 0 0 0 0 0 0 0 0 0
total_objects 2651418
total_used 18992G
total_avail 11409G
total_space 30401G
There are only 2.6M objects, almost all of them in the ecpool which runs a 5/3 profile. But there are quite a few snapshots and most of the RBDs are based on the clone of a template.
Updated by Paul Emmerich about 6 years ago
32 OSDs, so ~650k objects/ec chunks per OSDs.
Updated by Paul Emmerich about 6 years ago
I've just observed an OSD crashing with the following log.
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: ** Caught signal (Segmentation fault) *
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: in thread 7fc3b2b85700 thread_name:bstore_mempool
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 1: (()+0xa339e1) [0x5600132329e1]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 2: (()+0xf5e0) [0x7fc3bc64b5e0]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 3: (BlueStore::TwoQCache::_trim(unsigned long, unsigned long)+0x324) [0x5600130ff6a4]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 4: (BlueStore::Cache::trim(unsigned long, float, float, float)+0x1d1) [0x5600130d1d71]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 5: (BlueStore::MempoolThread::entry()+0x14d) [0x5600130d84ad]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 6: (()+0x7e25) [0x7fc3bc643e25]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 7: (clone()+0x6d) [0x7fc3bb73734d]
Last log entries where all:
-2> 2018-02-13 09:36:03.956096 7fc3a2b65700 5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 38451'11168173, trimmed: , trimmed_dups: , clear_divergent_priors: 0
It was at 17GB of memory (RSS) which seems unusual.
Updated by Igor Fedotov about 6 years ago
The last backtrace seems similar to the one from
https://tracker.ceph.com/issues/21259#change-107555
and not sure this relates to high CPU usage this ticket was created about.
Updated by Sage Weil about 6 years ago
- Related to Bug #21259: bluestore: segv in BlueStore::TwoQCache::_trim added
Updated by Sage Weil about 6 years ago
Hi Paul,
I'm having trouble viewing the perf report files (crashes on my box; doesn't shows but shows nothing on Mark's). Do you mind rendering the perf report as ascii? Thanks!
Updated by Sage Weil about 6 years ago
- Status changed from New to Need More Info
Updated by Paul Emmerich about 6 years ago
- File perf-before-restart.txt perf-before-restart.txt added
- File perf-after-restart.txt perf-after-restart.txt added
Sure, thanks for looking into it :)
Updated by Adam Kupczyk about 6 years ago
Hi Paul,
I made a development build, based on 12.2.2.
It will dump stats from onode_map, a container we suspect to have broken hash.
This build, every 1000 operations on bluestore, scans onode map and display information like:
"ONODE_MAP elements=8510 BUCKET count=15173 BUCKET max=4".
In my testing it has always behaved properly. I would really like to confirm we have hash collision.
Obviously it is enough to check it on one node.
The code (based on 12.2.2) is available here:
https://github.com/ceph/ceph-ci/tree/wip-22977-inspect-onode-map
You can fetch and compile, or try OSD from build packages:
https://shaman.ceph.com/builds/ceph/wip-22977-inspect-onode-map/494f106bc425022bfab2a55a34933d5b3cbf422c/
Best regards,
Adam Kupczyk
Updated by Paul Emmerich about 6 years ago
Thank you!
I'll run this on one OSD and report back tomorrow as it takes some time for the problem to appear.
But it already looks like this after 5 minutes:
2018-03-05 10:36:23.943177 7fa027027700 0 bluestore.OnodeSpace(0x5630826cc608 in 0x56307e3bfe20) ONODE_MAP elements=522 BUCKET count=823 BUCKET max=85 2018-03-05 10:36:33.792141 7fa023820700 0 bluestore.OnodeSpace(0x5630826cb108 in 0x56307e3bfd40) ONODE_MAP elements=909 BUCKET count=1741 BUCKET max=112 2018-03-05 10:36:42.732337 7fa026826700 0 bluestore.OnodeSpace(0x5630826caa08 in 0x56307e3bff00) ONODE_MAP elements=1115 BUCKET count=1741 BUCKET max=309 2018-03-05 10:36:54.435833 7fa025824700 0 bluestore.OnodeSpace(0x5630826cd788 in 0x56307e82a0e0) ONODE_MAP elements=771 BUCKET count=823 BUCKET max=113 2018-03-05 10:37:01.833181 7fa028029700 0 bluestore.OnodeSpace(0x5630826d5948 in 0x56307e3bd5e0) ONODE_MAP elements=266 BUCKET count=409 BUCKET max=158
Updated by Daniel Pryor about 6 years ago
Paul Emmerich wrote:
Thank you!
I'll run this on one OSD and report back tomorrow as it takes some time for the problem to appear.
But it already looks like this after 5 minutes:
[...]
I'm having what feels like the same problem. Ill be posting debug logs in the morning.
ceph v12.2.4
High CPU load and 50% io with no client load or recovery. No EC either.
Updated by Daniel Pryor about 6 years ago
Paul Emmerich wrote:
Thank you!
I'll run this on one OSD and report back tomorrow as it takes some time for the problem to appear.
But it already looks like this after 5 minutes:
[...]
Paul what perf command are you running? I want to see if my results are similar?
Updated by Adam Kupczyk about 6 years ago
Hi Paul,
An attempt to rectify hash problem is here:
https://shaman.ceph.com/builds/ceph/wip-22977-inspect-onode-map/
Logging info is still there.
Awaiting your input.
Regards,
Adam
Updated by Daniel Pryor about 6 years ago
Adam, can you get me a build for 12.2.4? My results are pretty immediate.
Updated by Paul Emmerich about 6 years ago
Thanks! I've updated one host and will report back tomorrow.
This is what the output looked like after ~20 hours with just the log:
44k elements in one bucket, yikes!
2018-03-06 07:00:35.305617 7fa028029700 0 bluestore.OnodeSpace(0x5630826cb808 in 0x56307e3bd5e0) ONODE_MAP elements=61703 BUCKET count=126271 BUCKET max=44462 2018-03-06 07:00:43.580324 7fa024021700 0 bluestore.OnodeSpace(0x5630826cb808 in 0x56307e3bd5e0) ONODE_MAP elements=61750 BUCKET count=126271 BUCKET max=44449 2018-03-06 07:00:52.526121 7fa028029700 0 bluestore.OnodeSpace(0x5630826cb808 in 0x56307e3bd5e0) ONODE_MAP elements=61859 BUCKET count=126271 BUCKET max=44429 2018-03-06 07:00:59.398315 7fa02201d700 0 bluestore.OnodeSpace(0x5630826d39c8 in 0x56307e82a000) ONODE_MAP elements=11228 BUCKET count=30727 BUCKET max=2269 2018-03-06 07:01:04.995306 7fa028029700 0 bluestore.OnodeSpace(0x5630826cb808 in 0x56307e3bd5e0) ONODE_MAP elements=62066 BUCKET count=126271 BUCKET max=44426
Running with commit ce9563a5cfb87db2957eaa225ddbbc5b6438e4cb yields: (after 10 minutes)
2018-03-06 07:33:52.138424 7f6089839700 0 bluestore.OnodeSpace(0x55e1fd0da148 in 0x55e1fcb3baa0) ONODE_MAP elements=1100 BUCKET count=1741 BUCKET max=241 2018-03-06 07:34:02.087571 7f608a83b700 0 bluestore.OnodeSpace(0x55e1fd08c988 in 0x55e1fcb3b8e0) ONODE_MAP elements=1984 BUCKET count=3739 BUCKET max=646 2018-03-06 07:34:11.511887 7f608e843700 0 bluestore.OnodeSpace(0x55e1fd08c988 in 0x55e1fcb3b8e0) ONODE_MAP elements=2003 BUCKET count=3739 BUCKET max=646 2018-03-06 07:34:19.958191 7f6088837700 0 bluestore.OnodeSpace(0x55e1fd0db2c8 in 0x55e1fcb3bc60) ONODE_MAP elements=636 BUCKET count=823 BUCKET max=377 2018-03-06 07:34:28.571263 7f608c03e700 0 bluestore.OnodeSpace(0x55e1fd08cd08 in 0x55e1fcb395e0) ONODE_MAP elements=1394 BUCKET count=1741 BUCKET max=446
CPU load is still low, but that hash table is already the
I'll report back tomorrow with results after a longer run time, but it doesn't look good :(
Daniel: perf top --pid XXX
is the simplest/quickest way to peek at what a process is doing, if you see that
Updated by Adam Kupczyk about 6 years ago
Daniel,
Version for 12.2.4:
https://shaman.ceph.com/builds/ceph/wip-22977-inspect-onode-map-12-2-4/
Updated by Paul Emmerich about 6 years ago
CPU load is ~30% lower than my "control group" now, but it's still pretty bad (as in: > 90% of the CPU time is spent in that hash map)
Number of collisions:
2018-03-07 04:32:27.037115 7f608c83f700 0 bluestore.OnodeSpace(0x55e1fd091cc8 in 0x55e1fcb3bc60) ONODE_MAP elements=44004 BUCKET count=126271 BUCKET max=6122 2018-03-07 04:32:33.702145 7f608c83f700 0 bluestore.OnodeSpace(0x55e1fd091cc8 in 0x55e1fcb3bc60) ONODE_MAP elements=44051 BUCKET count=126271 BUCKET max=6163 2018-03-07 04:32:41.721362 7f608c03e700 0 bluestore.OnodeSpace(0x55e1fd834d88 in 0x55e1fcb395e0) ONODE_MAP elements=22669 BUCKET count=30727 BUCKET max=15375 2018-03-07 04:32:48.724598 7f6090046700 0 bluestore.OnodeSpace(0x55e1fd0dcd08 in 0x55e1fcb395e0) ONODE_MAP elements=11806 BUCKET count=30727 BUCKET max=4784 2018-03-07 04:32:53.536191 7f608a03a700 0 bluestore.OnodeSpace(0x55e1fd08c608 in 0x55e1fcb3b9c0) ONODE_MAP elements=44156 BUCKET count=126271 BUCKET max=9436 2018-03-07 04:32:59.247743 7f608a83b700 0 bluestore.OnodeSpace(0x55e1fd0db648 in 0x55e1fcb3b8e0) ONODE_MAP elements=24499 BUCKET count=30727 BUCKET max=2065 2018-03-07 04:33:04.893470 7f608d841700 0 bluestore.OnodeSpace(0x55e1fd08c448 in 0x55e1fcb3baa0) ONODE_MAP elements=43016 BUCKET count=62233 BUCKET max=2562
Updated by Adam Kupczyk about 6 years ago
Paul,
Have you been using latest builds? IDs 93123 - 93126 ?
This is the only build that actually tries to fix hash problem, previous only print stats.
Updated by Paul Emmerich about 6 years ago
Yes, perf top now shows the new hash_helper struct as key in the table.
Updated by Paul Emmerich about 6 years ago
{"version":"12.2.2-7-gce9563a","release":"luminous","release_type":"stable"}
52.07% tp_osd_tp ceph-osd [.] std::_Hashtable<BlueStore::OnodeSpace::ghobject_hash_helper_t, std::pair<BlueStore::OnodeSpace::ghobject_hash_helper_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<BlueStore::OnodeSpace::ghobject_hash_helper_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<BlueStore::OnodeSpace::ghobject_hash_helper_t>, BlueStore::OnodeSpace::onode_hasher_t, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node 15.55% bstore_mempool ceph-osd [.] std::_Hashtable<BlueStore::OnodeSpace::ghobject_hash_helper_t, std::pair<BlueStore::OnodeSpace::ghobject_hash_helper_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<BlueStore::OnodeSpace::ghobject_hash_helper_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<BlueStore::OnodeSpace::ghobject_hash_helper_t>, BlueStore::OnodeSpace::onode_hasher_t, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node 3.19% tp_osd_tp libc-2.17.so [.] __memcmp_sse4_1 0.92% bstore_mempool libc-2.17.so [.] __memcmp_sse4_1 0.60% tp_osd_tp libtcmalloc.so.4.2.6 [.] operator new 0.44% tp_osd_tp libtcmalloc.so.4.2.6 [.] operator delete 0.41% tp_osd_tp ceph-osd [.] std::operator==<char> 0.34% tp_osd_tp libstdc++.so.6.0.19 [.] std::_Rb_tree_increment 0.30% bstore_kv_sync ceph-osd [.] rocksdb::MemTable::KeyComparator::operator()
Updated by Adam Kupczyk about 6 years ago
Paul,
Just improved hashing. Please test.
https://shaman.ceph.com/builds/ceph/wip-22977-inspect-onode-map/
Updated by Paul Emmerich about 6 years ago
Awesome, that fixed it, thanks :)
2018-03-08 14:03:21.062642 7f8f76817700 0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=12899 BUCKET count=15173 BUCKET max=6 2018-03-08 14:03:38.040037 7f8f76817700 0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=13540 BUCKET count=15173 BUCKET max=6 2018-03-08 14:03:48.194717 7f8f7a81f700 0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=13997 BUCKET count=15173 BUCKET max=6 2018-03-08 14:04:02.148098 7f8f7a81f700 0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=14654 BUCKET count=15173 BUCKET max=6 2018-03-08 14:04:06.502681 7f8f7a81f700 0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=14852 BUCKET count=15173 BUCKET max=6
Updated by Patrick Donnelly almost 6 years ago
- Status changed from Need More Info to Pending Backport
- Target version set to v14.0.0
Updated by Patrick Donnelly almost 6 years ago
- Copied to Backport #24720: mimic: High CPU load caused by operations on onode_map added
Updated by Nathan Cutler almost 6 years ago
- Status changed from Pending Backport to Resolved