Project

General

Profile

Bug #22977

High CPU load caused by operations on onode_map

Added by Paul Emmerich 10 months ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
Start date:
02/09/2018
Due date:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

I'm investigating performance on a cluster that shows an unusually high CPU load.
Setup are Bluestore OSDs running mostly a 5/3 erasure coding pool.

ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), centos 7, kernel 4.13.7

Each OSD is handling around 60 write ops and 40 read ops per second, all small operations from kernel RBD clients.
All OSDs are running on SSDs that look fine (low queue sizes/wait times, ~20% io load).
CPU load per OSD process was around 90-130% with 2.2 GHz CPU cores.
No background operations like scrubbing or recovery were ongoing.

I've run perf to see what the OSD is doing. I've initially expected that the CPU load was mainly due to the EC setup but no: it was caused by random operations on the onode_map in Bluestore.

More than 50% of the CPU time is spent inside std::_Hashtable chasing pointers in _M_find_node. Looks like a textbook hash collision problem in that unordered_map to me if I interpreted that correctly.
Restarting the OSD reduces the CPU usage by 50%, but it's slowly increasing again

I've uploaded the perf traces before and after restarting the OSD here:
https://static.croit.io/share/bluestore-onode-cpu/perf-before-restart.data
https://static.croit.io/share/bluestore-onode-cpu/perf-after-restart.data
(~4MB each, external links because I can only upload 1MB here.)

I waited for 5 minutes after the recovery completed after the restart before taking the second trace.

I unfortunately wasn't able to reproduce this on my test cluster; I probably had the wrong workload here to test it. But I've seen it on two production clusters.

perf-dump.json (28 KB) Paul Emmerich, 02/10/2018 10:02 PM

osd-4-mempool.json (1.6 KB) Paul Emmerich, 02/12/2018 11:18 AM

perf-before-restart.txt View (579 KB) Paul Emmerich, 02/27/2018 05:08 PM

perf-after-restart.txt View (769 KB) Paul Emmerich, 02/27/2018 05:08 PM


Related issues

Related to bluestore - Bug #21259: bluestore: segv in BlueStore::TwoQCache::_trim Resolved 09/06/2017
Copied to bluestore - Backport #24720: mimic: High CPU load caused by operations on onode_map Resolved

History

#1 Updated by Paul Emmerich 10 months ago

Here's a "perf dump" from an OSD suffering from this.

Potentially relevant onode data that looks similar on all OSDs.

"bluestore_onodes": 1933610,
"bluestore_onode_hits": 3083639062,
"bluestore_onode_misses": 540465668,
"bluestore_onode_shard_hits": 1679445540,
"bluestore_onode_shard_misses": 38863659,
"bluestore_onode_reshard": 16236977,

1.9 million onodes seems like too much. it's a 960GB SSD.

#2 Updated by Igor Fedotov 10 months ago

Hi Paul.
Could you please share your Ceph settings?
Also please collect mempool statistics on a saturated OSD using 'ceph daemon osd.NNN dump_mempools' command.

#3 Updated by Paul Emmerich 10 months ago

The cluster is running with default settings except for osd_max_backfills.

Sorry, I completely forgot about the mempool dump, see attachment.
Unfortunately, there's a recovery going on right now and I didn't save the output yesterday before the recovery started. So maybe not 100% accurate.
But I recall similar values from looking at it yesterday.

Process RSS is 5.5G, mempool claims 3.1G (so roughly within the usually quoted "off by up to +50%")

bluestore_onodes in perf dump was at 2289426 on this OSD when I took this mempool dump.

rados df looks like this:

POOL_NAME   USED  OBJECTS CLONES COPIES   MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS     RD    WR_OPS     WR     
xxxx 7132 846 0 2538 0 0 17 6863185 8183M 21946 1794k
ecpool 8250G 2593966 489191 20751728 6 0 335624 7034626398 176T 5295910555 72387G
xxxx 221G 56606 0 283030 0 0 436 42333060 3751G 477924 303G
xxxx 0 0 0 0 0 0 0 0 0 0 0
xxxx 0 0 0 0 0 0 0 0 0 0 0
total_objects    2651418
total_used 18992G
total_avail 11409G
total_space 30401G

There are only 2.6M objects, almost all of them in the ecpool which runs a 5/3 profile. But there are quite a few snapshots and most of the RBDs are based on the clone of a template.

#4 Updated by Paul Emmerich 10 months ago

32 OSDs, so ~650k objects/ec chunks per OSDs.

#5 Updated by Paul Emmerich 10 months ago

I've just observed an OSD crashing with the following log.

Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: ** Caught signal (Segmentation fault) *
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: in thread 7fc3b2b85700 thread_name:bstore_mempool
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 1: (()+0xa339e1) [0x5600132329e1]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 2: (()+0xf5e0) [0x7fc3bc64b5e0]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 3: (BlueStore::TwoQCache::_trim(unsigned long, unsigned long)+0x324) [0x5600130ff6a4]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 4: (BlueStore::Cache::trim(unsigned long, float, float, float)+0x1d1) [0x5600130d1d71]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 5: (BlueStore::MempoolThread::entry()+0x14d) [0x5600130d84ad]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 6: (()+0x7e25) [0x7fc3bc643e25]
Feb 13 09:36:03 ceph-XXX-osd-a10 ceph-osd[24106]: 7: (clone()+0x6d) [0x7fc3bb73734d]

Last log entries where all:

-2> 2018-02-13 09:36:03.956096 7fc3a2b65700  5 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 38451'11168173, trimmed: , trimmed_dups: , clear_divergent_priors: 0

It was at 17GB of memory (RSS) which seems unusual.

#6 Updated by Igor Fedotov 10 months ago

The last backtrace seems similar to the one from
https://tracker.ceph.com/issues/21259#change-107555

and not sure this relates to high CPU usage this ticket was created about.

#7 Updated by Sage Weil 10 months ago

  • Related to Bug #21259: bluestore: segv in BlueStore::TwoQCache::_trim added

#8 Updated by Sage Weil 10 months ago

Hi Paul,

I'm having trouble viewing the perf report files (crashes on my box; doesn't shows but shows nothing on Mark's). Do you mind rendering the perf report as ascii? Thanks!

#9 Updated by Sage Weil 10 months ago

  • Status changed from New to Need More Info

#10 Updated by Paul Emmerich 10 months ago

Sure, thanks for looking into it :)

#11 Updated by Adam Kupczyk 10 months ago

Hi Paul,

I made a development build, based on 12.2.2.
It will dump stats from onode_map, a container we suspect to have broken hash.
This build, every 1000 operations on bluestore, scans onode map and display information like:
"ONODE_MAP elements=8510 BUCKET count=15173 BUCKET max=4".

In my testing it has always behaved properly. I would really like to confirm we have hash collision.
Obviously it is enough to check it on one node.

The code (based on 12.2.2) is available here:
https://github.com/ceph/ceph-ci/tree/wip-22977-inspect-onode-map

You can fetch and compile, or try OSD from build packages:
https://shaman.ceph.com/builds/ceph/wip-22977-inspect-onode-map/494f106bc425022bfab2a55a34933d5b3cbf422c/

Best regards,
Adam Kupczyk

#12 Updated by Paul Emmerich 10 months ago

Thank you!

I'll run this on one OSD and report back tomorrow as it takes some time for the problem to appear.

But it already looks like this after 5 minutes:

2018-03-05 10:36:23.943177 7fa027027700  0 bluestore.OnodeSpace(0x5630826cc608 in 0x56307e3bfe20) ONODE_MAP elements=522 BUCKET count=823 BUCKET max=85
2018-03-05 10:36:33.792141 7fa023820700  0 bluestore.OnodeSpace(0x5630826cb108 in 0x56307e3bfd40) ONODE_MAP elements=909 BUCKET count=1741 BUCKET max=112
2018-03-05 10:36:42.732337 7fa026826700  0 bluestore.OnodeSpace(0x5630826caa08 in 0x56307e3bff00) ONODE_MAP elements=1115 BUCKET count=1741 BUCKET max=309
2018-03-05 10:36:54.435833 7fa025824700  0 bluestore.OnodeSpace(0x5630826cd788 in 0x56307e82a0e0) ONODE_MAP elements=771 BUCKET count=823 BUCKET max=113
2018-03-05 10:37:01.833181 7fa028029700  0 bluestore.OnodeSpace(0x5630826d5948 in 0x56307e3bd5e0) ONODE_MAP elements=266 BUCKET count=409 BUCKET max=158

#13 Updated by Daniel Pryor 10 months ago

Paul Emmerich wrote:

Thank you!

I'll run this on one OSD and report back tomorrow as it takes some time for the problem to appear.

But it already looks like this after 5 minutes:

[...]

I'm having what feels like the same problem. Ill be posting debug logs in the morning.

ceph v12.2.4

High CPU load and 50% io with no client load or recovery. No EC either.

#14 Updated by Daniel Pryor 10 months ago

Paul Emmerich wrote:

Thank you!

I'll run this on one OSD and report back tomorrow as it takes some time for the problem to appear.

But it already looks like this after 5 minutes:

[...]

Paul what perf command are you running? I want to see if my results are similar?

#15 Updated by Adam Kupczyk 10 months ago

Hi Paul,

An attempt to rectify hash problem is here:
https://shaman.ceph.com/builds/ceph/wip-22977-inspect-onode-map/

Logging info is still there.
Awaiting your input.
Regards,
Adam

#16 Updated by Daniel Pryor 10 months ago

Adam, can you get me a build for 12.2.4? My results are pretty immediate.

#17 Updated by Paul Emmerich 10 months ago

Thanks! I've updated one host and will report back tomorrow.

This is what the output looked like after ~20 hours with just the log:
44k elements in one bucket, yikes!

2018-03-06 07:00:35.305617 7fa028029700  0 bluestore.OnodeSpace(0x5630826cb808 in 0x56307e3bd5e0) ONODE_MAP elements=61703 BUCKET count=126271 BUCKET max=44462
2018-03-06 07:00:43.580324 7fa024021700  0 bluestore.OnodeSpace(0x5630826cb808 in 0x56307e3bd5e0) ONODE_MAP elements=61750 BUCKET count=126271 BUCKET max=44449
2018-03-06 07:00:52.526121 7fa028029700  0 bluestore.OnodeSpace(0x5630826cb808 in 0x56307e3bd5e0) ONODE_MAP elements=61859 BUCKET count=126271 BUCKET max=44429
2018-03-06 07:00:59.398315 7fa02201d700  0 bluestore.OnodeSpace(0x5630826d39c8 in 0x56307e82a000) ONODE_MAP elements=11228 BUCKET count=30727 BUCKET max=2269
2018-03-06 07:01:04.995306 7fa028029700  0 bluestore.OnodeSpace(0x5630826cb808 in 0x56307e3bd5e0) ONODE_MAP elements=62066 BUCKET count=126271 BUCKET max=44426

Running with commit ce9563a5cfb87db2957eaa225ddbbc5b6438e4cb yields: (after 10 minutes)

2018-03-06 07:33:52.138424 7f6089839700  0 bluestore.OnodeSpace(0x55e1fd0da148 in 0x55e1fcb3baa0) ONODE_MAP elements=1100 BUCKET count=1741 BUCKET max=241
2018-03-06 07:34:02.087571 7f608a83b700  0 bluestore.OnodeSpace(0x55e1fd08c988 in 0x55e1fcb3b8e0) ONODE_MAP elements=1984 BUCKET count=3739 BUCKET max=646
2018-03-06 07:34:11.511887 7f608e843700  0 bluestore.OnodeSpace(0x55e1fd08c988 in 0x55e1fcb3b8e0) ONODE_MAP elements=2003 BUCKET count=3739 BUCKET max=646
2018-03-06 07:34:19.958191 7f6088837700  0 bluestore.OnodeSpace(0x55e1fd0db2c8 in 0x55e1fcb3bc60) ONODE_MAP elements=636 BUCKET count=823 BUCKET max=377
2018-03-06 07:34:28.571263 7f608c03e700  0 bluestore.OnodeSpace(0x55e1fd08cd08 in 0x55e1fcb395e0) ONODE_MAP elements=1394 BUCKET count=1741 BUCKET max=446

CPU load is still low, but that hash table is already the
I'll report back tomorrow with results after a longer run time, but it doesn't look good :(

Daniel: perf top --pid XXX is the simplest/quickest way to peek at what a process is doing, if you see that

#19 Updated by Paul Emmerich 9 months ago

CPU load is ~30% lower than my "control group" now, but it's still pretty bad (as in: > 90% of the CPU time is spent in that hash map)

Number of collisions:

2018-03-07 04:32:27.037115 7f608c83f700  0 bluestore.OnodeSpace(0x55e1fd091cc8 in 0x55e1fcb3bc60) ONODE_MAP elements=44004 BUCKET count=126271 BUCKET max=6122
2018-03-07 04:32:33.702145 7f608c83f700  0 bluestore.OnodeSpace(0x55e1fd091cc8 in 0x55e1fcb3bc60) ONODE_MAP elements=44051 BUCKET count=126271 BUCKET max=6163
2018-03-07 04:32:41.721362 7f608c03e700  0 bluestore.OnodeSpace(0x55e1fd834d88 in 0x55e1fcb395e0) ONODE_MAP elements=22669 BUCKET count=30727 BUCKET max=15375
2018-03-07 04:32:48.724598 7f6090046700  0 bluestore.OnodeSpace(0x55e1fd0dcd08 in 0x55e1fcb395e0) ONODE_MAP elements=11806 BUCKET count=30727 BUCKET max=4784
2018-03-07 04:32:53.536191 7f608a03a700  0 bluestore.OnodeSpace(0x55e1fd08c608 in 0x55e1fcb3b9c0) ONODE_MAP elements=44156 BUCKET count=126271 BUCKET max=9436
2018-03-07 04:32:59.247743 7f608a83b700  0 bluestore.OnodeSpace(0x55e1fd0db648 in 0x55e1fcb3b8e0) ONODE_MAP elements=24499 BUCKET count=30727 BUCKET max=2065
2018-03-07 04:33:04.893470 7f608d841700  0 bluestore.OnodeSpace(0x55e1fd08c448 in 0x55e1fcb3baa0) ONODE_MAP elements=43016 BUCKET count=62233 BUCKET max=2562

#20 Updated by Adam Kupczyk 9 months ago

Paul,
Have you been using latest builds? IDs 93123 - 93126 ?
This is the only build that actually tries to fix hash problem, previous only print stats.

#21 Updated by Paul Emmerich 9 months ago

Yes, perf top now shows the new hash_helper struct as key in the table.

#22 Updated by Paul Emmerich 9 months ago

{"version":"12.2.2-7-gce9563a","release":"luminous","release_type":"stable"}
    52.07%  tp_osd_tp        ceph-osd              [.] std::_Hashtable<BlueStore::OnodeSpace::ghobject_hash_helper_t, std::pair<BlueStore::OnodeSpace::ghobject_hash_helper_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<BlueStore::OnodeSpace::ghobject_hash_helper_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<BlueStore::OnodeSpace::ghobject_hash_helper_t>, BlueStore::OnodeSpace::onode_hasher_t, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node
    15.55%  bstore_mempool   ceph-osd              [.] std::_Hashtable<BlueStore::OnodeSpace::ghobject_hash_helper_t, std::pair<BlueStore::OnodeSpace::ghobject_hash_helper_t const, boost::intrusive_ptr<BlueStore::Onode> >, mempool::pool_allocator<(mempool::pool_index_t)4, std::pair<BlueStore::OnodeSpace::ghobject_hash_helper_t const, boost::intrusive_ptr<BlueStore::Onode> > >, std::__detail::_Select1st, std::equal_to<BlueStore::OnodeSpace::ghobject_hash_helper_t>, BlueStore::OnodeSpace::onode_hasher_t, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node
     3.19%  tp_osd_tp        libc-2.17.so          [.] __memcmp_sse4_1
     0.92%  bstore_mempool   libc-2.17.so          [.] __memcmp_sse4_1
     0.60%  tp_osd_tp        libtcmalloc.so.4.2.6  [.] operator new
     0.44%  tp_osd_tp        libtcmalloc.so.4.2.6  [.] operator delete
     0.41%  tp_osd_tp        ceph-osd              [.] std::operator==<char>
     0.34%  tp_osd_tp        libstdc++.so.6.0.19   [.] std::_Rb_tree_increment
     0.30%  bstore_kv_sync   ceph-osd              [.] rocksdb::MemTable::KeyComparator::operator()

#23 Updated by Adam Kupczyk 9 months ago

#24 Updated by Paul Emmerich 9 months ago

Awesome, that fixed it, thanks :)

2018-03-08 14:03:21.062642 7f8f76817700  0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=12899 BUCKET count=15173 BUCKET max=6
2018-03-08 14:03:38.040037 7f8f76817700  0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=13540 BUCKET count=15173 BUCKET max=6
2018-03-08 14:03:48.194717 7f8f7a81f700  0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=13997 BUCKET count=15173 BUCKET max=6
2018-03-08 14:04:02.148098 7f8f7a81f700  0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=14654 BUCKET count=15173 BUCKET max=6
2018-03-08 14:04:06.502681 7f8f7a81f700  0 bluestore.OnodeSpace(0x56120f734448 in 0x56120b25db80) ONODE_MAP elements=14852 BUCKET count=15173 BUCKET max=6

#25 Updated by Patrick Donnelly 6 months ago

  • Status changed from Need More Info to Pending Backport
  • Target version set to v14.0.0

#26 Updated by Patrick Donnelly 6 months ago

  • Backport set to mimic

#27 Updated by Patrick Donnelly 6 months ago

  • Copied to Backport #24720: mimic: High CPU load caused by operations on onode_map added

#28 Updated by Nathan Cutler 5 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF