Fix #5232
osd: slow peering due to pg log rewrites
0%
Description
I noticed that since cuttlefish the osd recovery process is extremely slow. Also client I/o gets stalled to the recovering osd ending up in slow request log messages.
Things I noticed:
- recovering osd uses 100-200% CPU until recovered
- nearly no disk I/o from that osd
- under bobtail I had much lower CPU usage and high disk I/o (which makes sense)
- recovering from a simple osd restart (getting changed objects since stop) is awful slow even my cluster was only degraded by 0.1% (24tb/3repl) took 10-20min
- I'm using xfs
This is a major blocker as recovering is nearly impossible.
Used version current upstream/cuttlefish
I've provided sage a gdb threads bt all.
Greets
Stefan
Related issues
Associated revisions
Merge remote-tracking branch 'upstream/wip-log-rewrite-sam'
Fixes: #5232
Reviewed-by: Sage Weil <sage@inktank.com>
History
#1 Updated by Sage Weil almost 10 years ago
- File gdb.txt.gz added
- Category set to OSD
this thread?
No symbol table info available. #1 0x00007fb90be5d1f5 in ?? () from /usr/lib/libleveldb.so.1 No symbol table info available. #2 0x00007fb90be4781a in ?? () from /usr/lib/libleveldb.so.1 No symbol table info available. #3 0x00007fb90be47c3d in ?? () from /usr/lib/libleveldb.so.1 No symbol table info available. #4 0x000000000079a75a in LevelDBStore::LevelDBWholeSpaceIteratorImpl::seek_to_first(std::string const&) () No symbol table info available. #5 0x0000000000798b1f in LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix (this=0x2b4f02c0, prefix=...) at os/LevelDBStore.cc:141 it = {<std::tr1::__shared_ptr<KeyValueDB::IteratorImpl, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x36889740, _M_refcount = {_M_pi = 0x31ab5b20}}, <No data fields>} #6 0x000000000078a9f6 in DBObjectMap::clear_header (this=0x328ab00, header= DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at os/DBObjectMap.cc:1158 keys = {_M_t = { _M_impl = {<std::allocator<std::_Rb_tree_node<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >> = {<No data fields >}, <No data fields>}, _M_key_compare = {<std::binary_function<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool>> = {<No data fields>}, <No data fields>}, _M_header = { _M_color = 7894675, _M_parent = 0x7fb904ac73b8, _M_left = 0x7fb90be11fe0, _M_right = 0x0}, _M_node_count = 13107200}}} #7 0x000000000078af8a in DBObjectMap::_clear (this=0x328ab00, header= DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at os/DBObjectMap.cc:597 parent = {<std::tr1::__shared_ptr<DBObjectMap::_Header, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x3cb0780, _M_refcount = { _M_pi = 0x2e08da00}}, <No data fields>} __PRETTY_FUNCTION__ = "int DBObjectMap::_clear(std::tr1::shared_ptr<DBObjectMap::_Header>, std::tr1::shared_ptr<KeyValueDB::TransactionImpl>)" #8 0x000000000078b63c in DBObjectMap::clear (this=0x328ab00, hoid=<value optimized out>, spos=<value optimized out>) at os/DBObjectMap.cc:583 t = {<std::tr1::__shared_ptr<KeyValueDB::TransactionImpl, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x2b4f02c0, _M_refcount = {_M_pi = 0x369a6e80}}, <No data fields>} header = {<std::tr1::__shared_ptr<DBObjectMap::_Header, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x3cb0780, _M_refcount = { _M_pi = 0x2e08da00}}, <No data fields>} __PRETTY_FUNCTION__ = "virtual int DBObjectMap::clear(const hobject_t&, const SequencerPosition*)" r = <value optimized out> #9 0x000000000073a0a6 in FileStore::lfn_unlink (this=0x31be000, cid= DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at os/FileStore.cc:350 path = {<std::tr1::__shared_ptr<CollectionIndex::Path, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x32e05920, _M_refcount = { _M_pi = 0x31ab5ac0}}, <No data fields>} exist = 1 st = {st_dev = 64770, st_ino = 1610612898, st_nlink = 1, st_mode = 33188, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 4194304, st_blksize = 4096, st_blocks = 8200, st_atim = {tv_sec = 1369771413, tv_nsec = 196604128}, st_mtim = { tv_sec = 1370120522, tv_nsec = 795389833}, st_ctim = {tv_sec = 1370120522, tv_nsec = 795389833}, __unused = {0, 0, 0}} index = {<std::tr1::__shared_ptr<CollectionIndex, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x3b873c0, _M_refcount = { _M_pi = 0x29fa4120}}, <No data fields>} r = 0 __PRETTY_FUNCTION__ = "int FileStore::lfn_unlink(coll_t, const hobject_t&, const SequencerPosition&)" __func__ = "lfn_unlink" #10 0x000000000073a334 in FileStore::_remove (this=0x31be000, cid= DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at os/FileStore.cc:2882 r = <value optimized out> #11 0x0000000000765ef1 in FileStore::_do_transaction (this=<value optimized out>, t=<value optimized out>, op_seq=<value optimized out>, trans_num=<value optimized out>) at os/FileStore.cc:2408 cid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = { static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x3197258 "meta"}}}, str = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2fa64a38 "4.de0_head"}}} oid = {oid = {name = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2a116c48 "rbd_data.57f6766b8b4567.", '0' <repeats 14 times>, "17"}}}, snap = { val = 18446744073709551614}, hash = 252288480, max = false, pool = 4, nspace = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xc80398 ""}}, key = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xc80398 ""}}} op = <value optimized out> r = <value optimized out> i = {p = {bl = 0x35163620, ls = 0x35163620, off = 98, p = {_M_node = 0x31be3920}, p_off = 98}, sobject_encoding = false, pool_override = -1, use_pool_override = false, replica = false, _tolerate_collection_add_enoent = false} spos = {seq = 174858271, trans = 0, op = 0} __PRETTY_FUNCTION__ = "unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)" #12 0x0000000000769af1 in FileStore::_do_transactions (this=0x31be000, tls=..., op_seq=174858271, handle=0x7fb904ac8ce0) at os/FileStore.cc:2151 r = 201801232 trans_num = 0 #13 0x0000000000769d74 in FileStore::_do_op (this=0x31be000, osr=0x2a2ff900, handle=...) at os/FileStore.cc:1985 o = 0x2ae2f7d0 r = <value optimized out> #14 0x0000000000847f91 in ThreadPool::worker (this=0x31bea08, wt=<value optimized out>) at common/WorkQueue.cc:119 tp_handle = {cct = 0x31a2000, hb = 0x42ff450, grace = 60, suicide_grace = 180} item = 0x2a2ff900 wq = 0x31beb90 ss = <incomplete type> hb = 0x42ff450
it's removing an rbd data object, but stuck in the leveldb iterator?
#2 Updated by Sage Weil almost 10 years ago
stefan: can you also do
ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_ms 1
ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_osd 20
ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_filestore 20
to capture some logging when the osd is eating cpu?
thanks!
#3 Updated by Stefan Priebe almost 10 years ago
- File ceph-osd.23.tar.gz added
Attached you'll find a log with debugging enabled in betweek and two new gdb thread all traces.
#4 Updated by Stefan Priebe almost 10 years ago
May be a hint or just luck i could reduce the effect and time to recover by lowering osd op threads to 2 (default) instead of 4 i had normally.
But there are still slow requests.
#5 Updated by Sage Weil almost 10 years ago
- Status changed from New to 12
This looks to me like a lot of time is being spent in leveldb clearing and rewriting the pglog. This is probably just due to a naive PG::write_log()...?
#6 Updated by Sage Weil almost 10 years ago
even so, it seems like a lot of time is spent just in the removal phase.. perhaps there is something not quite right with the removal of old keys?
#7 Updated by Sage Weil almost 10 years ago
Stefan, are there any non-default options for in your ceph.conf that might affect pg log size? How many pgs do you have per osd?
#8 Updated by Stefan Priebe almost 10 years ago
No config changes except the minimum right now. Before I just had changed the osd op thread count.
4096 pgs 24 osds 6 hosts replication level 3
#9 Updated by Stefan Priebe almost 10 years ago
Should I post the config?
#10 Updated by Sage Weil almost 10 years ago
- Subject changed from Osd hangs I/o gets stalled while recovering to osd: slow peering due to pg log rewrites
#11 Updated by Ian Colle almost 10 years ago
- Status changed from 12 to In Progress
- Assignee set to Samuel Just
#12 Updated by Samuel Just almost 10 years ago
- Tracker changed from Bug to Fix
- Target version set to v0.66
#13 Updated by Sage Weil almost 10 years ago
- translation missing: en.field_story_points set to 3.00
#14 Updated by Stefan Priebe almost 10 years ago
This one misses cuttlefish for backport?
#15 Updated by Samuel Just almost 10 years ago
- Status changed from In Progress to Resolved
#16 Updated by Samuel Just almost 10 years ago
This won't get backported. Some mitigating patches did go into cuttlefish. Also, there is wip_cuttlefish_compact_on_startup. fridudad_ on irc tried it, and it appeared to work (peering went quickly). Not sure how long it will take to compact leveldb on an OSD backing a radosgw workload. It also hasn't been tested much. (You need osd_compact_leveldb_on_mount to actually cause compaction on osd startup with that branch.)
#17 Updated by Stefan Priebe almost 10 years ago
I'm fridudad ;-) Peering works fine but recovery does not. In my initial text of this tracker i also mentioned recovery as the main problem and not peering. Should i open a new report?