Fix #5232
closedosd: slow peering due to pg log rewrites
0%
Description
I noticed that since cuttlefish the osd recovery process is extremely slow. Also client I/o gets stalled to the recovering osd ending up in slow request log messages.
Things I noticed:
- recovering osd uses 100-200% CPU until recovered
- nearly no disk I/o from that osd
- under bobtail I had much lower CPU usage and high disk I/o (which makes sense)
- recovering from a simple osd restart (getting changed objects since stop) is awful slow even my cluster was only degraded by 0.1% (24tb/3repl) took 10-20min
- I'm using xfs
This is a major blocker as recovering is nearly impossible.
Used version current upstream/cuttlefish
I've provided sage a gdb threads bt all.
Greets
Stefan
Files
Updated by Sage Weil almost 11 years ago
- File gdb.txt.gz gdb.txt.gz added
- Category set to OSD
this thread?
No symbol table info available. #1 0x00007fb90be5d1f5 in ?? () from /usr/lib/libleveldb.so.1 No symbol table info available. #2 0x00007fb90be4781a in ?? () from /usr/lib/libleveldb.so.1 No symbol table info available. #3 0x00007fb90be47c3d in ?? () from /usr/lib/libleveldb.so.1 No symbol table info available. #4 0x000000000079a75a in LevelDBStore::LevelDBWholeSpaceIteratorImpl::seek_to_first(std::string const&) () No symbol table info available. #5 0x0000000000798b1f in LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix (this=0x2b4f02c0, prefix=...) at os/LevelDBStore.cc:141 it = {<std::tr1::__shared_ptr<KeyValueDB::IteratorImpl, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x36889740, _M_refcount = {_M_pi = 0x31ab5b20}}, <No data fields>} #6 0x000000000078a9f6 in DBObjectMap::clear_header (this=0x328ab00, header= DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at os/DBObjectMap.cc:1158 keys = {_M_t = { _M_impl = {<std::allocator<std::_Rb_tree_node<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >> = {<No data fields >}, <No data fields>}, _M_key_compare = {<std::binary_function<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool>> = {<No data fields>}, <No data fields>}, _M_header = { _M_color = 7894675, _M_parent = 0x7fb904ac73b8, _M_left = 0x7fb90be11fe0, _M_right = 0x0}, _M_node_count = 13107200}}} #7 0x000000000078af8a in DBObjectMap::_clear (this=0x328ab00, header= DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at os/DBObjectMap.cc:597 parent = {<std::tr1::__shared_ptr<DBObjectMap::_Header, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x3cb0780, _M_refcount = { _M_pi = 0x2e08da00}}, <No data fields>} __PRETTY_FUNCTION__ = "int DBObjectMap::_clear(std::tr1::shared_ptr<DBObjectMap::_Header>, std::tr1::shared_ptr<KeyValueDB::TransactionImpl>)" #8 0x000000000078b63c in DBObjectMap::clear (this=0x328ab00, hoid=<value optimized out>, spos=<value optimized out>) at os/DBObjectMap.cc:583 t = {<std::tr1::__shared_ptr<KeyValueDB::TransactionImpl, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x2b4f02c0, _M_refcount = {_M_pi = 0x369a6e80}}, <No data fields>} header = {<std::tr1::__shared_ptr<DBObjectMap::_Header, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x3cb0780, _M_refcount = { _M_pi = 0x2e08da00}}, <No data fields>} __PRETTY_FUNCTION__ = "virtual int DBObjectMap::clear(const hobject_t&, const SequencerPosition*)" r = <value optimized out> #9 0x000000000073a0a6 in FileStore::lfn_unlink (this=0x31be000, cid= DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at os/FileStore.cc:350 path = {<std::tr1::__shared_ptr<CollectionIndex::Path, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x32e05920, _M_refcount = { _M_pi = 0x31ab5ac0}}, <No data fields>} exist = 1 st = {st_dev = 64770, st_ino = 1610612898, st_nlink = 1, st_mode = 33188, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, st_size = 4194304, st_blksize = 4096, st_blocks = 8200, st_atim = {tv_sec = 1369771413, tv_nsec = 196604128}, st_mtim = { tv_sec = 1370120522, tv_nsec = 795389833}, st_ctim = {tv_sec = 1370120522, tv_nsec = 795389833}, __unused = {0, 0, 0}} index = {<std::tr1::__shared_ptr<CollectionIndex, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x3b873c0, _M_refcount = { _M_pi = 0x29fa4120}}, <No data fields>} r = 0 __PRETTY_FUNCTION__ = "int FileStore::lfn_unlink(coll_t, const hobject_t&, const SequencerPosition&)" __func__ = "lfn_unlink" #10 0x000000000073a334 in FileStore::_remove (this=0x31be000, cid= DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece. ) at os/FileStore.cc:2882 r = <value optimized out> #11 0x0000000000765ef1 in FileStore::_do_transaction (this=<value optimized out>, t=<value optimized out>, op_seq=<value optimized out>, trans_num=<value optimized out>) at os/FileStore.cc:2408 cid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = { static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x3197258 "meta"}}}, str = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2fa64a38 "4.de0_head"}}} oid = {oid = {name = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x2a116c48 "rbd_data.57f6766b8b4567.", '0' <repeats 14 times>, "17"}}}, snap = { val = 18446744073709551614}, hash = 252288480, max = false, pool = 4, nspace = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xc80398 ""}}, key = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0xc80398 ""}}} op = <value optimized out> r = <value optimized out> i = {p = {bl = 0x35163620, ls = 0x35163620, off = 98, p = {_M_node = 0x31be3920}, p_off = 98}, sobject_encoding = false, pool_override = -1, use_pool_override = false, replica = false, _tolerate_collection_add_enoent = false} spos = {seq = 174858271, trans = 0, op = 0} __PRETTY_FUNCTION__ = "unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)" #12 0x0000000000769af1 in FileStore::_do_transactions (this=0x31be000, tls=..., op_seq=174858271, handle=0x7fb904ac8ce0) at os/FileStore.cc:2151 r = 201801232 trans_num = 0 #13 0x0000000000769d74 in FileStore::_do_op (this=0x31be000, osr=0x2a2ff900, handle=...) at os/FileStore.cc:1985 o = 0x2ae2f7d0 r = <value optimized out> #14 0x0000000000847f91 in ThreadPool::worker (this=0x31bea08, wt=<value optimized out>) at common/WorkQueue.cc:119 tp_handle = {cct = 0x31a2000, hb = 0x42ff450, grace = 60, suicide_grace = 180} item = 0x2a2ff900 wq = 0x31beb90 ss = <incomplete type> hb = 0x42ff450
it's removing an rbd data object, but stuck in the leveldb iterator?
Updated by Sage Weil almost 11 years ago
stefan: can you also do
ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_ms 1
ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_osd 20
ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_filestore 20
to capture some logging when the osd is eating cpu?
thanks!
Updated by Stefan Priebe almost 11 years ago
- File ceph-osd.23.tar.gz ceph-osd.23.tar.gz added
Attached you'll find a log with debugging enabled in betweek and two new gdb thread all traces.
Updated by Stefan Priebe almost 11 years ago
May be a hint or just luck i could reduce the effect and time to recover by lowering osd op threads to 2 (default) instead of 4 i had normally.
But there are still slow requests.
Updated by Sage Weil almost 11 years ago
- Status changed from New to 12
This looks to me like a lot of time is being spent in leveldb clearing and rewriting the pglog. This is probably just due to a naive PG::write_log()...?
Updated by Sage Weil almost 11 years ago
even so, it seems like a lot of time is spent just in the removal phase.. perhaps there is something not quite right with the removal of old keys?
Updated by Sage Weil almost 11 years ago
Stefan, are there any non-default options for in your ceph.conf that might affect pg log size? How many pgs do you have per osd?
Updated by Stefan Priebe almost 11 years ago
No config changes except the minimum right now. Before I just had changed the osd op thread count.
4096 pgs 24 osds 6 hosts replication level 3
Updated by Sage Weil almost 11 years ago
- Subject changed from Osd hangs I/o gets stalled while recovering to osd: slow peering due to pg log rewrites
Updated by Ian Colle almost 11 years ago
- Status changed from 12 to In Progress
- Assignee set to Samuel Just
Updated by Samuel Just almost 11 years ago
- Tracker changed from Bug to Fix
- Target version set to v0.66
Updated by Sage Weil almost 11 years ago
- Translation missing: en.field_story_points set to 3.00
Updated by Stefan Priebe almost 11 years ago
This one misses cuttlefish for backport?
Updated by Samuel Just almost 11 years ago
- Status changed from In Progress to Resolved
Updated by Samuel Just almost 11 years ago
This won't get backported. Some mitigating patches did go into cuttlefish. Also, there is wip_cuttlefish_compact_on_startup. fridudad_ on irc tried it, and it appeared to work (peering went quickly). Not sure how long it will take to compact leveldb on an OSD backing a radosgw workload. It also hasn't been tested much. (You need osd_compact_leveldb_on_mount to actually cause compaction on osd startup with that branch.)
Updated by Stefan Priebe almost 11 years ago
I'm fridudad ;-) Peering works fine but recovery does not. In my initial text of this tracker i also mentioned recovery as the main problem and not peering. Should i open a new report?