Fix #5232: osd: slow peering due to pg log rewrites - Ceph - Ceph

Actions

Copy link

Fix #5232

closed

osd: slow peering due to pg log rewrites

Added by Stefan Priebe almost 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Samuel Just

Category:

OSD

Target version:

v0.66

% Done:

Source:

other

Tags:

Backport:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I noticed that since cuttlefish the osd recovery process is extremely slow. Also client I/o gets stalled to the recovering osd ending up in slow request log messages.

Things I noticed:
- recovering osd uses 100-200% CPU until recovered
- nearly no disk I/o from that osd
- under bobtail I had much lower CPU usage and high disk I/o (which makes sense)
- recovering from a simple osd restart (getting changed objects since stop) is awful slow even my cluster was only degraded by 0.1% (24tb/3repl) took 10-20min
- I'm using xfs

This is a major blocker as recovering is nearly impossible.

Used version current upstream/cuttlefish

I've provided sage a gdb threads bt all.

Greets
Stefan

Files

Download all files

gdb.txt.gz (31.9 KB) gdb.txt.gz		Sage Weil, 06/02/2013 09:38 AM
ceph-osd.23.tar.gz (1.83 MB) ceph-osd.23.tar.gz		Stefan Priebe, 06/02/2013 11:32 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil almost 11 years ago

File gdb.txt.gz gdb.txt.gz added
Category set to OSD

this thread?

No symbol table info available.
#1  0x00007fb90be5d1f5 in ?? () from /usr/lib/libleveldb.so.1
No symbol table info available.
#2  0x00007fb90be4781a in ?? () from /usr/lib/libleveldb.so.1
No symbol table info available.
#3  0x00007fb90be47c3d in ?? () from /usr/lib/libleveldb.so.1
No symbol table info available.
#4  0x000000000079a75a in LevelDBStore::LevelDBWholeSpaceIteratorImpl::seek_to_first(std::string const&) ()
No symbol table info available.
#5  0x0000000000798b1f in LevelDBStore::LevelDBTransactionImpl::rmkeys_by_prefix (this=0x2b4f02c0, prefix=...)
    at os/LevelDBStore.cc:141
        it = {<std::tr1::__shared_ptr<KeyValueDB::IteratorImpl, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x36889740, 
            _M_refcount = {_M_pi = 0x31ab5b20}}, <No data fields>}
#6  0x000000000078a9f6 in DBObjectMap::clear_header (this=0x328ab00, header=
DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece.
) at os/DBObjectMap.cc:1158
        keys = {_M_t = {
            _M_impl = {<std::allocator<std::_Rb_tree_node<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >> = {<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >> = {<No data fields
>}, <No data fields>}, 
              _M_key_compare = {<std::binary_function<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool>> = {<No data fields>}, <No data fields>}, _M_header = {
                _M_color = 7894675, _M_parent = 0x7fb904ac73b8, _M_left = 0x7fb90be11fe0, _M_right = 0x0}, 
              _M_node_count = 13107200}}}
#7  0x000000000078af8a in DBObjectMap::_clear (this=0x328ab00, header=
DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece.
) at os/DBObjectMap.cc:597
        parent = {<std::tr1::__shared_ptr<DBObjectMap::_Header, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x3cb0780, _M_refcount = {
              _M_pi = 0x2e08da00}}, <No data fields>}
        __PRETTY_FUNCTION__ = "int DBObjectMap::_clear(std::tr1::shared_ptr<DBObjectMap::_Header>, std::tr1::shared_ptr<KeyValueDB::TransactionImpl>)" 
#8  0x000000000078b63c in DBObjectMap::clear (this=0x328ab00, hoid=<value optimized out>, spos=<value optimized out>)
    at os/DBObjectMap.cc:583
        t = {<std::tr1::__shared_ptr<KeyValueDB::TransactionImpl, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x2b4f02c0, 
            _M_refcount = {_M_pi = 0x369a6e80}}, <No data fields>}
        header = {<std::tr1::__shared_ptr<DBObjectMap::_Header, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x3cb0780, _M_refcount = {
              _M_pi = 0x2e08da00}}, <No data fields>}
        __PRETTY_FUNCTION__ = "virtual int DBObjectMap::clear(const hobject_t&, const SequencerPosition*)" 
        r = <value optimized out>
#9  0x000000000073a0a6 in FileStore::lfn_unlink (this=0x31be000, cid=
DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece.
) at os/FileStore.cc:350
        path = {<std::tr1::__shared_ptr<CollectionIndex::Path, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x32e05920, _M_refcount = {
              _M_pi = 0x31ab5ac0}}, <No data fields>}
        exist = 1
        st = {st_dev = 64770, st_ino = 1610612898, st_nlink = 1, st_mode = 33188, st_uid = 0, st_gid = 0, __pad0 = 0, st_rdev = 0, 
          st_size = 4194304, st_blksize = 4096, st_blocks = 8200, st_atim = {tv_sec = 1369771413, tv_nsec = 196604128}, st_mtim = {
            tv_sec = 1370120522, tv_nsec = 795389833}, st_ctim = {tv_sec = 1370120522, tv_nsec = 795389833}, __unused = {0, 0, 0}}
        index = {<std::tr1::__shared_ptr<CollectionIndex, (__gnu_cxx::_Lock_policy)2>> = {_M_ptr = 0x3b873c0, _M_refcount = {
              _M_pi = 0x29fa4120}}, <No data fields>}
        r = 0
        __PRETTY_FUNCTION__ = "int FileStore::lfn_unlink(coll_t, const hobject_t&, const SequencerPosition&)" 
        __func__ = "lfn_unlink" 
#10 0x000000000073a334 in FileStore::_remove (this=0x31be000, cid=
DWARF-2 expression error: DW_OP_reg operations must be used either alone or in conjuction with DW_OP_piece.
) at os/FileStore.cc:2882
        r = <value optimized out>
#11 0x0000000000765ef1 in FileStore::_do_transaction (this=<value optimized out>, t=<value optimized out>, 
    op_seq=<value optimized out>, trans_num=<value optimized out>) at os/FileStore.cc:2408
        cid = {static META_COLL = {static META_COLL = <same as static member of an already seen type>, str = {
              static npos = 18446744073709551615, 
              _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
                _M_p = 0x3197258 "meta"}}}, str = {static npos = 18446744073709551615, 
            _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
              _M_p = 0x2fa64a38 "4.de0_head"}}}
        oid = {oid = {name = {static npos = 18446744073709551615, 
              _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
                _M_p = 0x2a116c48 "rbd_data.57f6766b8b4567.", '0' <repeats 14 times>, "17"}}}, snap = {
            val = 18446744073709551614}, hash = 252288480, max = false, pool = 4, nspace = {static npos = 18446744073709551615, 
            _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
              _M_p = 0xc80398 ""}}, key = {static npos = 18446744073709551615, 
            _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
              _M_p = 0xc80398 ""}}}
        op = <value optimized out>
        r = <value optimized out>
        i = {p = {bl = 0x35163620, ls = 0x35163620, off = 98, p = {_M_node = 0x31be3920}, p_off = 98}, sobject_encoding = false, 
          pool_override = -1, use_pool_override = false, replica = false, _tolerate_collection_add_enoent = false}
        spos = {seq = 174858271, trans = 0, op = 0}
        __PRETTY_FUNCTION__ = "unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)" 
#12 0x0000000000769af1 in FileStore::_do_transactions (this=0x31be000, tls=..., op_seq=174858271, handle=0x7fb904ac8ce0)
    at os/FileStore.cc:2151
        r = 201801232
        trans_num = 0
#13 0x0000000000769d74 in FileStore::_do_op (this=0x31be000, osr=0x2a2ff900, handle=...) at os/FileStore.cc:1985
        o = 0x2ae2f7d0
        r = <value optimized out>
#14 0x0000000000847f91 in ThreadPool::worker (this=0x31bea08, wt=<value optimized out>) at common/WorkQueue.cc:119
        tp_handle = {cct = 0x31a2000, hb = 0x42ff450, grace = 60, suicide_grace = 180}
        item = 0x2a2ff900
        wq = 0x31beb90
        ss = <incomplete type>
        hb = 0x42ff450

it's removing an rbd data object, but stuck in the leveldb iterator?

Actions

Copy link

Updated by Sage Weil almost 11 years ago

stefan: can you also do

ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_ms 1
 ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_osd 20
 ceph --admin-daemon /var/run/ceph/ceph-osd.NNN.asok config set debug_filestore 20

to capture some logging when the osd is eating cpu?

thanks!

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

File ceph-osd.23.tar.gz ceph-osd.23.tar.gz added

Attached you'll find a log with debugging enabled in betweek and two new gdb thread all traces.

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

May be a hint or just luck i could reduce the effect and time to recover by lowering osd op threads to 2 (default) instead of 4 i had normally.

But there are still slow requests.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Status changed from New to 12

This looks to me like a lot of time is being spent in leveldb clearing and rewriting the pglog. This is probably just due to a naive PG::write_log()...?

Actions

Copy link

Updated by Sage Weil almost 11 years ago

even so, it seems like a lot of time is spent just in the removal phase.. perhaps there is something not quite right with the removal of old keys?

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Stefan, are there any non-default options for in your ceph.conf that might affect pg log size? How many pgs do you have per osd?

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

No config changes except the minimum right now. Before I just had changed the osd op thread count.

4096 pgs 24 osds 6 hosts replication level 3

Actions

Copy link

Updated by Stefan Priebe almost 11 years ago

Should I post the config?

Actions

Copy link

#10

Updated by Sage Weil almost 11 years ago

Subject changed from Osd hangs I/o gets stalled while recovering to osd: slow peering due to pg log rewrites

Actions

Copy link

#11

Updated by Ian Colle almost 11 years ago

Status changed from 12 to In Progress
Assignee set to Samuel Just

Actions

Copy link

#12

Updated by Samuel Just almost 11 years ago

Tracker changed from Bug to Fix
Target version set to v0.66

Actions

Copy link

#13

Updated by Sage Weil almost 11 years ago

Translation missing: en.field_story_points set to 3.00

Actions

Copy link

#14

Updated by Stefan Priebe almost 11 years ago

This one misses cuttlefish for backport?

Actions

Copy link

#15

Updated by Samuel Just almost 11 years ago

Status changed from In Progress to Resolved

Actions

Copy link

#16

Updated by Samuel Just almost 11 years ago

This won't get backported. Some mitigating patches did go into cuttlefish. Also, there is wip_cuttlefish_compact_on_startup. fridudad_ on irc tried it, and it appeared to work (peering went quickly). Not sure how long it will take to compact leveldb on an OSD backing a radosgw workload. It also hasn't been tested much. (You need osd_compact_leveldb_on_mount to actually cause compaction on osd startup with that branch.)

Actions

Copy link

#17

Updated by Stefan Priebe almost 11 years ago

I'm fridudad ;-) Peering works fine but recovery does not. In my initial text of this tracker i also mentioned recovery as the main problem and not peering. Should i open a new report?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Fix #5232

osd: slow peering due to pg log rewrites

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Ian Colle almost 11 years ago

Updated by Samuel Just almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Stefan Priebe almost 11 years ago

Updated by Samuel Just almost 11 years ago

Updated by Samuel Just almost 11 years ago

Updated by Stefan Priebe almost 11 years ago