Bug #19099

valgrind runs time out in wait_for_up_osds after 300s; osd logs show *incredibly* slow startup

Added by Sage Weil about 2 months ago. Updated about 2 months ago.

Target version:
Start date:
Due date:
% Done:


3 - minor
Affected Versions:
Needs Doc:


a taste

2017-02-27 17:18:09.937916 9551500  0 ceph version 12.0.0-829-gc0b1e6e (c0b1e6ef10d3ac9c95ee80c55d5a6ff68f4197b5), process memcheck-amd64-, pid 19406
2017-02-27 17:18:09.956225 9551500  5 object store type is filestore
2017-02-27 17:18:10.149272 9551500  1 -- _finish_bind bind my_inst.addr is
2017-02-27 17:18:10.157672 9551500  1 -- _finish_bind bind my_inst.addr is
2017-02-27 17:18:10.159395 9551500  1 -- _finish_bind bind my_inst.addr is
2017-02-27 17:18:10.161994 9551500  1 -- _finish_bind bind my_inst.addr is
2017-02-27 17:18:11.462646 9551500 10 filestore(/var/lib/ceph/osd/ceph-1) dump_stop
2017-02-27 17:19:01.644980 9551500  0 load: jerasure load: lrc load: isa 
2017-02-27 17:19:01.723810 9551500  5 filestore(/var/lib/ceph/osd/ceph-1) test_mount basedir /var/lib/ceph/osd/ceph-1 journal /var/lib/ceph/osd/ceph-1/journal
2017-02-27 17:19:01.726379 9551500  1 -- start start
2017-02-27 17:19:01.727236 9551500  1 -- - start start
2017-02-27 17:19:01.727645 9551500  1 -- - start start
2017-02-27 17:19:02.306754 9551500 10 open_all_classes
2017-02-27 17:19:02.310274 9551500 10 open_all_classes found rgw
2017-02-27 17:19:02.321478 9551500 10 _get_class adding new class name rgw 0xee7dc38
2017-02-27 17:19:02.323570 9551500 10 _load_class rgw from /usr/lib64/rados-classes/
2017-02-27 17:19:19.430244 155bb700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry woke after 17.448297
2017-02-27 17:19:19.452476 9551500 10 register_class rgw status 3
2017-02-27 17:19:19.453815 9551500 10 register_cxx_method rgw.bucket_init_index flags 3 0x1ae7f090
2017-02-27 17:19:19.456257 9551500 10 register_cxx_method rgw.bucket_set_tag_timeout flags 3 0x1ae89400
2017-02-27 17:19:19.457235 9551500 10 register_cxx_method rgw.bucket_list flags 1 0x1ae898a0


Related issues

Duplicated by Bug #19113: slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds) Duplicate 03/01/2017
Duplicated by Bug #18732: monitor takes 2 minutes to start Duplicate 01/30/2017


#1 Updated by Nathan Cutler about 2 months ago

  • Duplicated by Bug #19113: slow osd boot with valgrind (reached maximum tries (50) after waiting for 300 seconds) added

#2 Updated by Kefu Chai about 2 months ago

  • Status changed from New to Need Review
  • Assignee set to Kefu Chai
  • Source set to Development

checked dmesg, on the test node. nothing really interesting. i guess it's just slow with valgrind (maybe i am wrong).

#3 Updated by Sage Weil about 2 months ago

Watchig this happen, I observed all 3 osds on a host at 100% CPU, and very very slow progression of log messages through FileStore.

Sadly, valgrind doesn't let you connect to the process w/ gdb while it's running, so i really can't tell what it was doing.. :/

#4 Updated by Sage Weil about 2 months ago

hmm, it is reading the debuginfo file...

lr-x------. 1 root root 64 Mar  2 00:20 12 -> /usr/lib/debug/.dwz/ceph-12.0.0-932.g8d6fb1d.el7.x86_64

pread(12, "\232\230\5\1\324\231\5\2\321\234\5\0<=%-\2\270\3 \1\251\230\5\256\230\5\1\324\231\5\0"..., 8192, 23207936) = 8192
pread(12, "\1\304\2\316\2K\255!\36\274!\36\260!\0(\316\tX\2\264l\1\1\333\2\347\2K\255!9"..., 8192, 23216128) = 8192
pread(12, "2\252\33\0%\201\4T\2\246\20\1js\3\247\33\"\24\0\0\0\0003\32j\2\0\246\21\1\336"..., 8192, 23224320) = 8192
pread(12, "\0\0@\3537-\2\264\247\375\326A\2\221\27\245\31\256\31E\254A6\311\10\0\0\0T\0058-"..., 8192, 23232512) = 8192
pread(12, "\214\1.\1?\31\3\16:\5;\v<\31c\31d\22\1\22\0\0\215\1.\1?\31\3\16:\v"..., 8192, 118603776) = 8192
pread(12, "\23\2\231 *d \0\36\373\232Y\2au\302Y\2$\301p\0\0Gle\10\1\1#^\5"..., 8192, 23248896) = 8192
pread(12, "o\0\0\0\27S\0\"\2\325\27\4\0\5\36\35\365U\2\0b\36k*\1 \237\5n\2\257d"..., 8192, 23265280) = 8192
pread(12, "\31\3\10:\v;\vn\16<\31d\25\1\25\0\0\234\3.\1?\31\3\16:\5;\5L\v\35"..., 8192, 118611968) = 8192
pread(12, "\301\247\35\252\205\32\24G\232Z\2\n\33\0\0\2\301\247\35\4\336\0\0\0\0w\314#\302\0\212\301"..., 8192, 23699456) = 8192
pread(12, "\205\35\1\325\221%\0\17%\353W\2\23Q\245\205\35\256\205\35\1\325\221%\2\332\221%\0\17$\353"..., 8192, 23748608) = 8192
pread(12, "\0\25\277Z=\2G\34\3\1\241\305\35\252\305\35\1\353\226%\2\331\304\35\0\25\277Z=\2G!"..., 8192, 23756800) = 8192
pread(12, "\r\0\0\255\1w\361\37\2\303\1\227\2\216\r\0\0\255\1P\315\37\2\303\1\230\2\216\r\0\0\203"..., 8192, 23764992) = 8192
pread(12, "\20\rg\301S\2\200\1\367\3cL\10\7@\20\r\362_U\2\200\1\370\3\335L\10\7H\20\r"..., 8192, 23781376) = 8192
pread(12, "\37\1\275\306\37\2\350\326%\0a\31\225[\2\1\247\305\37\254\305\37\1\275\306\37\0\334\1\30\225["..., 8192, 23789568) = 8192
pread(12, "\0\30:\325\323W\2\322\r\1\300\211\30 -\331y]\2\322\17\1\24\0\0\0P-J}]\2"..., 8192, 23797760) = 8192
pread(12, "\6\10\356\312\4\7\10\342\313\37\6\10\263\315\37\7\10\263\315\37\6\10\342\313\37\7\10\321\334\32\6\10"..., 8192, 23805952) = 8192
pread(12, "\305!\2\357\224\4\310k\202\2\0167\23\0\1\246\205!\253\205!\1\320\324!\0\nJ\220\343\2\357"..., 8192, 23814144) = 8192
pread(12, "\2\253\310!\0(\32j\2\0\362\1774\306\17\2\253\310!\1\253\305!\264\305!\1\241\310!\2\253"..., 8192, 23822336) = 8192
pread(12, "\3\0\0\232\205\"\237\205\"\1\316\243\"\0\"\201\227V\2\361\f\5\0\370\267\2\261\205\"\266\205\""..., 8192, 23830528) = 8192
pread(12, "\345#\1\232\305\"\237\305\"\1\202\334#\0B\306) \2E\2/\1^\243\37\2\1\263\305\"\274"..., 8192, 23838720) = 8192
pread(12, "\205#\2\320\205#\2\352\204\36\0j\346\305\32\2\361\364\20\226\237\227\1\264\205#\2\320\205#\2\352"..., 8192, 23846912) = 8192

#5 Updated by Sage Weil about 2 months ago

okay, according to tomhughes in #valgrind-dev, this is teh compressed debug info, and because it's compressed valgrind has to read the whole thing and decompress it to get what it needs.

<tomhughes> dwz is shared data - where data is duplicated across several programs the shared bits are extracted into the dwz
<tomhughes> your real problem is that genius (I'm guessing your a Debian victim here) has decided to compress your debug data, meaning that it all has to be read rather than just the required bits
<tomhughes> that probably interacts badly with extracting shared items to a dwz as well
<sage> ah.  this is centos 7.3 actually, but i'll see if we can fix the package not to do that
<tomhughes> because you have to read and decompress all the shared data
<sage> that makes sense, thanks
<tomhughes> redhat toolchains don't normally compress
<sage> we're building the package, so it's probably something in our tool.
<sage> i think i'm on the right track.. thanks tomhughes!
<tomhughes> look for --compress-debug-sections when linking

dwz appeard to be invoked by rpmbuild? from,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=centos7,DIST=centos7,MACHINE_SIZE=huge/1598/consoleText,

+ /usr/lib/rpm/ --strict-build-id -m --run-dwz --dwz-low-mem-die-limit 10000000 --dwz-max-die-limit 110000000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.0-932-g8d6fb1d/rpm/el7/BUILD/ceph-12.0.0-932-g8d6fb1d
extracting debug info from /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.0-932-g8d6fb1d/rpm/el7/BUILDROOT/ceph-12.0.0-932.g8d6fb1d.el7.x86_64/usr/lib64/ceph/
extracting debug info from /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.0-932-g8d6fb1d/rpm/el7/BUILDROOT/ceph-12.0.0-932.g8d6fb1d.el7.x86_64/usr/lib64/ceph/erasure-code/

but i'm not sure why this is happening. also these failures seem to be new-ish... not sure what changed?

#6 Updated by Kefu Chai about 2 months ago

guess the redhat-rpm-config is updated in our build machine to 9.1.0-30 or up. which includes the fix of
so as suggested by dmick,

%_find_debuginfo_dwz_opts %{nil}

we should just disable dwz in the .spec, right?

#7 Updated by Kefu Chai about 2 months ago

debuginfo sizes collected from and

release size
0.94.0 (hammer) 716586116
9.2.1 (infernalis) 985149724
10.1.0 (jewel) 1283800552
11.0.1 (kraken) 1260967380
12.0.0 (luminous) 896875136
12.0.0 (on f25) 809000000
master (w/o dwz) 1234977864
master (w/ dwz) 908413132

opensuse splits the debuginfo into smaller pieces, see, all the debuginfo packages size 971MB in total. but the reason why the valgrind started to slow down is still a mystery, dwz support on rpm and valgrind was introduced back in 2012. and centos 7 has included them for a while. maybe we can abandon the monolithic approach to package all debuginfo into a single package ?

#8 Updated by Kefu Chai about 2 months ago

  • Duplicated by Bug #18732: monitor takes 2 minutes to start added

#9 Updated by Brad Hubbard about 2 months ago

If someone can give me clear instructions on how to reproduce this (assume I'm dumb) or better yet give me access to an environment where this is happening I believe I should be able to gather more information about the exact nature of the issue, and possibly confirm that it is related to the size of the debuginfo and/or the fact that it is compressed. Remember that my teuthology foo is not great so try to be gentle. This also assumes we consider this a good use of my time.

If you'd like to get more information on the issue yourselves I would suggest a combination of "ps axHo %cpu,stat,pid,tid,pgid,ppid,comm,wchan" to identify the thread that is chewing the CPU (likely the problem thread) and gstack to dump out thread stacks and match the "tid" from ps output to the "LWP" number in gstack output. You may need to do this a few times to be sure you've pin-pointed the problematic code (this is the "poorman's profiler" technique, see Once we identify the function we could use a systemtap to work out how long it is spending in the function calls involved and hopefully zero in on the exact nature of the issue. You'll need the valgrind debuginfo package loaded as well as the systemtap package if you want to use that (run "stap-prep" to get it set up).

#10 Updated by Sage Weil about 2 months ago

teuthology-suite -s rados/verify --filter valgrind --limit 1 -p 10 -c wip-sage-testing

tail teuthology.log to see which machine it picks, and log in while the daemons are starting up.

#11 Updated by Sage Weil about 2 months ago

FWIW debian does a -dbg packages for every subpackage (e.g., ceph-osd-dbg, ceph-test-dbg) and it is so much nicer!! if this is an option, let's do that? (it'll probably mean changing the install task to try either ceph-debuginfo or the various subpackages, since older versions will still have the big -debuginfo).

#12 Updated by Sage Weil about 2 months ago

i guess more importantly, is there an reason we want/need the compressed dwz in the package? maybe we can just turn it off?

#13 Updated by Kefu Chai about 2 months ago

testing at

teuthology-suite --ceph wip-disable-dwz-kefu -s rados/verify --filter valgrind --limit 1 -p 10  --machine-type mira

the change is posted at

#14 Updated by Kefu Chai about 2 months ago

2017-03-02T16:38:35.562 INFO:tasks.ceph:Waiting until ceph cluster ceph is healthy...
2017-03-02T16:38:56.947 DEBUG:teuthology.misc:6 of 6 OSDs are up

it took over 5 minutes to boot and active with dwz, see #19113.

#15 Updated by Sage Weil about 2 months ago

  • Status changed from Need Review to Resolved

#16 Updated by Kefu Chai about 2 months ago

turns out, to split the the single big debuginfo packages into subpackages, is not a trivial task. see

Also available in: Atom PDF