Project

General

Profile

Actions

Bug #15379

closed

ceph mds continiously crashes and going into laggy state (stray purging problems)

Added by Kenneth Waegeman about 8 years ago. Updated almost 8 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We are removing a very large directory from cephFS on infernalis 9.2.1, and after a while, an MDS was in laggy state without failing over to another mds. I checked and all mdss has crashed. It leaves a stacktrace:

   -26> 2016-04-04 23:46:35.743970 7faa127fe700  5 mds.0.log _submit_thread 5134992710184~531 : EUpdate purge_stray [metablob 600, 1 dirs]
   -25> 2016-04-04 23:46:35.743978 7faa143fe700  1 -- 10.141.16.3:6806/2790032 --> 10.141.16.16:6824/1773611 -- osd_op(mds.0.332:140386 100015cef56.00000000 [delete] 4.116a78fc snapc 1=[] ondisk+write+known_if_r
edirected+full_force e12853) v6 -- ?+0 0x7faa3157d700 con 0x7faa29d9c840
   -24> 2016-04-04 23:46:35.744014 7faa143fe700  1 -- 10.141.16.3:6806/2790032 --> 10.141.16.25:6801/2962763 -- osd_op(mds.0.332:140387 100015cef57.00000000 [delete] 4.aed913ee snapc 1=[] ondisk+write+known_if_r
edirected+full_force e12853) v6 -- ?+0 0x7fb273ab99c0 con 0x7faa2994dde0
   -23> 2016-04-04 23:46:35.744019 7faa127fe700  5 mds.0.log _submit_thread 5134992710735~531 : EUpdate purge_stray [metablob 600, 1 dirs]
   -22> 2016-04-04 23:46:35.744044 7faa127fe700  5 mds.0.log _submit_thread 5134992711286~531 : EUpdate purge_stray [metablob 600, 1 dirs]
   -21> 2016-04-04 23:46:35.744046 7faa143fe700  1 -- 10.141.16.3:6806/2790032 --> 10.141.16.25:6801/2962763 -- osd_op(mds.0.332:140388 100015cef58.00000000 [delete] 4.7a4ded53 snapc 1=[] ondisk+write+known_if_r
edirected+full_force e12853) v6 -- ?+0 0x7fb2718f6b00 con 0x7faa2994dde0
   -20> 2016-04-04 23:46:35.744183 7faa19cb3700  1 -- 10.141.16.3:6806/2790032 <== client.834666 10.141.16.11:0/2497359121 2362568 ==== client_request(client.834666:74676965 lookup #10002349485/constructramseygr
aphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) v2 ==== 185+0+0 (3464717445 0 0) 0x7fb76e9c2ec0 con 0x7faa2420dc80
   -19> 2016-04-04 23:46:35.744198 7faa19cb3700  4 mds.0.server handle_client_request client_request(client.834666:74676965 lookup #10002349485/constructramseygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 201
6-04-04 23:46:35.730516) v2
   -18> 2016-04-04 23:46:35.744260 7faa19cb3700  5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744158, event: throttled, op: client_request(client.834666:74676965 lookup #10002349485/constructramsey
graphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516)
   -17> 2016-04-04 23:46:35.744264 7faa19cb3700  5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744172, event: all_read, op: client_request(client.834666:74676965 lookup #10002349485/constructramseyg
raphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516)
   -16> 2016-04-04 23:46:35.744268 7faa19cb3700  5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744195, event: dispatched, op: client_request(client.834666:74676965 lookup #10002349485/constructramse
ygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516)
   -15> 2016-04-04 23:46:35.744289 7faa19cb3700  5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744289, event: acquired locks, op: client_request(client.834666:74676965 lookup #10002349485/constructr
amseygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516)
   -14> 2016-04-04 23:46:35.744297 7faa19cb3700  5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744297, event: replying, op: client_request(client.834666:74676965 lookup #10002349485/constructramseyg
raphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516)
   -13> 2016-04-04 23:46:35.744307 7faa19cb3700  1 -- 10.141.16.3:6806/2790032 --> 10.141.16.11:0/2497359121 -- client_reply(???:74676965 = 0 (0) Success) v1 -- ?+0 0x7faa37b68b00 con 0x7faa2420dc80
   -12> 2016-04-04 23:46:35.744317 7faa19cb3700  5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744317, event: finishing request, op: client_request(client.834666:74676965 lookup #10002349485/constru
ctramseygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516)
   -11> 2016-04-04 23:46:35.744331 7faa19cb3700  5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744331, event: cleaned up request, op: client_request(client.834666:74676965 lookup #10002349485/constr
uctramseygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516)
   -10> 2016-04-04 23:46:35.744336 7faa19cb3700  5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744336, event: done, op: client_request(client.834666:74676965 lookup #10002349485/constructramseygraph
s_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516)
    -9> 2016-04-04 23:46:35.744454 7faa19cb3700  1 -- 10.141.16.3:6806/2790032 <== client.834666 10.141.16.11:0/2497359121 2362569 ==== client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23:46
:35.730516) v2 ==== 122+0+0 (3349840858 0 0) 0x7faa37b68b00 con 0x7faa2420dc80
    -8> 2016-04-04 23:46:35.744465 7faa19cb3700  4 mds.0.server handle_client_request client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23:46:35.730516) v2
    -7> 2016-04-04 23:46:35.744469 7faa19cb3700  5 -- op tracker -- seq: 1571328, time: 2016-04-04 23:46:35.744437, event: throttled, op: client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23:
46:35.730516)
    -6> 2016-04-04 23:46:35.744473 7faa19cb3700  5 -- op tracker -- seq: 1571328, time: 2016-04-04 23:46:35.744448, event: all_read, op: client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23:4
6:35.730516)
    -5> 2016-04-04 23:46:35.744476 7faa19cb3700  5 -- op tracker -- seq: 1571328, time: 2016-04-04 23:46:35.744463, event: dispatched, op: client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23
:46:35.730516)
    -4> 2016-04-04 23:46:35.744492 7faa19cb3700  5 -- op tracker -- seq: 1571328, time: 2016-04-04 23:46:35.744492, event: acquired locks, op: client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-0
4 23:46:35.730516)
    -3> 2016-04-04 23:46:35.744517 7faa19cb3700  1 -- 10.141.16.3:6806/2790032 --> 10.141.16.2:6805/1303901 -- osd_op(mds.0.332:140389 1000266eb9f.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxattr parent
] 6.acf7ddd ack+read+known_if_redirected+full_force e12853) v6 -- ?+0 0x7fb76e9c2ec0 con 0x7faa2420a420
    -2> 2016-04-04 23:46:35.757602 7faa13bfd700  1 -- 10.141.16.3:6806/2790032 <== osd.3 10.141.16.2:6805/1303901 1945 ==== osd_op_reply(140389 1000266eb9f.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxat
tr (432)] v0'0 uv54613 ondisk = 0) v6 ==== 271+0+597535 (3018918272 0 3680795695) 0x7fb279097c80 con 0x7faa2420a420
    -1> 2016-04-04 23:46:35.758698 7fa9febff700  1 -- 10.141.16.3:6806/2790032 <== osd.96 10.141.16.20:6802/1775683 4771 ==== osd_op_reply(140373 100015cef4c.00000000 [delete] v0'0 uv2249860 ondisk = 0) v6 ==== 
187+0+0 (3654499068 0 0) 0x7faa2a9ac000 con 0x7faa29d9c160
     0> 2016-04-04 23:46:35.767677 7faa143fe700 -1 *** Caught signal (Aborted) **
 in thread 7faa143fe700

 ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
 1: (()+0x4b6f42) [0x7faa20b6df42]
 2: (()+0xf100) [0x7faa1f87b100]
 3: (gsignal()+0x37) [0x7faa1e69d5f7]
 4: (abort()+0x148) [0x7faa1e69ece8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7faa1efa19d5]
 6: (()+0x5e946) [0x7faa1ef9f946]
 7: (()+0x5e973) [0x7faa1ef9f973]
 8: (()+0x5eb93) [0x7faa1ef9fb93]
 9: (CDir::_load_dentry(std::string const&, std::string const&, snapid_t, ceph::buffer::list&, int, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const*, bool*, std::list<CInode*, std::allocator<CInode*> >*)+0x2823) [0x7faa20a6f423]
 10: (CDir::_omap_fetched(ceph::buffer::list&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >&, std::string const&, int)+0x9f8) [0x7faa20a701e8]
 11: (MDSIOContextBase::complete(int)+0xa4) [0x7faa20ad0a54]
 12: (Finisher::finisher_thread_entry()+0x168) [0x7faa20b90f58]
 13: (()+0x7dc5) [0x7faa1f873dc5]
 14: (clone()+0x6d) [0x7faa1e75e28d]

I tried to restart the MDS but it again crashes every time, with similar stacktrace:

   -27> 2016-04-05 09:58:23.751054 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 105 ==== osd_op_reply(36267 200.0012af3e [write 3618294~1635 [fadvise_dontneed]] v12857'139171 uv
139171 ondisk = 0) v6 ==== 179+0+0 (2335018870 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -26> 2016-04-05 09:58:23.751097 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 106 ==== osd_op_reply(36268 200.0012af3e [write 3619929~990 [fadvise_dontneed]] v12857'139172 uv1
39172 ondisk = 0) v6 ==== 179+0+0 (1251463916 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -25> 2016-04-05 09:58:23.751122 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 107 ==== osd_op_reply(36269 200.0012af3e [write 3620919~1231 [fadvise_dontneed]] v12857'139173 uv
139173 ondisk = 0) v6 ==== 179+0+0 (155261733 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -24> 2016-04-05 09:58:23.751150 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 108 ==== osd_op_reply(36270 200.0012af3e [write 3622150~1234 [fadvise_dontneed]] v12857'139174 uv
139174 ondisk = 0) v6 ==== 179+0+0 (1138588898 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -23> 2016-04-05 09:58:23.751170 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 109 ==== osd_op_reply(36271 200.0012af3e [write 3623384~1228 [fadvise_dontneed]] v12857'139175 uv
139175 ondisk = 0) v6 ==== 179+0+0 (2523301192 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -22> 2016-04-05 09:58:23.751194 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 110 ==== osd_op_reply(36272 200.0012af3e [write 3624612~1230 [fadvise_dontneed]] v12857'139176 uv
139176 ondisk = 0) v6 ==== 179+0+0 (1337395193 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -21> 2016-04-05 09:58:23.751215 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 111 ==== osd_op_reply(36273 200.0012af3e [write 3625842~1232 [fadvise_dontneed]] v12857'139177 uv
139177 ondisk = 0) v6 ==== 179+0+0 (2160167011 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -20> 2016-04-05 09:58:23.751241 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 112 ==== osd_op_reply(36274 200.0012af3e [write 3627074~1232 [fadvise_dontneed]] v12857'139178 uv
139178 ondisk = 0) v6 ==== 179+0+0 (1458940318 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -19> 2016-04-05 09:58:23.751292 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 113 ==== osd_op_reply(36275 200.0012af3e [write 3628306~1230 [fadvise_dontneed]] v12857'139179 uv
139179 ondisk = 0) v6 ==== 179+0+0 (2161543871 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -18> 2016-04-05 09:58:23.751315 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 114 ==== osd_op_reply(36276 200.0012af3e [write 3629536~1230 [fadvise_dontneed]] v12857'139180 uv
139180 ondisk = 0) v6 ==== 179+0+0 (408786516 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -17> 2016-04-05 09:58:23.751387 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 115 ==== osd_op_reply(36277 200.0012af3e [write 3630766~1240 [fadvise_dontneed]] v12857'139181 uv
139181 ondisk = 0) v6 ==== 179+0+0 (155577562 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -16> 2016-04-05 09:58:23.751411 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 116 ==== osd_op_reply(36278 200.0012af3e [write 3632006~4335 [fadvise_dontneed]] v12857'139182 uv
139182 ondisk = 0) v6 ==== 179+0+0 (3957996991 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -15> 2016-04-05 09:58:23.751437 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 117 ==== osd_op_reply(36265 600.00000000 [stat,tmap2omap 1~0,omap-set-header 0~226,omap-set-vals 
0~508380,omap-rm-keys 0~116024] v12857'145527 uv145527 ondisk = 0) v6 ==== 347+0+0 (3496540317 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -14> 2016-04-05 09:58:23.752939 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 118 ==== osd_op_reply(36279 200.0012af3e [write 3636341~1752 [fadvise_dontneed]] v12857'139183 uv
139183 ondisk = 0) v6 ==== 179+0+0 (4072285340 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -13> 2016-04-05 09:58:23.753008 7f1e51dfa700  1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 119 ==== osd_op_reply(36361 200.0012af3e [write 3638093~33611 [fadvise_dontneed]] v12857'139184 u
v139184 ondisk = 0) v6 ==== 179+0+0 (4108456706 0 0) 0x7f1f875c3180 con 0x7f1e638249a0
   -12> 2016-04-05 09:58:24.205130 7f1e56ec3700 10 monclient: tick
   -11> 2016-04-05 09:58:24.205149 7f1e56ec3700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2016-04-05 09:57:54.205148)
   -10> 2016-04-05 09:58:24.205159 7f1e56ec3700 10 monclient: renew subs? (now: 2016-04-05 09:58:24.205159; renew after: 2016-04-05 09:59:24.204140) -- no
    -9> 2016-04-05 09:58:25.517199 7f1e54bfe700 10 monclient: _send_mon_message to mon.mds02 at 10.141.16.2:6789/0
    -8> 2016-04-05 09:58:25.517221 7f1e54bfe700  1 -- 10.141.16.3:6800/2852702 --> 10.141.16.2:6789/0 -- mdsbeacon(891178/mds03 up:active seq 64 v11535) v4 -- ?+0 0x7f1e700a6a00 con 0x7f1e638242c0
    -7> 2016-04-05 09:58:29.517263 7f1e54bfe700 10 monclient: _send_mon_message to mon.mds02 at 10.141.16.2:6789/0
    -6> 2016-04-05 09:58:29.517286 7f1e54bfe700  1 -- 10.141.16.3:6800/2852702 --> 10.141.16.2:6789/0 -- mdsbeacon(891178/mds03 up:active seq 65 v11535) v4 -- ?+0 0x7f1e700a6d00 con 0x7f1e638242c0
    -5> 2016-04-05 09:58:33.517341 7f1e54bfe700 10 monclient: _send_mon_message to mon.mds02 at 10.141.16.2:6789/0
    -4> 2016-04-05 09:58:33.517381 7f1e54bfe700  1 -- 10.141.16.3:6800/2852702 --> 10.141.16.2:6789/0 -- mdsbeacon(891178/mds03 up:active seq 66 v11535) v4 -- ?+0 0x7f1e700a7000 con 0x7f1e638242c0
    -3> 2016-04-05 09:58:34.205253 7f1e56ec3700 10 monclient: tick
    -2> 2016-04-05 09:58:34.205288 7f1e56ec3700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2016-04-05 09:58:04.205287)
    -1> 2016-04-05 09:58:34.205308 7f1e56ec3700 10 monclient: renew subs? (now: 2016-04-05 09:58:34.205308; renew after: 2016-04-05 09:59:24.204140) -- no
     0> 2016-04-05 09:58:35.480254 7f1e527fd700 -1 *** Caught signal (Aborted) **
 in thread 7f1e527fd700

 ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
 1: (()+0x4b6f42) [0x7f1e5f61ff42]
 2: (()+0xf100) [0x7f1e5e32d100]
 3: (gsignal()+0x37) [0x7f1e5d14f5f7]
 4: (abort()+0x148) [0x7f1e5d150ce8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f1e5da539d5]
 6: (()+0x5e946) [0x7f1e5da51946]
 7: (()+0x5e973) [0x7f1e5da51973]
 8: (()+0x5eb93) [0x7f1e5da51b93]
 9: (CDir::_load_dentry(std::string const&, std::string const&, snapid_t, ceph::buffer::list&, int, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const*, bool*, std::list<CInode*, std::alloc
ator<CInode*> >*)+0x2823) [0x7f1e5f521423]
 10: (CDir::_omap_fetched(ceph::buffer::list&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >&, std::string const&, int)+0x9
f8) [0x7f1e5f5221e8]
 11: (MDSIOContextBase::complete(int)+0xa4) [0x7f1e5f582a54]
 12: (Finisher::finisher_thread_entry()+0x168) [0x7f1e5f642f58]
 13: (()+0x7dc5) [0x7f1e5e325dc5]
 14: (clone()+0x6d) [0x7f1e5d21028d]

How can we fix this?


Files

ceph-mds.mds03.log.1.gz (188 KB) ceph-mds.mds03.log.1.gz Kenneth Waegeman, 04/05/2016 11:19 AM
ceph-mds02.tail.tail.gz (219 KB) ceph-mds02.tail.tail.gz Kenneth Waegeman, 04/05/2016 12:55 PM
ceph-mds02.tail.gz (432 KB) ceph-mds02.tail.gz Kenneth Waegeman, 04/05/2016 01:26 PM
Actions

Also available in: Atom PDF