Actions
Bug #15379
closedceph mds continiously crashes and going into laggy state (stray purging problems)
Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Performance/Resource Usage
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
We are removing a very large directory from cephFS on infernalis 9.2.1, and after a while, an MDS was in laggy state without failing over to another mds. I checked and all mdss has crashed. It leaves a stacktrace:
-26> 2016-04-04 23:46:35.743970 7faa127fe700 5 mds.0.log _submit_thread 5134992710184~531 : EUpdate purge_stray [metablob 600, 1 dirs] -25> 2016-04-04 23:46:35.743978 7faa143fe700 1 -- 10.141.16.3:6806/2790032 --> 10.141.16.16:6824/1773611 -- osd_op(mds.0.332:140386 100015cef56.00000000 [delete] 4.116a78fc snapc 1=[] ondisk+write+known_if_r edirected+full_force e12853) v6 -- ?+0 0x7faa3157d700 con 0x7faa29d9c840 -24> 2016-04-04 23:46:35.744014 7faa143fe700 1 -- 10.141.16.3:6806/2790032 --> 10.141.16.25:6801/2962763 -- osd_op(mds.0.332:140387 100015cef57.00000000 [delete] 4.aed913ee snapc 1=[] ondisk+write+known_if_r edirected+full_force e12853) v6 -- ?+0 0x7fb273ab99c0 con 0x7faa2994dde0 -23> 2016-04-04 23:46:35.744019 7faa127fe700 5 mds.0.log _submit_thread 5134992710735~531 : EUpdate purge_stray [metablob 600, 1 dirs] -22> 2016-04-04 23:46:35.744044 7faa127fe700 5 mds.0.log _submit_thread 5134992711286~531 : EUpdate purge_stray [metablob 600, 1 dirs] -21> 2016-04-04 23:46:35.744046 7faa143fe700 1 -- 10.141.16.3:6806/2790032 --> 10.141.16.25:6801/2962763 -- osd_op(mds.0.332:140388 100015cef58.00000000 [delete] 4.7a4ded53 snapc 1=[] ondisk+write+known_if_r edirected+full_force e12853) v6 -- ?+0 0x7fb2718f6b00 con 0x7faa2994dde0 -20> 2016-04-04 23:46:35.744183 7faa19cb3700 1 -- 10.141.16.3:6806/2790032 <== client.834666 10.141.16.11:0/2497359121 2362568 ==== client_request(client.834666:74676965 lookup #10002349485/constructramseygr aphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) v2 ==== 185+0+0 (3464717445 0 0) 0x7fb76e9c2ec0 con 0x7faa2420dc80 -19> 2016-04-04 23:46:35.744198 7faa19cb3700 4 mds.0.server handle_client_request client_request(client.834666:74676965 lookup #10002349485/constructramseygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 201 6-04-04 23:46:35.730516) v2 -18> 2016-04-04 23:46:35.744260 7faa19cb3700 5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744158, event: throttled, op: client_request(client.834666:74676965 lookup #10002349485/constructramsey graphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) -17> 2016-04-04 23:46:35.744264 7faa19cb3700 5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744172, event: all_read, op: client_request(client.834666:74676965 lookup #10002349485/constructramseyg raphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) -16> 2016-04-04 23:46:35.744268 7faa19cb3700 5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744195, event: dispatched, op: client_request(client.834666:74676965 lookup #10002349485/constructramse ygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) -15> 2016-04-04 23:46:35.744289 7faa19cb3700 5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744289, event: acquired locks, op: client_request(client.834666:74676965 lookup #10002349485/constructr amseygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) -14> 2016-04-04 23:46:35.744297 7faa19cb3700 5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744297, event: replying, op: client_request(client.834666:74676965 lookup #10002349485/constructramseyg raphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) -13> 2016-04-04 23:46:35.744307 7faa19cb3700 1 -- 10.141.16.3:6806/2790032 --> 10.141.16.11:0/2497359121 -- client_reply(???:74676965 = 0 (0) Success) v1 -- ?+0 0x7faa37b68b00 con 0x7faa2420dc80 -12> 2016-04-04 23:46:35.744317 7faa19cb3700 5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744317, event: finishing request, op: client_request(client.834666:74676965 lookup #10002349485/constru ctramseygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) -11> 2016-04-04 23:46:35.744331 7faa19cb3700 5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744331, event: cleaned up request, op: client_request(client.834666:74676965 lookup #10002349485/constr uctramseygraphs_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) -10> 2016-04-04 23:46:35.744336 7faa19cb3700 5 -- op tracker -- seq: 1571327, time: 2016-04-04 23:46:35.744336, event: done, op: client_request(client.834666:74676965 lookup #10002349485/constructramseygraph s_lower-3_9_32_e108-3n8-mod_500-haunter-pbs 2016-04-04 23:46:35.730516) -9> 2016-04-04 23:46:35.744454 7faa19cb3700 1 -- 10.141.16.3:6806/2790032 <== client.834666 10.141.16.11:0/2497359121 2362569 ==== client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23:46 :35.730516) v2 ==== 122+0+0 (3349840858 0 0) 0x7faa37b68b00 con 0x7faa2420dc80 -8> 2016-04-04 23:46:35.744465 7faa19cb3700 4 mds.0.server handle_client_request client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23:46:35.730516) v2 -7> 2016-04-04 23:46:35.744469 7faa19cb3700 5 -- op tracker -- seq: 1571328, time: 2016-04-04 23:46:35.744437, event: throttled, op: client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23: 46:35.730516) -6> 2016-04-04 23:46:35.744473 7faa19cb3700 5 -- op tracker -- seq: 1571328, time: 2016-04-04 23:46:35.744448, event: all_read, op: client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23:4 6:35.730516) -5> 2016-04-04 23:46:35.744476 7faa19cb3700 5 -- op tracker -- seq: 1571328, time: 2016-04-04 23:46:35.744463, event: dispatched, op: client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-04 23 :46:35.730516) -4> 2016-04-04 23:46:35.744492 7faa19cb3700 5 -- op tracker -- seq: 1571328, time: 2016-04-04 23:46:35.744492, event: acquired locks, op: client_request(client.834666:74676966 readdir #1000266eb9f 2016-04-0 4 23:46:35.730516) -3> 2016-04-04 23:46:35.744517 7faa19cb3700 1 -- 10.141.16.3:6806/2790032 --> 10.141.16.2:6805/1303901 -- osd_op(mds.0.332:140389 1000266eb9f.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxattr parent ] 6.acf7ddd ack+read+known_if_redirected+full_force e12853) v6 -- ?+0 0x7fb76e9c2ec0 con 0x7faa2420a420 -2> 2016-04-04 23:46:35.757602 7faa13bfd700 1 -- 10.141.16.3:6806/2790032 <== osd.3 10.141.16.2:6805/1303901 1945 ==== osd_op_reply(140389 1000266eb9f.00000000 [omap-get-header 0~0,omap-get-vals 0~16,getxat tr (432)] v0'0 uv54613 ondisk = 0) v6 ==== 271+0+597535 (3018918272 0 3680795695) 0x7fb279097c80 con 0x7faa2420a420 -1> 2016-04-04 23:46:35.758698 7fa9febff700 1 -- 10.141.16.3:6806/2790032 <== osd.96 10.141.16.20:6802/1775683 4771 ==== osd_op_reply(140373 100015cef4c.00000000 [delete] v0'0 uv2249860 ondisk = 0) v6 ==== 187+0+0 (3654499068 0 0) 0x7faa2a9ac000 con 0x7faa29d9c160 0> 2016-04-04 23:46:35.767677 7faa143fe700 -1 *** Caught signal (Aborted) ** in thread 7faa143fe700 ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) 1: (()+0x4b6f42) [0x7faa20b6df42] 2: (()+0xf100) [0x7faa1f87b100] 3: (gsignal()+0x37) [0x7faa1e69d5f7] 4: (abort()+0x148) [0x7faa1e69ece8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7faa1efa19d5] 6: (()+0x5e946) [0x7faa1ef9f946] 7: (()+0x5e973) [0x7faa1ef9f973] 8: (()+0x5eb93) [0x7faa1ef9fb93] 9: (CDir::_load_dentry(std::string const&, std::string const&, snapid_t, ceph::buffer::list&, int, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const*, bool*, std::list<CInode*, std::allocator<CInode*> >*)+0x2823) [0x7faa20a6f423] 10: (CDir::_omap_fetched(ceph::buffer::list&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >&, std::string const&, int)+0x9f8) [0x7faa20a701e8] 11: (MDSIOContextBase::complete(int)+0xa4) [0x7faa20ad0a54] 12: (Finisher::finisher_thread_entry()+0x168) [0x7faa20b90f58] 13: (()+0x7dc5) [0x7faa1f873dc5] 14: (clone()+0x6d) [0x7faa1e75e28d]
I tried to restart the MDS but it again crashes every time, with similar stacktrace:
-27> 2016-04-05 09:58:23.751054 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 105 ==== osd_op_reply(36267 200.0012af3e [write 3618294~1635 [fadvise_dontneed]] v12857'139171 uv 139171 ondisk = 0) v6 ==== 179+0+0 (2335018870 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -26> 2016-04-05 09:58:23.751097 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 106 ==== osd_op_reply(36268 200.0012af3e [write 3619929~990 [fadvise_dontneed]] v12857'139172 uv1 39172 ondisk = 0) v6 ==== 179+0+0 (1251463916 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -25> 2016-04-05 09:58:23.751122 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 107 ==== osd_op_reply(36269 200.0012af3e [write 3620919~1231 [fadvise_dontneed]] v12857'139173 uv 139173 ondisk = 0) v6 ==== 179+0+0 (155261733 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -24> 2016-04-05 09:58:23.751150 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 108 ==== osd_op_reply(36270 200.0012af3e [write 3622150~1234 [fadvise_dontneed]] v12857'139174 uv 139174 ondisk = 0) v6 ==== 179+0+0 (1138588898 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -23> 2016-04-05 09:58:23.751170 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 109 ==== osd_op_reply(36271 200.0012af3e [write 3623384~1228 [fadvise_dontneed]] v12857'139175 uv 139175 ondisk = 0) v6 ==== 179+0+0 (2523301192 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -22> 2016-04-05 09:58:23.751194 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 110 ==== osd_op_reply(36272 200.0012af3e [write 3624612~1230 [fadvise_dontneed]] v12857'139176 uv 139176 ondisk = 0) v6 ==== 179+0+0 (1337395193 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -21> 2016-04-05 09:58:23.751215 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 111 ==== osd_op_reply(36273 200.0012af3e [write 3625842~1232 [fadvise_dontneed]] v12857'139177 uv 139177 ondisk = 0) v6 ==== 179+0+0 (2160167011 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -20> 2016-04-05 09:58:23.751241 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 112 ==== osd_op_reply(36274 200.0012af3e [write 3627074~1232 [fadvise_dontneed]] v12857'139178 uv 139178 ondisk = 0) v6 ==== 179+0+0 (1458940318 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -19> 2016-04-05 09:58:23.751292 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 113 ==== osd_op_reply(36275 200.0012af3e [write 3628306~1230 [fadvise_dontneed]] v12857'139179 uv 139179 ondisk = 0) v6 ==== 179+0+0 (2161543871 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -18> 2016-04-05 09:58:23.751315 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 114 ==== osd_op_reply(36276 200.0012af3e [write 3629536~1230 [fadvise_dontneed]] v12857'139180 uv 139180 ondisk = 0) v6 ==== 179+0+0 (408786516 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -17> 2016-04-05 09:58:23.751387 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 115 ==== osd_op_reply(36277 200.0012af3e [write 3630766~1240 [fadvise_dontneed]] v12857'139181 uv 139181 ondisk = 0) v6 ==== 179+0+0 (155577562 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -16> 2016-04-05 09:58:23.751411 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 116 ==== osd_op_reply(36278 200.0012af3e [write 3632006~4335 [fadvise_dontneed]] v12857'139182 uv 139182 ondisk = 0) v6 ==== 179+0+0 (3957996991 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -15> 2016-04-05 09:58:23.751437 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 117 ==== osd_op_reply(36265 600.00000000 [stat,tmap2omap 1~0,omap-set-header 0~226,omap-set-vals 0~508380,omap-rm-keys 0~116024] v12857'145527 uv145527 ondisk = 0) v6 ==== 347+0+0 (3496540317 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -14> 2016-04-05 09:58:23.752939 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 118 ==== osd_op_reply(36279 200.0012af3e [write 3636341~1752 [fadvise_dontneed]] v12857'139183 uv 139183 ondisk = 0) v6 ==== 179+0+0 (4072285340 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -13> 2016-04-05 09:58:23.753008 7f1e51dfa700 1 -- 10.141.16.3:6800/2852702 <== osd.5 10.141.16.3:6801/2389244 119 ==== osd_op_reply(36361 200.0012af3e [write 3638093~33611 [fadvise_dontneed]] v12857'139184 u v139184 ondisk = 0) v6 ==== 179+0+0 (4108456706 0 0) 0x7f1f875c3180 con 0x7f1e638249a0 -12> 2016-04-05 09:58:24.205130 7f1e56ec3700 10 monclient: tick -11> 2016-04-05 09:58:24.205149 7f1e56ec3700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2016-04-05 09:57:54.205148) -10> 2016-04-05 09:58:24.205159 7f1e56ec3700 10 monclient: renew subs? (now: 2016-04-05 09:58:24.205159; renew after: 2016-04-05 09:59:24.204140) -- no -9> 2016-04-05 09:58:25.517199 7f1e54bfe700 10 monclient: _send_mon_message to mon.mds02 at 10.141.16.2:6789/0 -8> 2016-04-05 09:58:25.517221 7f1e54bfe700 1 -- 10.141.16.3:6800/2852702 --> 10.141.16.2:6789/0 -- mdsbeacon(891178/mds03 up:active seq 64 v11535) v4 -- ?+0 0x7f1e700a6a00 con 0x7f1e638242c0 -7> 2016-04-05 09:58:29.517263 7f1e54bfe700 10 monclient: _send_mon_message to mon.mds02 at 10.141.16.2:6789/0 -6> 2016-04-05 09:58:29.517286 7f1e54bfe700 1 -- 10.141.16.3:6800/2852702 --> 10.141.16.2:6789/0 -- mdsbeacon(891178/mds03 up:active seq 65 v11535) v4 -- ?+0 0x7f1e700a6d00 con 0x7f1e638242c0 -5> 2016-04-05 09:58:33.517341 7f1e54bfe700 10 monclient: _send_mon_message to mon.mds02 at 10.141.16.2:6789/0 -4> 2016-04-05 09:58:33.517381 7f1e54bfe700 1 -- 10.141.16.3:6800/2852702 --> 10.141.16.2:6789/0 -- mdsbeacon(891178/mds03 up:active seq 66 v11535) v4 -- ?+0 0x7f1e700a7000 con 0x7f1e638242c0 -3> 2016-04-05 09:58:34.205253 7f1e56ec3700 10 monclient: tick -2> 2016-04-05 09:58:34.205288 7f1e56ec3700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2016-04-05 09:58:04.205287) -1> 2016-04-05 09:58:34.205308 7f1e56ec3700 10 monclient: renew subs? (now: 2016-04-05 09:58:34.205308; renew after: 2016-04-05 09:59:24.204140) -- no 0> 2016-04-05 09:58:35.480254 7f1e527fd700 -1 *** Caught signal (Aborted) ** in thread 7f1e527fd700 ceph version 9.2.1 (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd) 1: (()+0x4b6f42) [0x7f1e5f61ff42] 2: (()+0xf100) [0x7f1e5e32d100] 3: (gsignal()+0x37) [0x7f1e5d14f5f7] 4: (abort()+0x148) [0x7f1e5d150ce8] 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f1e5da539d5] 6: (()+0x5e946) [0x7f1e5da51946] 7: (()+0x5e973) [0x7f1e5da51973] 8: (()+0x5eb93) [0x7f1e5da51b93] 9: (CDir::_load_dentry(std::string const&, std::string const&, snapid_t, ceph::buffer::list&, int, std::set<snapid_t, std::less<snapid_t>, std::allocator<snapid_t> > const*, bool*, std::list<CInode*, std::alloc ator<CInode*> >*)+0x2823) [0x7f1e5f521423] 10: (CDir::_omap_fetched(ceph::buffer::list&, std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > >&, std::string const&, int)+0x9 f8) [0x7f1e5f5221e8] 11: (MDSIOContextBase::complete(int)+0xa4) [0x7f1e5f582a54] 12: (Finisher::finisher_thread_entry()+0x168) [0x7f1e5f642f58] 13: (()+0x7dc5) [0x7f1e5e325dc5] 14: (clone()+0x6d) [0x7f1e5d21028d]
How can we fix this?
Files
Actions