Actions
Bug #8623
closedMDS crashes (unable to access CephFS) / mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)'
Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
All of a sudden I found all three MDS servers down and not starting (crashing):
0> 2014-06-18 10:58:03.702998 7f36fd13a700 -1 mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)' thread 7f36fd13a700 time 2014-06-18 10:58:03.699374 mds/MDCache.cc: 6119: FAILED assert(r == 0 || r == -2) ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: (()+0x2ee369) [0x7f3702616369] 2: (Context::complete(int)+0x9) [0x7f37024ae5a9] 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f3702732a2e] 4: (MDS::handle_core_message(Message*)+0xb3f) [0x7f37024d0b5f] 5: (MDS::_dispatch(Message*)+0x32) [0x7f37024d0d52] 6: (MDS::ms_dispatch(Message*)+0xab) [0x7f37024d273b] 7: (DispatchQueue::entry()+0x58a) [0x7f3702907bfa] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f370282404d] 9: (()+0x80ca) [0x7f3701c7e0ca] 10: (clone()+0x6d) [0x7f37005f3ffd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mds.debmain.log --- end dump of recent events --- 2014-06-18 10:58:03.765613 7f36fd13a700 -1 *** Caught signal (Aborted) ** in thread 7f36fd13a700 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: (()+0x430e42) [0x7f3702758e42] 2: (()+0xf8f0) [0x7f3701c858f0] 3: (gsignal()+0x37) [0x7f3700543407] 4: (abort()+0x148) [0x7f3700546508] 5: (__gnu_cxx::__verbose_terminate_handler()+0x175) [0x7f3700e2ed65] 6: (()+0x5edd6) [0x7f3700e2cdd6] 7: (()+0x5ee21) [0x7f3700e2ce21] 8: (()+0x5f039) [0x7f3700e2d039] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1e3) [0x7f370283d4b3] 10: (()+0x2ee369) [0x7f3702616369] 11: (Context::complete(int)+0x9) [0x7f37024ae5a9] 12: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f3702732a2e] 13: (MDS::handle_core_message(Message*)+0xb3f) [0x7f37024d0b5f] 14: (MDS::_dispatch(Message*)+0x32) [0x7f37024d0d52] 15: (MDS::ms_dispatch(Message*)+0xab) [0x7f37024d273b] 16: (DispatchQueue::entry()+0x58a) [0x7f3702907bfa] 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f370282404d] 18: (()+0x80ca) [0x7f3701c7e0ca] 19: (clone()+0x6d) [0x7f37005f3ffd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
-29> 2014-06-18 11:00:51.333236 7f9f24fc3700 5 mds.0.211 handle_mds_map epoch 1784 from mon.2 -28> 2014-06-18 11:00:51.333263 7f9f24fc3700 1 mds.0.211 handle_mds_map i am now mds.0.211 -27> 2014-06-18 11:00:51.333266 7f9f24fc3700 1 mds.0.211 handle_mds_map state change up:rejoin --> up:active -26> 2014-06-18 11:00:51.333272 7f9f24fc3700 1 mds.0.211 recovery_done -- successful recovery! -25> 2014-06-18 11:00:51.333279 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> mds.0 192.168.0.2:6802/15762 -- mds_table_request(anchortable server_ready) v1 -- ?+0 0x7f9f3ee19400 -24> 2014-06-18 11:00:51.333293 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> mds.0 192.168.0.2:6802/15762 -- mds_table_request(snaptable server_ready) v1 -- ?+0 0x7f9f3ee19600 -23> 2014-06-18 11:00:51.333347 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.7:6805/5005 -- osd_op(mds.0.211:169108 100000cfe98.00000000 [trimtrunc 2@0] 14.584bcdaa snapc 1=[] ondisk+write e37247) v4 -- ?+0 0x7f9f2def5d40 con 0x7f9f2d3e3e40 -22> 2014-06-18 11:00:51.598443 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.204:6801/25505 -- osd_op(mds.0.211:169109 100.00000000 [omap-get-header 0~0,omap-get-vals 0~16] 7.c5265ab3 ack+read e37247) v4 -- ?+0 0x7f9f30459440 con 0x7f9f2d3e3600 -21> 2014-06-18 11:00:51.598477 7f9f24fc3700 1 mds.0.211 active_start -20> 2014-06-18 11:00:51.599416 7f9f24fc3700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 112) v1 from client.4588693 -19> 2014-06-18 11:00:51.599429 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.6:0/2036 -- client_session(renewcaps seq 112) v1 -- ?+0 0x7f9f3b557c00 con 0x7f9f2f3c4580 -18> 2014-06-18 11:00:51.599443 7f9f24fc3700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 144964) v1 from client.3400732 -17> 2014-06-18 11:00:51.599447 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.250:0/15907 -- client_session(renewcaps seq 144964) v1 -- ?+0 0x7f9f2d3eea80 con 0x7f9f2d3e3b80 -16> 2014-06-18 11:00:51.599455 7f9f24fc3700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 49344) v1 from client.4205712 -15> 2014-06-18 11:00:51.599457 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.2:0/10573 -- client_session(renewcaps seq 49344) v1 -- ?+0 0x7f9f2d3ee1c0 con 0x7f9f2f3c4000 -14> 2014-06-18 11:00:51.599464 7f9f24fc3700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 12413) v1 from client.4475969 -13> 2014-06-18 11:00:51.599466 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.204:0/24100 -- client_session(renewcaps seq 12413) v1 -- ?+0 0x7f9f2d3ef500 con 0x7f9f2f3c4160 -12> 2014-06-18 11:00:51.599473 7f9f24fc3700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 12414) v1 from client.4475969 -11> 2014-06-18 11:00:51.599475 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.204:0/24100 -- client_session(renewcaps seq 12414) v1 -- ?+0 0x7f9f2d3ef340 con 0x7f9f2f3c4160 -10> 2014-06-18 11:00:51.599480 7f9f24fc3700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 49345) v1 from client.4205712 -9> 2014-06-18 11:00:51.599482 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.2:0/10573 -- client_session(renewcaps seq 49345) v1 -- ?+0 0x7f9f2efd0700 con 0x7f9f2f3c4000 -8> 2014-06-18 11:00:51.599524 7f9f24fc3700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 113) v1 from client.4588693 -7> 2014-06-18 11:00:51.599530 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.6:0/2036 -- client_session(renewcaps seq 113) v1 -- ?+0 0x7f9f4a862000 con 0x7f9f2f3c4580 -6> 2014-06-18 11:00:51.599544 7f9f24fc3700 3 mds.0.server handle_client_session client_session(request_renewcaps seq 144965) v1 from client.3400732 -5> 2014-06-18 11:00:51.599548 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 --> 192.168.0.250:0/15907 -- client_session(renewcaps seq 144965) v1 -- ?+0 0x7f9f68138fc0 con 0x7f9f2d3e3b80 -4> 2014-06-18 11:00:51.599558 7f9f24fc3700 1 mds.0.211 cluster recovered. -3> 2014-06-18 11:00:51.599568 7f9f24fc3700 5 mds.0.bal rebalance done -2> 2014-06-18 11:00:51.599587 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 <== mds.0 192.168.0.2:6802/15762 0 ==== mds_table_request(anchortable server_ready) v1 ==== 0+0+0 (0 0 0) 0x7f9f3ee19400 con 0x7f9f2d3e22c0 -1> 2014-06-18 11:00:51.599608 7f9f24fc3700 1 -- 192.168.0.2:6802/15762 <== osd.9 192.168.0.7:6805/5005 8371 ==== osd_op_reply(169108 100000cfe98.00000000 [trimtrunc 2@0] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 ==== 187+0+0 (3832707418 0 0) 0x7f9f4a1d6a00 con 0x7f9f2d3e3e40 0> 2014-06-18 11:00:51.601135 7f9f24fc3700 -1 mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)' thread 7f9f24fc3700 time 2014-06-18 11:00:51.599632 mds/MDCache.cc: 6119: FAILED assert(r == 0 || r == -2) ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: (()+0x2ee369) [0x7f9f2a49f369] 2: (Context::complete(int)+0x9) [0x7f9f2a3375a9] 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f9f2a5bba2e] 4: (MDS::handle_core_message(Message*)+0xb3f) [0x7f9f2a359b5f] 5: (MDS::_dispatch(Message*)+0x32) [0x7f9f2a359d52] 6: (MDS::ms_dispatch(Message*)+0xab) [0x7f9f2a35b73b] 7: (DispatchQueue::entry()+0x58a) [0x7f9f2a790bfa] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f9f2a6ad04d] 9: (()+0x80ca) [0x7f9f29b070ca] 10: (clone()+0x6d) [0x7f9f2847cffd] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 journaler 0/ 5 objectcacher 0/ 5 client 0/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 keyvaluestore 1/ 3 journal 0/ 5 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle -2/-2 (syslog threshold) -1/-1 (stderr threshold) max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mds.debmain.log --- end dump of recent events --- 2014-06-18 11:00:51.663899 7f9f24fc3700 -1 *** Caught signal (Aborted) ** in thread 7f9f24fc3700 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74) 1: (()+0x430e42) [0x7f9f2a5e1e42] ...
Scary thing is that I can't access file system any more because MDS servers crash as soon as they start. Please advise.
Actions