Project

General

Profile

Actions

Bug #8623

closed

MDS crashes (unable to access CephFS) / mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)'

Added by Dmitry Smirnov almost 10 years ago. Updated almost 8 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

All of a sudden I found all three MDS servers down and not starting (crashing):

     0> 2014-06-18 10:58:03.702998 7f36fd13a700 -1 mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)' thread 7f36fd13a700 time 2014-06-18 10:58:03.699374
mds/MDCache.cc: 6119: FAILED assert(r == 0 || r == -2)

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x2ee369) [0x7f3702616369]
 2: (Context::complete(int)+0x9) [0x7f37024ae5a9]
 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f3702732a2e]
 4: (MDS::handle_core_message(Message*)+0xb3f) [0x7f37024d0b5f]
 5: (MDS::_dispatch(Message*)+0x32) [0x7f37024d0d52]
 6: (MDS::ms_dispatch(Message*)+0xab) [0x7f37024d273b]
 7: (DispatchQueue::entry()+0x58a) [0x7f3702907bfa]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f370282404d]
 9: (()+0x80ca) [0x7f3701c7e0ca]
 10: (clone()+0x6d) [0x7f37005f3ffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.debmain.log
--- end dump of recent events ---
2014-06-18 10:58:03.765613 7f36fd13a700 -1 *** Caught signal (Aborted) **
 in thread 7f36fd13a700

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x430e42) [0x7f3702758e42]
 2: (()+0xf8f0) [0x7f3701c858f0]
 3: (gsignal()+0x37) [0x7f3700543407]
 4: (abort()+0x148) [0x7f3700546508]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x175) [0x7f3700e2ed65]
 6: (()+0x5edd6) [0x7f3700e2cdd6]
 7: (()+0x5ee21) [0x7f3700e2ce21]
 8: (()+0x5f039) [0x7f3700e2d039]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1e3) [0x7f370283d4b3]
 10: (()+0x2ee369) [0x7f3702616369]
 11: (Context::complete(int)+0x9) [0x7f37024ae5a9]
 12: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f3702732a2e]
 13: (MDS::handle_core_message(Message*)+0xb3f) [0x7f37024d0b5f]
 14: (MDS::_dispatch(Message*)+0x32) [0x7f37024d0d52]
 15: (MDS::ms_dispatch(Message*)+0xab) [0x7f37024d273b]
 16: (DispatchQueue::entry()+0x58a) [0x7f3702907bfa]
 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f370282404d]
 18: (()+0x80ca) [0x7f3701c7e0ca]
 19: (clone()+0x6d) [0x7f37005f3ffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
   -29> 2014-06-18 11:00:51.333236 7f9f24fc3700  5 mds.0.211 handle_mds_map epoch 1784 from mon.2
   -28> 2014-06-18 11:00:51.333263 7f9f24fc3700  1 mds.0.211 handle_mds_map i am now mds.0.211
   -27> 2014-06-18 11:00:51.333266 7f9f24fc3700  1 mds.0.211 handle_mds_map state change up:rejoin --> up:active
   -26> 2014-06-18 11:00:51.333272 7f9f24fc3700  1 mds.0.211 recovery_done -- successful recovery!
   -25> 2014-06-18 11:00:51.333279 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> mds.0 192.168.0.2:6802/15762 -- mds_table_request(anchortable server_ready) v1 -- ?+0 0x7f9f3ee19400
   -24> 2014-06-18 11:00:51.333293 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> mds.0 192.168.0.2:6802/15762 -- mds_table_request(snaptable server_ready) v1 -- ?+0 0x7f9f3ee19600
   -23> 2014-06-18 11:00:51.333347 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.7:6805/5005 -- osd_op(mds.0.211:169108 100000cfe98.00000000 [trimtrunc 2@0] 14.584bcdaa snapc 1=[] ondisk+write e37247) v4 -- ?+0 0x7f9f2def5d40 con 0x7f9f2d3e3e40
   -22> 2014-06-18 11:00:51.598443 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.204:6801/25505 -- osd_op(mds.0.211:169109 100.00000000 [omap-get-header 0~0,omap-get-vals 0~16] 7.c5265ab3 ack+read e37247) v4 -- ?+0 0x7f9f30459440 con 0x7f9f2d3e3600
   -21> 2014-06-18 11:00:51.598477 7f9f24fc3700  1 mds.0.211 active_start
   -20> 2014-06-18 11:00:51.599416 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 112) v1 from client.4588693
   -19> 2014-06-18 11:00:51.599429 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.6:0/2036 -- client_session(renewcaps seq 112) v1 -- ?+0 0x7f9f3b557c00 con 0x7f9f2f3c4580
   -18> 2014-06-18 11:00:51.599443 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 144964) v1 from client.3400732
   -17> 2014-06-18 11:00:51.599447 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.250:0/15907 -- client_session(renewcaps seq 144964) v1 -- ?+0 0x7f9f2d3eea80 con 0x7f9f2d3e3b80
   -16> 2014-06-18 11:00:51.599455 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 49344) v1 from client.4205712
   -15> 2014-06-18 11:00:51.599457 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.2:0/10573 -- client_session(renewcaps seq 49344) v1 -- ?+0 0x7f9f2d3ee1c0 con 0x7f9f2f3c4000
   -14> 2014-06-18 11:00:51.599464 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 12413) v1 from client.4475969
   -13> 2014-06-18 11:00:51.599466 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.204:0/24100 -- client_session(renewcaps seq 12413) v1 -- ?+0 0x7f9f2d3ef500 con 0x7f9f2f3c4160
   -12> 2014-06-18 11:00:51.599473 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 12414) v1 from client.4475969
   -11> 2014-06-18 11:00:51.599475 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.204:0/24100 -- client_session(renewcaps seq 12414) v1 -- ?+0 0x7f9f2d3ef340 con 0x7f9f2f3c4160
   -10> 2014-06-18 11:00:51.599480 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 49345) v1 from client.4205712
    -9> 2014-06-18 11:00:51.599482 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.2:0/10573 -- client_session(renewcaps seq 49345) v1 -- ?+0 0x7f9f2efd0700 con 0x7f9f2f3c4000
    -8> 2014-06-18 11:00:51.599524 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 113) v1 from client.4588693
    -7> 2014-06-18 11:00:51.599530 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.6:0/2036 -- client_session(renewcaps seq 113) v1 -- ?+0 0x7f9f4a862000 con 0x7f9f2f3c4580
    -6> 2014-06-18 11:00:51.599544 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 144965) v1 from client.3400732
    -5> 2014-06-18 11:00:51.599548 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.250:0/15907 -- client_session(renewcaps seq 144965) v1 -- ?+0 0x7f9f68138fc0 con 0x7f9f2d3e3b80
    -4> 2014-06-18 11:00:51.599558 7f9f24fc3700  1 mds.0.211 cluster recovered.
    -3> 2014-06-18 11:00:51.599568 7f9f24fc3700  5 mds.0.bal rebalance done
    -2> 2014-06-18 11:00:51.599587 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 <== mds.0 192.168.0.2:6802/15762 0 ==== mds_table_request(anchortable server_ready) v1 ==== 0+0+0 (0 0 0) 0x7f9f3ee19400 con 0x7f9f2d3e22c0
    -1> 2014-06-18 11:00:51.599608 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 <== osd.9 192.168.0.7:6805/5005 8371 ==== osd_op_reply(169108 100000cfe98.00000000 [trimtrunc 2@0] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 ==== 187+0+0 (3832707418 0 0) 0x7f9f4a1d6a00 con 0x7f9f2d3e3e40
     0> 2014-06-18 11:00:51.601135 7f9f24fc3700 -1 mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)' thread 7f9f24fc3700 time 2014-06-18 11:00:51.599632
mds/MDCache.cc: 6119: FAILED assert(r == 0 || r == -2)

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x2ee369) [0x7f9f2a49f369]
 2: (Context::complete(int)+0x9) [0x7f9f2a3375a9]
 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f9f2a5bba2e]
 4: (MDS::handle_core_message(Message*)+0xb3f) [0x7f9f2a359b5f]
 5: (MDS::_dispatch(Message*)+0x32) [0x7f9f2a359d52]
 6: (MDS::ms_dispatch(Message*)+0xab) [0x7f9f2a35b73b]
 7: (DispatchQueue::entry()+0x58a) [0x7f9f2a790bfa]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f9f2a6ad04d]
 9: (()+0x80ca) [0x7f9f29b070ca]
 10: (clone()+0x6d) [0x7f9f2847cffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.debmain.log
--- end dump of recent events ---
2014-06-18 11:00:51.663899 7f9f24fc3700 -1 *** Caught signal (Aborted) **
 in thread 7f9f24fc3700

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x430e42) [0x7f9f2a5e1e42]
...

Scary thing is that I can't access file system any more because MDS servers crash as soon as they start. Please advise.


Related issues 1 (0 open1 closed)

Related to CephFS - Bug #8624: monitor: disallow specifying an EC pool as a data or metadata poolResolvedJoao Eduardo Luis06/17/2014

Actions
Actions

Also available in: Atom PDF