Bug #8623: MDS crashes (unable to access CephFS) / mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)' - CephFS - Ceph

Actions

Copy link

Bug #8623

closed

MDS crashes (unable to access CephFS) / mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)'

Added by Dmitry Smirnov almost 10 years ago. Updated almost 8 years ago.

Status:

Won't Fix

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

All of a sudden I found all three MDS servers down and not starting (crashing):

     0> 2014-06-18 10:58:03.702998 7f36fd13a700 -1 mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)' thread 7f36fd13a700 time 2014-06-18 10:58:03.699374
mds/MDCache.cc: 6119: FAILED assert(r == 0 || r == -2)

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x2ee369) [0x7f3702616369]
 2: (Context::complete(int)+0x9) [0x7f37024ae5a9]
 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f3702732a2e]
 4: (MDS::handle_core_message(Message*)+0xb3f) [0x7f37024d0b5f]
 5: (MDS::_dispatch(Message*)+0x32) [0x7f37024d0d52]
 6: (MDS::ms_dispatch(Message*)+0xab) [0x7f37024d273b]
 7: (DispatchQueue::entry()+0x58a) [0x7f3702907bfa]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f370282404d]
 9: (()+0x80ca) [0x7f3701c7e0ca]
 10: (clone()+0x6d) [0x7f37005f3ffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.debmain.log
--- end dump of recent events ---
2014-06-18 10:58:03.765613 7f36fd13a700 -1 *** Caught signal (Aborted) **
 in thread 7f36fd13a700

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x430e42) [0x7f3702758e42]
 2: (()+0xf8f0) [0x7f3701c858f0]
 3: (gsignal()+0x37) [0x7f3700543407]
 4: (abort()+0x148) [0x7f3700546508]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x175) [0x7f3700e2ed65]
 6: (()+0x5edd6) [0x7f3700e2cdd6]
 7: (()+0x5ee21) [0x7f3700e2ce21]
 8: (()+0x5f039) [0x7f3700e2d039]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1e3) [0x7f370283d4b3]
 10: (()+0x2ee369) [0x7f3702616369]
 11: (Context::complete(int)+0x9) [0x7f37024ae5a9]
 12: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f3702732a2e]
 13: (MDS::handle_core_message(Message*)+0xb3f) [0x7f37024d0b5f]
 14: (MDS::_dispatch(Message*)+0x32) [0x7f37024d0d52]
 15: (MDS::ms_dispatch(Message*)+0xab) [0x7f37024d273b]
 16: (DispatchQueue::entry()+0x58a) [0x7f3702907bfa]
 17: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f370282404d]
 18: (()+0x80ca) [0x7f3701c7e0ca]
 19: (clone()+0x6d) [0x7f37005f3ffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

   -29> 2014-06-18 11:00:51.333236 7f9f24fc3700  5 mds.0.211 handle_mds_map epoch 1784 from mon.2
   -28> 2014-06-18 11:00:51.333263 7f9f24fc3700  1 mds.0.211 handle_mds_map i am now mds.0.211
   -27> 2014-06-18 11:00:51.333266 7f9f24fc3700  1 mds.0.211 handle_mds_map state change up:rejoin --> up:active
   -26> 2014-06-18 11:00:51.333272 7f9f24fc3700  1 mds.0.211 recovery_done -- successful recovery!
   -25> 2014-06-18 11:00:51.333279 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> mds.0 192.168.0.2:6802/15762 -- mds_table_request(anchortable server_ready) v1 -- ?+0 0x7f9f3ee19400
   -24> 2014-06-18 11:00:51.333293 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> mds.0 192.168.0.2:6802/15762 -- mds_table_request(snaptable server_ready) v1 -- ?+0 0x7f9f3ee19600
   -23> 2014-06-18 11:00:51.333347 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.7:6805/5005 -- osd_op(mds.0.211:169108 100000cfe98.00000000 [trimtrunc 2@0] 14.584bcdaa snapc 1=[] ondisk+write e37247) v4 -- ?+0 0x7f9f2def5d40 con 0x7f9f2d3e3e40
   -22> 2014-06-18 11:00:51.598443 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.204:6801/25505 -- osd_op(mds.0.211:169109 100.00000000 [omap-get-header 0~0,omap-get-vals 0~16] 7.c5265ab3 ack+read e37247) v4 -- ?+0 0x7f9f30459440 con 0x7f9f2d3e3600
   -21> 2014-06-18 11:00:51.598477 7f9f24fc3700  1 mds.0.211 active_start
   -20> 2014-06-18 11:00:51.599416 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 112) v1 from client.4588693
   -19> 2014-06-18 11:00:51.599429 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.6:0/2036 -- client_session(renewcaps seq 112) v1 -- ?+0 0x7f9f3b557c00 con 0x7f9f2f3c4580
   -18> 2014-06-18 11:00:51.599443 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 144964) v1 from client.3400732
   -17> 2014-06-18 11:00:51.599447 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.250:0/15907 -- client_session(renewcaps seq 144964) v1 -- ?+0 0x7f9f2d3eea80 con 0x7f9f2d3e3b80
   -16> 2014-06-18 11:00:51.599455 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 49344) v1 from client.4205712
   -15> 2014-06-18 11:00:51.599457 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.2:0/10573 -- client_session(renewcaps seq 49344) v1 -- ?+0 0x7f9f2d3ee1c0 con 0x7f9f2f3c4000
   -14> 2014-06-18 11:00:51.599464 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 12413) v1 from client.4475969
   -13> 2014-06-18 11:00:51.599466 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.204:0/24100 -- client_session(renewcaps seq 12413) v1 -- ?+0 0x7f9f2d3ef500 con 0x7f9f2f3c4160
   -12> 2014-06-18 11:00:51.599473 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 12414) v1 from client.4475969
   -11> 2014-06-18 11:00:51.599475 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.204:0/24100 -- client_session(renewcaps seq 12414) v1 -- ?+0 0x7f9f2d3ef340 con 0x7f9f2f3c4160
   -10> 2014-06-18 11:00:51.599480 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 49345) v1 from client.4205712
    -9> 2014-06-18 11:00:51.599482 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.2:0/10573 -- client_session(renewcaps seq 49345) v1 -- ?+0 0x7f9f2efd0700 con 0x7f9f2f3c4000
    -8> 2014-06-18 11:00:51.599524 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 113) v1 from client.4588693
    -7> 2014-06-18 11:00:51.599530 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.6:0/2036 -- client_session(renewcaps seq 113) v1 -- ?+0 0x7f9f4a862000 con 0x7f9f2f3c4580
    -6> 2014-06-18 11:00:51.599544 7f9f24fc3700  3 mds.0.server handle_client_session client_session(request_renewcaps seq 144965) v1 from client.3400732
    -5> 2014-06-18 11:00:51.599548 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 --> 192.168.0.250:0/15907 -- client_session(renewcaps seq 144965) v1 -- ?+0 0x7f9f68138fc0 con 0x7f9f2d3e3b80
    -4> 2014-06-18 11:00:51.599558 7f9f24fc3700  1 mds.0.211 cluster recovered.
    -3> 2014-06-18 11:00:51.599568 7f9f24fc3700  5 mds.0.bal rebalance done
    -2> 2014-06-18 11:00:51.599587 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 <== mds.0 192.168.0.2:6802/15762 0 ==== mds_table_request(anchortable server_ready) v1 ==== 0+0+0 (0 0 0) 0x7f9f3ee19400 con 0x7f9f2d3e22c0
    -1> 2014-06-18 11:00:51.599608 7f9f24fc3700  1 -- 192.168.0.2:6802/15762 <== osd.9 192.168.0.7:6805/5005 8371 ==== osd_op_reply(169108 100000cfe98.00000000 [trimtrunc 2@0] v0'0 uv0 ondisk = -95 ((95) Operation not supported)) v6 ==== 187+0+0 (3832707418 0 0) 0x7f9f4a1d6a00 con 0x7f9f2d3e3e40
     0> 2014-06-18 11:00:51.601135 7f9f24fc3700 -1 mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)' thread 7f9f24fc3700 time 2014-06-18 11:00:51.599632
mds/MDCache.cc: 6119: FAILED assert(r == 0 || r == -2)

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x2ee369) [0x7f9f2a49f369]
 2: (Context::complete(int)+0x9) [0x7f9f2a3375a9]
 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xf3e) [0x7f9f2a5bba2e]
 4: (MDS::handle_core_message(Message*)+0xb3f) [0x7f9f2a359b5f]
 5: (MDS::_dispatch(Message*)+0x32) [0x7f9f2a359d52]
 6: (MDS::ms_dispatch(Message*)+0xab) [0x7f9f2a35b73b]
 7: (DispatchQueue::entry()+0x58a) [0x7f9f2a790bfa]
 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f9f2a6ad04d]
 9: (()+0x80ca) [0x7f9f29b070ca]
 10: (clone()+0x6d) [0x7f9f2847cffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.debmain.log
--- end dump of recent events ---
2014-06-18 11:00:51.663899 7f9f24fc3700 -1 *** Caught signal (Aborted) **
 in thread 7f9f24fc3700

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x430e42) [0x7f9f2a5e1e42]
...

Scary thing is that I can't access file system any more because MDS servers crash as soon as they start. Please advise.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #8623

MDS crashes (unable to access CephFS) / mds/MDCache.cc: In function 'virtual void C_MDC_TruncateFinish::finish(int)'

Updated by Zheng Yan almost 10 years ago

Updated by Zheng Yan almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Zheng Yan almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Zheng Yan almost 10 years ago

Updated by Greg Farnum almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Zheng Yan almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Dmitry Smirnov almost 10 years ago

Updated by Greg Farnum almost 8 years ago