Project

General

Profile

Bug #596

crash during mds reconnect

Added by Greg Farnum over 13 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

100%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While testing my Journaler changes, I got a cfuse segfault. My steps:
vstart with 1 of each daemon
mount cfuse
copy in the pjd workunit, start running
kill -9 the mds while it was untarring
restart the mds
kill -9 the mds while running tests
restart the mds
then cfuse crashed
The other daemons seemed to be fine, and restarting the entire system worked and let me mount cfuse again.
Unfortunately I had no logging.

I reproduced this on unstable by just killing and restarting the MDS once the pjd test and started running tests. It seems to be 100% reproducible.

#0  0x000000000046a1fa in Client::encode_dentry_release (this=0x2924700, dn=0x29cfe00, req=0x2dfa000, mds=0, drop=256, unless=512) at client/Client.cc:1105
1105                                          mds, drop, unless, 1);
(gdb) bt
#0  0x000000000046a1fa in Client::encode_dentry_release (this=0x2924700, dn=0x29cfe00, req=0x2dfa000, mds=0, drop=256, unless=512) at client/Client.cc:1105
#1  0x000000000046a4bb in Client::encode_cap_releases (this=0x2924700, req=0x2f05280, m=0x2dfa000, mds=0) at client/Client.cc:1140
#2  0x0000000000472312 in Client::send_request (this=0x2924700, request=0x2f05280, mds=0) at client/Client.cc:1218
#3  0x0000000000472976 in Client::resend_unsafe_requests (this=0x2924700, mds_num=0) at client/Client.cc:1596
#4  0x00000000004804e1 in Client::send_reconnect (this=<value optimized out>, mds=0) at client/Client.cc:1566
#5  0x0000000000493bdc in Client::handle_mds_map (this=0x2924700, m=<value optimized out>) at client/Client.cc:1494
#6  0x000000000049b65b in Client::ms_dispatch (this=0x2924700, m=0x2d22600) at client/Client.cc:1410
#7  0x000000000044c479 in Messenger::ms_deliver_dispatch (this=0x2934000) at msg/Messenger.h:97
#8  SimpleMessenger::dispatch_entry (this=0x2934000) at msg/SimpleMessenger.cc:332
#9  0x0000000000444c2c in SimpleMessenger::DispatchThread::entry (this=0x2934488) at msg/SimpleMessenger.h:529
#10 0x000000000045853a in Thread::_entry_func (arg=0x2924700) at ./common/Thread.h:39
#11 0x00007faf195e173a in start_thread (arg=<value optimized out>) at pthread_create.c:300
#12 0x00007faf1834e69d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#13 0x0000000000000000 in ?? ()
(gdb) p *dn
$1 = {
  <LRUObject> = {
    lru_next = 0x27, 
    lru_prev = 0x4c, 
    lru_pinned = false, 
    lru = 0x657473662d646a70, 
    lru_list = 0x30383030322d7473
  }, 
  members of Dentry: 
  name = {
    static npos = 18446744073709551615, 
    _M_dataplus = {
      <std::allocator<char>> = {
        <__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
      members of std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Alloc_hider: 
      _M_p = 0x747365742f363138 <Address 0x747365742f363138 out of bounds>
    }
  }, 
  dir = 0x61636e7572742f73, 
  inode = 0x742e36302f6574, 
  ref = 0, 
  offset = 2, 
  lease_mds = -1, 
  lease_ttl = {
    tv = {
      tv_sec = 0, 
      tv_nsec = 0
    }
  }, 
  lease_gen = 0, 
  lease_seq = 0, 
  cap_shared_gen = 1
}
(gdb) p *(dn->dir)
Cannot access memory at address 0x61636e7572742f73

History

#1 Updated by Sage Weil over 13 years ago

  • Assignee set to Greg Farnum
  • Target version set to v0.23.1

The encode_cap_releases can only be called once, the very first time we send the request. So at some level this is already off in the weeds. That needs to be fixed.

As for the actual crash, that may be partly why we're getting a bad pointer, but there is possibly a more specific bug as well...

#2 Updated by Greg Farnum over 13 years ago

  • Status changed from New to Resolved
  • % Done changed from 0 to 100

Well, that seems to fix it. I added a releases vector to the MetaReqest so it will only encode the releases once, and the steps outlined above no longer break stuff.
Pushed in commit:f7170f95f084a6f91729c3543a214792be571fc1 to the testing branch.

#3 Updated by John Spray over 7 years ago

  • Project changed from Ceph to CephFS
  • Category deleted (11)
  • Target version deleted (v0.23.1)

Bulk updating project=ceph category=ceph-fuse issues to move to fs project so that we can remove the ceph-fuse category from the ceph project

Also available in: Atom PDF