Project

General

Profile

Bug #1228

mds bug: testceph hangs forever in Client::unmount

Added by Colin McCabe almost 13 years ago. Updated almost 13 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Sometime between 0.29 and 0.28.1, testceph broke.

I tried to bisect it, but there were other problems in testceph that cropped up and were later fixed in this period, so bisecting is very difficult.

The line where it's hanging is this one in client/Client.cc:

  ...
  //if (0) {// hack
  while (lru.lru_get_size() > 0 ||
         !inode_map.empty()) {
    ldout(cct, 2) << "cache still has " << lru.lru_get_size()
            << "+" << inode_map.size() << " items" 
            << ", waiting (for caps to release?)" 
            << dendl;
    dump_cache();
==> mount_cond.Wait(client_lock);
  }
  assert(lru.lru_get_size() == 0);
  assert(inode_map.empty());
  //}
  ...

To reproduce the bug, run this command:

CEPH_NUM_OSD=1 CEPH_NUM_MON=1 CEPH_NUM_MDS=1 ./vstart.sh -d -n && ./testceph

It will hang forever in Client::unmount.

If you run with --debug-client=20, you'll see this:

2011-06-24 15:40:30.661648 7f1c9e5e2720 client4101 telldir(0x7f1c94000c30) = 1152921504606846976
ceph_telldir: offset: from return code:1152921504606846976
2011-06-24 15:40:30.661657 7f1c9e5e2720 client4101 unmounting
2011-06-24 15:40:30.661663 7f1c9e5e2720 client4101 waiting on 1 requests
2011-06-24 15:40:30.661733 7f1c9a80a700 client4101 handle_client_reply got a reply. Safe:1 tid 31
2011-06-24 15:40:30.661750 7f1c9a80a700 client4101 insert_trace from 2011-06-24 15:40:30.619911 mds0 is_target=0 is_dentry=0
2011-06-24 15:40:30.661756 7f1c9a80a700 client4101 insert_trace -- no trace
2011-06-24 15:40:30.661773 7f1c9a80a700 client4101 unmounting: trim pass, size was 3+4
2011-06-24 15:40:30.661779 7f1c9a80a700 client4101 unmounting: trim pass, size still 3+4
2011-06-24 15:40:30.661792 7f1c9a80a700 client4101 dump_inode: inode 1 ref 3 dir 0xdc7150
2011-06-24 15:40:30.661797 7f1c9a80a700 client4101   dir size 2
2011-06-24 15:40:30.661808 7f1c9a80a700 client4101     dn readdir_test ref 1
2011-06-24 15:40:30.661813 7f1c9a80a700 client4101 dump_inode: inode 10000000005 ref 3 dir 0x7f1c94001120
2011-06-24 15:40:30.661817 7f1c9a80a700 client4101   dir size 1
2011-06-24 15:40:30.661821 7f1c9a80a700 client4101     dn opened_file_1 ref 0
2011-06-24 15:40:30.661826 7f1c9a80a700 client4101 dump_inode: inode 10000000006 ref 2 dir 0
2011-06-24 15:40:30.661833 7f1c9a80a700 client4101     dn barfile ref 0
2011-06-24 15:40:30.661838 7f1c9a80a700 client4101 dump_inode: inode 10000000004 ref 1 dir 0
2011-06-24 15:40:30.661871 7f1c9a80a700 client4101 dump_cache: inode 10000000005.head ref 3 dir 0x7f1c94001120 10000000005.head( cap_refs={} open={} ref=3 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780 parent=0xdc5800)
2011-06-24 15:40:30.661879 7f1c9a80a700 client4101   dir size 1
2011-06-24 15:40:30.661893 7f1c9a80a700 client4101 dump_cache: inode 10000000004.head ref 1 dir 0 10000000004.head( cap_refs={} open={1=0} ref=1 caps=pAsxLsXsFscr mode=100666 mtime=2011-06-24 15:40:30.560345 parent=0xdc5000)
2011-06-24 15:40:30.661905 7f1c9a80a700 client4101 dump_cache: inode 1.head ref 3 dir 0xdc7150 1.head( cap_refs={} open={} ref=3 caps=pAsLsXs mode=40755 mtime=2011-06-24 15:40:30.560345)
2011-06-24 15:40:30.661910 7f1c9a80a700 client4101   dir size 2
2011-06-24 15:40:30.661924 7f1c9a80a700 client4101 dump_cache: inode 10000000006.head ref 2 dir 0 10000000006.head( cap_refs={} open={1=0} ref=2 caps=pAsxLsXsxFscr mode=100666 mtime=2011-06-24 15:40:30.623780 parent=0x7f1c940017b0)
2011-06-24 15:40:30.662019 7f1c9e5e2720 client4101 _ll_drop_pins
2011-06-24 15:40:30.662038 7f1c9e5e2720 client4101 _ll_put 0xdc5220 1 1 -> 0
2011-06-24 15:40:30.662049 7f1c9e5e2720 client4101 trim_cache unlinking dn barfile in dir 1
2011-06-24 15:40:30.662070 7f1c9e5e2720 client4101 remove_cap mds0 on 10000000004.head( cap_refs={} open={1=0} ref=0 caps=pAsxLsXsFscr mode=100666 mtime=2011-06-24 15:40:30.560345)
2011-06-24 15:40:30.662086 7f1c9e5e2720 client4101 remove_cap last one, closing snaprealm 0xdc5140
2011-06-24 15:40:30.662113 7f1c9e5e2720 client4101 trim_cache unlinking dn opened_file_1 in dir 10000000005
2011-06-24 15:40:30.662128 7f1c9e5e2720 client4101  clearing I_COMPLETE on 10000000005.head( cap_refs={} open={} ref=3 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780 parent=0xdc5800)
2011-06-24 15:40:30.662145 7f1c9e5e2720 client4101 trim_cache unlinking dn readdir_test in dir 1
2011-06-24 15:40:30.662164 7f1c9e5e2720 client4101 remove_cap mds0 on 1.head( cap_refs={} open={} ref=0 caps=pAsLsXs mode=40755 mtime=2011-06-24 15:40:30.560345)
2011-06-24 15:40:30.662174 7f1c9e5e2720 client4101 remove_cap last one, closing snaprealm 0xdc5140
2011-06-24 15:40:30.662203 7f1c9e5e2720 client4101 _flush 10000000005.head( cap_refs={1024=0} open={} ref=2 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780)
2011-06-24 15:40:30.662209 7f1c9e5e2720 client4101  nothing to flush
2011-06-24 15:40:30.662223 7f1c9e5e2720 client4101 _flush 10000000006.head( cap_refs={1024=0} open={1=0} ref=2 caps=pAsxLsXsxFscr mode=100666 mtime=2011-06-24 15:40:30.623780)
2011-06-24 15:40:30.662228 7f1c9e5e2720 client4101  nothing to flush
2011-06-24 15:40:30.662235 7f1c9e5e2720 client4101 flush_caps
2011-06-24 15:40:30.662252 7f1c9e5e2720 client4101 check_caps on 10000000006.head( cap_refs={1024=0} open={1=0} ref=1 caps=pAsxLsXsxFscr mode=100666 mtime=2011-06-24 15:40:30.623780) wanted - used - is_delayed=1
2011-06-24 15:40:30.662263 7f1c9e5e2720 client4101  cap mds0 issued pAsxLsXsxFscr implemented pAsxLsXsxFscr revoking -
2011-06-24 15:40:30.662269 7f1c9e5e2720 client4101  flushing -
2011-06-24 15:40:30.662285 7f1c9e5e2720 client4101 send_cap 10000000006.head( cap_refs={1024=0} open={1=0} ref=1 caps=pAsxLsXsxFscr mode=100666 mtime=2011-06-24 15:40:30.623780) mds0 seq 2 used - want - flush - retain p held pAsxLsXsxFscr revoking - dropping AsxLsXsxFscr
2011-06-24 15:40:30.662449 7f1c9e5e2720 client4101 auth cap, setting max_size = 0
2011-06-24 15:40:30.662498 7f1c9e5e2720 client4101 check_caps on 10000000005.head( cap_refs={1024=0} open={} ref=1 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780) wanted - used - is_delayed=1
2011-06-24 15:40:30.662519 7f1c9e5e2720 client4101  cap mds0 issued pAsLsXs implemented pAsLsXs revoking -
2011-06-24 15:40:30.662531 7f1c9e5e2720 client4101  flushing -
2011-06-24 15:40:30.662558 7f1c9e5e2720 client4101 send_cap 10000000005.head( cap_refs={1024=0} open={} ref=1 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780) mds0 seq 10 used - want - flush - retain p held pAsLsXs revoking - dropping AsLsXs
2011-06-24 15:40:30.662574 7f1c9e5e2720 client4101 auth cap, setting max_size = 0
2011-06-24 15:40:30.662604 7f1c9e5e2720 client4101 wait_sync_caps want 0 (last is 0, 0 total flushing)
2011-06-24 15:40:30.662619 7f1c9e5e2720 client4101 cache still has 0+2 items, waiting (for caps to release?)     <====== CAPS ARE NEVER RELEASED?
2011-06-24 15:40:30.662643 7f1c9e5e2720 client4101 dump_cache: inode 10000000005.head ref 1 dir 0 10000000005.head( cap_refs={1024=0} open={} ref=1 caps=p mode=40777 mtime=2011-06-24 15:40:30.623780)
2011-06-24 15:40:30.662670 7f1c9e5e2720 client4101 dump_cache: inode 10000000006.head ref 1 dir 0 10000000006.head( cap_refs={1024=0} open={1=0} ref=1 caps=p mode=100666 mtime=2011-06-24 15:40:30.623780)

So in conclusion, it seems like some kind of MDS bug with caps never being released on an inode or something.

Associated revisions

Revision 19614bb8 (diff)
Added by Greg Farnum over 12 years ago

testceph: actually call closedir on opened directory.

Otherwise testceph hangs because it's still got things open!
Fixes #1228

Signed-off-by: Greg Farnum <>

History

#1 Updated by Sage Weil almost 13 years ago

  • Assignee set to Sage Weil
  • Target version set to v0.31

#2 Updated by Greg Farnum almost 13 years ago

  • Assignee changed from Sage Weil to Greg Farnum

steal

#3 Updated by Greg Farnum almost 13 years ago

  • Status changed from New to Resolved

Well, this was a testceph problem. Fixed by 19614bb868124fcb51308f62f7c0cfa13a31038c so it actually closes the dir it opens.

Also available in: Atom PDF