Actions
Bug #1228
closedmds bug: testceph hangs forever in Client::unmount
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Sometime between 0.29 and 0.28.1, testceph broke.
I tried to bisect it, but there were other problems in testceph that cropped up and were later fixed in this period, so bisecting is very difficult.
The line where it's hanging is this one in client/Client.cc:
... //if (0) {// hack while (lru.lru_get_size() > 0 || !inode_map.empty()) { ldout(cct, 2) << "cache still has " << lru.lru_get_size() << "+" << inode_map.size() << " items" << ", waiting (for caps to release?)" << dendl; dump_cache(); ==> mount_cond.Wait(client_lock); } assert(lru.lru_get_size() == 0); assert(inode_map.empty()); //} ...
To reproduce the bug, run this command:
CEPH_NUM_OSD=1 CEPH_NUM_MON=1 CEPH_NUM_MDS=1 ./vstart.sh -d -n && ./testceph
It will hang forever in Client::unmount.
If you run with --debug-client=20, you'll see this:
2011-06-24 15:40:30.661648 7f1c9e5e2720 client4101 telldir(0x7f1c94000c30) = 1152921504606846976 ceph_telldir: offset: from return code:1152921504606846976 2011-06-24 15:40:30.661657 7f1c9e5e2720 client4101 unmounting 2011-06-24 15:40:30.661663 7f1c9e5e2720 client4101 waiting on 1 requests 2011-06-24 15:40:30.661733 7f1c9a80a700 client4101 handle_client_reply got a reply. Safe:1 tid 31 2011-06-24 15:40:30.661750 7f1c9a80a700 client4101 insert_trace from 2011-06-24 15:40:30.619911 mds0 is_target=0 is_dentry=0 2011-06-24 15:40:30.661756 7f1c9a80a700 client4101 insert_trace -- no trace 2011-06-24 15:40:30.661773 7f1c9a80a700 client4101 unmounting: trim pass, size was 3+4 2011-06-24 15:40:30.661779 7f1c9a80a700 client4101 unmounting: trim pass, size still 3+4 2011-06-24 15:40:30.661792 7f1c9a80a700 client4101 dump_inode: inode 1 ref 3 dir 0xdc7150 2011-06-24 15:40:30.661797 7f1c9a80a700 client4101 dir size 2 2011-06-24 15:40:30.661808 7f1c9a80a700 client4101 dn readdir_test ref 1 2011-06-24 15:40:30.661813 7f1c9a80a700 client4101 dump_inode: inode 10000000005 ref 3 dir 0x7f1c94001120 2011-06-24 15:40:30.661817 7f1c9a80a700 client4101 dir size 1 2011-06-24 15:40:30.661821 7f1c9a80a700 client4101 dn opened_file_1 ref 0 2011-06-24 15:40:30.661826 7f1c9a80a700 client4101 dump_inode: inode 10000000006 ref 2 dir 0 2011-06-24 15:40:30.661833 7f1c9a80a700 client4101 dn barfile ref 0 2011-06-24 15:40:30.661838 7f1c9a80a700 client4101 dump_inode: inode 10000000004 ref 1 dir 0 2011-06-24 15:40:30.661871 7f1c9a80a700 client4101 dump_cache: inode 10000000005.head ref 3 dir 0x7f1c94001120 10000000005.head( cap_refs={} open={} ref=3 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780 parent=0xdc5800) 2011-06-24 15:40:30.661879 7f1c9a80a700 client4101 dir size 1 2011-06-24 15:40:30.661893 7f1c9a80a700 client4101 dump_cache: inode 10000000004.head ref 1 dir 0 10000000004.head( cap_refs={} open={1=0} ref=1 caps=pAsxLsXsFscr mode=100666 mtime=2011-06-24 15:40:30.560345 parent=0xdc5000) 2011-06-24 15:40:30.661905 7f1c9a80a700 client4101 dump_cache: inode 1.head ref 3 dir 0xdc7150 1.head( cap_refs={} open={} ref=3 caps=pAsLsXs mode=40755 mtime=2011-06-24 15:40:30.560345) 2011-06-24 15:40:30.661910 7f1c9a80a700 client4101 dir size 2 2011-06-24 15:40:30.661924 7f1c9a80a700 client4101 dump_cache: inode 10000000006.head ref 2 dir 0 10000000006.head( cap_refs={} open={1=0} ref=2 caps=pAsxLsXsxFscr mode=100666 mtime=2011-06-24 15:40:30.623780 parent=0x7f1c940017b0) 2011-06-24 15:40:30.662019 7f1c9e5e2720 client4101 _ll_drop_pins 2011-06-24 15:40:30.662038 7f1c9e5e2720 client4101 _ll_put 0xdc5220 1 1 -> 0 2011-06-24 15:40:30.662049 7f1c9e5e2720 client4101 trim_cache unlinking dn barfile in dir 1 2011-06-24 15:40:30.662070 7f1c9e5e2720 client4101 remove_cap mds0 on 10000000004.head( cap_refs={} open={1=0} ref=0 caps=pAsxLsXsFscr mode=100666 mtime=2011-06-24 15:40:30.560345) 2011-06-24 15:40:30.662086 7f1c9e5e2720 client4101 remove_cap last one, closing snaprealm 0xdc5140 2011-06-24 15:40:30.662113 7f1c9e5e2720 client4101 trim_cache unlinking dn opened_file_1 in dir 10000000005 2011-06-24 15:40:30.662128 7f1c9e5e2720 client4101 clearing I_COMPLETE on 10000000005.head( cap_refs={} open={} ref=3 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780 parent=0xdc5800) 2011-06-24 15:40:30.662145 7f1c9e5e2720 client4101 trim_cache unlinking dn readdir_test in dir 1 2011-06-24 15:40:30.662164 7f1c9e5e2720 client4101 remove_cap mds0 on 1.head( cap_refs={} open={} ref=0 caps=pAsLsXs mode=40755 mtime=2011-06-24 15:40:30.560345) 2011-06-24 15:40:30.662174 7f1c9e5e2720 client4101 remove_cap last one, closing snaprealm 0xdc5140 2011-06-24 15:40:30.662203 7f1c9e5e2720 client4101 _flush 10000000005.head( cap_refs={1024=0} open={} ref=2 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780) 2011-06-24 15:40:30.662209 7f1c9e5e2720 client4101 nothing to flush 2011-06-24 15:40:30.662223 7f1c9e5e2720 client4101 _flush 10000000006.head( cap_refs={1024=0} open={1=0} ref=2 caps=pAsxLsXsxFscr mode=100666 mtime=2011-06-24 15:40:30.623780) 2011-06-24 15:40:30.662228 7f1c9e5e2720 client4101 nothing to flush 2011-06-24 15:40:30.662235 7f1c9e5e2720 client4101 flush_caps 2011-06-24 15:40:30.662252 7f1c9e5e2720 client4101 check_caps on 10000000006.head( cap_refs={1024=0} open={1=0} ref=1 caps=pAsxLsXsxFscr mode=100666 mtime=2011-06-24 15:40:30.623780) wanted - used - is_delayed=1 2011-06-24 15:40:30.662263 7f1c9e5e2720 client4101 cap mds0 issued pAsxLsXsxFscr implemented pAsxLsXsxFscr revoking - 2011-06-24 15:40:30.662269 7f1c9e5e2720 client4101 flushing - 2011-06-24 15:40:30.662285 7f1c9e5e2720 client4101 send_cap 10000000006.head( cap_refs={1024=0} open={1=0} ref=1 caps=pAsxLsXsxFscr mode=100666 mtime=2011-06-24 15:40:30.623780) mds0 seq 2 used - want - flush - retain p held pAsxLsXsxFscr revoking - dropping AsxLsXsxFscr 2011-06-24 15:40:30.662449 7f1c9e5e2720 client4101 auth cap, setting max_size = 0 2011-06-24 15:40:30.662498 7f1c9e5e2720 client4101 check_caps on 10000000005.head( cap_refs={1024=0} open={} ref=1 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780) wanted - used - is_delayed=1 2011-06-24 15:40:30.662519 7f1c9e5e2720 client4101 cap mds0 issued pAsLsXs implemented pAsLsXs revoking - 2011-06-24 15:40:30.662531 7f1c9e5e2720 client4101 flushing - 2011-06-24 15:40:30.662558 7f1c9e5e2720 client4101 send_cap 10000000005.head( cap_refs={1024=0} open={} ref=1 caps=pAsLsXs mode=40777 mtime=2011-06-24 15:40:30.623780) mds0 seq 10 used - want - flush - retain p held pAsLsXs revoking - dropping AsLsXs 2011-06-24 15:40:30.662574 7f1c9e5e2720 client4101 auth cap, setting max_size = 0 2011-06-24 15:40:30.662604 7f1c9e5e2720 client4101 wait_sync_caps want 0 (last is 0, 0 total flushing) 2011-06-24 15:40:30.662619 7f1c9e5e2720 client4101 cache still has 0+2 items, waiting (for caps to release?) <====== CAPS ARE NEVER RELEASED? 2011-06-24 15:40:30.662643 7f1c9e5e2720 client4101 dump_cache: inode 10000000005.head ref 1 dir 0 10000000005.head( cap_refs={1024=0} open={} ref=1 caps=p mode=40777 mtime=2011-06-24 15:40:30.623780) 2011-06-24 15:40:30.662670 7f1c9e5e2720 client4101 dump_cache: inode 10000000006.head ref 1 dir 0 10000000006.head( cap_refs={1024=0} open={1=0} ref=1 caps=p mode=100666 mtime=2011-06-24 15:40:30.623780)
So in conclusion, it seems like some kind of MDS bug with caps never being released on an inode or something.
Actions