Project

General

Profile

Bug #3271

kernel BUG() on unmount after export/unexport of ceph using kclient

Added by David Zafman over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

plana78:
$ sudo exportfs '*:/tmp/cephtest/mnt.1' -o fsid=1234,rw,no_root_squash

NFS client
$ sudo mount plana78:/mnt/cephtest/mnt.1 /mnt/nfs
$ sudo touch /mnt/nfs/nfs.file
$ sudo umount /mnt/nfs

Plana78 running ceph kclient
$ sudo exportfs -u /mnt/cephtest/mnt.1

Teuthology
Ceph test interactive mode, use ctx to interact with the cluster, press control-D to exit...

^D

INFO:teuthology.task.kclient:Unmounting kernel clients...
( ssh ubuntu@plana78... sudo umount /mnt/cephtest/mnt.1 )

<3>[13121.315683] BUG: Dentry ffff8802212c1740{i=10000000001,n=nfs.file} still in use (1) [unmount of ceph ceph]
[5]kdb> bt
Stack traceback for pid 5102
0xffff88020cf1de20 5102 5101 1 5 R 0xffff88020cf1e280 *umount
ffff880223ba9dc8 0000000000000018 ffffffff81192869 ffff8802231d1e70
000000007f319460 ffff8802231d1800 ffffffffa055fd80 ffff880223ba9e08
ffffffff81194a63 ffff8802231d1000 ffff8802231d1800 ffff880223ba9e38
Call Trace:
[<ffffffff81192869>] ? shrink_dcache_for_umount_subtree+0x1e9/0x1f0
[<ffffffff81194a63>] ? shrink_dcache_for_umount+0x33/0x60
[<ffffffff8117e06c>] ? generic_shutdown_super+0x2c/0xf0
[<ffffffff8117e1c6>] ? kill_anon_super+0x16/0x30
[<ffffffffa053e189>] ? ceph_kill_sb+0x39/0x70 [ceph]
[<ffffffff8117e435>] ? deactivate_locked_super+0x45/0x70
[<ffffffff8117f0de>] ? deactivate_super+0x4e/0x70
[<ffffffff8119b4b6>] ? mntput_no_expire+0x106/0x160
[<ffffffff8119c4ee>] ? sys_umount+0x6e/0x3b0
[<ffffffff8163e1e9>] ? system_call_fastpath+0x16/0x1b

History

#1 Updated by David Zafman over 9 years ago

  • Description updated (diff)

By default our kernels don't drop into kdb. Here is the logging before and after the BUG after executing ceph/src/script/kcon_most.sh

stat("/sbin/umount.ceph", 0x7fffca468e60) = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_BLOCK, ~[TRAP SEGV RTMIN RT_1], NULL, 8) = 0
umount("/tmp/cephtest/mnt.1", 0 <unfinished ...>
++ killed by SIGSEGV ++

Oct 5 17:20:38 plana11 kernel: [ 630.794749] libceph: ===== ffff88020cb01a00 20 from mds0 22=client_session len 28+0 (3104888343 0 0) =====
Oct 5 17:20:38 plana11 kernel: [ 630.853735] ceph: handle_session mds0 renewcaps ffff88020d847000 state open seq 13
Oct 5 17:20:38 plana11 kernel: [ 630.912528] ceph: renewed_caps mds0 ttl now 4295006413, was fresh, now stale
Oct 5 17:20:38 plana11 kernel: [ 630.946767] ceph: mdsc con_put ffff88020d847000 (2)
Oct 5 17:20:38 plana11 kernel: [ 630.978588] ceph: mdsc put_session ffff88020d847000 3 > 2
Oct 5 17:20:42 plana11 kernel: [ 634.762142] ceph: do_getattr inode ffff88021fa88460 mask pAsLsXsFs mode 040755
Oct 5 17:20:42 plana11 kernel: [ 634.825160] ceph: __ceph_caps_issued_mask ffff88021fa88460 cap ffff88021fa99080 issued pAsLsXsFs (mask pAsLsXsFs)
Oct 5 17:20:42 plana11 kernel: [ 634.894368] ceph: __touch_cap ffff88021fa88460 cap ffff88021fa99080 mds0
Oct 5 17:20:44 plana11 kernel: [ 636.071184] ceph: mdsc delayed_work
Oct 5 17:20:44 plana11 kernel: [ 636.104677] ceph: check_delayed_caps
Oct 5 17:20:44 plana11 kernel: [ 636.137481] ceph: lookup_mds_session ffff88020d847000 2
Oct 5 17:20:44 plana11 kernel: [ 636.171870] ceph: mdsc get_session ffff88020d847000 2 -> 3
Oct 5 17:20:44 plana11 kernel: [ 636.206585] ceph: mdsc get_session ffff88020d847000 3 -> 4
Oct 5 17:20:44 plana11 kernel: [ 636.240456] ceph: mdsc con_get ffff88020d847000 ok (4)
Oct 5 17:20:44 plana11 kernel: [ 636.273660] ceph: add_cap_releases ffff88020d847000 mds0 extra 680
Oct 5 17:20:44 plana11 kernel: [ 636.308342] ceph: send_cap_releases mds0
Oct 5 17:20:44 plana11 kernel: [ 636.340639] ceph: mdsc put_session ffff88020d847000 4 -> 3
Oct 5 17:20:44 plana11 kernel: [ 636.374945] ceph: mdsc con_put ffff88020d847000 (2)
Oct 5 17:20:44 plana11 kernel: [ 636.408469] ceph: mdsc put_session ffff88020d847000 3 -> 2
Oct 5 17:20:46 plana11 kernel: [ 638.287324] libceph: monc delayed_work
Oct 5 17:20:46 plana11 kernel: [ 638.319846] libceph: ceph_x_should_authenticate want=39 need=0 have=39
Oct 5 17:20:46 plana11 kernel: [ 638.356101] libceph: ceph_x_is_authenticated want=39 need=0 have=39
Oct 5 17:20:46 plana11 kernel: [ 638.392755] libceph: __send_subscribe sub_sent=0 exp=0 want_osd=0
Oct 5 17:20:46 plana11 kernel: [ 638.429454] libceph: __schedule_delayed after 2000
Oct 5 17:20:50 plana11 kernel: [ 642.060757] ceph: mdsc delayed_work
Oct 5 17:20:50 plana11 kernel: [ 642.094991] ceph: check_delayed_caps
Oct 5 17:20:50 plana11 kernel: [ 642.128609] ceph: lookup_mds_session ffff88020d847000 2
Oct 5 17:20:50 plana11 kernel: [ 642.163584] ceph: mdsc get_session ffff88020d847000 2 -> 3
Oct 5 17:20:50 plana11 kernel: [ 642.198226] ceph: mdsc get_session ffff88020d847000 3 -> 4
Oct 5 17:20:50 plana11 kernel: [ 642.231615] ceph: mdsc con_get ffff88020d847000 ok (4)
Oct 5 17:20:50 plana11 kernel: [ 642.263963] ceph: add_cap_releases ffff88020d847000 mds0 extra 680
Oct 5 17:20:50 plana11 kernel: [ 642.297436] ceph: send_cap_releases mds0
Oct 5 17:20:50 plana11 kernel: [ 642.328459] ceph: mdsc put_session ffff88020d847000 4 -> 3
Oct 5 17:20:50 plana11 kernel: [ 642.361196] ceph: mdsc con_put ffff88020d847000 (2)
Oct 5 17:20:50 plana11 kernel: [ 642.392497] ceph: mdsc put_session ffff88020d847000 3 -> 2
Oct 5 17:20:51 plana11 kernel: [ 642.999131] libceph: osds timeout
Oct 5 17:20:51 plana11 kernel: [ 643.027306] libceph: __remove_old_osds ffff88021233d950
Oct 5 17:20:53 plana11 kernel: [ 645.239420] ceph: kill_sb ffff88020c695800
Oct 5 17:20:53 plana11 kernel: [ 645.267727] ceph: pre_umount
Oct 5 17:20:53 plana11 kernel: [ 645.294606] ceph: drop_leases
Oct 5 17:20:53 plana11 kernel: [ 645.321116] ceph: lookup_mds_session ffff88020d847000 2
Oct 5 17:20:53 plana11 kernel: [ 645.350421] ceph: mdsc get_session ffff88020d847000 2 -> 3
Oct 5 17:20:53 plana11 kernel: [ 645.380611] ceph: mdsc put_session ffff88020d847000 3 -> 2
Oct 5 17:20:53 plana11 kernel: [ 645.410222] ceph: flush_dirty_caps
Oct 5 17:20:53 plana11 kernel: [ 645.437095] ceph: flush_dirty_caps done
Oct 5 17:20:53 plana11 kernel: [ 645.463677] ceph: wait_requests done
Oct 5 17:20:53 plana11 kernel: [ 645.489695] BUG: Dentry ffff88020c293078{i=10000000000,n=nfs.file} still in use (1) [unmount of ceph ceph]
Oct 5 17:20:53 plana11 kernel: [ 645.546541] -----------
[ cut here ]------------
Oct 5 17:20:53 plana11 kernel: [ 645.575040] kernel BUG at /srv/autobuild-ceph/gitbuilder.git/build/fs/dcache.c:967!
Oct 5 17:20:53 plana11 kernel: [ 645.631391] invalid opcode: 0000 [#1] SMP
Oct 5 17:20:53 plana11 kernel: [ 645.660002] Modules linked in: ceph libceph coretemp kvm_intel kvm nfsd nfs_acl auth_rpcgss exportfs nfs ghash_clmulni_intel fscache aesni_intel lockd ablk_helper cryptd aes_x86_64 aes_generic sunrpc psmouse i7core_edac microcode edac_core joydev hid_generic dcdbas serio_raw lpc_ich mfd_core hed lp parport usbhid hid mptsas mptscsih ixgbe mptbase dca scsi_transport_sas bnx2 mdio btrfs zlib_deflate crc32c_intel libcrc32c
Oct 5 17:20:53 plana11 kernel: [ 645.862090] CPU 3
Oct 5 17:20:53 plana11 kernel: [ 645.864407] Pid: 2090, comm: umount Not tainted 3.6.0-ceph-00071-g6285bc2 #1 Dell Inc. PowerEdge R410/01V648
Oct 5 17:20:54 plana11 kernel: [ 645.961109] RIP: 0010:[<ffffffff81192869>] [<ffffffff81192869>] shrink_dcache_for_umount_subtree+0x1e9/0x1f0
Oct 5 17:20:54 plana11 kernel: [ 646.032471] RSP: 0018:ffff88020c8abdc8 EFLAGS: 00010296
Oct 5 17:20:54 plana11 kernel: [ 646.069305] RAX: 000000000000005e RBX: ffff88020c293078 RCX: 0000000000000006
Oct 5 17:20:54 plana11 kernel: [ 646.108519] RDX: 0000000000000040 RSI: ffff880221e406b0 RDI: 0000000000000246
Oct 5 17:20:54 plana11 kernel: [ 646.147014] RBP: ffff88020c8abde8 R08: 0000000000000002 R09: 0000000000000000
Oct 5 17:20:54 plana11 kernel: [ 646.185353] R10: 0000000000000648 R11: 0000000000000001 R12: ffffffffa091cd80
Oct 5 17:20:54 plana11 kernel: [ 646.223896] R13: ffff88020c51bc20 R14: ffff88020c51bc20 R15: ffff88020c695800
Oct 5 17:20:54 plana11 kernel: [ 646.263069] FS: 00007f9e14248800(0000) GS:ffff880227260000(0000) knlGS:0000000000000000
Oct 5 17:20:54 plana11 kernel: [ 646.335499] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct 5 17:20:54 plana11 kernel: [ 646.375097] CR2: 00007f9e13864120 CR3: 0000000219482000 CR4: 00000000000007e0
Oct 5 17:20:54 plana11 kernel: [ 646.416735] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 5 17:20:54 plana11 kernel: [ 646.458788] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct 5 17:20:54 plana11 kernel: [ 646.500609] Process umount (pid: 2090, threadinfo ffff88020c8aa000, task ffff880221e40000)
Oct 5 17:20:54 plana11 kernel: [ 646.576629] Stack:
Oct 5 17:20:54 plana11 kernel: [ 646.611646] ffff88020c695e70 00000000023700cf ffff88020c695800 ffffffffa091cd80
Oct 5 17:20:54 plana11 kernel: [ 646.685587] ffff88020c8abe08 ffffffff81194a63 ffff88020d847000 ffff88020c695800
Oct 5 17:20:54 plana11 kernel: [ 646.761190] ffff88020c8abe38 ffffffff8117e06c ffff88020c8abe58 0000000000000015
Oct 5 17:20:54 plana11 kernel: [ 646.837149] Call Trace:
Oct 5 17:20:54 plana11 kernel: [ 646.872567] [<ffffffff81194a63>] shrink_dcache_for_umount+0x33/0x60
Oct 5 17:20:54 plana11 kernel: [ 646.912482] [<ffffffff8117e06c>] generic_shutdown_super+0x2c/0xf0
Oct 5 17:20:54 plana11 kernel: [ 646.951593] [<ffffffff8117e1c6>] kill_anon_super+0x16/0x30
Oct 5 17:20:54 plana11 kernel: [ 646.990232] [<ffffffffa08fb189>] ceph_kill_sb+0x39/0x70 [ceph]
Oct 5 17:20:55 plana11 kernel: [ 647.028815] [<ffffffff8117e435>] deactivate_locked_super+0x45/0x70
Oct 5 17:20:55 plana11 kernel: [ 647.067338] [<ffffffff8117f0de>] deactivate_super+0x4e/0x70
Oct 5 17:20:55 plana11 kernel: [ 647.104456] [<ffffffff8119b4b6>] mntput_no_expire+0x106/0x160
Oct 5 17:20:55 plana11 kernel: [ 647.140865] [<ffffffff8119c4ee>] sys_umount+0x6e/0x3b0
Oct 5 17:20:55 plana11 kernel: [ 647.176156] [<ffffffff8163e410>] tracesys+0xdd/0xe2
Oct 5 17:20:55 plana11 kernel: [ 647.210562] Code: 00 00 48 8b 40 28 4c 8b 08 48 8b 43 30 48 85 c0 74 04 48 8b 50 40 48 89 34 24 48 c7 c7 08 22 a0 81 48 89 de 31 c0 e8 a2 7d 49 00 <0f> 0b 0f 0b 0f 1f 00 55 48 89 e5 48 83 ec 50 48 89 5d d8 4c 89
Oct 5 17:20:55 plana11 kernel: [ 647.324377] RIP [<ffffffff81192869>] shrink_dcache_for_umount_subtree+0x1e9/0x1f0
Oct 5 17:20:55 plana11 kernel: [ 647.395026] RSP <ffff88020c8abdc8>
Oct 5 17:20:55 plana11 kernel: [ 647.475850] ---[ end trace 8eef80acef8a604e ]---

#2 Updated by David Zafman over 9 years ago

I've pushed code to Teuthology repo implementing Feature #3245. So now this can be reproduced using the following YAML file:

roles:
- [mon.0, osd.0, osd.1, mds.0]
- [client.0]
- [client.1]

tasks:
- ceph:
- kcon_most:
- kclient: [client.0]
- knfsd: [client.0]
- nfs:
client.1:
server: client.0
options: [rw,hard,intr]
- workunit:
clients:
client.1:
- misc/trivial_sync.sh

INFO:teuthology.task.workunit:Running workunits matching misc/trivial_sync.sh on client.1...
INFO:teuthology.task.workunit:Running workunit misc/trivial_sync.sh...
<type 'exceptions.StopIteration'>
INFO:teuthology.task.nfs:Unmounting nfs clients...
INFO:teuthology.task.knfsd:Unexporting nfs server...
INFO:teuthology.task.kclient:Unmounting kernel clients...
ERROR:teuthology.run_tasks:Manager failed: <contextlib.GeneratorContextManager object at 0x2fa2490>
Traceback (most recent call last):
File "/home/dzafman/teuthology/teuthology/run_tasks.py", line 45, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python2.7/contextlib.py", line 24, in exit
self.gen.next()
File "/home/dzafman/teuthology/teuthology/task/kclient.py", line 92, in task
mnt,
File "/home/dzafman/teuthology/teuthology/orchestra/remote.py", line 40, in run
r = self._runner(client=self.ssh, **kwargs)
File "/home/dzafman/teuthology/teuthology/orchestra/run.py", line 257, in run
r.exitstatus = _check_status(r.exitstatus)
File "/home/dzafman/teuthology/teuthology/orchestra/run.py", line 253, in _check_status
raise CommandFailedError(command=r.command, exitstatus=status)
CommandFailedError: Command failed with status 139: 'sudo umount /tmp/cephtest/mnt.0'

$ grep BUG /var/log/syslog
Oct 11 10:52:40 plana04 kernel: [149894.376868] BUG: Dentry ffff88021fb37268{i=10000000002,n=foo} still in use (2) [unmount of ceph ceph]
Oct 11 10:52:40 plana04 kernel: [149895.153697] kernel BUG at /srv/autobuild-ceph/gitbuilder.git/build/fs/dcache.c:967!

#3 Updated by Sage Weil over 9 years ago

  • Status changed from New to Resolved

pushed to testing branch

Also available in: Atom PDF