Project

General

Profile

Bug #15845

Kernel crash after unmounting CephFS mountpoint

Added by Burkhard Linke almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

We had a server crash several hours after unmounting a CephFS mountpoint.

OS: Ubuntu 14.04 with Xenial LTS backport kernel 4.4.0-18
Ceph: ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)

1. umount /srv/ceph

Error message in terminal: Segmentation fault
Kernel log:

May 11 10:46:52 waas kernel: [2328493.378160] VFS: Busy inodes after unmount of ceph. Self-destruct in 5 seconds. Have a nice day...
May 11 10:46:52 waas kernel: [2328493.378313] FS-Cache: Cookie 'CEPH.fsid' still has children
May 11 10:46:54 waas kernel: [2328493.415074] ------------[ cut here ]------------
May 11 10:46:54 waas kernel: [2328493.451388] kernel BUG at /build/linux-lts-xenial-WD7MTT/linux-lts-xenial-4.4.0/fs/fscache/cookie.c:524!
May 11 10:46:54 waas kernel: [2328493.525716] invalid opcode: 0000 [#1] SMP
May 11 10:46:54 waas kernel: [2328493.562257] Modules linked in: rbd ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs ceph libceph cts 8021q garp mrp stp llc openvswitch nf_defrag_ipv6 nf_conntrack libcrc32c ipmi_ssif intel_rapl x86_pkg_temp_thermal intel_powerclamp coretem
p kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul joydev aesni_intel rpcsec_gss_krb5 nfsd hpilo auth_rpcgss nfs_acl input_leds serio_raw nfs shpchp sb_edac lockd grace 8250_fintek edac_core mac_hid ioatdma sunrpc acpi_power_meter wmi ipmi_si aes_x86_64 lpc_ich lp l
rw gf128mul glue_helper ipmi_msghandler fscache ablk_helper cryptd parport btrfs xor hid_generic usbhid hid raid6_pq ixgbe dca psmouse vxlan tg3 ip6_udp_tunnel udp_tunnel ptp hpsa pps_core scsi_transport_sas mdio fjes
May 11 10:46:54 waas kernel: [2328493.944655] CPU: 10 PID: 15196 Comm: umount Not tainted 4.4.0-18-generic #34~14.04.1-Ubuntu
May 11 10:46:54 waas kernel: [2328494.022091] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 07/20/2015
May 11 10:46:54 waas kernel: [2328494.060144] task: ffff881fe9ee2940 ti: ffff880413ce0000 task.ti: ffff880413ce0000
May 11 10:46:54 waas kernel: [2328494.136513] RIP: 0010:[<ffffffffc0114d60>] [<ffffffffc0114d60>] __fscache_disable_cookie+0x1a0/0x1e0 [fscache]
May 11 10:46:54 waas kernel: [2328494.216464] RSP: 0018:ffff880413ce3dd0 EFLAGS: 00010296
May 11 10:46:54 waas kernel: [2328494.255946] RAX: 000000000000002f RBX: ffff881fab61c000 RCX: 00000000ffffffff
May 11 10:46:54 waas kernel: [2328494.334802] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
May 11 10:46:54 waas kernel: [2328494.413225] RBP: ffff880413ce3df8 R08: 0000000000000000 R09: ffff881fed375220
May 11 10:46:54 waas kernel: [2328494.491477] R10: 0000000000005000 R11: 00000000000015af R12: ffff881fab61c000
May 11 10:46:54 waas kernel: [2328494.572308] R13: 000000000000001d R14: ffff881fe9ee2940 R15: 00000000016e41b0
May 11 10:46:54 waas kernel: [2328494.651741] FS: 00007f1e81e0f880(0000) GS:ffff881fff480000(0000) knlGS:0000000000000000
May 11 10:46:54 waas kernel: [2328494.735018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 11 10:46:54 waas kernel: [2328494.777006] CR2: 00007f1e81439cf0 CR3: 0000001fd37e8000 CR4: 00000000001406e0
May 11 10:46:54 waas kernel: [2328494.859892] Stack:
May 11 10:46:54 waas kernel: [2328494.899193] ffff881fab61c000 ffff881f1a3eb000 000000000000001d ffff881fe9ee2940
May 11 10:46:54 waas kernel: [2328494.979408] 00000000016e41b0 ffff880413ce3e10 ffffffffc0115559 ffff881f1a3eb000
May 11 10:46:54 waas kernel: [2328495.058058] ffff880413ce3e28 ffffffffc06509f1 ffff881f1a3eb000 ffff880413ce3e40
May 11 10:46:54 waas kernel: [2328495.138089] Call Trace:
May 11 10:46:54 waas kernel: [2328495.176902] [<ffffffffc0115559>] __fscache_relinquish_cookie+0x39/0x140 [fscache]
May 11 10:46:54 waas kernel: [2328495.256600] [<ffffffffc06509f1>] ceph_fscache_unregister_fs+0x31/0x40 [ceph]
May 11 10:46:54 waas kernel: [2328495.335208] [<ffffffffc062a06e>] destroy_fs_client+0x1e/0xb0 [ceph]
May 11 10:46:54 waas kernel: [2328495.376211] [<ffffffffc062a147>] ceph_kill_sb+0x47/0x70 [ceph]
May 11 10:46:54 waas kernel: [2328495.415226] [<ffffffff811ff853>] deactivate_locked_super+0x43/0x70
May 11 10:46:54 waas kernel: [2328495.455126] [<ffffffff811ff8c6>] deactivate_super+0x46/0x60
May 11 10:46:54 waas kernel: [2328495.492709] [<ffffffff8121b93f>] cleanup_mnt+0x3f/0x80
May 11 10:46:54 waas kernel: [2328495.532423] [<ffffffff8121b9d2>] __cleanup_mnt+0x12/0x20
May 11 10:46:54 waas kernel: [2328495.569888] [<ffffffff81099ec7>] task_work_run+0x77/0x90
May 11 10:46:54 waas kernel: [2328495.606028] [<ffffffff81078806>] exit_to_usermode_loop+0x73/0xa2
May 11 10:46:54 waas kernel: [2328495.641044] [<ffffffff81003a5e>] syscall_return_slowpath+0x4e/0x60
May 11 10:46:54 waas kernel: [2328495.676108] [<ffffffff817eea58>] int_ret_from_sys_call+0x25/0x8f
May 11 10:46:54 waas kernel: [2328495.708901] Code: 0f ba 74 24 48 03 72 07 31 d2 e9 66 ff ff ff ba 01 00 00 00 e9 5c ff ff ff 48 8b 77 20 31 c0 48 c7 c7 18 b1 11 c0 e8 9f b8 06 c1 <0f> 0b 48 c7 c7 88 c2 11 c0 31 c0 e8 8f b8 06 c1 48 c7 c7 96 c2
May 11 10:46:54 waas kernel: [2328495.812838] RIP [<ffffffffc0114d60>] __fscache_disable_cookie+0x1a0/0x1e0 [fscache]
May 11 10:46:54 waas kernel: [2328495.878970] RSP <ffff880413ce3dd0>
May 11 10:46:54 waas kernel: [2328495.957741] ---[ end trace d46a6ec086d7f531 ]---

The mount point was used to backup files from CephFS on tape, and held several millions caps according to the MDS session status.

2. Re-mount /srv/ceph
Kernel log:
May 11 10:50:33 waas kernel: [2328714.423366] libceph: client2735548 fsid 49098879-85ac-4c5d-aac0-e1a2658a680b
May 11 10:50:33 waas kernel: [2328714.494295] libceph: mon1 192.168.6.133:6789 session established

After remounting the mount point was idle, with just a dozend capabilities held.

3. Two hours later: complete server freeze, soft lockup messages etc. Had to restart the machine.
Kernel log:
May 11 12:54:08 waas kernel: [2336127.416016] BUG: unable to handle kernel NULL pointer dereference at (null)
May 11 12:54:08 waas kernel: [2336127.475968] IP: [<ffffffffc064dafd>] ceph_mdsc_handle_map+0x6d/0x650 [ceph]
May 11 12:54:08 waas kernel: [2336127.506619] PGD 0
May 11 12:54:08 waas kernel: [2336127.537923] Oops: 0000 [#2] SMP
May 11 12:54:08 waas kernel: [2336127.568024] Modules linked in: rbd ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs ceph libceph cts 8021q garp mrp stp llc openvswitch nf_defrag_ipv6 nf_conntrack libcrc32c ipmi_ssif intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul joydev aesni_intel rpcsec_gss_krb5 nfsd hpilo auth_rpcgss nfs_acl input_leds serio_raw nfs shpchp sb_edac lockd grace 8250_fintek edac_core mac_hid ioatdma sunrpc acpi_power_meter wmi ipmi_si aes_x86_64 lpc_ich lp lrw gf128mul glue_helper ipmi_msghandler fscache ablk_helper cryptd parport btrfs xor hid_generic usbhid hid raid6_pq ixgbe dca psmouse vxlan tg3 ip6_udp_tunnel udp_tunnel ptp hpsa pps_core scsi_transport_sas mdio fjes
May 11 12:54:08 waas kernel: [2336127.898223] CPU: 3 PID: 9696 Comm: kworker/3:0 Tainted: G D 4.4.0-18-generic #34~14.04.1-Ubuntu
May 11 12:54:08 waas kernel: [2336127.971855] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 07/20/2015
May 11 12:54:08 waas kernel: [2336128.009882] Workqueue: ceph-msgr ceph_con_workfn [libceph]
May 11 12:54:08 waas kernel: [2336128.047523] task: ffff880103d644c0 ti: ffff880243a28000 task.ti: ffff880243a28000
May 11 12:54:08 waas kernel: [2336128.120833] RIP: 0010:[<ffffffffc064dafd>] [<ffffffffc064dafd>] ceph_mdsc_handle_map+0x6d/0x650 [ceph]
May 11 12:54:08 waas kernel: [2336128.197396] RSP: 0018:ffff880243a2bc18 EFLAGS: 00010282
May 11 12:54:08 waas kernel: [2336128.235299] RAX: 5d4cac8579880949 RBX: ffff881a55591e94 RCX: ffff881a55591c10
May 11 12:54:08 waas kernel: [2336128.311435] RDX: 0b688a65a2e1c0aa RSI: ffff880243a2bc70 RDI: 0000000000000000
May 11 12:54:08 waas kernel: [2336128.388958] RBP: ffff880243a2bcc0 R08: 000000000001a000 R09: ffffffff816d626a
May 11 12:54:08 waas kernel: [2336128.467017] R10: ffff88077dbde900 R11: ffffea00049ab200 R12: 0000000000000015
May 11 12:54:08 waas kernel: [2336128.546565] R13: ffff881eedcfe440 R14: 0000000000000000 R15: 0000000000000001
May 11 12:54:08 waas kernel: [2336128.625745] FS: 0000000000000000(0000) GS:ffff881fff2c0000(0000) knlGS:0000000000000000
May 11 12:54:08 waas kernel: [2336128.708279] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 11 12:54:08 waas kernel: [2336128.749349] CR2: 0000000000000000 CR3: 0000000001c0c000 CR4: 00000000001406e0
May 11 12:54:08 waas kernel: [2336128.828880] Stack:
May 11 12:54:08 waas kernel: [2336128.868506] 0000000000000000 0000000000000000 0000000000004040 0000000000000000
May 11 12:54:08 waas kernel: [2336128.945246] ffff880243a2bc70 ffffffffc05ddff0 ffff880126ac8800 ffff881eedcfe2d8
May 11 12:54:08 waas kernel: [2336129.022320] 0000000000000001 ffff881eedcfe2b8 ffff881a55591c10 5d4cac8579880949
May 11 12:54:08 waas kernel: [2336129.100669] Call Trace:
May 11 12:54:08 waas kernel: [2336129.138602] [<ffffffffc05ddff0>] ? read_partial.isra.25+0x50/0x70 [libceph]
May 11 12:54:08 waas kernel: [2336129.215472] [<ffffffffc05e0209>] ? read_partial_message+0x419/0x890 [libceph]
May 11 12:54:08 waas kernel: [2336129.293238] [<ffffffffc062a73d>] extra_mon_dispatch+0x1d/0x30 [ceph]
May 11 12:54:08 waas kernel: [2336129.332335] [<ffffffffc05e45ca>] dispatch+0x30a/0x720 [libceph]
May 11 12:54:08 waas kernel: [2336129.370846] [<ffffffffc05e0ac3>] try_read+0x443/0x1120 [libceph]
May 11 12:54:08 waas kernel: [2336129.410031] [<ffffffff81096510>] ? try_to_grab_pending+0xb0/0x150
May 11 12:54:08 waas kernel: [2336129.448015] [<ffffffff810b3175>] ? put_prev_entity+0x35/0x670
May 11 12:54:08 waas kernel: [2336129.485085] [<ffffffff8102c696>] ? __switch_to+0x1d6/0x550
May 11 12:54:08 waas kernel: [2336129.521170] [<ffffffffc05e1852>] ceph_con_workfn+0xb2/0x5d0 [libceph]
May 11 12:54:08 waas kernel: [2336129.556734] [<ffffffff8109598d>] process_one_work+0x14d/0x3f0
May 11 12:54:08 waas kernel: [2336129.591955] [<ffffffff8109610a>] worker_thread+0x11a/0x470
May 11 12:54:08 waas kernel: [2336129.626262] [<ffffffff817eaa49>] ? __schedule+0x359/0x970
May 11 12:54:08 waas kernel: [2336129.659578] [<ffffffff81095ff0>] ? rescuer_thread+0x310/0x310
May 11 12:54:08 waas kernel: [2336129.692627] [<ffffffff8109b842>] kthread+0xd2/0xf0
May 11 12:54:08 waas kernel: [2336129.725096] [<ffffffff8109b770>] ? kthread_park+0x50/0x50
May 11 12:54:08 waas kernel: [2336129.756811] [<ffffffff817eec8f>] ret_from_fork+0x3f/0x70
May 11 12:54:08 waas kernel: [2336129.787102] [<ffffffff8109b770>] ? kthread_park+0x50/0x50
May 11 12:54:08 waas kernel: [2336129.817415] Code: d8 48 29 c8 48 83 f8 17 0f 86 b0 00 00 00 48 8b 01 48 8b 51 08 49 89 fe 48 8d 74 24 58 48 83 c1 10 48 89 4c 24 50 48 89 44 24 58 <48> 8b 07 48 89 54 24 60 48 8b 78 10 e8 f2 d6 f8 ff 85 c0 0f 88
May 11 12:54:08 waas kernel: [2336129.910566] RIP [<ffffffffc064dafd>] ceph_mdsc_handle_map+0x6d/0x650 [ceph]
May 11 12:54:08 waas kernel: [2336129.973138] RSP <ffff880243a2bc18>
May 11 12:54:08 waas kernel: [2336130.002901] CR2: 0000000000000000
May 11 12:54:08 waas kernel: [2336130.077904] ---[ end trace d46a6ec086d7f532 ]---

The only messages in the mon log at the time is a dump of the mds map:

2016-05-11 12:54:05.659616 7f97792f1700 0 mon.ceph-storage-03@0(leader).mds e187276 print_map
epoch 187276
flags 0
created 2014-10-22 19:19:45.462330
modified 2016-05-11 12:54:05.648986
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 186368
last_failure_osd_epoch 352275
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table}
max_mds 1
in 0
up {0=2718471}
failed
stopped
data_pools 7,12
metadata_pool 8
inline_data disabled
2718471: 192.168.6.129:6801/12700 'ceph-storage-01' mds.0.111 up:active seq 253459

History

#1 Updated by Greg Farnum almost 8 years ago

  • Assignee set to Zheng Yan

Given those first two lines...

May 11 10:46:52 waas kernel: [2328493.378160] VFS: Busy inodes after unmount of ceph. Self-destruct in 5 seconds. Have a nice day...
May 11 10:46:52 waas kernel: [2328493.378313] FS-Cache: Cookie 'CEPH.fsid' still has children

...I tend to think it took too long to release all its caps, and something went wrong?

#2 Updated by Burkhard Linke almost 8 years ago

This is the current situation after the crash two days ago and one backup run (ceph daemon mds.X session ls entry for the server):

{
"id": 2755773,
"num_leases": 0,
"num_caps": 2577959,
"state": "open",
"replay_requests": 0,
"reconnecting": false,
"inst": "client.2755773 134.176.31.198:0\/1873999028",
"client_metadata": {
"entity_id": "admin",
"hostname": "waas",
"kernel_version": "4.4.0-22-generic"
}
},

The backup software running on that machine is Tivoli Client Version 7, Release 1, Level 1.0. Since that server is also our main home directory NFS server, I hesitate to try to reproduce the problem.

#3 Updated by Zheng Yan almost 8 years ago

  • Status changed from New to 7

BUG at fs/fscache/cookie.c:524 is caused by inode reference leak, The leak is likely fixed by https://github.com/ceph/ceph-client/commit/1a0c9829aa297ae3e62f835a9043e8fe028520cd

#4 Updated by Loïc Dachary almost 8 years ago

  • Target version deleted (v0.94.7)

#5 Updated by Zheng Yan over 7 years ago

  • Status changed from 7 to Resolved

Also available in: Atom PDF