Bug #18474
closedoops in __unregister_request
0%
Description
I left a xfstests run going overnight last night and when I came back I saw this oops on the console:
[14231.902682] run fstests generic/205 at 2017-01-09 23:11:24 [14339.010567] ceph: dropping unsafe request 18446621567284867072 [14339.011516] ------------[ cut here ]------------ [14339.012188] kernel BUG at fs/ceph/mds_client.c:576! [14339.012816] invalid opcode: 0000 [#1] SMP [14339.013338] Modules linked in: loop ceph(OE) libceph fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_raw ip6table_mangle ip6table_security ip6table_ nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 iptable_raw iptable_mangle iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables nfsd auth_rp cgss crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nfs_acl lockd ppdev joydev acpi_cpufreq parport_pc tpm_tis pcspkr parport i2c_piix4 tpm_tis_core tpm virtio_balloon qemu_fw_cfg grace sunrpc xfs libcrc32c virtio_blk virtio_net v irtio_console qxl drm_kms_helper ata_generic ttm serio_raw crc32c_intel drm pata_acpi virtio_pci virtio_ring virtio floppy [14339.026549] CPU: 1 PID: 29659 Comm: kworker/1:2 Tainted: G OE 4.9.0+ #13 [14339.027572] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014 [14339.029143] Workqueue: ceph-msgr ceph_con_workfn [libceph] [14339.029916] task: ffff9094ea818000 task.stack: ffffb500420b4000 [14339.030676] RIP: 0010:[<ffffffffc05a41bd>] [<ffffffffc05a41bd>] __unregister_request+0x1ad/0x1b0 [ceph] [14339.032275] RSP: 0018:ffffb500420b7c08 EFLAGS: 00010246 [14339.033134] RAX: 0000000000000000 RBX: ffff9094c012f400 RCX: 0000000000000000 [14339.034282] RDX: 0000000000000000 RSI: ffff9094efb87100 RDI: ffff9094efb87000 [14339.035108] RBP: ffffb500420b7c20 R08: 0000000000000000 R09: 0000000000000000 [14339.036325] R10: 0000000000000c00 R11: 0000000000000479 R12: ffff9094efb87000 [14339.037455] R13: ffff9094c012f408 R14: ffff9094efd00db8 R15: ffff9094c012f738 [14339.038532] FS: 0000000000000000(0000) GS:ffff9094ffc80000(0000) knlGS:0000000000000000 [14339.039827] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [14339.040674] CR2: 0000555bf06f56b8 CR3: 00000001311a5000 CR4: 00000000000406e0 [14339.041755] Stack: [14339.042082] ffff9094efd00800 ffff9094c012f400 ffff9094efb87000 ffffb500420b7c60 [14339.043306] ffffffffc05a4247 ffff9094efb87008 ffff9094efb87000 0000000000000003 [14339.044514] ffff9094eee4af00 ffff9094efd00800 ffff9094eeea3420 ffffb500420b7d18 [14339.045718] Call Trace: [14339.046109] [<ffffffffc05a4247>] cleanup_session_requests+0x87/0x130 [ceph] [14339.047199] [<ffffffffc05a78ee>] dispatch+0x69e/0x16d0 [ceph] [14339.048059] [<ffffffffc052920b>] ceph_con_workfn+0x5fb/0x2ef0 [libceph] [14339.049099] [<ffffffffad0daa2c>] ? dequeue_task_fair+0x5ec/0x940 [14339.050035] [<ffffffffad05cf4e>] ? kvm_sched_clock_read+0x1e/0x30 [14339.051029] [<ffffffffad02fc49>] ? sched_clock+0x9/0x10 [14339.051807] [<ffffffffad025728>] ? __switch_to+0x2a8/0x5c0 [14339.052652] [<ffffffffad0bbe54>] process_one_work+0x184/0x430 [14339.053572] [<ffffffffad0bc14e>] worker_thread+0x4e/0x490 [14339.054364] [<ffffffffad0bc100>] ? process_one_work+0x430/0x430 [14339.055220] [<ffffffffad0bc100>] ? process_one_work+0x430/0x430 [14339.056336] [<ffffffffad0c1b59>] kthread+0xd9/0xf0 [14339.057270] [<ffffffffad0c1a80>] ? kthread_park+0x60/0x60 [14339.058275] [<ffffffffad0c1a80>] ? kthread_park+0x60/0x60 [14339.059293] [<ffffffffad80cb15>] ret_from_fork+0x25/0x30 [14339.060293] Code: 11 00 00 74 e6 48 8b 40 f8 49 89 84 24 f8 00 00 00 e9 96 fe ff ff 4c 8b a3 88 00 00 00 4d 85 e4 0f 85 56 ff ff ff e9 cb fe ff ff <0f> 0b 90 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 41 54 49 [14339.064785] RIP [<ffffffffc05a41bd>] __unregister_request+0x1ad/0x1b0 [ceph] [14339.065884] RSP <ffffb500420b7c08> [14339.067866] ---[ end trace 8a0bb333836080c5 ]---
...the debugger says:
(gdb) list *(__unregister_request+0x1ad) 0x231ed is in __unregister_request (fs/ceph/mds_client.c:576). 571 put_request_session(req); 572 ceph_unreserve_caps(req->r_mdsc, &req->r_caps_reservation); 573 kfree(req); 574 } 575 576 DEFINE_RB_FUNCS(request, struct ceph_mds_request, r_tid, r_node) 577 578 /* 579 * lookup session, bump ref if found. 580 *
Which may be the BUG_ON in DEFINE_RB_INSDEL_FUNCS:
BUG_ON(!RB_EMPTY_NODE(&t->nodefld));
...unfortunately I didn't get a vmcore, so it's hard to know for sure.
Updated by Jeff Layton over 7 years ago
On the next attempt:
$ sudo ./check generic/205 FSTYP -- ceph PLATFORM -- Linux/x86_64 cephfs 4.9.0+ MKFS_OPTIONS -- 192.168.1.149:/scratch MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 192.168.1.149:/scratch /mnt/scratch generic/205 [not run] Reflink not supported by scratch filesystem type: ceph Not run: generic/205 Passed all 0 tests
...with no crash. So it seems likely that this occurred during the mount or unmount perhaps? The message about a dropped unsafe request may be a hint there.
Updated by Jeff Layton about 7 years ago
Hit a very similar crash in testing today. This time it crashed while doing the list_del_init in cleanup_session_requests(). Almost the same spot, but not exactly. I think we may have a memory scribble.
I'll plan to add some strategic BUG_ONs -- see if we're maybe freeing the reqs while they're still on s_unsafe? I'll also see if I can set up kdump and get a vmcore. That might tell us a bit more about how we get a zeroed out object here.
It might also not hurt to make a dedicated slabcache for ceph_mds_request objects. That way we could do some memory poisoning of just those objects too.
Updated by Zheng Yan about 7 years ago
if op == CEPH_SESSION_CLOSE, handle_session() unregisters the session. Maybe it's use-after-free bug. (I didn't check it carefully)
Updated by Jeff Layton about 7 years ago
Yeah, could be a use-after-free, or maybe a refcounting imbalance in the session handling? I kicked off another xfstests run and it didn't occur, so I'm guessing it may be a race of some sort.
xfstests does do a bunch of mounting and unmounting in quick succession and this oops is occurring in the session teardown code, so maybe it's related to that somehow?
Updated by Jeff Layton about 7 years ago
I added this just before the kfree(req) in ceph_mdsc_release_request:
BUG_ON(!list_empty(&req->r_unsafe_item));
...and kicked off an xfstests run before bedtime last night. This was on the console this morning:
[43351.124014] run fstests generic/205 at 2017-02-06 22:52:49 [43415.014713] ------------[ cut here ]------------ [43415.015243] kernel BUG at fs/ceph/mds_client.c:587! [43415.015759] invalid opcode: 0000 [#1] SMP [43415.016190] Modules linked in: ceph(OE) libceph(OE) loop rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_raw ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_security ebtable_filter ebtables ip6table_filter ip6_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev joydev acpi_cpufreq i2c_piix4 parport_pc pcspkr nfsd tpm_tis parport virtio_balloon tpm_tis_core qemu_fw_cfg tpm auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c qxl virtio_console drm_kms_helper virtio_blk virtio_net ttm floppy drm ata_generic virtio_pci [43415.023469] crc32c_intel virtio_ring serio_raw pata_acpi virtio [last unloaded: libceph] [43415.024331] CPU: 0 PID: 7618 Comm: umount Tainted: G OE 4.10.0-rc5+ #8 [43415.025125] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014 [43415.026033] task: ffff9ee239c9d880 task.stack: ffffc275833b8000 [43415.026663] RIP: 0010:ceph_mdsc_release_request+0x169/0x170 [ceph] [43415.027313] RSP: 0018:ffffc275833bbdc0 EFLAGS: 00010287 [43415.027860] RAX: ffff9ee2335f7b30 RBX: ffff9ee2335f4f30 RCX: 0000000100000000 [43415.028607] RDX: ffff9ee22e30d688 RSI: ffff9ee2335f4f58 RDI: ffff9ee22eb3fc00 [43415.029355] RBP: ffffc275833bbdd0 R08: ffff9ee229ea0a40 R09: 0000000000000001 [43415.030120] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9ee2335f4c00 [43415.030857] R13: ffff9ee229eec010 R14: ffff9ee22eb3fc08 R15: ffff9ee2335f4c00 [43415.031638] FS: 00007fe9e3c86340(0000) GS:ffff9ee23fc00000(0000) knlGS:0000000000000000 [43415.032484] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [43415.033090] CR2: 00005623e6999c28 CR3: 000000012cd0c000 CR4: 00000000000406f0 [43415.033830] Call Trace: [43415.034108] __unregister_request+0xc3/0x200 [ceph] [43415.034628] ceph_mdsc_pre_umount+0x131/0x1f0 [ceph] [43415.035163] ceph_kill_sb+0x2b/0x70 [ceph] [43415.035600] deactivate_locked_super+0x43/0x70 [43415.036074] deactivate_super+0x5a/0x60 [43415.036481] cleanup_mnt+0x3f/0x90 [43415.036845] __cleanup_mnt+0x12/0x20 [43415.037232] task_work_run+0x80/0xa0 [43415.037611] exit_to_usermode_loop+0xaa/0xb0 [43415.038069] syscall_return_slowpath+0xaa/0xb0 [43415.038540] entry_SYSCALL_64_fastpath+0xa7/0xa9 [43415.039031] RIP: 0033:0x7fe9e2ceffd7 [43415.039414] RSP: 002b:00007ffe6835a1b8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [43415.040200] RAX: 0000000000000000 RBX: 000055c9a6dea080 RCX: 00007fe9e2ceffd7 [43415.040940] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055c9a6df2350 [43415.041679] RBP: 000055c9a6df2350 R08: 000055c9a6df2370 R09: 0000000000000014 [43415.042424] R10: 000055c9a6dede40 R11: 0000000000000246 R12: 00007fe9e3a6d184 [43415.043374] R13: 0000000000000000 R14: 000055c9a6dea260 R15: 00007ffe6835a440 [43415.044341] Code: 8b bb 58 fd ff ff 48 8d b3 90 00 00 00 48 83 c3 68 e8 dc 20 ff ff 48 8b 03 48 39 c3 75 0d 4c 89 e7 e8 4c 85 9c e1 5b 41 5c 5d c3 <0f> 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 49 [43415.046730] RIP: ceph_mdsc_release_request+0x169/0x170 [ceph] RSP: ffffc275833bbdc0 [43415.049061] ---[ end trace e740b673f6130a3f ]---
So it seems likely that we have a ceph_mds_request refcounting imbalance somewhere that is causing reqs to be freed before it has been removed from the s_unsafe list.
In my testing, it always happens in the unmount codepath during generic/205 test. I'll plan to have a look at that test and see whether we can narrow it down to certain codepaths.
Updated by Zheng Yan about 7 years ago
I think wait_requests() should remove request from unsafe list before calling __unregister_request()
Updated by Jeff Layton about 7 years ago
Zheng Yan wrote:
I think wait_requests() should remove request from unsafe list before calling __unregister_request()
Maybe. Would it be better instead to do it in __unregister_request? It's protected by the same lock you need for the stuff in that function, so removing it there would make sense to me.
Is there ever a time you'd want to remove it from the request tree but leave it on the s_unsafe list?
Updated by Zheng Yan about 7 years ago
Jeff Layton wrote:
Is there ever a time you'd want to remove it from the request tree but leave it on the s_unsafe list?
I don't think there is
Updated by Jeff Layton about 7 years ago
On second thought, I don't really like that since __register_request doesn't put the thing on the list. I'm going to go with your original idea of list_del_init'ing it in wait_requests, and then add a WARN_ON_ONCE() in __unregister_request if the list isn't empty at that point. I ran xfstests on ceph last night with a patch like that and got no panics and no warnings, so I think it's safe enough.