Bug #18474: oops in __unregister_request - Linux kernel client - Ceph

Actions

Copy link

Bug #18474

closed

oops in __unregister_request

Added by Jeff Layton over 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jeff Layton

Category:

fs/ceph

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

I left a xfstests run going overnight last night and when I came back I saw this oops on the console:

[14231.902682] run fstests generic/205 at 2017-01-09 23:11:24
[14339.010567] ceph:  dropping unsafe request 18446621567284867072
[14339.011516] ------------[ cut here ]------------
[14339.012188] kernel BUG at fs/ceph/mds_client.c:576!
[14339.012816] invalid opcode: 0000 [#1] SMP
[14339.013338] Modules linked in: loop ceph(OE) libceph fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_raw ip6table_mangle ip6table_security ip6table_
nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 iptable_raw iptable_mangle iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables nfsd auth_rp
cgss crct10dif_pclmul crc32_pclmul ghash_clmulni_intel nfs_acl lockd ppdev joydev acpi_cpufreq parport_pc tpm_tis pcspkr parport i2c_piix4 tpm_tis_core tpm virtio_balloon qemu_fw_cfg grace sunrpc xfs libcrc32c virtio_blk virtio_net v
irtio_console qxl drm_kms_helper ata_generic ttm serio_raw crc32c_intel drm pata_acpi virtio_pci virtio_ring virtio floppy
[14339.026549] CPU: 1 PID: 29659 Comm: kworker/1:2 Tainted: G           OE   4.9.0+ #13
[14339.027572] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
[14339.029143] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[14339.029916] task: ffff9094ea818000 task.stack: ffffb500420b4000
[14339.030676] RIP: 0010:[<ffffffffc05a41bd>]  [<ffffffffc05a41bd>] __unregister_request+0x1ad/0x1b0 [ceph]
[14339.032275] RSP: 0018:ffffb500420b7c08  EFLAGS: 00010246
[14339.033134] RAX: 0000000000000000 RBX: ffff9094c012f400 RCX: 0000000000000000
[14339.034282] RDX: 0000000000000000 RSI: ffff9094efb87100 RDI: ffff9094efb87000
[14339.035108] RBP: ffffb500420b7c20 R08: 0000000000000000 R09: 0000000000000000
[14339.036325] R10: 0000000000000c00 R11: 0000000000000479 R12: ffff9094efb87000
[14339.037455] R13: ffff9094c012f408 R14: ffff9094efd00db8 R15: ffff9094c012f738
[14339.038532] FS:  0000000000000000(0000) GS:ffff9094ffc80000(0000) knlGS:0000000000000000
[14339.039827] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14339.040674] CR2: 0000555bf06f56b8 CR3: 00000001311a5000 CR4: 00000000000406e0
[14339.041755] Stack:
[14339.042082]  ffff9094efd00800 ffff9094c012f400 ffff9094efb87000 ffffb500420b7c60
[14339.043306]  ffffffffc05a4247 ffff9094efb87008 ffff9094efb87000 0000000000000003
[14339.044514]  ffff9094eee4af00 ffff9094efd00800 ffff9094eeea3420 ffffb500420b7d18
[14339.045718] Call Trace:
[14339.046109]  [<ffffffffc05a4247>] cleanup_session_requests+0x87/0x130 [ceph]
[14339.047199]  [<ffffffffc05a78ee>] dispatch+0x69e/0x16d0 [ceph]
[14339.048059]  [<ffffffffc052920b>] ceph_con_workfn+0x5fb/0x2ef0 [libceph]
[14339.049099]  [<ffffffffad0daa2c>] ? dequeue_task_fair+0x5ec/0x940
[14339.050035]  [<ffffffffad05cf4e>] ? kvm_sched_clock_read+0x1e/0x30
[14339.051029]  [<ffffffffad02fc49>] ? sched_clock+0x9/0x10
[14339.051807]  [<ffffffffad025728>] ? __switch_to+0x2a8/0x5c0
[14339.052652]  [<ffffffffad0bbe54>] process_one_work+0x184/0x430
[14339.053572]  [<ffffffffad0bc14e>] worker_thread+0x4e/0x490
[14339.054364]  [<ffffffffad0bc100>] ? process_one_work+0x430/0x430
[14339.055220]  [<ffffffffad0bc100>] ? process_one_work+0x430/0x430
[14339.056336]  [<ffffffffad0c1b59>] kthread+0xd9/0xf0
[14339.057270]  [<ffffffffad0c1a80>] ? kthread_park+0x60/0x60
[14339.058275]  [<ffffffffad0c1a80>] ? kthread_park+0x60/0x60
[14339.059293]  [<ffffffffad80cb15>] ret_from_fork+0x25/0x30
[14339.060293] Code: 11 00 00 74 e6 48 8b 40 f8 49 89 84 24 f8 00 00 00 e9 96 fe ff ff 4c 8b a3 88 00 00 00 4d 85 e4 0f 85 56 ff ff ff e9 cb fe ff ff <0f> 0b 90 66 66 66 66 90 55 48 89 e5 41 57 41 56 41 55 41 54 49
[14339.064785] RIP  [<ffffffffc05a41bd>] __unregister_request+0x1ad/0x1b0 [ceph]
[14339.065884]  RSP <ffffb500420b7c08>
[14339.067866] ---[ end trace 8a0bb333836080c5 ]---

...the debugger says:

(gdb) list *(__unregister_request+0x1ad)
0x231ed is in __unregister_request (fs/ceph/mds_client.c:576).
571        put_request_session(req);
572        ceph_unreserve_caps(req->r_mdsc, &req->r_caps_reservation);
573        kfree(req);
574    }
575    
576    DEFINE_RB_FUNCS(request, struct ceph_mds_request, r_tid, r_node)
577    
578    /*
579     * lookup session, bump ref if found.
580     *

Which may be the BUG_ON in DEFINE_RB_INSDEL_FUNCS:

        BUG_ON(!RB_EMPTY_NODE(&t->nodefld));

...unfortunately I didn't get a vmcore, so it's hard to know for sure.

Actions

Copy link

Updated by Jeff Layton over 7 years ago

On the next attempt:

$ sudo ./check generic/205
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 cephfs 4.9.0+
MKFS_OPTIONS  -- 192.168.1.149:/scratch
MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 192.168.1.149:/scratch /mnt/scratch

generic/205     [not run] Reflink not supported by scratch filesystem type: ceph
Not run: generic/205
Passed all 0 tests

...with no crash. So it seems likely that this occurred during the mount or unmount perhaps? The message about a dropped unsafe request may be a hint there.

Actions

Copy link

Updated by Ilya Dryomov over 7 years ago

Category set to fs/ceph

Actions

Copy link

Updated by Jeff Layton about 7 years ago

Hit a very similar crash in testing today. This time it crashed while doing the list_del_init in cleanup_session_requests(). Almost the same spot, but not exactly. I think we may have a memory scribble.

I'll plan to add some strategic BUG_ONs -- see if we're maybe freeing the reqs while they're still on s_unsafe? I'll also see if I can set up kdump and get a vmcore. That might tell us a bit more about how we get a zeroed out object here.

It might also not hurt to make a dedicated slabcache for ceph_mds_request objects. That way we could do some memory poisoning of just those objects too.

Actions

Copy link

Updated by Zheng Yan about 7 years ago

if op == CEPH_SESSION_CLOSE, handle_session() unregisters the session. Maybe it's use-after-free bug. (I didn't check it carefully)

Actions

Copy link

Updated by Jeff Layton about 7 years ago

Yeah, could be a use-after-free, or maybe a refcounting imbalance in the session handling? I kicked off another xfstests run and it didn't occur, so I'm guessing it may be a race of some sort.

xfstests does do a bunch of mounting and unmounting in quick succession and this oops is occurring in the session teardown code, so maybe it's related to that somehow?

Actions

Copy link

Updated by Jeff Layton about 7 years ago

I added this just before the kfree(req) in ceph_mdsc_release_request:

BUG_ON(!list_empty(&req->r_unsafe_item));

...and kicked off an xfstests run before bedtime last night. This was on the console this morning:

[43351.124014] run fstests generic/205 at 2017-02-06 22:52:49
[43415.014713] ------------[ cut here ]------------
[43415.015243] kernel BUG at fs/ceph/mds_client.c:587!
[43415.015759] invalid opcode: 0000 [#1] SMP
[43415.016190] Modules linked in: ceph(OE) libceph(OE) loop rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_raw ip6table_mangle ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_security iptable_raw iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_security ebtable_filter ebtables ip6table_filter ip6_tables crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev joydev acpi_cpufreq i2c_piix4 parport_pc pcspkr nfsd tpm_tis parport virtio_balloon tpm_tis_core qemu_fw_cfg tpm auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c qxl virtio_console drm_kms_helper virtio_blk virtio_net ttm floppy drm ata_generic virtio_pci
[43415.023469]  crc32c_intel virtio_ring serio_raw pata_acpi virtio [last unloaded: libceph]
[43415.024331] CPU: 0 PID: 7618 Comm: umount Tainted: G           OE   4.10.0-rc5+ #8
[43415.025125] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
[43415.026033] task: ffff9ee239c9d880 task.stack: ffffc275833b8000
[43415.026663] RIP: 0010:ceph_mdsc_release_request+0x169/0x170 [ceph]
[43415.027313] RSP: 0018:ffffc275833bbdc0 EFLAGS: 00010287
[43415.027860] RAX: ffff9ee2335f7b30 RBX: ffff9ee2335f4f30 RCX: 0000000100000000
[43415.028607] RDX: ffff9ee22e30d688 RSI: ffff9ee2335f4f58 RDI: ffff9ee22eb3fc00
[43415.029355] RBP: ffffc275833bbdd0 R08: ffff9ee229ea0a40 R09: 0000000000000001
[43415.030120] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9ee2335f4c00
[43415.030857] R13: ffff9ee229eec010 R14: ffff9ee22eb3fc08 R15: ffff9ee2335f4c00
[43415.031638] FS:  00007fe9e3c86340(0000) GS:ffff9ee23fc00000(0000) knlGS:0000000000000000
[43415.032484] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[43415.033090] CR2: 00005623e6999c28 CR3: 000000012cd0c000 CR4: 00000000000406f0
[43415.033830] Call Trace:
[43415.034108]  __unregister_request+0xc3/0x200 [ceph]
[43415.034628]  ceph_mdsc_pre_umount+0x131/0x1f0 [ceph]
[43415.035163]  ceph_kill_sb+0x2b/0x70 [ceph]
[43415.035600]  deactivate_locked_super+0x43/0x70
[43415.036074]  deactivate_super+0x5a/0x60
[43415.036481]  cleanup_mnt+0x3f/0x90
[43415.036845]  __cleanup_mnt+0x12/0x20
[43415.037232]  task_work_run+0x80/0xa0
[43415.037611]  exit_to_usermode_loop+0xaa/0xb0
[43415.038069]  syscall_return_slowpath+0xaa/0xb0
[43415.038540]  entry_SYSCALL_64_fastpath+0xa7/0xa9
[43415.039031] RIP: 0033:0x7fe9e2ceffd7
[43415.039414] RSP: 002b:00007ffe6835a1b8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[43415.040200] RAX: 0000000000000000 RBX: 000055c9a6dea080 RCX: 00007fe9e2ceffd7
[43415.040940] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055c9a6df2350
[43415.041679] RBP: 000055c9a6df2350 R08: 000055c9a6df2370 R09: 0000000000000014
[43415.042424] R10: 000055c9a6dede40 R11: 0000000000000246 R12: 00007fe9e3a6d184
[43415.043374] R13: 0000000000000000 R14: 000055c9a6dea260 R15: 00007ffe6835a440
[43415.044341] Code: 8b bb 58 fd ff ff 48 8d b3 90 00 00 00 48 83 c3 68 e8 dc 20 ff ff 48 8b 03 48 39 c3 75 0d 4c 89 e7 e8 4c 85 9c e1 5b 41 5c 5d c3 <0f> 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 55 41 54 49 
[43415.046730] RIP: ceph_mdsc_release_request+0x169/0x170 [ceph] RSP: ffffc275833bbdc0
[43415.049061] ---[ end trace e740b673f6130a3f ]---

So it seems likely that we have a ceph_mds_request refcounting imbalance somewhere that is causing reqs to be freed before it has been removed from the s_unsafe list.

In my testing, it always happens in the unmount codepath during generic/205 test. I'll plan to have a look at that test and see whether we can narrow it down to certain codepaths.

Actions

Copy link

Updated by Jeff Layton about 7 years ago

Assignee set to Jeff Layton

Actions

Copy link

Updated by Zheng Yan about 7 years ago

I think wait_requests() should remove request from unsafe list before calling __unregister_request()

Actions

Copy link

Updated by Jeff Layton about 7 years ago

Zheng Yan wrote:

I think wait_requests() should remove request from unsafe list before calling __unregister_request()

Maybe. Would it be better instead to do it in __unregister_request? It's protected by the same lock you need for the stuff in that function, so removing it there would make sense to me.

Is there ever a time you'd want to remove it from the request tree but leave it on the s_unsafe list?

Actions

Copy link

#10

Updated by Zheng Yan about 7 years ago

Jeff Layton wrote:

Is there ever a time you'd want to remove it from the request tree but leave it on the s_unsafe list?

I don't think there is

Actions

Copy link

#11

Updated by Jeff Layton about 7 years ago

On second thought, I don't really like that since __register_request doesn't put the thing on the list. I'm going to go with your original idea of list_del_init'ing it in wait_requests, and then add a WARN_ON_ONCE() in __unregister_request if the list isn't empty at that point. I ran xfstests on ceph last night with a patch like that and got no panics and no warnings, so I think it's safe enough.

Actions

Copy link

#12

Updated by Jeff Layton about 7 years ago

Status changed from New to Resolved

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » Linux kernel client

Custom queries

Bug #18474

oops in __unregister_request

Updated by Jeff Layton over 7 years ago

Updated by Ilya Dryomov over 7 years ago

Updated by Jeff Layton about 7 years ago

Updated by Zheng Yan about 7 years ago

Updated by Jeff Layton about 7 years ago

Updated by Jeff Layton about 7 years ago

Updated by Jeff Layton about 7 years ago

Updated by Zheng Yan about 7 years ago

Updated by Jeff Layton about 7 years ago

Updated by Zheng Yan about 7 years ago

Updated by Jeff Layton about 7 years ago

Updated by Jeff Layton about 7 years ago