Bug #17153
closedkernel hung task warnings on teuthology.front kernel
0%
Description
teuthology.front.sepia.ceph.com was recently upgraded to the ubuntu distro kernel (4.4.0-34-generic). Since then, I'm seeing some warnings in the ring buffer:
[998531.323457] INFO: task nginx:2786 blocked for more than 120 seconds. [998531.323489] Not tainted 4.4.0-34-generic #53-Ubuntu [998531.323506] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [998531.323529] nginx D ffff880819a4b968 0 2786 2774 0x00000000 [998531.323533] ffff880819a4b968 ffff88081ec00d80 ffff88081bb6e040 ffff880813a26040 [998531.323536] ffff880819a4c000 ffff88081f3d6d00 7fffffffffffffff ffff880819a4bae0 [998531.323538] ffffffff8182a610 ffff880819a4b980 ffffffff81829e15 0000000000000000 [998531.323540] Call Trace: [998531.323564] [<ffffffff8182a610>] ? bit_wait+0x60/0x60 [998531.323567] [<ffffffff81829e15>] schedule+0x35/0x80 [998531.323569] [<ffffffff8182cf35>] schedule_timeout+0x1b5/0x270 [998531.323573] [<ffffffff811eb0a9>] ? get_partial_node.isra.61+0x1c9/0x200 [998531.323577] [<ffffffff8106426e>] ? kvm_clock_get_cycles+0x1e/0x20 [998531.323579] [<ffffffff8182a610>] ? bit_wait+0x60/0x60 [998531.323581] [<ffffffff81829344>] io_schedule_timeout+0xa4/0x110 [998531.323583] [<ffffffff8182a62b>] bit_wait_io+0x1b/0x70 [998531.323585] [<ffffffff8182a3ae>] __wait_on_bit_lock+0x4e/0xb0 [998531.323590] [<ffffffff8118d43b>] __lock_page+0xbb/0xe0 [998531.323594] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40 [998531.323599] [<ffffffff8123f981>] __generic_file_splice_read+0x4b1/0x5c0 [998531.323601] [<ffffffff8123e2f0>] ? page_cache_pipe_buf_release+0x20/0x20 [998531.323605] [<ffffffff8179dd93>] ? inet_sendpage+0x73/0xd0 [998531.323607] [<ffffffff8123fe72>] generic_file_splice_read+0x42/0x80 [998531.323609] [<ffffffff8123e779>] do_splice_to+0x69/0x80 [998531.323611] [<ffffffff8123e84a>] splice_direct_to_actor+0xba/0x210 [998531.323613] [<ffffffff8123e200>] ? do_splice_from+0x30/0x30 [998531.323615] [<ffffffff8123ea38>] do_splice_direct+0x98/0xd0 [998531.323618] [<ffffffff8120dd2f>] do_sendfile+0x1bf/0x3a0 [998531.323620] [<ffffffff8120e96e>] SyS_sendfile64+0x5e/0xb0 [998531.323622] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71 [998531.323672] INFO: task teuthology:26992 blocked for more than 120 seconds. [998531.323693] Not tainted 4.4.0-34-generic #53-Ubuntu [998531.323710] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [998531.323732] teuthology D ffff8802df4379a8 0 26992 25730 0x00000000 [998531.323734] ffff8802df4379a8 ffff88081549e068 ffff880812fc0000 ffff8805e743d280 [998531.323736] ffff8802df438000 ffff88081f596d00 7fffffffffffffff ffff8802df437b20 [998531.323738] ffffffff8182a610 ffff8802df4379c0 ffffffff81829e15 0000000000000000 [998531.323740] Call Trace: [998531.323743] [<ffffffff8182a610>] ? bit_wait+0x60/0x60 [998531.323745] [<ffffffff81829e15>] schedule+0x35/0x80 [998531.323746] [<ffffffff8182cf35>] schedule_timeout+0x1b5/0x270 [998531.323750] [<ffffffff81405dd5>] ? find_next_bit+0x15/0x20 [998531.323754] [<ffffffff813f0fbf>] ? cpumask_next_and+0x2f/0x40 [998531.323756] [<ffffffff810bcb65>] ? update_sd_lb_stats+0x115/0x520 [998531.323758] [<ffffffff8106426e>] ? kvm_clock_get_cycles+0x1e/0x20 [998531.323760] [<ffffffff8182a610>] ? bit_wait+0x60/0x60 [998531.323762] [<ffffffff81829344>] io_schedule_timeout+0xa4/0x110 [998531.323764] [<ffffffff8182a62b>] bit_wait_io+0x1b/0x70 [998531.323766] [<ffffffff8182a3ae>] __wait_on_bit_lock+0x4e/0xb0 [998531.323768] [<ffffffff8118d43b>] __lock_page+0xbb/0xe0 [998531.323771] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40 [998531.323773] [<ffffffff8118e79d>] pagecache_get_page+0x17d/0x1c0 [998531.323788] [<ffffffffc039233a>] ? ceph_pool_perm_check+0x5a/0x700 [ceph] [998531.323791] [<ffffffff8118e806>] grab_cache_page_write_begin+0x26/0x40 [998531.323797] [<ffffffffc0391638>] ceph_write_begin+0x48/0xe0 [ceph] [998531.323799] [<ffffffff8118db4e>] generic_perform_write+0xce/0x1c0 [998531.323803] [<ffffffff812282a9>] ? file_update_time+0xc9/0x110 [998531.323809] [<ffffffffc038c16a>] ceph_write_iter+0xf8a/0x1050 [ceph] [998531.323812] [<ffffffff8122dbc4>] ? mntput+0x24/0x40 [998531.323814] [<ffffffff8121777d>] ? terminate_walk+0xbd/0xd0 [998531.323817] [<ffffffff8121ce11>] ? filename_lookup+0xf1/0x180 [998531.323819] [<ffffffff811ebac7>] ? kmem_cache_alloc+0x187/0x1f0 [998531.323821] [<ffffffff8121c9d6>] ? getname_flags+0x56/0x1f0 [998531.323823] [<ffffffff8120c97b>] new_sync_write+0x9b/0xe0 [998531.323825] [<ffffffff8120c9e6>] __vfs_write+0x26/0x40 [998531.323827] [<ffffffff8120d369>] vfs_write+0xa9/0x1a0 [998531.323828] [<ffffffff8120e025>] SyS_write+0x55/0xc0 [998531.323830] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71 [998531.323833] INFO: task teuthology:2191 blocked for more than 120 seconds. [998531.323853] Not tainted 4.4.0-34-generic #53-Ubuntu [998531.323869] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [998531.323892] teuthology D ffff880005c7b9a8 0 2191 6398 0x00000000 [998531.323894] ffff880005c7b9a8 ffff88081549e068 ffff880620cce040 ffff880035c9d280 [998531.323896] ffff880005c7c000 ffff88081f5d6d00 7fffffffffffffff ffff880005c7bb20 [998531.323898] ffffffff8182a610 ffff880005c7b9c0 ffffffff81829e15 0000000000000000 [998531.323900] Call Trace: [998531.323902] [<ffffffff8182a610>] ? bit_wait+0x60/0x60 [998531.323904] [<ffffffff81829e15>] schedule+0x35/0x80 [998531.323906] [<ffffffff8182cf35>] schedule_timeout+0x1b5/0x270 [998531.323908] [<ffffffff81749bf4>] ? sch_direct_xmit+0x74/0x220 [998531.323910] [<ffffffff810ca961>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [998531.323912] [<ffffffff8106426e>] ? kvm_clock_get_cycles+0x1e/0x20 [998531.323914] [<ffffffff8182a610>] ? bit_wait+0x60/0x60 [998531.323916] [<ffffffff81829344>] io_schedule_timeout+0xa4/0x110 [998531.323918] [<ffffffff8182a62b>] bit_wait_io+0x1b/0x70 [998531.323920] [<ffffffff8182a3ae>] __wait_on_bit_lock+0x4e/0xb0 [998531.323922] [<ffffffff8118d43b>] __lock_page+0xbb/0xe0 [998531.323924] [<ffffffff810c3ce0>] ? autoremove_wake_function+0x40/0x40 [998531.323926] [<ffffffff8118e79d>] pagecache_get_page+0x17d/0x1c0 [998531.323933] [<ffffffffc039233a>] ? ceph_pool_perm_check+0x5a/0x700 [ceph] [998531.323935] [<ffffffff8118e806>] grab_cache_page_write_begin+0x26/0x40 [998531.323940] [<ffffffffc0391638>] ceph_write_begin+0x48/0xe0 [ceph] [998531.323943] [<ffffffff8118db4e>] generic_perform_write+0xce/0x1c0 [998531.323945] [<ffffffff812282a9>] ? file_update_time+0xc9/0x110 [998531.323950] [<ffffffffc038c16a>] ceph_write_iter+0xf8a/0x1050 [ceph] [998531.323952] [<ffffffff8122dbc4>] ? mntput+0x24/0x40 [998531.323955] [<ffffffff8121777d>] ? terminate_walk+0xbd/0xd0 [998531.323957] [<ffffffff8121ce11>] ? filename_lookup+0xf1/0x180 [998531.323959] [<ffffffff811ebac7>] ? kmem_cache_alloc+0x187/0x1f0 [998531.323962] [<ffffffff8121c9d6>] ? getname_flags+0x56/0x1f0 [998531.323964] [<ffffffff8120c97b>] new_sync_write+0x9b/0xe0 [998531.323966] [<ffffffff8120c9e6>] __vfs_write+0x26/0x40 [998531.323968] [<ffffffff8120d369>] vfs_write+0xa9/0x1a0 [998531.323970] [<ffffffff8120e025>] SyS_write+0x55/0xc0 [998531.323972] [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71
I don't have sudo perms on this box, so I can't verify whether the tasks are still hung there, but I see several processes with matching pids that have been running for a very long time.
Updated by Jeff Layton over 7 years ago
Most likely we just need to update to a more recent kernel. 4.4.0 is pretty old at this point and I'm sure it's missing a lot of fixes that we'd want in there. Ilya (I think?) had mentioned maintaining a more bleeding-edge kernel for teuthology.front, so that may be the best solution.
Updated by Jeff Layton over 7 years ago
- Subject changed from kernel hung task warnings on teuthology kernel to kernel hung task warnings on teuthology.front kernel
Updated by David Galloway over 7 years ago
Jeff's got sudo access on teuthology.front now. Please let me know when you've got what you need so I can revert the change.
Updated by Jeff Layton over 7 years ago
Ok, dgalloway granted me sudo perms on there for a bit. All of those tasks seem to be sitting in the same spot now:
jlayton@teuthology:~$ sudo cat /proc/6411/stack [<ffffffff81256220>] ep_poll+0x2c0/0x3d0 [<ffffffff81257508>] SyS_epoll_wait+0xb8/0xd0 [<ffffffff8182def2>] entry_SYSCALL_64_fastpath+0x16/0x71 [<ffffffffffffffff>] 0xffffffffffffffff jlayton@teuthology:~$ sudo lsof -p 6411 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME teutholog 6411 teuthworker cwd DIR 253,2 4096 2171158542 /home/teuthworker teutholog 6411 teuthworker rtd DIR 253,2 4096 96 / teutholog 6411 teuthworker txt REG 253,2 3546104 6442864293 /home/teuthworker/src/teuthology_master/virtualenv/bin/python2 (deleted) teutholog 6411 teuthworker mem REG 253,2 47648 2147567838 /lib/x86_64-linux-gnu/libnss_nis-2.23.so teutholog 6411 teuthworker mem REG 253,2 93128 2147567836 /lib/x86_64-linux-gnu/libnsl-2.23.so teutholog 6411 teuthworker mem REG 253,2 35688 2147567834 /lib/x86_64-linux-gnu/libnss_compat-2.23.so teutholog 6411 teuthworker mem REG 253,2 101200 2147568339 /lib/x86_64-linux-gnu/libresolv-2.23.so teutholog 6411 teuthworker mem REG 253,2 27016 2147568349 /lib/x86_64-linux-gnu/libnss_dns-2.23.so teutholog 6411 teuthworker mem REG 253,2 47600 2147567841 /lib/x86_64-linux-gnu/libnss_files-2.23.so teutholog 6411 teuthworker DEL REG 253,2 4300124251 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/psutil/_psutil_posix.so teutholog 6411 teuthworker DEL REG 253,2 4300124250 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/psutil/_psutil_linux.so teutholog 6411 teuthworker DEL REG 253,2 4298367050 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/simplejson/_speedups.so teutholog 6411 teuthworker DEL REG 253,2 4300123227 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/.libs_cffi_backend/libffi-72499c49.so.6.0.4 teutholog 6411 teuthworker DEL REG 253,2 2173769662 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/_cffi_backend.so teutholog 6411 teuthworker DEL REG 253,2 4300123249 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/cryptography/hazmat/bindings/_openssl.so teutholog 6411 teuthworker mem REG 253,2 18976 2147567913 /lib/x86_64-linux-gnu/libuuid.so.1.3.0 teutholog 6411 teuthworker mem REG 253,2 31104 2147671658 /usr/lib/x86_64-linux-gnu/libffi.so.6.0.4 teutholog 6411 teuthworker mem REG 253,2 140184 6442786126 /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so teutholog 6411 teuthworker mem REG 253,2 54144 6442786099 /usr/lib/python2.7/lib-dynload/_json.x86_64-linux-gnu.so teutholog 6411 teuthworker mem REG 253,2 125016 2148846251 /usr/lib/x86_64-linux-gnu/libyaml-0.so.2.0.4 teutholog 6411 teuthworker DEL REG 253,2 2173768114 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/_yaml.so teutholog 6411 teuthworker mem REG 253,2 24736 6442786125 /usr/lib/python2.7/lib-dynload/termios.x86_64-linux-gnu.so teutholog 6411 teuthworker DEL REG 253,2 15021951 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/gevent/_semaphore.so teutholog 6411 teuthworker DEL REG 253,2 15021987 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/gevent/corecext.so teutholog 6411 teuthworker DEL REG 253,2 2173768032 /home/teuthworker/src/teuthology_master/virtualenv/lib/python2.7/site-packages/greenlet.so teutholog 6411 teuthworker mem REG 253,2 428368 2147568500 /lib/x86_64-linux-gnu/libssl.so.1.0.0 teutholog 6411 teuthworker mem REG 253,2 102504 6442786117 /usr/lib/python2.7/lib-dynload/_ssl.x86_64-linux-gnu.so teutholog 6411 teuthworker mem REG 253,2 2455456 2147568499 /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 teutholog 6411 teuthworker mem REG 253,2 29384 6442786109 /usr/lib/python2.7/lib-dynload/_hashlib.x86_64-linux-gnu.so teutholog 6411 teuthworker mem REG 253,2 2981280 4298006600 /usr/lib/locale/locale-archive teutholog 6411 teuthworker mem REG 253,2 1088952 2147568341 /lib/x86_64-linux-gnu/libm-2.23.so teutholog 6411 teuthworker mem REG 253,2 104824 2147569012 /lib/x86_64-linux-gnu/libz.so.1.2.8 teutholog 6411 teuthworker mem REG 253,2 10656 2147568345 /lib/x86_64-linux-gnu/libutil-2.23.so teutholog 6411 teuthworker mem REG 253,2 14608 2147568352 /lib/x86_64-linux-gnu/libdl-2.23.so teutholog 6411 teuthworker mem REG 253,2 1864888 2147568351 /lib/x86_64-linux-gnu/libc-2.23.so teutholog 6411 teuthworker mem REG 253,2 138744 2147568350 /lib/x86_64-linux-gnu/libpthread-2.23.so teutholog 6411 teuthworker mem REG 253,2 162632 2147568347 /lib/x86_64-linux-gnu/ld-2.23.so teutholog 6411 teuthworker 0r CHR 1,3 0t0 6 /dev/null teutholog 6411 teuthworker 1u CHR 136,18 0t0 21 /dev/pts/18 (deleted) teutholog 6411 teuthworker 2u CHR 136,18 0t0 21 /dev/pts/18 (deleted) teutholog 6411 teuthworker 3w REG 0,0 452855 1099943554918 /ceph/teuthology-archive/worker_logs/worker.vps.6411 teutholog 6411 teuthworker 4u IPv4 26548245 0t0 TCP localhost:44772->localhost:11300 (ESTABLISHED) teutholog 6411 teuthworker 9u a_inode 0,11 0 7113 [eventpoll] teutholog 6411 teuthworker 10r FIFO 0,10 0t0 26548229 pipe teutholog 6411 teuthworker 11r CHR 1,9 0t0 11 /dev/urandom teutholog 6411 teuthworker 12w FIFO 0,10 0t0 26548229 pipe jlayton@teuthology:~$ sudo strace -f -p 6411 strace: Process 6411 attached with 2 threads [pid 6542] futex(0x7fa4b0001af0, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff <unfinished ...> [pid 6411] clock_gettime(CLOCK_MONOTONIC, {1466159, 850065931}) = 0 [pid 6411] clock_gettime(CLOCK_MONOTONIC, {1466159, 850143464}) = 0 [pid 6411] epoll_wait(9, ^Cstrace: Process 6411 detached <detached ...> strace: Process 6542 detached
So they're all epoll'ing at the moment and seem to be stuck there indefinitely. My guess would be that they're epolling the worker.vps file.
FWIW, all of warnings seem to have popped on August 21, so it's possible that there was a hiccup in the ceph cluster that was mounted on that day. Either way, I don't see any evidence of a real problem at the moment, so maybe this is just the result of a network partition or something?
Updated by Jeff Layton about 7 years ago
My guess at this point is that this may be a different manifestation of bug 18130. This is also occurring in the ceph splice code, and that clearly doesn't handle all ITER_* types correctly. Once that's fixed, we can revisit this and see whether it's still a problem.
Updated by Jeff Layton over 6 years ago
Haven't seen this specific lockup in a while, but have seen this one, which looks unrelated:
[ 6068.578689] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [python:1346] [ 6068.578733] Modules linked in: binfmt_misc xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables ipmi_devintf ipmi_msghandler ceph libceph fscache crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd ppdev parport_pc parport joydev input_leds i2c_piix4 serio_raw mac_hid qemu_fw_cfg sunrpc autofs4 xfs libcrc32c qxl ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm virtio_net psmouse virtio_blk pata_acpi floppy [ 6068.578771] CPU: 15 PID: 1346 Comm: python Not tainted 4.10.0-42-generic #46~16.04.1-Ubuntu [ 6068.578772] Hardware name: Red Hat RHEV Hypervisor, BIOS 1.10.2-3.el7_4.1 04/01/2014 [ 6068.578773] task: ffff9718d1ad1680 task.stack: ffffa89f0c788000 [ 6068.578779] RIP: 0010:smp_call_function_single+0xd6/0x130 [ 6068.578780] RSP: 0018:ffffa89f0c78bc88 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10 [ 6068.578781] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000830 [ 6068.578782] RDX: 0000000000000001 RSI: 00000000000000fb RDI: 0000000000000830 [ 6068.578782] RBP: ffffa89f0c78bcd0 R08: 0000000000000000 R09: 0000000000008010 [ 6068.578783] R10: ffff9718c201e578 R11: ffffd8ed2414b9f0 R12: ffffffffa0e75f60 [ 6068.578784] R13: ffffa89f0c78bd28 R14: 0000000000000004 R15: ffff9718cb6aa270 [ 6068.578785] FS: 00007f97e7fff700(0000) GS:ffff9718db1c0000(0000) knlGS:0000000000000000 [ 6068.578785] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6068.578786] CR2: 00007f19851e58a8 CR3: 0000000904e0c000 CR4: 00000000000406e0 [ 6068.578790] Call Trace: [ 6068.578795] ? do_flush_tlb_all+0x50/0x50 [ 6068.578797] ? do_flush_tlb_all+0x50/0x50 [ 6068.578798] smp_call_function_many+0x20c/0x250 [ 6068.578800] native_flush_tlb_others+0x65/0x130 [ 6068.578801] flush_tlb_mm_range+0x64/0x150 [ 6068.578803] copy_process.part.34+0x1b4a/0x1c20 [ 6068.578806] ? kmem_cache_alloc_trace+0x1a1/0x1c0 [ 6068.578808] _do_fork+0xe7/0x3f0 [ 6068.578812] ? __fd_install+0x30/0xd0 [ 6068.578813] SyS_clone+0x19/0x20 [ 6068.578816] do_syscall_64+0x5b/0xc0 [ 6068.578820] entry_SYSCALL64_slow_path+0x25/0x25 [ 6068.578821] RIP: 0033:0x7f97f1d1441a [ 6068.578822] RSP: 002b:00007f97e7ffe510 EFLAGS: 00000246 ORIG_RAX: 0000000000000038 [ 6068.578823] RAX: ffffffffffffffda RBX: 00007f97e7ffe510 RCX: 00007f97f1d1441a [ 6068.578824] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001200011 [ 6068.578825] RBP: 00007f97e7ffe560 R08: 0000000000000466 R09: 00007f97e7fff700 [ 6068.578825] R10: 00007f97e7fff9d0 R11: 0000000000000246 R12: 0000000000000542 [ 6068.578826] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000005 [ 6068.578827] Code: 25 28 00 00 00 75 70 48 83 c4 38 5b 41 5c 5d c3 48 8d 75 c8 48 89 d1 89 df 4c 89 e2 e8 14 fe ff ff 8b 55 e0 83 e2 01 74 0a f3 90 <8b> 55 e0 83 e2 01 75 f6 eb c3 8b 05 4a 44 13 01 85 c0 75 85 80
I'll close bug for now.
Updated by Jeff Layton over 6 years ago
- Status changed from New to Can't reproduce