Project

General

Profile

Actions

Bug #8818

closed

IO Hang on raw rbd device - Workqueue: ceph-msgr con_work [libceph]

Added by Greg Wilson almost 10 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

While conducting dd read and write tests to a raw rbd device for baseline performance tests we experience an IO hang. The result is the dd never finishes and can only be cleared by a reboot. We started testing with firefly on Ubuntu with a 3.15 kernel but have upgraded to 3.16 in case there were changes that might have impacted this situation:

  1. cat /proc/version
    Linux version 3.16.0-031600rc4-generic (apw@gomeisa) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201407061635 SMP Sun Jul 6 20:36:26 UTC 2014
  1. ceph -v
    ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)

The script we are running does a dd write and then a dd read using various block sizes. The script consistently hangs but never on the same dd command or block size and has failed on both a dd write and a dd read. When the error occurs the following messages appear in the system log:

Jul 11 16:35:37 ks2-p1 kernel: [ 4325.515421] INFO: task kworker/6:1:2739 blocked for more than 120 seconds.
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.523177] Not tainted 3.16.0-031600rc4-generic #201407061635
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.530193] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538849] kworker/6:1 D 0000000000000006 0 2739 2 0x00000000
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538856] Workqueue: ceph-msgr con_work [libceph]
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538857] ffff880bf8573b58 0000000000000002 ffff880bf8573c08 ffff880bf8573fd8
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538858] 0000000000014440 0000000000014440 ffff8817f96e64c0 ffff880bf9061930
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538860] 0000000000000000 ffff880bfc7567d0 ffff880bfc7567d4 ffff880bf9061930
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538862] Call Trace:
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538864] [<ffffffff8179d2b9>] schedule+0x29/0x70
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538866] [<ffffffff8179d5de>] schedule_preempt_disabled+0xe/0x10
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538868] [<ffffffff8179f3f5>] __mutex_lock_slowpath+0xd5/0x1c0
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538870] [<ffffffff8179f503>] mutex_lock+0x23/0x37
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538875] [<ffffffffc061808d>] get_reply.isra.30+0x3d/0x240 [libceph]
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538886] [<ffffffffc0618322>] alloc_msg+0x92/0xa0 [libceph]
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538894] [<ffffffffc0610d01>] ceph_con_in_msg_alloc+0x71/0x1e0 [libceph]
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538898] [<ffffffffc0611080>] read_partial_message+0x210/0x4e0 [libceph]
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538900] [<ffffffff8166c926>] ? kernel_recvmsg+0x46/0x60
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538904] [<ffffffffc060cfd8>] ? ceph_tcp_recvmsg+0x48/0x60 [libceph]
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538920] [<ffffffffc0611608>] try_read+0x2b8/0x5a0 [libceph]
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538928] [<ffffffffc0611c21>] con_work+0x91/0x290 [libceph]
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538931] [<ffffffff8108e6ff>] process_one_work+0x17f/0x4c0
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538933] [<ffffffff8108f46b>] worker_thread+0x11b/0x3f0
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538935] [<ffffffff8108f350>] ? create_and_start_worker+0x80/0x80
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538937] [<ffffffff81096479>] kthread+0xc9/0xe0
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538940] [<ffffffff810963b0>] ? flush_kthread_worker+0xb0/0xb0
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538942] [<ffffffff817a13fc>] ret_from_fork+0x7c/0xb0
Jul 11 16:35:37 ks2-p1 kernel: [ 4325.538944] [<ffffffff810963b0>] ? flush_kthread_worker+0xb0/0xb0

I have attached sample dd script and a pared down version of the kernel.log.


Files

bug_10jul2014.txt.gz (10.9 KB) bug_10jul2014.txt.gz Greg Wilson, 07/11/2014 09:22 PM
10GB_raw_dd.sh (2.6 KB) 10GB_raw_dd.sh Greg Wilson, 07/11/2014 09:22 PM
dump.txt (428 KB) dump.txt Xavier Trilla, 07/16/2014 11:52 AM
kern-log.tar.gz (115 KB) kern-log.tar.gz Greg Wilson, 07/16/2014 04:34 PM
configs.tar.gz (123 KB) configs.tar.gz Greg Wilson, 07/25/2014 07:52 AM

Related issues 1 (0 open1 closed)

Has duplicate Linux kernel client - Bug #8464: krbd: deadlockResolvedIlya Dryomov05/29/2014

Actions
Actions

Also available in: Atom PDF