Project

General

Profile

Bug #50

osd timeout reset leaves some ops hanging

Added by Sage Weil about 9 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
04/19/2010
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:

Description

i see an osd reset:

[ 7769.416465] ceph: tid 65961 timed out on osd4, will reset osd

followed by stray replies..

[ 7777.377747] ceph: get_reply unknown tid 65796 from osd4
[ 7777.383457] ceph: get_reply unknown tid 65802 from osd4
[ 7777.389138] ceph: get_reply unknown tid 65842 from osd4
[ 7777.394682] ceph: get_reply unknown tid 65845 from osd4
[ 7777.400288] ceph: get_reply unknown tid 65863 from osd4
[ 7779.513523] ceph: get_reply unknown tid 65864 from osd4
[ 7784.539006] ceph: get_reply unknown tid 65882 from osd4
[ 7786.043971] ceph: get_reply unknown tid 65886 from osd4
[ 7786.055170] ceph: get_reply unknown tid 65888 from osd4
[ 7786.068336] ceph: get_reply unknown tid 65890 from osd4
[ 7786.521783] ceph: get_reply unknown tid 65894 from osd4
[ 7787.741176] ceph: get_reply unknown tid 65896 from osd4
[ 7787.813535] ceph: get_reply unknown tid 65902 from osd4
[ 7787.819263] ceph: get_reply unknown tid 65922 from osd4
[ 7787.824780] ceph: get_reply unknown tid 65942 from osd4
[ 7789.564113] ceph: get_reply unknown tid 65946 from osd4
[ 7794.589502] ceph: get_reply unknown tid 65953 from osd4
[ 7799.616798] ceph: get_reply unknown tid 65897 from osd4

and a hung app. pending (hung) osd ops are

ceph4:~# cat /sys/kernel/debug/ceph/*/osdc
65479 osd4 0.c0ca 1000017c55c.0000081e write
65583 osd4 0.1448 1000017c55c.00000886 write
65618 osd4 0.54f3 1000017c55c.000008a9 write
65639 osd4 0.fef3 1000017c55c.000008be write
65695 osd4 0.2048 1000017c55c.000008f6 write
65713 osd4 0.5d58 1000017c55c.00000908 write
65812 osd4 0.a2f3 1000017c55c.0000096b write
65820 osd4 0.98a4 1000017c55c.00000973 write
65836 osd4 0.8358 1000017c55c.00000983 write
65930 osd4 0.bde9 1000017c55c.000009e1 write
65931 osd4 0.f85e 1000017c55c.000009e2 write
65936 osd4 0.e05e 1000017c55c.000009e7 write

and

[ 7966.265301] INFO: task iozone:1984 blocked for more than 120 seconds.
[ 7966.271976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7966.279936] iozone D 0000000000000001 0 1984 1979 0x00000000
[ 7966.286951] ffff88011e249d08 0000000000000046 ffff88011e249c68 ffff88011e249fd8
[ 7966.294563] ffff88011c9aec80 0000000000004000 0000000000004000 00000000001d1f80
[ 7966.302164] 00000000001d1f80 ffff88011e249fd8 ffff88011e249fd8 00000000001d1f80
[ 7966.309756] Call Trace:
[ 7966.312248] [<ffffffff81058061>] ? trace_hardirqs_on_caller+0x113/0x13e
[ 7966.319051] [<ffffffff81058099>] ? trace_hardirqs_on+0xd/0xf
[ 7966.324900] [<ffffffff81425595>] io_schedule+0x38/0x4d
[ 7966.330212] [<ffffffff8107dcb2>] sync_page+0x4c/0x50
[ 7966.335341] [<ffffffff81425a9b>] __wait_on_bit+0x45/0x76
[ 7966.340827] [<ffffffff8107dc66>] ? sync_page+0x0/0x50
[ 7966.346061] [<ffffffff8107de5f>] wait_on_page_bit+0x6f/0x76
[ 7966.351809] [<ffffffff8104a8e8>] ? wake_bit_function+0x0/0x2a
[ 7966.357735] [<ffffffff8108644b>] ? pagevec_lookup_tag+0x22/0x2b
[ 7966.363830] [<ffffffff8107e333>] filemap_fdatawait_range+0x7c/0x179
[ 7966.370279] [<ffffffff8107e504>] filemap_write_and_wait_range+0x41/0x54
[ 7966.377083] [<ffffffff810ce505>] vfs_fsync_range+0x5a/0xa6
[ 7966.382739] [<ffffffff810ce5be>] vfs_fsync+0x18/0x1a
[ 7966.387876] [<ffffffff810ce5f2>] do_fsync+0x32/0x48
[ 7966.392920] [<ffffffff810ce625>] sys_fsync+0xb/0xf
[ 7966.397879] [<ffffffff810029eb>] system_call_fastpath+0x16/0x1b
[ 7966.403980] no locks held by iozone/1984.

History

#1 Updated by Sage Weil about 9 years ago

finally found this, fixed by commit:77eb74b92fee7340d104b24a9ee2800196b0f140

#2 Updated by Sage Weil about 9 years ago

  • Status changed from New to Testing

#3 Updated by Sage Weil about 9 years ago

  • Status changed from Testing to Resolved

Also available in: Atom PDF