Project

General

Profile

Actions

Bug #13587

open

IO hang when writing to a cached erasure pool

Added by Jay Ring over 8 years ago. Updated over 8 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I am testing erasure pools for performance and I am getting a hang similar to the one in Bug #8818.

The process never completes, the health goes to "requests are blocked > 32 sec". Only a hard reboot solves the problem.

This happens when writing to an erasure pool (k=2, m=1) that has a cache tier attached.

The dd command works for small transfers ( < 500M ). Somewhere around 1G the bug appears. That may be related to the kernel's own caching.

I can reproduce this 100% of the time.

cat /proc/version
Linux version 3.19.0-30-generic (buildd@lgw01-13) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #34~14.04.1-Ubuntu SMP Fri Oct 2 22:09:39 UTC 2015

ceph -v
ceph version 0.87.2 (87a7cec9ab11c677de2ab23a7668a77d2f5b955e)

Oct 23 13:52:43 mca-h3 kernel: [14040.688974] INFO: task dd:18142 blocked for more than 120 seconds.
Oct 23 13:52:43 mca-h3 kernel: [14040.689577] Tainted: G W 3.19.0-30-generic #34~14.04.1-Ubuntu
Oct 23 13:52:43 mca-h3 kernel: [14040.690225] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 23 13:52:43 mca-h3 kernel: [14040.690895] dd D ffff8800c6f8fc18 0 18142 11875 0x00000004
Oct 23 13:52:43 mca-h3 kernel: [14040.690897] ffff8800c6f8fc18 ffff88040cc344b0 0000000000013e80 ffff8800c6f8ffd8
Oct 23 13:52:43 mca-h3 kernel: [14040.690899] 0000000000013e80 ffff88040d1dd850 ffff88040cc344b0 ffff8800c6f8fcc0
Oct 23 13:52:43 mca-h3 kernel: [14040.690901] ffff88041fad4778 ffff8800c6f8fcc0 ffff88041fdb2ce8 0000000000000002
Oct 23 13:52:43 mca-h3 kernel: [14040.690902] Call Trace:
Oct 23 13:52:43 mca-h3 kernel: [14040.690908] [<ffffffff817b35d0>] ? bit_wait+0x50/0x50
Oct 23 13:52:43 mca-h3 kernel: [14040.690910] [<ffffffff817b2da0>] io_schedule+0xa0/0x130
Oct 23 13:52:43 mca-h3 kernel: [14040.690911] [<ffffffff817b35fc>] bit_wait_io+0x2c/0x50
Oct 23 13:52:43 mca-h3 kernel: [14040.690913] [<ffffffff817b3235>] wait_on_bit+0x65/0x90
Oct 23 13:52:43 mca-h3 kernel: [14040.690915] [<ffffffff8117667d>] ? find_get_pages_tag+0xcd/0x170
Oct 23 13:52:43 mca-h3 kernel: [14040.690917] [<ffffffff81175657>] wait_on_page_bit+0xc7/0xd0
Oct 23 13:52:43 mca-h3 kernel: [14040.690920] [<ffffffff810b4e70>] ? autoremove_wake_function+0x40/0x40
Oct 23 13:52:43 mca-h3 kernel: [14040.690921] [<ffffffff81175759>] filemap_fdatawait_range+0xf9/0x190
Oct 23 13:52:43 mca-h3 kernel: [14040.690923] [<ffffffff81175817>] filemap_fdatawait+0x27/0x30
Oct 23 13:52:43 mca-h3 kernel: [14040.690924] [<ffffffff8117760b>] filemap_write_and_wait+0x3b/0x60
Oct 23 13:52:43 mca-h3 kernel: [14040.690927] [<ffffffff812247bf>] __sync_blockdev+0x1f/0x40
Oct 23 13:52:43 mca-h3 kernel: [14040.690929] [<ffffffff81224b1c>] __blkdev_put+0x5c/0x1a0
Oct 23 13:52:43 mca-h3 kernel: [14040.690930] [<ffffffff8122558e>] blkdev_put+0x4e/0x140
Oct 23 13:52:43 mca-h3 kernel: [14040.690932] [<ffffffff81225735>] blkdev_close+0x25/0x30
Oct 23 13:52:43 mca-h3 kernel: [14040.690934] [<ffffffff811edf77>] __fput+0xe7/0x220
Oct 23 13:52:43 mca-h3 kernel: [14040.690936] [<ffffffff811ee0fe>] _
_fput+0xe/0x10
Oct 23 13:52:43 mca-h3 kernel: [14040.690938] [<ffffffff81091de7>] task_work_run+0xb7/0xf0
Oct 23 13:52:43 mca-h3 kernel: [14040.690942] [<ffffffff81015007>] do_notify_resume+0x97/0xb0
Oct 23 13:52:43 mca-h3 kernel: [14040.690943] [<ffffffff817b70ef>] int_signal+0x12/0x17

Actions #1

Updated by Josh Durgin over 8 years ago

Does the i/o eventually complete? What does 'ceph -s' say? This could be caused by promotion/demotion thrashing in the cache tier slowing things down. I'd recommend upgrading to hammer in any case, since giant is eol (http://docs.ceph.com/docs/master/releases/#timeline).

Actions

Also available in: Atom PDF