Project

General

Profile

Bug #15694

process enter D status when running on rbd

Added by Benetnash Alkaid almost 8 years ago. Updated about 5 years ago.

Status:
Closed
Priority:
Low
Assignee:
Category:
rbd
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Crash signature (v1):
Crash signature (v2):

Description

Hi,
Seldom, some kind of process running with data locate on rbd(formated with ext4) may enter D status(stuck).
I can reproduce it by running a minecraft server on rbd. Many other processes like mysql, apt-get, nginx etc. have encountered this problem. I know it sounds really weird, but minecraft server is the only method I know to reproduce it stably(may due to its certain I/O workload).
Meanwhile, I found this backtrace in dmesg

Apr 22 00:49:46 scgyshell-1 kernel: INFO: task java:177716 blocked for more than 120 seconds.
Apr 22 00:49:46 scgyshell-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 22 00:49:46 scgyshell-1 kernel: java D ffff880111448000 0 177716 169372 7 0x00000004
Apr 22 00:49:46 scgyshell-1 kernel: ffff88010ddd7af0 0000000000000086 ffff880111448000 ffff88010ddd7fd8
Apr 22 00:49:46 scgyshell-1 kernel: ffff88010ddd7fd8 ffff88010ddd7fd8 0000000000000007 ffff8804495d6580
Apr 22 00:49:46 scgyshell-1 kernel: ffff88045fcdec40 0000000000000000 7fffffffffffffff ffffffff8117ae20
Apr 22 00:49:46 scgyshell-1 kernel: Call Trace:
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae20>] ? wait_on_page_read+0x60/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81635d69>] schedule+0x29/0x70
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81633b39>] schedule_timeout+0x239/0x2d0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa0470e5c>] ? __ext4_journal_stop+0x3c/0xb0 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa043fcc9>] ? ext4_da_write_end+0x139/0x2e0 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae20>] ? wait_on_page_read+0x60/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8163544e>] io_schedule_timeout+0xae/0x130
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff816354e8>] io_schedule+0x18/0x1a
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae2e>] sleep_on_page+0xe/0x20
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81633c90>] __wait_on_bit+0x60/0x90
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117abb6>] wait_on_page_bit+0x86/0xb0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff810a8790>] ? wake_atomic_t_function+0x40/0x40
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8118c80b>] truncate_inode_pages_range+0x3bb/0x740
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff810b2c04>] ? __wake_up+0x44/0x50
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa017325a>] ? jbd2_journal_stop+0x1ea/0x3d0 [jbd2]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa0449f94>] ? ext4_unlink+0x304/0x390 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8125ddda>] ? __dquot_initialize+0x3a/0x1c0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8118cc0e>] truncate_inode_pages_final+0x5e/0x90
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa043f40c>] ext4_evict_inode+0x10c/0x520 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff812163d7>] evict+0xa7/0x170
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81216cbb>] iput+0x18b/0x1f0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8120b26e>] do_unlinkat+0x1ae/0x2b0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff811ffa0e>] ? SYSC_newstat+0x3e/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8120c276>] SyS_unlink+0x16/0x20
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81640e49>] system_call_fastpath+0x16/0x1b

Besides, many warnings exist in dmesg like
[Mon May 2 11:00:31 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 11:35:01 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 11:37:12 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 11:57:12 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 12:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 12:36:11 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 12:37:12 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 12:57:12 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 13:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 13:36:10 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 13:37:16 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 13:57:11 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 14:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 14:37:10 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 15:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 15:37:11 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 16:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 16:37:41 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)

I don't know whether it is related to this problem.

system centos7
kernel 3.10.0-327.3.1.vz7.10.15(based on 3.10.0-327.el7 and don't touch code about ceph)
here are the kernel souce https://src.openvz.org/projects/OVZ/repos/vzkernel/browse
ceph version :
ceph-common-0.94.6-0.el7.x86_64
ceph-0.94.6-0.el7.x86_64

rbd client, ceph-osd, ceph-mon are the same(4 servers, 4 osds, 3 mons, and they are also rbd clients)

I originally report this bug at https://bugs.openvz.org/browse/OVZ-6734 and then found rbd is the one to blame.

I did the same test set on debian jessie(kernel 3.16) and works fine. I guess it might be a known bugs and has been fixed in upstream. But limited by the centos 7 kernel, how can I locate the bug fix commit or get a patch ?

History

#1 Updated by Kefu Chai almost 8 years ago

  • Project changed from Ceph to rbd

#2 Updated by Jason Dillaman almost 8 years ago

  • Project changed from rbd to Linux kernel client

#3 Updated by Ilya Dryomov almost 8 years ago

"socket closed (con state OPEN)" are harmless, most likely.

Could you check your OSD logs for any "slow request" warnings around Apr 22 00:49:46?

Could you try to reproduce this stall with minecraft on a newer upstream kernel, like 4.5 or 4.6?

#4 Updated by Ilya Dryomov almost 8 years ago

  • Category set to rbd
  • Status changed from New to Need More Info
  • Priority changed from High to Low

#5 Updated by Benetnash Alkaid almost 8 years ago

I am very sorry for replying late. Thank you for paying attention to my issue.

Ilya Dryomov wrote:

"socket closed (con state OPEN)" are harmless, most likely.

Could you check your OSD logs for any "slow request" warnings around Apr 22 00:49:46?

the osd logs around Apr 22 seem to be incomplete. I try to reproduce it again and this time umount progress enters status D.

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      719426  0.0  0.0      0     0 ?        D    13:14   0:00 [kworker/u16:1]  
root      719950  0.0  0.0 125620  1188 pts/0    D+   13:44   0:00 umount /mnt/freeshell/pool/

No "slow request" is found in OSD logs above.

Could you try to reproduce this stall with minecraft on a newer upstream kernel, like 4.5 or 4.6?

I do test on jessie(3.16) and my desktop(archlinux 4.6), by mapping an rbd which contains a minecraft server. I can't reproduce the same stall.

#6 Updated by Ilya Dryomov over 7 years ago

  • Assignee set to Ilya Dryomov

If you can't upgrade to one of the newer kernels and can build your own, try cherry-picking commits

5a60e87603c4c533492c515b7f62578189b03c9c "rbd: use GFP_NOIO in rbd_obj_request_create()" 
2224d879c7c0f85c14183ef82eb48bd875ceb599 "rbd: use GFP_NOIO consistently for request allocations"

from upstream.

#7 Updated by Ilya Dryomov about 5 years ago

  • Status changed from Need More Info to Closed

Also available in: Atom PDF