Bug #15694
process enter D status when running on rbd
0%
Description
Hi,
Seldom, some kind of process running with data locate on rbd(formated with ext4) may enter D status(stuck).
I can reproduce it by running a minecraft server on rbd. Many other processes like mysql, apt-get, nginx etc. have encountered this problem. I know it sounds really weird, but minecraft server is the only method I know to reproduce it stably(may due to its certain I/O workload).
Meanwhile, I found this backtrace in dmesg
Apr 22 00:49:46 scgyshell-1 kernel: INFO: task java:177716 blocked for more than 120 seconds.
Apr 22 00:49:46 scgyshell-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 22 00:49:46 scgyshell-1 kernel: java D ffff880111448000 0 177716 169372 7 0x00000004
Apr 22 00:49:46 scgyshell-1 kernel: ffff88010ddd7af0 0000000000000086 ffff880111448000 ffff88010ddd7fd8
Apr 22 00:49:46 scgyshell-1 kernel: ffff88010ddd7fd8 ffff88010ddd7fd8 0000000000000007 ffff8804495d6580
Apr 22 00:49:46 scgyshell-1 kernel: ffff88045fcdec40 0000000000000000 7fffffffffffffff ffffffff8117ae20
Apr 22 00:49:46 scgyshell-1 kernel: Call Trace:
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae20>] ? wait_on_page_read+0x60/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81635d69>] schedule+0x29/0x70
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81633b39>] schedule_timeout+0x239/0x2d0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa0470e5c>] ? __ext4_journal_stop+0x3c/0xb0 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa043fcc9>] ? ext4_da_write_end+0x139/0x2e0 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae20>] ? wait_on_page_read+0x60/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8163544e>] io_schedule_timeout+0xae/0x130
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff816354e8>] io_schedule+0x18/0x1a
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae2e>] sleep_on_page+0xe/0x20
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81633c90>] __wait_on_bit+0x60/0x90
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117abb6>] wait_on_page_bit+0x86/0xb0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff810a8790>] ? wake_atomic_t_function+0x40/0x40
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8118c80b>] truncate_inode_pages_range+0x3bb/0x740
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff810b2c04>] ? __wake_up+0x44/0x50
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa017325a>] ? jbd2_journal_stop+0x1ea/0x3d0 [jbd2]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa0449f94>] ? ext4_unlink+0x304/0x390 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8125ddda>] ? __dquot_initialize+0x3a/0x1c0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8118cc0e>] truncate_inode_pages_final+0x5e/0x90
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa043f40c>] ext4_evict_inode+0x10c/0x520 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff812163d7>] evict+0xa7/0x170
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81216cbb>] iput+0x18b/0x1f0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8120b26e>] do_unlinkat+0x1ae/0x2b0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff811ffa0e>] ? SYSC_newstat+0x3e/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8120c276>] SyS_unlink+0x16/0x20
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81640e49>] system_call_fastpath+0x16/0x1b
Besides, many warnings exist in dmesg like
[Mon May 2 11:00:31 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 11:35:01 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 11:37:12 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 11:57:12 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 12:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 12:36:11 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 12:37:12 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 12:57:12 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 13:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 13:36:10 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 13:37:16 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 13:57:11 2016] libceph: osd0 114.214.197.1:6800 socket closed (con state OPEN)
[Mon May 2 14:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 14:37:10 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 15:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 15:37:11 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 16:00:30 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
[Mon May 2 16:37:41 2016] libceph: osd3 114.214.197.4:6800 socket closed (con state OPEN)
I don't know whether it is related to this problem.
system centos7
kernel 3.10.0-327.3.1.vz7.10.15(based on 3.10.0-327.el7 and don't touch code about ceph)
here are the kernel souce https://src.openvz.org/projects/OVZ/repos/vzkernel/browse
ceph version :
ceph-common-0.94.6-0.el7.x86_64
ceph-0.94.6-0.el7.x86_64
rbd client, ceph-osd, ceph-mon are the same(4 servers, 4 osds, 3 mons, and they are also rbd clients)
I originally report this bug at https://bugs.openvz.org/browse/OVZ-6734 and then found rbd is the one to blame.
I did the same test set on debian jessie(kernel 3.16) and works fine. I guess it might be a known bugs and has been fixed in upstream. But limited by the centos 7 kernel, how can I locate the bug fix commit or get a patch ?
History
#1 Updated by Kefu Chai almost 8 years ago
- Project changed from Ceph to rbd
#2 Updated by Jason Dillaman almost 8 years ago
- Project changed from rbd to Linux kernel client
#3 Updated by Ilya Dryomov almost 8 years ago
"socket closed (con state OPEN)" are harmless, most likely.
Could you check your OSD logs for any "slow request" warnings around Apr 22 00:49:46?
Could you try to reproduce this stall with minecraft on a newer upstream kernel, like 4.5 or 4.6?
#4 Updated by Ilya Dryomov almost 8 years ago
- Category set to rbd
- Status changed from New to Need More Info
- Priority changed from High to Low
#5 Updated by Benetnash Alkaid almost 8 years ago
I am very sorry for replying late. Thank you for paying attention to my issue.
Ilya Dryomov wrote:
"socket closed (con state OPEN)" are harmless, most likely.
Could you check your OSD logs for any "slow request" warnings around Apr 22 00:49:46?
the osd logs around Apr 22 seem to be incomplete. I try to reproduce it again and this time umount progress enters status D.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 719426 0.0 0.0 0 0 ? D 13:14 0:00 [kworker/u16:1] root 719950 0.0 0.0 125620 1188 pts/0 D+ 13:44 0:00 umount /mnt/freeshell/pool/
No "slow request" is found in OSD logs above.
Could you try to reproduce this stall with minecraft on a newer upstream kernel, like 4.5 or 4.6?
I do test on jessie(3.16) and my desktop(archlinux 4.6), by mapping an rbd which contains a minecraft server. I can't reproduce the same stall.
#6 Updated by Ilya Dryomov over 7 years ago
- Assignee set to Ilya Dryomov
If you can't upgrade to one of the newer kernels and can build your own, try cherry-picking commits
5a60e87603c4c533492c515b7f62578189b03c9c "rbd: use GFP_NOIO in rbd_obj_request_create()"
2224d879c7c0f85c14183ef82eb48bd875ceb599 "rbd: use GFP_NOIO consistently for request allocations"
from upstream.
#7 Updated by Ilya Dryomov about 5 years ago
- Status changed from Need More Info to Closed