Project

General

Profile

Bug #48617

task fsstress:62206 blocked for more than 120 seconds.

Added by Patrick Donnelly about 1 month ago. Updated about 1 month ago.

Status:
New
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Crash signature:

Description

See: /ceph/teuthology-archive/pdonnell-2020-12-14_23:22:08-fs-wip-pdonnell-testing-20201214.211711-distro-basic-smithi/5708743

dmesg attached. RHEL 8 stock kernel.

dmesg.log.gz (24.5 KB) Patrick Donnelly, 12/16/2020 03:11 AM

History

#1 Updated by Jeff Layton about 1 month ago

All of the stack traces are stuck doing a sync().

A lot of connectivity problems in this log, before during and after the stuck sync warnings:

[  687.046134] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  688.391634] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  689.288276] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  690.622717] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  690.927763] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  692.228218] libceph: osd7 weight 0x0 (out)
[  695.081607] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  696.447083] libceph: osd4 (1)172.21.15.131:6802 socket closed (con state OPEN)
[  696.449527] libceph: osd5 (1)172.21.15.131:6817 socket error on write
[  696.560801] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  697.523859] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  697.608880] libceph: osd0 weight 0x10000 (in)
[  698.649820] libceph: osd1 weight 0x10000 (in)
[  699.052360] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  700.234082] libceph: osd2 weight 0x10000 (in)
[  700.668132] libceph: osd3 weight 0x10000 (in)
[  700.787121] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  701.215661] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  702.212773] libceph: osd4 weight 0x10000 (in)
[  702.579148] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  702.854344] libceph: osd5 weight 0x10000 (in)
[  703.478728] libceph: mds0 (1)172.21.15.192:6835 socket error on write
[  861.638318] INFO: task fsstress:62206 blocked for more than 120 seconds.
[  861.645288]       Not tainted 4.18.0-240.1.1.el8_3.x86_64 #1
[  861.651202] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  861.659305] fsstress        D    0 62206  62205 0x00004080
[  861.665053] Call Trace:
[  861.667787]  __schedule+0x2a6/0x700
[  861.671577]  ? __switch_to_asm+0x41/0x70
[  861.675784]  ? __switch_to_asm+0x41/0x70
[  861.679993]  schedule+0x38/0xa0
[  861.683456]  schedule_timeout+0x246/0x2f0
[  861.687764]  ? __switch_to_asm+0x35/0x70
[  861.691987]  ? __switch_to_asm+0x41/0x70
[  861.696190]  ? __switch_to_asm+0x35/0x70
[  861.700440]  ? __switch_to_asm+0x41/0x70
[  861.704663]  ? __switch_to_asm+0x35/0x70
[  861.708882]  ? _cond_resched+0x15/0x30
[  861.712909]  ? __mutex_lock.isra.5+0x474/0x4a0
[  861.717659]  ? __switch_to+0x101/0x3f0
[  861.721715]  ? _cond_resched+0x15/0x30
[  861.725946]  wait_for_completion+0x11f/0x190
[  861.730563]  ? wake_up_q+0x80/0x80
[  861.734297]  ceph_mdsc_sync+0x173/0x350 [ceph]
[  861.739042]  ? _cond_resched+0x15/0x30
[  861.743106]  ? mutex_lock+0xe/0x30
[  861.746823]  ? __ia32_sys_fdatasync+0x20/0x20
[  861.751508]  ceph_sync_fs+0x2f/0xb0 [ceph]
[  861.755902]  iterate_supers+0x98/0x100
[  861.759952]  ksys_sync+0x60/0xb0
[  861.763548]  __ia32_sys_sync+0xa/0x10
[  861.767589]  do_syscall_64+0x5b/0x1a0
[  861.771584]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  861.776959] RIP: 0033:0x7fc99bb3d75b
[  861.780832] Code: Bad RIP value.
[  861.784383] RSP: 002b:00007ffe2c147cc8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a2
[  861.792281] RAX: ffffffffffffffda RBX: 0000000000405c5a RCX: 00007fc99bb3d75b
[  861.799745] RDX: 0000000028bb447d RSI: 0000000028bb447d RDI: 00000000000000d9
[  861.807195] RBP: 00007ffe2c147ce0 R08: 00007fc99be0624c R09: 00007fc99be06260
[  861.814686] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000401760
[  861.822157] R13: 00007ffe2c148020 R14: 0000000000000000 R15: 0000000000000000

My guess is that this is just happening because we're stuck trying to write back data or metadata, and we aren't able to talk to the server well. It might be interesting to look at the daemon logs and see if they note why the sockets were being closed so aggressively.

Another thing that stands out here is that the wait_for_completion is done unconditionally, and we probably only ought to wait when the VFS asks for sync_fs to wait. Still, that's probably not the source of this problem.

#2 Updated by Jeff Layton about 1 month ago

This may be a duplicate of #48618. If there are connectivity problems when you go to sync, then stalls like this are likely.

#3 Updated by Patrick Donnelly about 1 month ago

Jeff Layton wrote:

This may be a duplicate of #48618. If there are connectivity problems when you go to sync, then stalls like this are likely.

#48618 was a testing kernel issue so I don't think so.

We'll see if this continues to crop up...

#4 Updated by Jeff Layton about 1 month ago

Ok. Why are all of the peers closing the sockets? Is this some sort of stress test?

[  699.052360] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  700.234082] libceph: osd2 weight 0x10000 (in)
[  700.668132] libceph: osd3 weight 0x10000 (in)
[  700.787121] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  701.215661] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  702.212773] libceph: osd4 weight 0x10000 (in)
[  702.579148] libceph: mds0 (1)172.21.15.192:6835 socket closed (con state OPEN)
[  702.854344] libceph: osd5 weight 0x10000 (in)
[  703.478728] libceph: mds0 (1)172.21.15.192:6835 socket error on write

sync/syncfs will just hang if there is dirty data that can't be flushed.

#5 Updated by Patrick Donnelly about 1 month ago

Jeff Layton wrote:

Ok. Why are all of the peers closing the sockets? Is this some sort of stress test?

[...]

sync/syncfs will just hang if there is dirty data that can't be flushed.

Probably because of the msgr-failures/osd-mds-delay yaml which is causing messenger failures deliberately.

#6 Updated by Jeff Layton about 1 month ago

In that case, I'm not sure what we can do. The WARNING is unfortunate, but it's sort of beyond our control given the writeback delays.

#7 Updated by Jeff Layton about 1 month ago

I wonder if this may be helped by this patch by Ilya. The reconnection delay could just be getting to be very high.

https://github.com/ceph/ceph-client/pull/22/commits/a2446715bdeb227f8c49df7fd6955bc39a278965

This will probably make RHEL 8.4, if so.

Also available in: Atom PDF