Bug #50826
closedkceph: stock RHEL kernel hangs on snaptests with mon|osd thrashers
0%
Description
/ceph/teuthology-archive/pdonnell-2021-05-14_21:45:42-fs-master-distro-basic-smithi/6115757/teuthology.log
and
/ceph/teuthology-archive/pdonnell-2021-05-14_21:45:42-fs-master-distro-basic-smithi/6115769/teuthology.log
I see:
[ 2459.191432] INFO: task git:214273 blocked for more than 120 seconds. [ 2459.198266] Not tainted 4.18.0-240.1.1.el8_3.x86_64 #1 [ 2459.204471] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2459.212884] git D 0 214273 136000 0x00000080 [ 2459.218901] Call Trace: [ 2459.221802] __schedule+0x2a6/0x700 [ 2459.225931] ? __dentry_kill+0x121/0x170 [ 2459.230363] schedule+0x38/0xa0 [ 2459.233922] io_schedule+0x12/0x40 [ 2459.237741] __lock_page+0x141/0x240 [ 2459.241724] ? file_check_and_advance_wb_err+0xd0/0xd0 [ 2459.247309] pagecache_get_page+0x19a/0x2d0 [ 2459.252075] grab_cache_page_write_begin+0x1f/0x40 [ 2459.257309] ceph_write_begin+0x40/0x130 [ceph] [ 2459.262265] generic_perform_write+0xf4/0x1b0 [ 2459.267044] ? file_update_time+0xed/0x130 [ 2459.271559] ceph_write_iter+0xa75/0xc90 [ceph] [ 2459.276495] ? atime_needs_update+0x77/0xe0 [ 2459.281076] ? touch_atime+0x33/0xe0 [ 2459.285318] ? _copy_to_user+0x26/0x30 [ 2459.289542] ? cp_new_stat+0x150/0x180 [ 2459.293712] ? new_sync_write+0x124/0x170 [ 2459.298131] ? ceph_fallocate+0x5f0/0x5f0 [ceph] [ 2459.303149] new_sync_write+0x124/0x170 [ 2459.307393] vfs_write+0xa5/0x1a0 [ 2459.311111] ksys_write+0x4f/0xb0 [ 2459.314826] do_syscall_64+0x5b/0x1a0 [ 2459.318895] entry_SYSCALL_64_after_hwframe+0x65/0xca
in one of the dmesg logs and
[ 4169.826043] libceph: reset on mds4 [ 4169.826044] ceph: mds4 closed our session [ 4169.826045] ceph: mds4 reconnect start [ 4169.826062] libceph: mds1 (1)172.21.15.74:6836 connection reset [ 4169.826064] libceph: reset on mds1 [ 4169.826064] ceph: mds1 closed our session [ 4169.826065] ceph: mds1 reconnect start [ 4169.827221] ceph: mds1 reconnect denied [ 4169.827231] ceph: mds4 reconnect denied [ 4169.831925] libceph: reset on mds2 [ 4169.838044] libceph: reset on mds3 [ 4169.844154] ceph: mds2 closed our session [ 4169.844154] ceph: mds2 reconnect start [ 4169.861696] ceph: mds2 reconnect denied [ 4169.863550] ceph: mds3 closed our session [ 4169.911453] ceph: mds3 reconnect start [ 4169.916937] ceph: mds3 reconnect denied [ 4221.027310] libceph: mon0 (1)172.21.15.16:6789 socket closed (con state OPEN) [ 4221.034962] libceph: mon0 (1)172.21.15.16:6789 session lost, hunting for new mon .. ad infinitum
in the other
Updated by Patrick Donnelly almost 3 years ago
- Related to Bug #50281: qa: untar_snap_rm timeout added
Updated by Patrick Donnelly almost 3 years ago
Might be related to #50281 but that was with the testing kernel.
Updated by Jeff Layton almost 3 years ago
The bad patch involved in #50281 was never merged into RHEL, so I doubt this is related.
The hung task warning in the log only popped once, which implies that it eventually did get the page lock and proceed. With thrasher testing, there's not a lot we can do to silence those warnings. Someimes the client just has to wait to get a lock held by a task that is stuck waiting for a remote host to come back.
I'm a little unclear on the problem here though. How did you determine that the test was hung and not making progress?
Updated by Patrick Donnelly almost 3 years ago
Jeff Layton wrote:
The bad patch involved in #50281 was never merged into RHEL, so I doubt this is related.
The hung task warning in the log only popped once, which implies that it eventually did get the page lock and proceed. With thrasher testing, there's not a lot we can do to silence those warnings. Someimes the client just has to wait to get a lock held by a task that is stuck waiting for a remote host to come back.
I'm a little unclear on the problem here though. How did you determine that the test was hung and not making progress?
Unfortunately we can't know (yet) if the test was really hung because we don't have the MDS logs (dead job). I'll see if I can dig into the logs next time I run tests.
Updated by Jeff Layton almost 2 years ago
- Assignee changed from Jeff Layton to Patrick Donnelly
Handing this back to Patrick for now. I haven't seen this occur myself. Is this still a problem? Should we close it out?