Project

General

Profile

Bug #50826

kceph: stock RHEL kernel hangs on snaptests with mon|osd thrashers

Added by Patrick Donnelly over 1 year ago. Updated 7 months ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
kceph
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/ceph/teuthology-archive/pdonnell-2021-05-14_21:45:42-fs-master-distro-basic-smithi/6115757/teuthology.log

and

/ceph/teuthology-archive/pdonnell-2021-05-14_21:45:42-fs-master-distro-basic-smithi/6115769/teuthology.log

I see:

[ 2459.191432] INFO: task git:214273 blocked for more than 120 seconds.
[ 2459.198266]       Not tainted 4.18.0-240.1.1.el8_3.x86_64 #1
[ 2459.204471] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2459.212884] git             D    0 214273 136000 0x00000080
[ 2459.218901] Call Trace:
[ 2459.221802]  __schedule+0x2a6/0x700
[ 2459.225931]  ? __dentry_kill+0x121/0x170
[ 2459.230363]  schedule+0x38/0xa0
[ 2459.233922]  io_schedule+0x12/0x40
[ 2459.237741]  __lock_page+0x141/0x240
[ 2459.241724]  ? file_check_and_advance_wb_err+0xd0/0xd0
[ 2459.247309]  pagecache_get_page+0x19a/0x2d0
[ 2459.252075]  grab_cache_page_write_begin+0x1f/0x40
[ 2459.257309]  ceph_write_begin+0x40/0x130 [ceph]
[ 2459.262265]  generic_perform_write+0xf4/0x1b0
[ 2459.267044]  ? file_update_time+0xed/0x130
[ 2459.271559]  ceph_write_iter+0xa75/0xc90 [ceph]
[ 2459.276495]  ? atime_needs_update+0x77/0xe0
[ 2459.281076]  ? touch_atime+0x33/0xe0
[ 2459.285318]  ? _copy_to_user+0x26/0x30
[ 2459.289542]  ? cp_new_stat+0x150/0x180
[ 2459.293712]  ? new_sync_write+0x124/0x170
[ 2459.298131]  ? ceph_fallocate+0x5f0/0x5f0 [ceph]
[ 2459.303149]  new_sync_write+0x124/0x170
[ 2459.307393]  vfs_write+0xa5/0x1a0
[ 2459.311111]  ksys_write+0x4f/0xb0
[ 2459.314826]  do_syscall_64+0x5b/0x1a0
[ 2459.318895]  entry_SYSCALL_64_after_hwframe+0x65/0xca

in one of the dmesg logs and

[ 4169.826043] libceph: reset on mds4
[ 4169.826044] ceph: mds4 closed our session
[ 4169.826045] ceph: mds4 reconnect start
[ 4169.826062] libceph: mds1 (1)172.21.15.74:6836 connection reset
[ 4169.826064] libceph: reset on mds1
[ 4169.826064] ceph: mds1 closed our session
[ 4169.826065] ceph: mds1 reconnect start
[ 4169.827221] ceph: mds1 reconnect denied
[ 4169.827231] ceph: mds4 reconnect denied
[ 4169.831925] libceph: reset on mds2
[ 4169.838044] libceph: reset on mds3
[ 4169.844154] ceph: mds2 closed our session
[ 4169.844154] ceph: mds2 reconnect start
[ 4169.861696] ceph: mds2 reconnect denied
[ 4169.863550] ceph: mds3 closed our session
[ 4169.911453] ceph: mds3 reconnect start
[ 4169.916937] ceph: mds3 reconnect denied
[ 4221.027310] libceph: mon0 (1)172.21.15.16:6789 socket closed (con state OPEN)
[ 4221.034962] libceph: mon0 (1)172.21.15.16:6789 session lost, hunting for new mon
.. ad infinitum

in the other


Related issues

Related to CephFS - Bug #50281: qa: untar_snap_rm timeout Resolved

History

#1 Updated by Patrick Donnelly over 1 year ago

  • Related to Bug #50281: qa: untar_snap_rm timeout added

#2 Updated by Patrick Donnelly over 1 year ago

Might be related to #50281 but that was with the testing kernel.

#3 Updated by Jeff Layton over 1 year ago

The bad patch involved in #50281 was never merged into RHEL, so I doubt this is related.

The hung task warning in the log only popped once, which implies that it eventually did get the page lock and proceed. With thrasher testing, there's not a lot we can do to silence those warnings. Someimes the client just has to wait to get a lock held by a task that is stuck waiting for a remote host to come back.

I'm a little unclear on the problem here though. How did you determine that the test was hung and not making progress?

#4 Updated by Patrick Donnelly over 1 year ago

Jeff Layton wrote:

The bad patch involved in #50281 was never merged into RHEL, so I doubt this is related.

The hung task warning in the log only popped once, which implies that it eventually did get the page lock and proceed. With thrasher testing, there's not a lot we can do to silence those warnings. Someimes the client just has to wait to get a lock held by a task that is stuck waiting for a remote host to come back.

I'm a little unclear on the problem here though. How did you determine that the test was hung and not making progress?

Unfortunately we can't know (yet) if the test was really hung because we don't have the MDS logs (dead job). I'll see if I can dig into the logs next time I run tests.

#5 Updated by Jeff Layton 7 months ago

  • Assignee changed from Jeff Layton to Patrick Donnelly

Handing this back to Patrick for now. I haven't seen this occur myself. Is this still a problem? Should we close it out?

Also available in: Atom PDF