Bug #50826: kceph: stock RHEL kernel hangs on snaptests with mon|osd thrashers - CephFS - Ceph

Actions

Copy link

Bug #50826

closed

kceph: stock RHEL kernel hangs on snaptests with mon|osd thrashers

Added by Patrick Donnelly almost 3 years ago. Updated 6 months ago.

Status:

Closed

Priority:

Normal

Assignee:

Patrick Donnelly

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

kceph

Labels (FS):

qa-failure

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

/ceph/teuthology-archive/pdonnell-2021-05-14_21:45:42-fs-master-distro-basic-smithi/6115757/teuthology.log

and

/ceph/teuthology-archive/pdonnell-2021-05-14_21:45:42-fs-master-distro-basic-smithi/6115769/teuthology.log

I see:

[ 2459.191432] INFO: task git:214273 blocked for more than 120 seconds.
[ 2459.198266]       Not tainted 4.18.0-240.1.1.el8_3.x86_64 #1
[ 2459.204471] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2459.212884] git             D    0 214273 136000 0x00000080
[ 2459.218901] Call Trace:
[ 2459.221802]  __schedule+0x2a6/0x700
[ 2459.225931]  ? __dentry_kill+0x121/0x170
[ 2459.230363]  schedule+0x38/0xa0
[ 2459.233922]  io_schedule+0x12/0x40
[ 2459.237741]  __lock_page+0x141/0x240
[ 2459.241724]  ? file_check_and_advance_wb_err+0xd0/0xd0
[ 2459.247309]  pagecache_get_page+0x19a/0x2d0
[ 2459.252075]  grab_cache_page_write_begin+0x1f/0x40
[ 2459.257309]  ceph_write_begin+0x40/0x130 [ceph]
[ 2459.262265]  generic_perform_write+0xf4/0x1b0
[ 2459.267044]  ? file_update_time+0xed/0x130
[ 2459.271559]  ceph_write_iter+0xa75/0xc90 [ceph]
[ 2459.276495]  ? atime_needs_update+0x77/0xe0
[ 2459.281076]  ? touch_atime+0x33/0xe0
[ 2459.285318]  ? _copy_to_user+0x26/0x30
[ 2459.289542]  ? cp_new_stat+0x150/0x180
[ 2459.293712]  ? new_sync_write+0x124/0x170
[ 2459.298131]  ? ceph_fallocate+0x5f0/0x5f0 [ceph]
[ 2459.303149]  new_sync_write+0x124/0x170
[ 2459.307393]  vfs_write+0xa5/0x1a0
[ 2459.311111]  ksys_write+0x4f/0xb0
[ 2459.314826]  do_syscall_64+0x5b/0x1a0
[ 2459.318895]  entry_SYSCALL_64_after_hwframe+0x65/0xca

in one of the dmesg logs and

[ 4169.826043] libceph: reset on mds4
[ 4169.826044] ceph: mds4 closed our session
[ 4169.826045] ceph: mds4 reconnect start
[ 4169.826062] libceph: mds1 (1)172.21.15.74:6836 connection reset
[ 4169.826064] libceph: reset on mds1
[ 4169.826064] ceph: mds1 closed our session
[ 4169.826065] ceph: mds1 reconnect start
[ 4169.827221] ceph: mds1 reconnect denied
[ 4169.827231] ceph: mds4 reconnect denied
[ 4169.831925] libceph: reset on mds2
[ 4169.838044] libceph: reset on mds3
[ 4169.844154] ceph: mds2 closed our session
[ 4169.844154] ceph: mds2 reconnect start
[ 4169.861696] ceph: mds2 reconnect denied
[ 4169.863550] ceph: mds3 closed our session
[ 4169.911453] ceph: mds3 reconnect start
[ 4169.916937] ceph: mds3 reconnect denied
[ 4221.027310] libceph: mon0 (1)172.21.15.16:6789 socket closed (con state OPEN)
[ 4221.034962] libceph: mon0 (1)172.21.15.16:6789 session lost, hunting for new mon
.. ad infinitum

in the other

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Related to Bug #50281: qa: untar_snap_rm timeout added

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Might be related to #50281 but that was with the testing kernel.

Actions

Copy link

Updated by Jeff Layton almost 3 years ago

The bad patch involved in #50281 was never merged into RHEL, so I doubt this is related.

The hung task warning in the log only popped once, which implies that it eventually did get the page lock and proceed. With thrasher testing, there's not a lot we can do to silence those warnings. Someimes the client just has to wait to get a lock held by a task that is stuck waiting for a remote host to come back.

I'm a little unclear on the problem here though. How did you determine that the test was hung and not making progress?

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Jeff Layton wrote:

The bad patch involved in #50281 was never merged into RHEL, so I doubt this is related.

The hung task warning in the log only popped once, which implies that it eventually did get the page lock and proceed. With thrasher testing, there's not a lot we can do to silence those warnings. Someimes the client just has to wait to get a lock held by a task that is stuck waiting for a remote host to come back.

I'm a little unclear on the problem here though. How did you determine that the test was hung and not making progress?

Unfortunately we can't know (yet) if the test was really hung because we don't have the MDS logs (dead job). I'll see if I can dig into the logs next time I run tests.

Actions

Copy link