Bug #58715: File systems going read-only in Pacific (16.2.11) - Ceph - Ceph

Actions

Copy link

Bug #58715

open

File systems going read-only in Pacific (16.2.11)

Added by Bryan Stillwell about 1 year ago. Updated about 1 year ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

v16.2.11

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We have multiple NVMe-based clusters running the latest Pacific release
(16.2.11) that are experiencing a problem where their filesystems randomly go
read-only.

This problem first affected us in October of 2022 while the clusters were
running either 16.2.7, 16.2.9, or 16.2.10. Digging into the logs it seemed
that we were experiencing the problem reported in issue 54701 since concurrent
lookup and unlink operations were seen before the filesystem went read-only.

After patiently waiting for 16.2.11 to be released (which had a fix for 54701)
we proceeded to gradually roll it out to all the clusters. This was finished
last Wednesday (February 8th, 2023), but this weekend the problem re-appeared.

The log entries do look a little different now. It seems to start with some
slow requests for a file called wpo_logo_small.png*:

2023-02-11T15:05:43.135-0700 7fede94a8700 0 log_channel(cluster) log [WRN] : slow request 511.708058 seconds old, received at 2023-02-11T14:57:11.427532-0700: client_request(client.168395754:122198341 lookup #0x1000a03a2ba/wpo_logo_small.png.webp.lossless.webp 2023-02-11T14:57:11.426323-0700 caller_uid=[redacted], caller_gid=[redacted]{}) currently dispatched
2023-02-11T15:05:43.135-0700 7fede94a8700 0 log_channel(cluster) log [WRN] : slow request 514.370971 seconds old, received at 2023-02-11T14:57:08.764619-0700: client_request(client.168335863:181891244 lookup #0x1000a03a2ba/wpo_logo_small.png.webp.lossless.webp 2023-02-11T14:57:08.761738-0700 caller_uid=[redacted], caller_gid=[redacted]{}) currently dispatched
2023-02-11T15:05:43.135-0700 7fede94a8700 0 log_channel(cluster) log [WRN] : slow request 514.370960 seconds old, received at 2023-02-11T14:57:08.764630-0700: client_request(client.168335863:181891245 lookup #0x1000a03a2ba/wpo_logo_small.png.webp.lossless.webp 2023-02-11T14:57:08.761738-0700 caller_uid=[redacted], caller_gid=[redacted]{}) currently dispatched
2023-02-11T15:05:43.135-0700 7fede94a8700 0 log_channel(cluster) log [WRN] : slow request 514.370821 seconds old, received at 2023-02-11T14:57:08.764769-0700: client_request(client.168335863:181891247 create #0x1000a03a2ba/wpo_logo_small.png.webp.lossy.webp 2023-02-11T14:57:08.762738-0700 caller_uid=[redacted], caller_gid=[redacted]{}) currently failed to wrlock, waiting

followed by an MDS map being updated:

2023-02-11T15:05:51.283-0700 7fedeacab700 1 mds.[redacted] Updating MDS map to version 315291 from mon.2

then the file system going read-only:

2023-02-11T15:05:55.971-0700 7fede4c9f700 -1 mds.0.277011 unhandled write error (90) Message too long, force readonly...
2023-02-11T15:05:55.971-0700 7fede4c9f700 1 mds.0.cache force file system read-only
2023-02-11T15:05:55.971-0700 7fede4c9f700 0 log_channel(cluster) log [WRN] : force file system read-only

we added some auto-remediation that then restarts the MDS that is read-only:

2023-02-11T15:07:25.510-0700 7fedec4ae700 -1 received signal: Terminated from /usr/lib/systemd/systemd --system --deserialize 25 (PID: 1) UID: 0
2023-02-11T15:07:25.510-0700 7fedec4ae700 -1 mds.[redacted] * got signal Terminated *
2023-02-11T15:07:25.510-0700 7fedec4ae700 1 mds.[redacted] suicide! Wanted state up:active
2023-02-11T15:07:29.510-0700 7fedec4ae700 1 mds.0.277011 shutdown: shutting down rank 0

this causes the standby-replay to take over which then goes read-only shortly
afterwards:

2023-02-11T15:09:13.238-0700 7fd104c9d700 1 mds.[redacted] Updating MDS map to version 315305 from mon.0
2023-02-11T15:09:13.238-0700 7fd104c9d700 1 mds.0.315301 handle_mds_map i am now mds.0.315301
2023-02-11T15:09:13.238-0700 7fd104c9d700 1 mds.0.315301 handle_mds_map state change up:rejoin --> up:active
2023-02-11T15:09:13.238-0700 7fd104c9d700 1 mds.0.315301 recovery_done -- successful recovery!
2023-02-11T15:09:18.236-0700 7fd104c9d700 1 mds.0.315301 active_start
2023-02-11T15:09:25.221-0700 7fd104c9d700 1 mds.0.315301 cluster recovered.
2023-02-11T15:09:30.020-0700 7fd104c9d700 1 mds.[redacted] Updating MDS map to version 315306 from mon.0
2023-02-11T15:09:30.049-0700 7fd0fec91700 -1 mds.0.315301 unhandled write error (90) Message too long, force readonly...
2023-02-11T15:09:30.049-0700 7fd0fec91700 1 mds.0.cache force file system read-only
2023-02-11T15:09:30.049-0700 7fd0fec91700 0 log_channel(cluster) log [WRN] : force file system read-only

our auto-remediation then restarts that mds and the file system is then able to
become healthy again.

Unfortunately I haven't been able to reproduce the problem in an isolated
environment, but this code change to the wp-optimize plugin appears to be
related:

https://plugins.trac.wordpress.org/changeset/2794595/wp-optimize/trunk/cache/file-based-page-cache-functions.php?old=2737167&old_path=wp-optimize%2Ftrunk%2Fcache%2Ffile-based-page-cache-functions.php

A version with that change was released around the time this read-only problem
started affecting us.

Actions

Copy link

Updated by Bryan Stillwell about 1 year ago

This appears to be related to this mailing list post:

https://www.spinics.net/lists/ceph-users/msg52977.html?fbclid=IwAR37ojpTE8IHf-tyGGb7TQllZ-mny1xZQmxv4lNprqMGgMK54UzcNTAKPI8

Looking at the osd logs I see:

2023-02-11T15:05:55.971-0700 7fc19fdc5700 -1 osd.135 pg_epoch: 130463 pg[11.b( v 130463'96620709 (130459'96618662,130463'96620709] local-lis/les=129379/129380 n=492579 ec=1330/1330 lis/c=129379/129379 les/c/f=129380/129380/0 sis=129379) [135,206,70] r=0 lpr=129379 crt=130463'96620708 lcod 130463'96620708 mlcod 130463'96620708 active+clean] do_op msg data len 94382290 > osd_max_write_size 94371840 on osd_op(mds.0.277011:226984764 11.b 11:d0630e4c:::mds0_sessionmap:head [omap-set-header in=14b,omap-set-vals in=94382276b] snapc 0=[] ondisk+write+known_if_redirected+full_force e130463) v8
2023-02-11T15:09:26.846-0700 7fc19fdc5700 -1 osd.135 pg_epoch: 130464 pg[11.b( v 130463'96620726 (130460'96618762,130463'96620726] local-lis/les=129379/129380 n=492581 ec=1330/1330 lis/c=129379/129379 les/c/f=129380/129380/0 sis=129379) [135,206,70] r=0 lpr=129379 crt=130463'96620726 lcod 130463'96620725 mlcod 130463'96620725 active+clean] do_op msg data len 94372125 > osd_max_write_size 94371840 on osd_op(mds.166925851.315301:2028390 11.b 11:d0630e4c:::mds0_sessionmap:head [omap-set-header in=14b,omap-set-vals in=94369167b,omap-rm-keys in=2944b] snapc 0=[] ondisk+write+known_if_redirected+full_force e130464) v8

The clients are using the 5.4.206 kernel from here:

http://mirror.centos.org/altarch/7/kernel/x86_64/Packages/kernel-5.4.206-200.el7.x86_64.rpm

Perhaps this old kernel is misbehaving and causing the sessionmap to become too large?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #58715

File systems going read-only in Pacific (16.2.11)

Updated by Bryan Stillwell about 1 year ago