Bug #52260: 1 MDSs are read only | pacific 16.2.5 - CephFS - Ceph

Actions

Copy link

Bug #52260

closed

1 MDSs are read only | pacific 16.2.5

Added by cephuser2345 user over 2 years ago. Updated about 2 months ago.

Status:

Duplicate

Priority:

Low

Assignee:

Milind Changire

Category:

fsck/damage handling

Target version:

Ceph - v18.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

08/14/2021

Affected Versions:

Ceph - v16.2.5

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hi,

We've upgraded from Ceph 14.2.20 to 16.2.5 a couple of weeks ago and suddently the MDS metadata pool OSD's filled up to 100%, because the MDS was behind trimming.

When this occures we increase the mds cephfs_metadata pool OSDs size (it is a virtual ssd drive) and restart the data's OSD and everything comes back to normal (the trimming is working and reducing the cephfs_metadata objects),

What happend here is that after restarting the MDS the fs return failed commit on the 0x1 directory (which is the root directory) and it is up in read-only mode.

This is the second time it happend to us, we thought it was related to the Ceph upgrade to version 16 because it happend immidiately after upgrade, now this occured a couple of weeks later,

Last time it happend we tried everything we found online, and eventually we rebuilt the metadata pool, as it was a very large storage and it took too much time we gaveup and rebuilt a new fs (so we can't tell if it worked).

Any idea why this occures and how we can solve this without rebuilding the cephfs metadata?

ceph -s
cluster:
id: b41468a7-45b9-4812-a943-3b531a72ea6d
health: HEALTH_ERR
1 MDSs are read only
1 MDSs behind on trimming
3 full osd(s)
2 pool(s) full

services:
mon: 3 daemons, quorum mon01,mon02,mon03 (age 3d)
mgr: mon01(active, since 2w), standbys: mon02, mon03
mds: 1/1 daemons up, 2 standby
osd: 45 osds: 45 up (since 6h), 45 in (since 2w)

data:
volumes: 1/1 healthy
pools: 3 pools, 1041 pgs
objects: 70.64M objects, 262 TiB
usage: 356 TiB used, 181 TiB / 538 TiB avail
pgs: 1038 active+clean
3 active+clean+scrubbing+deep

io:
client: 1.6 KiB/s rd, 18 MiB/s wr, 0 op/s rd, 4 op/s wr

ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 537 TiB 181 TiB 356 TiB 356 TiB 66.26
ssd 510 GiB 15 GiB 495 GiB 495 GiB 97.03
TOTAL 538 TiB 181 TiB 356 TiB 356 TiB 66.29

--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
cephfs_data 1 1024 262 TiB 70.45M 353 TiB 83.70 41 TiB
cephfs_metadata 2 16 654 GiB 189.74k 492 GiB 100.00 0 B
device_health_metrics 3 1 5.6 MiB 84 22 MiB 100.00 0 B

ceph mds log:
2021-08-13T11:12:36.575-0500 7f5f5dfb2700 1 mds.0.68933 reconnect_done
2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.mds03 Updating MDS map to version 68939 from mon.1
2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map i am now mds.0.68933
2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map state change up:reconnect --> up:rejoin
2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.0.68933 rejoin_start
2021-08-13T11:12:36.911-0500 7f5f5dfb2700 1 mds.0.68933 rejoin_joint_start
2021-08-13T11:12:38.319-0500 7f5f57fa6700 1 mds.0.68933 rejoin_done
2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.mds03 Updating MDS map to version 68940 from mon.1
2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map i am now mds.0.68933
2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map state change up:rejoin --> up:active
2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.0.68933 recovery_done -- successful recovery!
2021-08-13T11:12:38.967-0500 7f5f5dfb2700 1 mds.0.68933 active_start
2021-08-13T11:12:38.979-0500 7f5f5dfb2700 1 mds.0.68933 cluster recovered.
2021-08-13T11:12:41.175-0500 7f5f5b7ad700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2021-08-13T11:12:41.203-0500 7f5f57fa6700 1 mds.0.cache.dir(0x1) commit error -22 v 84709885
2021-08-13T11:12:41.203-0500 7f5f57fa6700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22
2021-08-13T11:12:41.203-0500 7f5f57fa6700 -1 mds.0.68933 unhandled write error (22) Invalid argument, force readonly...
2021-08-13T11:12:41.203-0500 7f5f57fa6700 1 mds.0.cache force file system read-only
2021-08-13T11:12:41.203-0500 7f5f57fa6700 0 log_channel(cluster) log [WRN] : force file system read-only

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #52260

1 MDSs are read only | pacific 16.2.5

Updated by Patrick Donnelly over 2 years ago

Updated by cephuser2345 user over 2 years ago

Updated by cephuser2345 user over 2 years ago

Updated by Dan van der Ster over 2 years ago

Updated by Patrick Donnelly over 2 years ago

Updated by Milind Changire over 2 years ago

Updated by Milind Changire over 2 years ago

Updated by Patrick Donnelly over 2 years ago

Updated by cephuser2345 user over 2 years ago

Updated by cephuser2345 user over 2 years ago

Updated by cephuser2345 user over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by cephuser2345 user over 2 years ago

Updated by Sridhar Seshasayee over 2 years ago

Updated by Greg Farnum over 2 years ago

Updated by cephuser2345 user over 2 years ago

Updated by Sridhar Seshasayee over 2 years ago

Updated by cephuser2345 user over 2 years ago

Updated by Loïc Dachary over 2 years ago

Updated by cephuser2345 user about 2 years ago

Updated by Milind Changire about 2 years ago

Updated by Milind Changire over 1 year ago

Updated by Milind Changire over 1 year ago

Updated by Xiubo Li over 1 year ago

Updated by Prayank Saxena over 1 year ago

Updated by Patrick Donnelly about 2 months ago

Updated by Patrick Donnelly about 2 months ago