Project

General

Profile

Actions

Bug #52260

closed

1 MDSs are read only | pacific 16.2.5

Added by cephuser2345 user over 2 years ago. Updated about 2 months ago.

Status:
Duplicate
Priority:
Low
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
08/14/2021
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

We've upgraded from Ceph 14.2.20 to 16.2.5 a couple of weeks ago and suddently the MDS metadata pool OSD's filled up to 100%, because the MDS was behind trimming.

When this occures we increase the mds cephfs_metadata pool OSDs size (it is a virtual ssd drive) and restart the data's OSD and everything comes back to normal (the trimming is working and reducing the cephfs_metadata objects),

What happend here is that after restarting the MDS the fs return failed commit on the 0x1 directory (which is the root directory) and it is up in read-only mode.

This is the second time it happend to us, we thought it was related to the Ceph upgrade to version 16 because it happend immidiately after upgrade, now this occured a couple of weeks later,

Last time it happend we tried everything we found online, and eventually we rebuilt the metadata pool, as it was a very large storage and it took too much time we gaveup and rebuilt a new fs (so we can't tell if it worked).

Any idea why this occures and how we can solve this without rebuilding the cephfs metadata?

  1. ceph -s

    cluster:
    id: b41468a7-45b9-4812-a943-3b531a72ea6d
    health: HEALTH_ERR
    1 MDSs are read only
    1 MDSs behind on trimming
    3 full osd(s)
    2 pool(s) full

    services:
    mon: 3 daemons, quorum mon01,mon02,mon03 (age 3d)
    mgr: mon01(active, since 2w), standbys: mon02, mon03
    mds: 1/1 daemons up, 2 standby
    osd: 45 osds: 45 up (since 6h), 45 in (since 2w)

    data:
    volumes: 1/1 healthy
    pools: 3 pools, 1041 pgs
    objects: 70.64M objects, 262 TiB
    usage: 356 TiB used, 181 TiB / 538 TiB avail
    pgs: 1038 active+clean
    3 active+clean+scrubbing+deep

    io:
    client: 1.6 KiB/s rd, 18 MiB/s wr, 0 op/s rd, 4 op/s wr

  1. ceph df
    --- RAW STORAGE ---
    CLASS SIZE AVAIL USED RAW USED %RAW USED
    hdd 537 TiB 181 TiB 356 TiB 356 TiB 66.26
    ssd 510 GiB 15 GiB 495 GiB 495 GiB 97.03
    TOTAL 538 TiB 181 TiB 356 TiB 356 TiB 66.29

--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
cephfs_data 1 1024 262 TiB 70.45M 353 TiB 83.70 41 TiB
cephfs_metadata 2 16 654 GiB 189.74k 492 GiB 100.00 0 B
device_health_metrics 3 1 5.6 MiB 84 22 MiB 100.00 0 B

  1. ceph mds log:
    2021-08-13T11:12:36.575-0500 7f5f5dfb2700 1 mds.0.68933 reconnect_done
    2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.mds03 Updating MDS map to version 68939 from mon.1
    2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map i am now mds.0.68933
    2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map state change up:reconnect --> up:rejoin
    2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.0.68933 rejoin_start
    2021-08-13T11:12:36.911-0500 7f5f5dfb2700 1 mds.0.68933 rejoin_joint_start
    2021-08-13T11:12:38.319-0500 7f5f57fa6700 1 mds.0.68933 rejoin_done
    2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.mds03 Updating MDS map to version 68940 from mon.1
    2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map i am now mds.0.68933
    2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map state change up:rejoin --> up:active
    2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.0.68933 recovery_done -- successful recovery!
    2021-08-13T11:12:38.967-0500 7f5f5dfb2700 1 mds.0.68933 active_start
    2021-08-13T11:12:38.979-0500 7f5f5dfb2700 1 mds.0.68933 cluster recovered.
    2021-08-13T11:12:41.175-0500 7f5f5b7ad700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 1 mds.0.cache.dir(0x1) commit error -22 v 84709885
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 -1 mds.0.68933 unhandled write error (22) Invalid argument, force readonly...
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 1 mds.0.cache force file system read-only
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 0 log_channel(cluster) log [WRN] : force file system read-only

Related issues 1 (0 open1 closed)

Is duplicate of CephFS - Bug #58082: cephfs:filesystem became read only after Quincy upgradeResolvedXiubo Li

Actions
Actions

Also available in: Atom PDF