Project

General

Profile

Actions

Bug #52260

closed

1 MDSs are read only | pacific 16.2.5

Added by cephuser2345 user over 2 years ago. Updated about 1 month ago.

Status:
Duplicate
Priority:
Low
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
08/14/2021
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi,

We've upgraded from Ceph 14.2.20 to 16.2.5 a couple of weeks ago and suddently the MDS metadata pool OSD's filled up to 100%, because the MDS was behind trimming.

When this occures we increase the mds cephfs_metadata pool OSDs size (it is a virtual ssd drive) and restart the data's OSD and everything comes back to normal (the trimming is working and reducing the cephfs_metadata objects),

What happend here is that after restarting the MDS the fs return failed commit on the 0x1 directory (which is the root directory) and it is up in read-only mode.

This is the second time it happend to us, we thought it was related to the Ceph upgrade to version 16 because it happend immidiately after upgrade, now this occured a couple of weeks later,

Last time it happend we tried everything we found online, and eventually we rebuilt the metadata pool, as it was a very large storage and it took too much time we gaveup and rebuilt a new fs (so we can't tell if it worked).

Any idea why this occures and how we can solve this without rebuilding the cephfs metadata?

  1. ceph -s

    cluster:
    id: b41468a7-45b9-4812-a943-3b531a72ea6d
    health: HEALTH_ERR
    1 MDSs are read only
    1 MDSs behind on trimming
    3 full osd(s)
    2 pool(s) full

    services:
    mon: 3 daemons, quorum mon01,mon02,mon03 (age 3d)
    mgr: mon01(active, since 2w), standbys: mon02, mon03
    mds: 1/1 daemons up, 2 standby
    osd: 45 osds: 45 up (since 6h), 45 in (since 2w)

    data:
    volumes: 1/1 healthy
    pools: 3 pools, 1041 pgs
    objects: 70.64M objects, 262 TiB
    usage: 356 TiB used, 181 TiB / 538 TiB avail
    pgs: 1038 active+clean
    3 active+clean+scrubbing+deep

    io:
    client: 1.6 KiB/s rd, 18 MiB/s wr, 0 op/s rd, 4 op/s wr

  1. ceph df
    --- RAW STORAGE ---
    CLASS SIZE AVAIL USED RAW USED %RAW USED
    hdd 537 TiB 181 TiB 356 TiB 356 TiB 66.26
    ssd 510 GiB 15 GiB 495 GiB 495 GiB 97.03
    TOTAL 538 TiB 181 TiB 356 TiB 356 TiB 66.29

--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
cephfs_data 1 1024 262 TiB 70.45M 353 TiB 83.70 41 TiB
cephfs_metadata 2 16 654 GiB 189.74k 492 GiB 100.00 0 B
device_health_metrics 3 1 5.6 MiB 84 22 MiB 100.00 0 B

  1. ceph mds log:
    2021-08-13T11:12:36.575-0500 7f5f5dfb2700 1 mds.0.68933 reconnect_done
    2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.mds03 Updating MDS map to version 68939 from mon.1
    2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map i am now mds.0.68933
    2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map state change up:reconnect --> up:rejoin
    2021-08-13T11:12:36.907-0500 7f5f5dfb2700 1 mds.0.68933 rejoin_start
    2021-08-13T11:12:36.911-0500 7f5f5dfb2700 1 mds.0.68933 rejoin_joint_start
    2021-08-13T11:12:38.319-0500 7f5f57fa6700 1 mds.0.68933 rejoin_done
    2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.mds03 Updating MDS map to version 68940 from mon.1
    2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map i am now mds.0.68933
    2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.0.68933 handle_mds_map state change up:rejoin --> up:active
    2021-08-13T11:12:38.963-0500 7f5f5dfb2700 1 mds.0.68933 recovery_done -- successful recovery!
    2021-08-13T11:12:38.967-0500 7f5f5dfb2700 1 mds.0.68933 active_start
    2021-08-13T11:12:38.979-0500 7f5f5dfb2700 1 mds.0.68933 cluster recovered.
    2021-08-13T11:12:41.175-0500 7f5f5b7ad700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 1 mds.0.cache.dir(0x1) commit error -22 v 84709885
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 -1 mds.0.68933 unhandled write error (22) Invalid argument, force readonly...
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 1 mds.0.cache force file system read-only
    2021-08-13T11:12:41.203-0500 7f5f57fa6700 0 log_channel(cluster) log [WRN] : force file system read-only

Related issues 1 (0 open1 closed)

Is duplicate of CephFS - Bug #58082: cephfs:filesystem became read only after Quincy upgradeResolvedXiubo Li

Actions
Actions #1

Updated by Patrick Donnelly over 2 years ago

  • Assignee set to Milind Changire
Actions #2

Updated by cephuser2345 user over 2 years ago

Hi,
Any update on this?

Actions #3

Updated by cephuser2345 user over 2 years ago

We think there’s a major bug here with great risk.

Do you need any further logs/information? Try to replicate this?

Actions #5

Updated by Patrick Donnelly over 2 years ago

  • Priority changed from Normal to Urgent
Actions #6

Updated by Milind Changire over 2 years ago

  • Priority changed from Urgent to Normal

@cephuser2345
On any of the clusters still running Nautilus, please run the following command to extract the inode backtrace for inode 0x1:

$ rados --pool=<meta-data-pool-name> getxattr 1.00000000 parent | hexdump

  • Remember to specify the correct metadata pool name of the chosen cluster
  • Please paste the output of the command here
Actions #7

Updated by Milind Changire over 2 years ago

  • Priority changed from Normal to Urgent

resetting priority to Urgent; probably got set to Normal due to some mouse or keyboard activity on my part

Actions #8

Updated by Patrick Donnelly over 2 years ago

Milind Changire wrote:

@cephuser2345
On any of the clusters still running Nautilus, please run the following command to extract the inode backtrace for inode 0x1:
[...]

  • Remember to specify the correct metadata pool name of the chosen cluster
  • Please paste the output of the command here

Better is:

$ rados --pool=<meta-data-pool-name> getxattr 1.00000000 parent | ceph-dencoder type inode_backtrace_t import - decode dump_json
Actions #9

Updated by cephuser2345 user over 2 years ago

Unfortunately we needed the space and terminated the pools.

This issue is not similar to 52406, It’s a different behavior.

Actions #10

Updated by cephuser2345 user over 2 years ago

We ran into the same issue in a different cluster, same problem, this is the result:

root@admin01:~# rados --pool=cephfs_metadata getxattr 1.00000000 parent | ceph-dencoder type inode_backtrace_t import - decode dump_json {
"ino": 1,
"ancestors": [],
"pool": 8,
"old_pools": []
}

Actions #11

Updated by cephuser2345 user over 2 years ago

As this is the 3rd CephFS that is crushing with the same symptom and only goes up in read-only mode,

We must find a solution for this,

We see similar behavior here that might help resolving this ASAP:

- In all cases, the cephfs_metadata pool was full before it happened.

- After the mds restarted we received the message and then everything stopped working.

- In all cluster we set ops_op_queue to mclock_scheduler - and for some unknown reason, there was an override configuration of multiple values such as osd_recovery_max_active, osd_max_backfills - we couldn't change these manually, not with inject or setting the file, or the config - nothing - as if it has it's own hardcoded configuration:

like this:

"osd_recovery_max_active": {
"default": 0,
"file": 1,
"override": 1000,
"final": 1000
},
"osd_recovery_max_active_hdd": {
"default": 3,
"override": 1000,
"final": 1000
},
"osd_recovery_max_active_ssd": {
"default": 10,
"override": 1000,
"final": 1000
},
"osd_max_backfills": {
"default": 1,
"file": 24,
"override": 1000,
"final": 1000
},

our assumption is that the osd_op_queue settings misconfiguration caused stalled ops, and then the trimming couldn't be done which filled the pool, and when the pool got filled - it didn't write correct data to the cephfs_metadata and that causes the fs not to go up.

We kept the cephfs_metadata and the cephfs_data disfunction - we need fast help here, as we need to cluster up and free space.

Thanks.

Actions #12

Updated by Neha Ojha over 2 years ago

cephuser2345 user wrote:

As this is the 3rd CephFS that is crushing with the same symptom and only goes up in read-only mode,

We must find a solution for this,

We see similar behavior here that might help resolving this ASAP:

- In all cases, the cephfs_metadata pool was full before it happened.

- After the mds restarted we received the message and then everything stopped working.

- In all cluster we set ops_op_queue to mclock_scheduler - and for some unknown reason, there was an override configuration of multiple values such as osd_recovery_max_active, osd_max_backfills - we couldn't change these manually, not with inject or setting the file, or the config - nothing - as if it has it's own hardcoded configuration:

These values are set for mclock_scheduler's built-in profiles. This is documented in https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#mclock-built-in-profiles

As a workaround, you can use the "custom" profile, which allows you to override all settings, https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#mclock-profile-types

I'll also ask Sridhar to take a closer look at this.

like this:

"osd_recovery_max_active": {
"default": 0,
"file": 1,
"override": 1000,
"final": 1000
},
"osd_recovery_max_active_hdd": {
"default": 3,
"override": 1000,
"final": 1000
},
"osd_recovery_max_active_ssd": {
"default": 10,
"override": 1000,
"final": 1000
},
"osd_max_backfills": {
"default": 1,
"file": 24,
"override": 1000,
"final": 1000
},

our assumption is that the osd_op_queue settings misconfiguration caused stalled ops, and then the trimming couldn't be done which filled the pool, and when the pool got filled - it didn't write correct data to the cephfs_metadata and that causes the fs not to go up.

We kept the cephfs_metadata and the cephfs_data disfunction - we need fast help here, as we need to cluster up and free space.

Thanks.

Actions #13

Updated by cephuser2345 user over 2 years ago

Thanks, that seems to work now.

What about the MDS crushing and not going up? this is the major issue here.

Actions #14

Updated by Sridhar Seshasayee over 2 years ago

Neha Ojha wrote:

These values are set for mclock_scheduler's built-in profiles. This is documented in https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#mclock-built-in-profiles

As a workaround, you can use the "custom" profile, which allows you to override all settings, https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#mclock-profile-types

I'll also ask Sridhar to take a closer look at this.

like this:

"osd_recovery_max_active": {
"default": 0,
"file": 1,
"override": 1000,
"final": 1000
},
"osd_recovery_max_active_hdd": {
"default": 3,
"override": 1000,
"final": 1000
},
"osd_recovery_max_active_ssd": {
"default": 10,
"override": 1000,
"final": 1000
},
"osd_max_backfills": {
"default": 1,
"file": 24,
"override": 1000,
"final": 1000
},

Yes, setting the "custom" profile will help in allowing changes to be made to the max backfills and max recovery settings which you have already indicated.

our assumption is that the osd_op_queue settings misconfiguration caused stalled ops, and then the trimming couldn't be done which filled the pool, and when the pool got filled - it didn't write correct data to the cephfs_metadata and that causes the fs not to go up.

I am curious to know a couple of things:
1. When was the change to osd_op_queue to mclock_scheduler made? Was it after the upgrade to pacific?
2. When was the trimming issue observed - before the change to osd_op_queue or after? This could help in determining if changing osd_op_queue played a part in the metadata pools becoming full.

Actions #15

Updated by Greg Farnum over 2 years ago

How large is the metadata pool in use here? How many SSDs, of what size? Does Ceph detect them as SSDs, or as HDDs?

Do you have any of the map dumps? How far behind is the MDS reporting itself on trimming? Was it making any progress? Or was it somehow generating more log than it could remove?

...actually, I'm a bit confused by the report. It sounds like you've gotten a read-only CephFS twice, once during upgrade and once now. But did the metadata pool go full in both cases, and then the root inode commit failed? Or did the metadata pool go full the first time, and now the root inode commit failure is a second case that happened differently?

Actions #16

Updated by cephuser2345 user over 2 years ago

1. we chaned osd_opqueue to mclock_scheduler right after upgrading to pacific. that was our main reason for the upgrade to handle the io loads during snaptrims and overloads that caused mds behind trimming as ops were stuck in the osds.

2. we had mds behind trimming issues before the upgrade too.

3. The metadata pool size before it was behind trimming was appx. 1 GB. not big. it started to grow until it filled the pool that is now as big as 160 GB.
the metadata pool is configured with 3 OSDs, SSD drives, 200 GB each, and the pool configuration is replica, size 3.

The symptom is that it grew until it filled the pool and it stopped responding because it couldn't grow further. after restart it didn't go up.

We had this in the Nautilus too, what we've seen during mds behind trimming is that there were ops stuck in the OSD where cephfs_data is.

but in Nautilus we were able to grow the metadata osds and everything returned to normal, here in the pacific returned the error above (0x1) and went up in read-only.

maybe that's because we upgraded from Nautilus to Pacific that caused this?

after upgrade everything worked perfectly. only after the metadata pool filled - it didn't go up and loaded in read-only.

Actions #17

Updated by Sridhar Seshasayee over 2 years ago

cephuser2345 user wrote:

1. we chaned osd_opqueue to mclock_scheduler right after upgrading to pacific. that was our main reason for the upgrade to handle the io loads during snaptrims and overloads that caused mds behind trimming as ops were stuck in the osds.

A couple of additional queries with respect to mclock to get a better understanding:

1. What necessitated the change of the mclock profile to "custom"? Is it because cluster operations like recovery, backfills and trimming were getting backed-up more with the default mclock profile?

2. Did changing the profile to "custom" and then changing the limits for max backfills and recovery help? It would be good if you could share the osd & mds logs so that we can better understand what is going on.

Actions #18

Updated by cephuser2345 user over 2 years ago

We didn't change the mclock profile to custom, it is set to high io client.

I am not sure the op_queue is the problem - it seems like the mds metadata pool is filled and that causes the whole issue.

Why the mds is not going up normally and is in read-only mode?

How to we solve this:
"
2021-08-13T11:12:41.203-0500 7f5f57fa6700 1 mds.0.cache.dir(0x1) commit error -22 v 84709885
2021-08-13T11:12:41.203-0500 7f5f57fa6700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22
2021-08-13T11:12:41.203-0500 7f5f57fa6700 -1 mds.0.68933 unhandled write error (22) Invalid argument, force readonly...
"

Actions #19

Updated by Loïc Dachary over 2 years ago

  • Target version deleted (v16.2.6)
Actions #20

Updated by cephuser2345 user about 2 years ago

Hi

Any update on this ?:)

Actions #21

Updated by Milind Changire about 2 years ago

cephuser2345 user wrote:

Hi

Any update on this ?:)

Its been a while since I attempted to reproduce this issue in-house using https://github.com/ceph/ceph/pull/42902

If possible, it would great if you would reproduce this issue at your end with global config option "debug mds = 20"
So the steps are:
1. install nautilus
2. set "debug mds = 20"
3. push data to the volume
4. get backtrace for inode 0x1
5. ugrade to pacific
6. watch for issue to reproduce
7. get backtrace for inode 0x1
8. attach the mds logs to this tracker
9. set "debug mds = 5"

Actions #22

Updated by Milind Changire over 1 year ago

  • Priority changed from Urgent to Low

closing tracker for now
lowering priority to low
please reopen in case this seen again

Actions #23

Updated by Milind Changire over 1 year ago

  • Status changed from New to Closed
Actions #24

Updated by Xiubo Li over 1 year ago

  • Status changed from Closed to Duplicate
  • Target version set to v18.0.0
  • Parent task set to #58082

Will tracker and fix it in https://tracker.ceph.com/issues/58082.

Actions #25

Updated by Prayank Saxena over 1 year ago

Hello Team,

We got the issue similar to 'mds read-only' in Pacific 16.2.9 where one write commit failed and made the entire cluster read-only.

2022-12-12T04:55:12.095+0000 7f6dbfb56700 1 mds.0.cache.dir(0x10009535b89) commit error -22 v 8533429
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x10009535b89 object, errno -22
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 mds.0.29196 unhandled write error (22) Invalid argument, force readonly...

In our case, we did not see any pool getting filled up but there was not client I/O visible in ceph status command before restarting of active mds service, once we restarted active mds then we saw above logs.

Could you please let us know what is the use case of hitting this issue?
Also, do let me know if anymore information is required?

Thanks

Actions #26

Updated by Patrick Donnelly about 1 month ago

  • Parent task deleted (#58082)
Actions #27

Updated by Patrick Donnelly about 1 month ago

  • Is duplicate of Bug #58082: cephfs:filesystem became read only after Quincy upgrade added
Actions

Also available in: Atom PDF