Project

General

Profile

Bug #58082

cephfs:filesystem became read only after Quincy upgrade

Added by Xiubo Li 2 months ago. Updated 17 days ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

100%

Source:
Tags:
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Copy the info from ceph-user mail list by Adrien:

Hi,

We upgraded this morning a Pacific Ceph cluster to the last Quincy version.
The cluster was healthy before the upgrade, everything was done according to the upgrade procedure (non-cephadm) [1], all services have restarted correctly but the filesystem switched to read only mode when it became active.
|
||HEALTH_WARN 1 MDSs are read only||
||[WRN] MDS_READ_ONLY: 1 MDSs are read only||
||    mds.cccephadm32(mds.0): MDS in read-only mode|

This is the only warning we got on the cluster.
In the MDS log, this error "failed to commit dir 0x1 object, errno -22" seems to be the root cause :
|
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22||
||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 mds.0.11963 unhandled write error (22) Invalid argument, force readonly...||
||2022-11-23T12:41:09.843+0100 7f930f56d700  1 mds.0.cache force file system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700  0 log_channel(cluster) log [WRN] : force file system read-only||
||2022-11-23T12:41:09.843+0100 7f930f56d700 10 mds.0.server force_clients_readonly|

I couldn't get more info with ceph config set mds.x debug_mds 20

|ceph fs status||
||cephfs - 17 clients||
||======||
||RANK  STATE       MDS         ACTIVITY     DNS INOS   DIRS   CAPS ||
|| 0    active  cccephadm32  Reqs:    0 /s  12.9k 12.8k   673   1538 ||
||      POOL         TYPE     USED  AVAIL ||
||cephfs_metadata  metadata   513G  48.6T ||
||  cephfs_data      data    2558M  48.6T ||
||  cephfs_data2     data     471G  48.6T ||
||  cephfs_data3     data     433G  48.6T ||
||STANDBY MDS ||
||cccephadm30 ||
||cccephadm31 ||
||MDS version: ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)|

Any idea what could go wrong and how to solve it before starting a disaster recovery procedure?

Cheers,
Adrien
This bug looks very similar to this issue opened last year and closed without any solution : https://tracker.ceph.com/issues/52260
Hi Xiubo,

We did the upgrade in rolling mode as always, with only few kubernetes pods as clients accessing their PVC on CephFS.

I can reproduce the problem everytime I restart the MDS daemon.
You can find the MDS log with debug_mds 25 and debug_ms 1 here : https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd
(The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 mds.0.server force_clients_readonly)

I couldn't find any errors in the OSD logs, anything specific should I looking for?

Best,
Adrien

Subtasks

Bug #52260: 1 MDSs are read only | pacific 16.2.5DuplicateMilind Changire

History

#1 Updated by Xiubo Li 2 months ago

  • Pull request ID set to 49048

#2 Updated by Xiubo Li 2 months ago

From the logs, the dir(0x1) will submit the volumes Dentry to metadata pool:

1726069 2022-11-24T09:18:10.177+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) commit want 0 on [dir 0x1 / [2,head] auth pv=5067819 v=5067817 cv=0/0 dir_auth=0 ap=2+1 state=1610612800|fetching f(v0 m2022-02-14T14:08:22.630497+0100 1=0+1) n(v203259 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3145360=132068+13292) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=1 authpin=1 0x5569a9e49600]
1726070 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) auth_pin by 0x5569a9e49600 on [dir 0x1 / [2,head] auth pv=5067819 v=5067817 cv=0/0 dir_auth=0 ap=3+1 state=1610612800|fetching f(v0 m2022-02-14T14:08:22.630497+0100 1=0+1) n(v203259 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=1 authpin=1 0x5569a9e49600] count now         3
1726071 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) _commit want 5067817 on [dir 0x1 / [2,head] auth pv=5067819 v=5067817 cv=0/0 dir_auth=0 ap=3+1 state=1610612800|fetching f(v0 m2022-02-14T14:08:22.630497+0100 1=0+1) n(v203259 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=1 authpin=1 0x5569a9e49600]
1726072 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) marking committing
1726073 2022-11-24T09:18:10.178+0100 7fe0347eb700 20 mds.0.bal hit_dir 4 pop is 1, frag * size 1 [pop IRD:[C 0.00e+00] IWR:[C 0.00e+00] RDR:[C 0.00e+00] FET:[C 9.82e-01] STR:[C 1.00e+00] *LOAD:6.0]
1726074 
1726075 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) _omap_commit
1726076 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1)  set volumes [dentry #0x1/volumes [2,head] auth (dversion lock) pv=5067818 v=5067816 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x5569b24b2280]
1726077 2022-11-24T09:18:10.178+0100 7fe0347eb700 14 mds.0.cache.dir(0x1)  dn 'volumes' inode [inode 0x10000000000 [...c0d9,head] /volumes/ auth v5067816 pv5067818 ap=1 f(v0 m2022-11-23T11:31:31.711158+0100 5235=5229+6) n(v377862 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292)/n(v377861 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292) old_inodes=25874 (inest lock w=1 dirty) (iversion l        ock w=1) | dirtyscattered=1 lock=2 dirfrag=1 dirtyrstat=0 dirty=1 authpin=1 0x5569ae8beb00]                                                    
1726078 2022-11-24T09:18:10.178+0100 7fe0347eb700 15 mds.0.journal try_to_expire committing [dir 0x100 ~mds0/ [2,head] auth pv=226292767 v=226292765 cv=0/0 dir_auth=0 ap=1+1 state=1610612737|complete f(v0 10=0+10) n(v497237 rc2022-11-23T17:09:36.965133+0100 b1041 14=3+11) hs=10+0,ss=0+0 dirty=10 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=0 authpin=1 0x5569a9e49a80]
1726079 2022-11-24T09:18:10.178+0100 7fe02ffe2700 10 mds.0.cache.dir(0x1) _omap_commit_ops

But it submitted two osd reqeusts:

The first one, which was a invalidate request:

1733304 2022-11-24T09:18:10.228+0100 7fe02ffe2700  1 -- [v2:134.158.80.254:6800/3072911038,v1:134.158.80.254:6801/3072911038] --> [v2:134.158.80.251:6820/1129259,v1:134.158.80.251:6821/1129259] -- osd_op(unknown.0.12135:45216 2.1f 2:ff5b34d6:::1.00000000:head [] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e62331) v8 -- 0x5569b7730800 con 0x5569aab94400   

And it will get -22 error:

1862249 2022-11-24T09:18:11.481+0100 7fe0397f5700  1 -- [v2:134.158.80.254:6800/3072911038,v1:134.158.80.254:6801/3072911038] <== osd.20 v2:134.158.80.251:6820/1129259 1455 ==== osd_op_reply(45216 1.00000000 [] v0'0 uv0 ondisk = -22 ((22) Invalid argument)) v8 ==== 112+0+0 (crc 0 0 0) 0x5569af77ab40 con 0x5569aab94400    

And finally will make the cephfs readonly:

2175076 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 MDSContext::complete: 18C_IO_Dir_Committed                                                        
2175077 2022-11-24T09:18:12.965+0100 7fe02ffe2700  1 mds.0.cache.dir(0x1) commit error -22 v 5067817
2175078 2022-11-24T09:18:12.965+0100 7fe035fee700  1 -- [v2:134.158.80.254:6800/3072911038,v1:134.158.80.254:6801/3072911038] <== client.4166749 v1:134        .158.80.110:0/3015200204 732 ==== client_request(client.4166749:132710 lookup #0x1000002dc6e/1 2022-11-24T09:18:12.403544+0100 caller_uid=0, caller_gid=0{}) v4 ==== 137+0+0 (unknown 2206901006 0 0) 0x5569bce93340 con 0x5569b7730400
2175079 2022-11-24T09:18:12.965+0100 7fe02ffe2700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22
2175080 2022-11-24T09:18:12.965+0100 7fe02ffe2700 -1 mds.0.12135 unhandled write error (22) Invalid argument, force readonly...
2175081 2022-11-24T09:18:12.965+0100 7fe02ffe2700  1 mds.0.cache force file system read-only
2175082 2022-11-24T09:18:12.965+0100 7fe02ffe2700  0 log_channel(cluster) log [WRN] : force file system read-only
2175083 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 mds.0.server force_clients_readonly

The second one:

1733315 2022-11-24T09:18:10.229+0100 7fe02ffe2700  1 -- [v2:134.158.80.254:6800/3072911038,v1:134.158.80.254:6801/3072911038] --> [v2:134.158.80.251:6820/1129259,v1:134.158.80.251:6821/1129259] -- osd_op(unknown.0.12135:45217 2.1f 2:ff5b34d6:::1.00000000:head [omap-set-header in=274b,omap-set-vals in=11928426b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e62331) v8 -- 0x5569b7731400 con 0x5569aab94400 

The volumes CInode seems will include a large number of snapshots [...c0d9,head]:

1726077 2022-11-24T09:18:10.178+0100 7fe0347eb700 14 mds.0.cache.dir(0x1)  dn 'volumes' inode [inode 0x10000000000 [...c0d9,head] /volumes/ auth v5067816 pv5067818 ap=1 f(v0 m2022-11-23T11:31:31.711158+0100 5235=5229+6) n(v377862 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292)/n(v377861 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292) old_inodes=25874 (inest lock w=1 dirty) (iversion l        ock w=1) | dirtyscattered=1 lock=2 dirfrag=1 dirtyrstat=0 dirty=1 authpin=1 0x5569ae8beb00]

So for the volumes itself could have a very large size when submitting to metadata pool.

The mds source code is buggy.

#3 Updated by Konstantin Shalygin 2 months ago

  • Status changed from New to Fix Under Review
  • Target version set to v18.0.0

#4 Updated by Xiubo Li 2 months ago

  • Assignee set to Xiubo Li

#5 Updated by Prayank Saxena 25 days ago

Hello Team,

We got the issue similar to 'mds read-only' in Pacific 16.2.9 where one write commit failed and made the entire cluster read-only.

2022-12-12T04:55:12.095+0000 7f6dbfb56700 1 mds.0.cache.dir(0x10009535b89) commit error -22 v 8533429
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x10009535b89 object, errno -22
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 mds.0.29196 unhandled write error (22) Invalid argument, force readonly...

In our case, we did not see any pool getting filled up but there was not client I/O visible in ceph status command before restarting of active mds service, once we restarted active mds then we saw above logs.

Could you please let us know what is the use case of hitting this issue?
Also, do let me know if anymore information is required?

Thanks

#6 Updated by Prayank Saxena 25 days ago

Prayank Saxena wrote:

Hello Team,

We got the issue similar to 'mds read-only' in Pacific 16.2.9 where one write commit failed and made the entire cluster read-only.

2022-12-12T04:55:12.095+0000 7f6dbfb56700 1 mds.0.cache.dir(0x10009535b89) commit error -22 v 8533429
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x10009535b89 object, errno -22
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 mds.0.29196 unhandled write error (22) Invalid argument, force readonly...

In our case, we did not see any pool getting filled up but there was not client I/O visible in ceph status command before restarting of active mds service, once we restarted active mds then we saw above logs.

Could you please let us know what is the use case of hitting this issue?
Also, do let me know if anymore information is required?

Thanks

In our case for cluster to be in Healthy state, we marked particular osd out of the cluster which was failing during write commit for '0x10009535b89 object' and restarted active mds service, this helped to regain client I/O's back to normal

#7 Updated by Xiubo Li 25 days ago

Prayank Saxena wrote:

Prayank Saxena wrote:

Hello Team,

We got the issue similar to 'mds read-only' in Pacific 16.2.9 where one write commit failed and made the entire cluster read-only.

2022-12-12T04:55:12.095+0000 7f6dbfb56700 1 mds.0.cache.dir(0x10009535b89) commit error -22 v 8533429
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x10009535b89 object, errno -22
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 mds.0.29196 unhandled write error (22) Invalid argument, force readonly...

In our case, we did not see any pool getting filled up but there was not client I/O visible in ceph status command before restarting of active mds service, once we restarted active mds then we saw above logs.

Could you please let us know what is the use case of hitting this issue?
Also, do let me know if anymore information is required?

Thanks

In our case for cluster to be in Healthy state, we marked particular osd out of the cluster which was failing during write commit for '0x10009535b89 object' and restarted active mds service, this helped to regain client I/O's back to normal

I am sure this is the same issue and have been fixed. It's buggy in the mds code.

#8 Updated by Prayank Saxena 24 days ago

Thanks Xiubo Li for the update

We are facing similar issue currently where client I/O is not visible in ceph status but its happening, we tried uploading and reading files and its working.

I am not seeing any error logs getting generated in any of mds nodes, also ceph fs command wrt subvolume is not giving any output. 'ceph fs subvolume ls cephfs --group_name <group-name>'

although 'ceph fs dump' is giving the output:

ceph fs dump
      e29958
      enable_multiple, ever_enabled_multiple: 1,1
      default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
      legacy client fscid: 1

      Filesystem 'cephfs' (1)
      fs_name    cephfs
      epoch    29958
      flags    12
      created    2021-06-17T07:57:45.058546+0000
      modified    2023-01-05T00:58:52.048613+0000
      tableserver    0
      root    0
      session_timeout    60
      session_autoclose    300
      max_file_size    1099511627776
      required_client_features    {7=luminous}
      last_failure    0
      last_failure_osd_epoch    85648
      compat    compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
      max_mds    1
      in    0
      up    {0=201959591}
      failed
      damaged
      stopped
data_pools    [2]
metadata_pool    3
inline_data    disabled
balancer
standby_count_wanted    1
[mds.mds-1002.lsnopy{0:201959591} state up:active seq 515327 join_fscid=1 addr [IP-address:port] compat {c=[1],r=[1],i=[7ff]}]

Standby daemons:

[mds.mds-1001.unghcl{-1:202000231} state up:standby seq 1 join_fscid=1 addr [IP-address:port] compat {c=[1],r=[1],i=[7ff]}]
[mds.mds-1003.dvqloi{-1:215936600} state up:standby seq 1 join_fscid=1 addr [IP-address:port] compat {c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 29958

I am sure once we restart active mds service we will see "write commit error" log as mentioned in my previous comment.

Is there any alternate way to resolve from such issues without restarting active mds service?

Thanks

#9 Updated by Xiubo Li 24 days ago

Prayank Saxena wrote:

Thanks Xiubo Li for the update

We are facing similar issue currently where client I/O is not visible in ceph status but its happening, we tried uploading and reading files and its working.

I am not seeing any error logs getting generated in any of mds nodes, also ceph fs command wrt subvolume is not giving any output. 'ceph fs subvolume ls cephfs --group_name <group-name>'

although 'ceph fs dump' is giving the output:

[...]

I am sure once we restart active mds service we will see "write commit error" log as mentioned in my previous comment.

Is there any alternate way to resolve from such issues without restarting active mds service?

As I remembered last time when others hitting this issue I suggested to cleanup the journal and it worked, more detail please see: https://www.spinics.net/lists/ceph-users/msg74790.html

Thanks

#10 Updated by Prayank Saxena 24 days ago

okay i see, thanks Xiubo li

i was going through the link and found reset of journal and session resolved the issue for Adrien, i was going through ceph document (https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/) for the same where its mentioned to run with extreme caution. Is it safe to perform in prod setup since its mentioned there could be data loss situation?

Thanks

#11 Updated by Venky Shankar 17 days ago

Prayank Saxena wrote:

okay i see, thanks Xiubo li

i was going through the link and found reset of journal and session resolved the issue for Adrien, i was going through ceph document (https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/) for the same where its mentioned to run with extreme caution. Is it safe to perform in prod setup since its mentioned there could be data loss situation?

Resetting the journal can lead to data loss.

To workaround this issue you can increase the omap commit size (courtesy Patrick) in the mds via:

ceph config set mds mds_dir_max_commit_size 80
ceph fs fail <fs_name>
ceph fs set <fs_name> joinable true

Also available in: Atom PDF