Bug #58082
closedcephfs:filesystem became read only after Quincy upgrade
100%
Description
Copy the info from ceph-user mail list by Adrien:
Hi, We upgraded this morning a Pacific Ceph cluster to the last Quincy version. The cluster was healthy before the upgrade, everything was done according to the upgrade procedure (non-cephadm) [1], all services have restarted correctly but the filesystem switched to read only mode when it became active. | ||HEALTH_WARN 1 MDSs are read only|| ||[WRN] MDS_READ_ONLY: 1 MDSs are read only|| || mds.cccephadm32(mds.0): MDS in read-only mode| This is the only warning we got on the cluster. In the MDS log, this error "failed to commit dir 0x1 object, errno -22" seems to be the root cause : | ||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22|| ||2022-11-23T12:41:09.843+0100 7f930f56d700 -1 mds.0.11963 unhandled write error (22) Invalid argument, force readonly...|| ||2022-11-23T12:41:09.843+0100 7f930f56d700 1 mds.0.cache force file system read-only|| ||2022-11-23T12:41:09.843+0100 7f930f56d700 0 log_channel(cluster) log [WRN] : force file system read-only|| ||2022-11-23T12:41:09.843+0100 7f930f56d700 10 mds.0.server force_clients_readonly| I couldn't get more info with ceph config set mds.x debug_mds 20 |ceph fs status|| ||cephfs - 17 clients|| ||======|| ||RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS || || 0 active cccephadm32 Reqs: 0 /s 12.9k 12.8k 673 1538 || || POOL TYPE USED AVAIL || ||cephfs_metadata metadata 513G 48.6T || || cephfs_data data 2558M 48.6T || || cephfs_data2 data 471G 48.6T || || cephfs_data3 data 433G 48.6T || ||STANDBY MDS || ||cccephadm30 || ||cccephadm31 || ||MDS version: ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)| Any idea what could go wrong and how to solve it before starting a disaster recovery procedure? Cheers, Adrien
This bug looks very similar to this issue opened last year and closed without any solution : https://tracker.ceph.com/issues/52260
Hi Xiubo, We did the upgrade in rolling mode as always, with only few kubernetes pods as clients accessing their PVC on CephFS. I can reproduce the problem everytime I restart the MDS daemon. You can find the MDS log with debug_mds 25 and debug_ms 1 here : https://filesender.renater.fr/?s=download&token=4b413a71-480c-4c1a-b80a-7c9984e4decd (The last timestamp : 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 mds.0.server force_clients_readonly) I couldn't find any errors in the OSD logs, anything specific should I looking for? Best, Adrien
Updated by Xiubo Li over 1 year ago
From the logs, the dir(0x1) will submit the volumes Dentry to metadata pool:
1726069 2022-11-24T09:18:10.177+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) commit want 0 on [dir 0x1 / [2,head] auth pv=5067819 v=5067817 cv=0/0 dir_auth=0 ap=2+1 state=1610612800|fetching f(v0 m2022-02-14T14:08:22.630497+0100 1=0+1) n(v203259 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3145360=132068+13292) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=1 authpin=1 0x5569a9e49600] 1726070 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) auth_pin by 0x5569a9e49600 on [dir 0x1 / [2,head] auth pv=5067819 v=5067817 cv=0/0 dir_auth=0 ap=3+1 state=1610612800|fetching f(v0 m2022-02-14T14:08:22.630497+0100 1=0+1) n(v203259 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=1 authpin=1 0x5569a9e49600] count now 3 1726071 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) _commit want 5067817 on [dir 0x1 / [2,head] auth pv=5067819 v=5067817 cv=0/0 dir_auth=0 ap=3+1 state=1610612800|fetching f(v0 m2022-02-14T14:08:22.630497+0100 1=0+1) n(v203259 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292) hs=1+0,ss=0+0 dirty=1 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=1 authpin=1 0x5569a9e49600] 1726072 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) marking committing 1726073 2022-11-24T09:18:10.178+0100 7fe0347eb700 20 mds.0.bal hit_dir 4 pop is 1, frag * size 1 [pop IRD:[C 0.00e+00] IWR:[C 0.00e+00] RDR:[C 0.00e+00] FET:[C 9.82e-01] STR:[C 1.00e+00] *LOAD:6.0] 1726074 1726075 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) _omap_commit 1726076 2022-11-24T09:18:10.178+0100 7fe0347eb700 10 mds.0.cache.dir(0x1) set volumes [dentry #0x1/volumes [2,head] auth (dversion lock) pv=5067818 v=5067816 ino=0x10000000000 state=1610612736 | inodepin=1 dirty=1 0x5569b24b2280] 1726077 2022-11-24T09:18:10.178+0100 7fe0347eb700 14 mds.0.cache.dir(0x1) dn 'volumes' inode [inode 0x10000000000 [...c0d9,head] /volumes/ auth v5067816 pv5067818 ap=1 f(v0 m2022-11-23T11:31:31.711158+0100 5235=5229+6) n(v377862 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292)/n(v377861 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292) old_inodes=25874 (inest lock w=1 dirty) (iversion l ock w=1) | dirtyscattered=1 lock=2 dirfrag=1 dirtyrstat=0 dirty=1 authpin=1 0x5569ae8beb00] 1726078 2022-11-24T09:18:10.178+0100 7fe0347eb700 15 mds.0.journal try_to_expire committing [dir 0x100 ~mds0/ [2,head] auth pv=226292767 v=226292765 cv=0/0 dir_auth=0 ap=1+1 state=1610612737|complete f(v0 10=0+10) n(v497237 rc2022-11-23T17:09:36.965133+0100 b1041 14=3+11) hs=10+0,ss=0+0 dirty=10 | child=1 subtree=1 subtreetemp=0 dirty=1 waiter=0 authpin=1 0x5569a9e49a80] 1726079 2022-11-24T09:18:10.178+0100 7fe02ffe2700 10 mds.0.cache.dir(0x1) _omap_commit_ops
But it submitted two osd reqeusts:
The first one, which was a invalidate request:
1733304 2022-11-24T09:18:10.228+0100 7fe02ffe2700 1 -- [v2:134.158.80.254:6800/3072911038,v1:134.158.80.254:6801/3072911038] --> [v2:134.158.80.251:6820/1129259,v1:134.158.80.251:6821/1129259] -- osd_op(unknown.0.12135:45216 2.1f 2:ff5b34d6:::1.00000000:head [] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e62331) v8 -- 0x5569b7730800 con 0x5569aab94400
And it will get -22 error:
1862249 2022-11-24T09:18:11.481+0100 7fe0397f5700 1 -- [v2:134.158.80.254:6800/3072911038,v1:134.158.80.254:6801/3072911038] <== osd.20 v2:134.158.80.251:6820/1129259 1455 ==== osd_op_reply(45216 1.00000000 [] v0'0 uv0 ondisk = -22 ((22) Invalid argument)) v8 ==== 112+0+0 (crc 0 0 0) 0x5569af77ab40 con 0x5569aab94400
And finally will make the cephfs readonly:
2175076 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 MDSContext::complete: 18C_IO_Dir_Committed 2175077 2022-11-24T09:18:12.965+0100 7fe02ffe2700 1 mds.0.cache.dir(0x1) commit error -22 v 5067817 2175078 2022-11-24T09:18:12.965+0100 7fe035fee700 1 -- [v2:134.158.80.254:6800/3072911038,v1:134.158.80.254:6801/3072911038] <== client.4166749 v1:134 .158.80.110:0/3015200204 732 ==== client_request(client.4166749:132710 lookup #0x1000002dc6e/1 2022-11-24T09:18:12.403544+0100 caller_uid=0, caller_gid=0{}) v4 ==== 137+0+0 (unknown 2206901006 0 0) 0x5569bce93340 con 0x5569b7730400 2175079 2022-11-24T09:18:12.965+0100 7fe02ffe2700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x1 object, errno -22 2175080 2022-11-24T09:18:12.965+0100 7fe02ffe2700 -1 mds.0.12135 unhandled write error (22) Invalid argument, force readonly... 2175081 2022-11-24T09:18:12.965+0100 7fe02ffe2700 1 mds.0.cache force file system read-only 2175082 2022-11-24T09:18:12.965+0100 7fe02ffe2700 0 log_channel(cluster) log [WRN] : force file system read-only 2175083 2022-11-24T09:18:12.965+0100 7fe02ffe2700 10 mds.0.server force_clients_readonly
The second one:
1733315 2022-11-24T09:18:10.229+0100 7fe02ffe2700 1 -- [v2:134.158.80.254:6800/3072911038,v1:134.158.80.254:6801/3072911038] --> [v2:134.158.80.251:6820/1129259,v1:134.158.80.251:6821/1129259] -- osd_op(unknown.0.12135:45217 2.1f 2:ff5b34d6:::1.00000000:head [omap-set-header in=274b,omap-set-vals in=11928426b] snapc 0=[] ondisk+write+known_if_redirected+full_force+supports_pool_eio e62331) v8 -- 0x5569b7731400 con 0x5569aab94400
The volumes CInode seems will include a large number of snapshots [...c0d9,head]:
1726077 2022-11-24T09:18:10.178+0100 7fe0347eb700 14 mds.0.cache.dir(0x1) dn 'volumes' inode [inode 0x10000000000 [...c0d9,head] /volumes/ auth v5067816 pv5067818 ap=1 f(v0 m2022-11-23T11:31:31.711158+0100 5235=5229+6) n(v377862 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292)/n(v377861 rc2022-11-23T17:09:36.965133+0100 b324608642500 rs3 145360=132068+13292) old_inodes=25874 (inest lock w=1 dirty) (iversion l ock w=1) | dirtyscattered=1 lock=2 dirfrag=1 dirtyrstat=0 dirty=1 authpin=1 0x5569ae8beb00]
So for the volumes itself could have a very large size when submitting to metadata pool.
The mds source code is buggy.
Updated by Konstantin Shalygin over 1 year ago
- Status changed from New to Fix Under Review
- Target version set to v18.0.0
Updated by Prayank Saxena over 1 year ago
Hello Team,
We got the issue similar to 'mds read-only' in Pacific 16.2.9 where one write commit failed and made the entire cluster read-only.
2022-12-12T04:55:12.095+0000 7f6dbfb56700 1 mds.0.cache.dir(0x10009535b89) commit error -22 v 8533429
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x10009535b89 object, errno -22
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 mds.0.29196 unhandled write error (22) Invalid argument, force readonly...
In our case, we did not see any pool getting filled up but there was not client I/O visible in ceph status command before restarting of active mds service, once we restarted active mds then we saw above logs.
Could you please let us know what is the use case of hitting this issue?
Also, do let me know if anymore information is required?
Thanks
Updated by Prayank Saxena over 1 year ago
Prayank Saxena wrote:
Hello Team,
We got the issue similar to 'mds read-only' in Pacific 16.2.9 where one write commit failed and made the entire cluster read-only.
2022-12-12T04:55:12.095+0000 7f6dbfb56700 1 mds.0.cache.dir(0x10009535b89) commit error -22 v 8533429
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x10009535b89 object, errno -22
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 mds.0.29196 unhandled write error (22) Invalid argument, force readonly...In our case, we did not see any pool getting filled up but there was not client I/O visible in ceph status command before restarting of active mds service, once we restarted active mds then we saw above logs.
Could you please let us know what is the use case of hitting this issue?
Also, do let me know if anymore information is required?Thanks
In our case for cluster to be in Healthy state, we marked particular osd out of the cluster which was failing during write commit for '0x10009535b89 object' and restarted active mds service, this helped to regain client I/O's back to normal
Updated by Xiubo Li over 1 year ago
Prayank Saxena wrote:
Prayank Saxena wrote:
Hello Team,
We got the issue similar to 'mds read-only' in Pacific 16.2.9 where one write commit failed and made the entire cluster read-only.
2022-12-12T04:55:12.095+0000 7f6dbfb56700 1 mds.0.cache.dir(0x10009535b89) commit error -22 v 8533429
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 log_channel(cluster) log [ERR] : failed to commit dir 0x10009535b89 object, errno -22
2022-12-12T04:55:12.095+0000 7f6dbfb56700 -1 mds.0.29196 unhandled write error (22) Invalid argument, force readonly...In our case, we did not see any pool getting filled up but there was not client I/O visible in ceph status command before restarting of active mds service, once we restarted active mds then we saw above logs.
Could you please let us know what is the use case of hitting this issue?
Also, do let me know if anymore information is required?Thanks
In our case for cluster to be in Healthy state, we marked particular osd out of the cluster which was failing during write commit for '0x10009535b89 object' and restarted active mds service, this helped to regain client I/O's back to normal
I am sure this is the same issue and have been fixed. It's buggy in the mds code.
Updated by Prayank Saxena over 1 year ago
Thanks Xiubo Li for the update
We are facing similar issue currently where client I/O is not visible in ceph status but its happening, we tried uploading and reading files and its working.
I am not seeing any error logs getting generated in any of mds nodes, also ceph fs command wrt subvolume is not giving any output. 'ceph fs subvolume ls cephfs --group_name <group-name>'
although 'ceph fs dump' is giving the output:
ceph fs dump e29958 enable_multiple, ever_enabled_multiple: 1,1 default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'cephfs' (1) fs_name cephfs epoch 29958 flags 12 created 2021-06-17T07:57:45.058546+0000 modified 2023-01-05T00:58:52.048613+0000 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 required_client_features {7=luminous} last_failure 0 last_failure_osd_epoch 85648 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 1 in 0 up {0=201959591} failed damaged stopped data_pools [2] metadata_pool 3 inline_data disabled balancer standby_count_wanted 1 [mds.mds-1002.lsnopy{0:201959591} state up:active seq 515327 join_fscid=1 addr [IP-address:port] compat {c=[1],r=[1],i=[7ff]}] Standby daemons: [mds.mds-1001.unghcl{-1:202000231} state up:standby seq 1 join_fscid=1 addr [IP-address:port] compat {c=[1],r=[1],i=[7ff]}] [mds.mds-1003.dvqloi{-1:215936600} state up:standby seq 1 join_fscid=1 addr [IP-address:port] compat {c=[1],r=[1],i=[7ff]}] dumped fsmap epoch 29958
I am sure once we restart active mds service we will see "write commit error" log as mentioned in my previous comment.
Is there any alternate way to resolve from such issues without restarting active mds service?
Thanks
Updated by Xiubo Li over 1 year ago
Prayank Saxena wrote:
Thanks Xiubo Li for the update
We are facing similar issue currently where client I/O is not visible in ceph status but its happening, we tried uploading and reading files and its working.
I am not seeing any error logs getting generated in any of mds nodes, also ceph fs command wrt subvolume is not giving any output. 'ceph fs subvolume ls cephfs --group_name <group-name>'
although 'ceph fs dump' is giving the output:
[...]
I am sure once we restart active mds service we will see "write commit error" log as mentioned in my previous comment.
Is there any alternate way to resolve from such issues without restarting active mds service?
As I remembered last time when others hitting this issue I suggested to cleanup the journal and it worked, more detail please see: https://www.spinics.net/lists/ceph-users/msg74790.html
Thanks
Updated by Prayank Saxena over 1 year ago
okay i see, thanks Xiubo li
i was going through the link and found reset of journal and session resolved the issue for Adrien, i was going through ceph document (https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/) for the same where its mentioned to run with extreme caution. Is it safe to perform in prod setup since its mentioned there could be data loss situation?
Thanks
Updated by Venky Shankar over 1 year ago
Prayank Saxena wrote:
okay i see, thanks Xiubo li
i was going through the link and found reset of journal and session resolved the issue for Adrien, i was going through ceph document (https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/) for the same where its mentioned to run with extreme caution. Is it safe to perform in prod setup since its mentioned there could be data loss situation?
Resetting the journal can lead to data loss.
To workaround this issue you can increase the omap commit size (courtesy Patrick) in the mds via:
ceph config set mds mds_dir_max_commit_size 80
ceph fs fail <fs_name>
ceph fs set <fs_name> joinable true
Updated by Venky Shankar about 1 year ago
- Category set to Correctness/Safety
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot about 1 year ago
- Copied to Backport #58608: pacific: cephfs:filesystem became read only after Quincy upgrade added
Updated by Backport Bot about 1 year ago
- Copied to Backport #58609: quincy: cephfs:filesystem became read only after Quincy upgrade added
Updated by Konstantin Shalygin about 1 year ago
- Status changed from Pending Backport to Resolved
Updated by Patrick Donnelly about 1 month ago
- Has duplicate Bug #52260: 1 MDSs are read only | pacific 16.2.5 added