Bug #63259
openmds: failed to store backtrace and force file system read-only
0%
Description
From teuthology file:
2023-10-17T00:51:05.801 INFO:tasks.workunit.client.0.smithi114.stderr:+ set -e 2023-10-17T00:51:05.802 INFO:tasks.workunit.client.0.smithi114.stderr:+ sudo rsync -av /tmp/multiple_rsync_payload.271406 payload.1 2023-10-17T00:51:05.855 INFO:tasks.workunit.client.0.smithi114.stdout:sending incremental file list 2023-10-17T00:51:05.865 INFO:tasks.workunit.client.0.smithi114.stderr:rsync: mkdir "/home/ubuntu/cephtest/mnt.0/client.0/tmp/payload.1" failed: Read-only file system (30) 2023-10-17T00:51:05.866 INFO:tasks.workunit.client.0.smithi114.stderr:rsync error: error in file IO (code 11) at main.c(664) [Receiver=3.1.3] 2023-10-17T00:51:05.869 DEBUG:teuthology.orchestra.run:got remote process result: 11 2023-10-17T00:51:05.966 INFO:tasks.workunit:Stopping ['fs/misc'] on client.0...
And from mds.b:
2023-10-17T00:50:12.583+0000 7fa035298700 10 mds.0.cache.ino(0x1000000c567) clear_dirty_parent 2023-10-17T00:50:12.583+0000 7fa035298700 1 mds.0.cache.ino(0x1000000c566) store backtrace error -2 v 36221 2023-10-17T00:50:12.583+0000 7fa035298700 -1 log_channel(cluster) log [ERR] : failed to store backtrace on ino 0x1000000c566 object, pool 2, errno -2 2023-10-17T00:50:12.583+0000 7fa035298700 -1 mds.0.14 unhandled write error (2) No such file or directory, force readonly... 2023-10-17T00:50:12.583+0000 7fa035298700 1 mds.0.cache force file system read-only 2023-10-17T00:50:12.583+0000 7fa035298700 0 log_channel(cluster) log [WRN] : force file system read-only 2023-10-17T00:50:12.583+0000 7fa035298700 10 mds.0.server force_clients_readonly 2023-10-17T00:50:12.583+0000 7fa035298700 10 mds.0.14 send_message_client client.15078 192.168.0.1:0/1906848847 client_session(force_ro) v5 2023-10-17T00:50:12.583+0000 7fa035298700 1 -- [v2:172.21.15.139:6838/1590276002,v1:172.21.15.139:6839/1590276002] --> 192.168.0.1:0/1906848847 -- client_session(force_ro) v5 -- 0x5630e2d58e00 con 0x5630ce089800 2023-10-17T00:50:12.583+0000 7fa035298700 10 mds.0.locker eval 3648 [inode 0x1000000c568 [...2,head] /volumes/_nogroup/sv_1/327a5477-e5c2-4ade-b24e-f477c29c079e/client.0/tmp/ auth v828 ap=1 DIRTYPARENT f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) caps={15078=pAsxLsXsxFsx/-@1},l=15078 | request=0 dirfrag=0 caps=1 dirtyparent=1 dirty=0 authpin=1 0x5630dadb6580] 2023-10-17T00:50:12.583+0000 7fa035298700 10 mds.0.locker eval want loner: client.-1 but failed to set it 2023-10-17T00:50:12.583+0000 7fa035298700 7 mds.0.locker file_eval wanted= loner_wanted= other_wanted= filelock=(ifile excl) on [inode 0x1000000c568 [...2,head] /volumes/_nogroup/sv_1/327a5477-e5c2-4ade-b24e-f477c29c079e/client.0/tmp/ auth v828 ap=1 DIRTYPARENT f() n(v0 1=0+1) (iauth excl) (inest lock) (ifile excl) (ixattr excl) (iversion lock) caps={15078=pAsxLsXsxFsx/-@1},l=15078(-1) | request=0 dirfrag=0 caps=1 dirtyparent=1 dirty=0 authpin=1 0x5630dadb6580]
It failed to store the backtrace and force the cephfs to be readonly.
Updated by Milind Changire 6 months ago
- Assignee set to Kotresh Hiremath Ravishankar
Updated by Venky Shankar 6 months ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Target version set to v19.0.0
- Backport set to quincy,reef
- Component(FS) MDS added
Updated by Kotresh Hiremath Ravishankar 6 months ago
Hi Xiubo,
The logs for the job link in the description is not matching the logs snippet provided by you.
I see the job has failed with following Traceback
2023-10-17T00:24:07.651 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: 2023-10-17T00:24:07.359+0000 7f3f16afe700 -1 log_channel(cephadm) log [ERR] : Can't communicate with remote host `172.21.15.70`, possibly because the host is not reac hable or python3 is not installed on the host. [Errno 113] Connect call failed ('172.21.15.70', 22) 2023-10-17T00:24:07.651 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: Traceback (most recent call last): 2023-10-17T00:24:07.651 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 122, in redirect_log 2023-10-17T00:24:07.651 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: yield 2023-10-17T00:24:07.652 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: File "/usr/share/ceph/mgr/cephadm/ssh.py", line 101, in _remote_connection 2023-10-17T00:24:07.652 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: preferred_auth=['publickey'], options=ssh_options) 2023-10-17T00:24:07.652 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: File "/lib/python3.6/site-packages/asyncssh/connection.py", line 6804, in connect 2023-10-17T00:24:07.652 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: 'Opening SSH connection to') 2023-10-17T00:24:07.652 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: File "/lib/python3.6/site-packages/asyncssh/connection.py", line 299, in _connect 2023-10-17T00:24:07.653 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: local_addr=local_addr) 2023-10-17T00:24:07.653 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: File "/lib64/python3.6/asyncio/base_events.py", line 794, in create_connection 2023-10-17T00:24:07.653 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: raise exceptions[0] 2023-10-17T00:24:07.653 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: File "/lib64/python3.6/asyncio/base_events.py", line 781, in create_connection 2023-10-17T00:24:07.653 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: yield from self.sock_connect(sock, address) 2023-10-17T00:24:07.653 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: File "/lib64/python3.6/asyncio/selector_events.py", line 439, in sock_connect 2023-10-17T00:24:07.654 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: return (yield from fut) 2023-10-17T00:24:07.654 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: File "/lib64/python3.6/asyncio/selector_events.py", line 469, in _sock_connect_cb 2023-10-17T00:24:07.654 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: raise OSError(err, 'Connect call failed %s' % (address,)) 2023-10-17T00:24:07.654 INFO:journalctl@ceph.mgr.z.smithi177.stdout:Oct 17 00:24:07 smithi177 ceph-b278b73a-6c81-11ee-8db6-212e2dc638e7-mgr-z[105683]: OSError: [Errno 113] Connect call failed ('172.21.15.70', 22)
And 25 % pgs are degraded
2023-10-17T00:23:52.324 INFO:journalctl@ceph.mon.b.smithi079.stdout:Oct 17 00:23:52 smithi079 ceph-mon[103382]: pgmap v4: 129 pgs: 3 down, 32 active+clean, 40 active+undersized, 31 undersized+peered, 1 unknown, 16 active+undersized+degraded, 6 undersized+degraded+peered; 20 MiB data, 356 MiB used, 715 GiB / 715 GiB avail; 59/227 objects degraded (25.991%) 2023-10-17T00:23:52.324 INFO:journalctl@ceph.mon.b.smithi079.stdout:Oct 17 00:23:52 smithi079 ceph-mon[103382]: Health check failed: Reduced data availability: 12 pgs inactive, 3 pgs down (PG_AVAILABILITY) 2023-10-17T00:23:52.325 INFO:journalctl@ceph.mon.b.smithi079.stdout:Oct 17 00:23:52 smithi079 ceph-mon[103382]: Health check failed: Degraded data redundancy: 59/227 objects degraded (25.991%), 22 pgs degraded (PG_DEGRADED) 2023-10-17T00:23:52.325 INFO:journalctl@ceph.mon.b.smithi079.stdout:Oct 17 00:23:52 smithi079 ceph-mon[103382]: mgrmap e29: z(active, since 2s), standbys: y 2023-10-17T00:23:52.401 INFO:journalctl@ceph.mon.c.smithi177.stdout:Oct 17 00:23:52 smithi177 ceph-mon[103807]: pgmap v4: 129 pgs: 3 down, 32 active+clean, 40 active+undersized, 31 undersized+peered, 1 unknown, 16 active+undersized+degraded, 6 undersized+degraded+peered; 20 MiB data, 356 MiB used, 715 GiB / 715 GiB avail; 59/227 objects degraded (25.991%)
And I also see the following on `smithi070`
2023-10-17T00:19:19.766315+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered blocking state 2023-10-17T00:19:19.766380+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered disabled state 2023-10-17T00:19:19.766415+00:00 smithi070 kernel: device brx.0 entered promiscuous mode 2023-10-17T00:19:19.776687+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered blocking state 2023-10-17T00:19:19.776728+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered forwarding state 2023-10-17T00:20:43.463574+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered disabled state 2023-10-17T00:20:43.474848+00:00 smithi070 kernel: device brx.0 left promiscuous mode 2023-10-17T00:20:43.474898+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered disabled state 2023-10-17T00:20:46.599641+00:00 smithi070 kernel: IPv6: ADDRCONF(NETDEV_UP): veth0: link is not ready 2023-10-17T00:20:46.733074+00:00 smithi070 kernel: IPv6: ADDRCONF(NETDEV_UP): brx.0: link is not ready 2023-10-17T00:20:46.733124+00:00 smithi070 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): brx.0: link becomes ready 2023-10-17T00:20:46.733145+00:00 smithi070 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready 2023-10-17T00:20:46.770505+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered blocking state 2023-10-17T00:20:46.770583+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered disabled state 2023-10-17T00:20:46.770609+00:00 smithi070 kernel: device brx.0 entered promiscuous mode 2023-10-17T00:20:46.782244+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered blocking state 2023-10-17T00:20:46.782290+00:00 smithi070 kernel: ceph-brx: port 1(brx.0) entered forwarding state 2023-10-17T00:20:46.992612+00:00 smithi070 kernel: Key type dns_resolver registered 2023-10-17T00:20:47.020613+00:00 smithi070 kernel: Key type ceph registered 2023-10-17T00:20:47.027618+00:00 smithi070 kernel: libceph: loaded (mon/osd proto 15/24) 2023-10-17T00:20:47.069320+00:00 smithi070 kernel: ceph: loaded (mds proto 32) 2023-10-17T00:20:47.092579+00:00 smithi070 kernel: ceph: device name is missing path (no : separator in 0@b278b73a-6c81-11ee-8db6-212e2dc638e7.cephfs=/volumes/_nogroup/sv_1/01bcc01b-872b-44d5-ae90-55cda190fd63) 2023-10-17T00:20:47.101591+00:00 smithi070 kernel: libceph: mon1 (1)172.21.15.79:6789 session established 2023-10-17T00:20:47.109585+00:00 smithi070 kernel: libceph: client25127 fsid b278b73a-6c81-11ee-8db6-212e2dc638e7 2023-10-17T00:20:47.117605+00:00 smithi070 kernel: ceph: mds1 session blocklisted 2023-10-17T00:20:47.172627+00:00 smithi070 kernel: ceph: mds0 session blocklisted
Could you please double check ?
Updated by Xiubo Li 6 months ago
Kotresh Hiremath Ravishankar wrote:
Hi Xiubo,
The logs for the job link in the description is not matching the logs snippet provided by you.
I see the job has failed with following Traceback
[...]
And 25 % pgs are degraded
[...]
And I also see the following on `smithi070`
[...]Could you please double check ?
Hi Kotresh,
I noted this before, but this happened around 30 minutes early before the backtrace failure and during this period I didn't see any failure for the test, so I just thought that wasn't related to the rsync failure directly. maybe the above logs were expected for this teuthology test.
We need to found out what has caused the backtrace storing failure, maybe it's related to the pgs are degraded issue.
Updated by Xiubo Li 6 months ago
Sorry, it should be this link https://pulpito.ceph.com/yuriw-2023-10-16_14:43:00-fs-wip-yuri4-testing-2023-10-11-0735-reef-distro-default-smithi/7429561/.
Updated by Venky Shankar 4 months ago
- Assignee changed from Kotresh Hiremath Ravishankar to Venky Shankar
- Priority changed from Normal to High
- Severity changed from 3 - minor to 2 - major
Reproduced again in main branch integration run: https://pulpito.ceph.com/vshankar-2024-01-10_15:00:23-fs-wip-vshankar-testing-20240103.072409-1-testing-default-smithi/7511468/
Kotresh, I'm taking this one.
Updated by Venky Shankar 4 months ago
Venky Shankar wrote:
Reproduced again in main branch integration run: https://pulpito.ceph.com/vshankar-2024-01-10_15:00:23-fs-wip-vshankar-testing-20240103.072409-1-testing-default-smithi/7511468/
The backtrace update failure is this run is:
2024-01-11T17:50:32.510 INFO:journalctl@ceph.mds.b.smithi102.stdout:Jan 11 17:50:32 smithi102 ceph-aab47ca4-b0a7-11ee-95ab-87774f69a715-mds-b[72601]: 2024-01-11T17:50:32.190+0000 7effb2392700 -1 log_channel(cluster) log [ERR] : failed to store backtrace on ino 0x100000060f5 object, pool 2, errno -2 2024-01-11T17:50:32.510 INFO:journalctl@ceph.mds.b.smithi102.stdout:Jan 11 17:50:32 smithi102 ceph-aab47ca4-b0a7-11ee-95ab-87774f69a715-mds-b[72601]: 2024-01-11T17:50:32.190+0000 7effb2392700 -1 mds.0.14 unhandled write error (2) No such file or directory, force readonly... 2024-01-11T17:50:32.510 INFO:journalctl@ceph.mds.b.smithi102.stdout:Jan 11 17:50:32 smithi102 ceph-aab47ca4-b0a7-11ee-95ab-87774f69a715-mds-b[72601]: 2024-01-11T17:50:32.191+0000 7effb2392700 -1 log_channel(cluster) log [ERR] : failed to store backtrace on ino 0x100000060f7 object, pool 2, errno -2
Inode: 0x100000060f5 which is a directory, so the backtrace updation is on the metadata pool.
The OSD (osd.5) throws the following error:
2024-01-11T17:50:32.177+0000 7f9db5e1b700 15 bluestore(/var/lib/ceph/osd/ceph-5) getattr 2.d_head #2:b330f730:::100000060f5.00000000:head# _ 2024-01-11T17:50:32.177+0000 7f9db5e1b700 20 bluestore(/var/lib/ceph/osd/ceph-5).collection(2.d_head 0x55dcc713c1e0) get_onode oid #2:b330f730:::100000060f5.00000000:head# key 0x7F8000000000000002B330F7'0!100000060f5.00000000!='0xFFFFFFFFFFFFFFFEFFFFFFFFFFFFFFFF6F 2024-01-11T17:50:32.177+0000 7f9db3616700 10 osd.5 pg_epoch: 93 pg[2.12( v 89'418 (0'0,89'418] local-lis/les=78/79 n=6 ec=78/78 lis/c=78/78 les/c/f=79/79/0 sis=78) [5,3,7] r=0 lpr=78 crt=89'418 lcod 89'417 mlcod 89'417 active+clean] final snaps et 0=[]:{} in 2:4bec3886:::10000000002.00000000:head 2024-01-11T17:50:32.177+0000 7f9db3616700 20 osd.5 pg_epoch: 93 pg[2.12( v 89'418 (0'0,89'418] local-lis/les=78/79 n=6 ec=78/78 lis/c=78/78 les/c/f=79/79/0 sis=78) [5,3,7] r=0 lpr=78 crt=89'418 lcod 89'417 mlcod 89'417 active+clean] finish_ctx o bject 2:4bec3886:::10000000002.00000000:head marks clean_regions clean_offsets: [(0, 18446744073709551615)], clean_omap: false, new_object: false 2024-01-11T17:50:32.177+0000 7f9db5e1b700 20 bluestore(/var/lib/ceph/osd/ceph-5).collection(2.d_head 0x55dcc713c1e0) r -2 v.len 0 2024-01-11T17:50:32.177+0000 7f9db5e1b700 10 bluestore(/var/lib/ceph/osd/ceph-5) getattr 2.d_head #2:b330f730:::100000060f5.00000000:head# _ = -2
which means it failed to get the onode for the object which happens in BlueStore.cc:;get_onode():
bufferlist v; int r = -ENOENT; Onode *on; if (!is_createop) { r = store->db->get(PREFIX_OBJ, key.c_str(), key.size(), &v); ldout(store->cct, 20) << " r " << r << " v.len " << v.length() << dendl; } if (v.length() == 0) { ceph_assert(r == -ENOENT); if (!create) return OnodeRef(); } else { ceph_assert(r >= 0); }
I.e., an OnodeRef() is returned to the caller (getattr) which returns back -ENOENT:
OnodeRef o = c->get_onode(oid, false); if (!o || !o->exists) { r = -ENOENT; goto out; }
RADOS is unable to find the object and since the backtrace updation operation is done as follows:
void CInodeCommitOperation::update(ObjectOperation &op, inode_backtrace_t &bt) { using ceph::encode; op.priority = priority; op.create(false); bufferlist parent_bl; encode(bt, parent_bl); op.setxattr("parent", parent_bl);
I.e., with op.create(false), the object is expected to exist - which is correct. But for some reason, the directory object is missing o_O
Updated by Venky Shankar 4 months ago
Venky Shankar wrote:
Venky Shankar wrote:
Reproduced again in main branch integration run: https://pulpito.ceph.com/vshankar-2024-01-10_15:00:23-fs-wip-vshankar-testing-20240103.072409-1-testing-default-smithi/7511468/
The backtrace update failure is this run is:
[...]
Inode: 0x100000060f5 which is a directory, so the backtrace updation is on the metadata pool.
The OSD (osd.5) throws the following error:
[...]
which means it failed to get the onode for the object which happens in BlueStore.cc:;get_onode():
[...]
I.e., an OnodeRef() is returned to the caller (getattr) which returns back -ENOENT:
[...]
RADOS is unable to find the object and since the backtrace updation operation is done as follows:
[...]
I.e., with op.create(false), the object is expected to exist - which is correct. But for some reason, the directory object is missing o_O
I might be misreading this - exclusive=false implies that the operation continues even if the object exists.
Updated by Venky Shankar 4 months ago
I'll continue debugging this tomorrow given than now the "-2" from rados is likely not the actual problem.
Updated by Venky Shankar 3 months ago
Couldn't get to this today. Will continue tomorrow.
Updated by Venky Shankar 3 months ago
So, the issue is that one of the commit operation in the set of backtrace to update failed with ENOENT since the previous test added a data pool, create a file, deleted it and then removed the data pool. The mdlog had reference to the (now gone) data pool for which backtrace updation fails (pool nuked). Since the commit ops use C_Gather, the error from one of the failed commit ops gets trickled to every other commit op in the set.
Updated by Venky Shankar 3 months ago
One way to solve this would be to "split" the backtrace commit operation based on file or directory. File backtrace updates go to the data pool (which can be removed) and directory backtrace updates go to the metadata pool - which technically can be removed too, but if users choose to shoot themselves in the foot, then let them :)
But that does not totally avoid the problem, since files can have different layouts and the errno can then trickle to the commit ops for which the data pool exists. So, we could split the backtrace commit based on the pool-id, but maybe that's too much to solve this.
Updated by Venky Shankar 3 months ago
Venky Shankar wrote:
One way to solve this would be to "split" the backtrace commit operation based on file or directory. File backtrace updates go to the data pool (which can be removed) and directory backtrace updates go to the metadata pool - which technically can be removed too, but if users choose to shoot themselves in the foot, then let them :)
But that does not totally avoid the problem, since files can have different layouts and the errno can then trickle to the commit ops for which the data pool exists. So, we could split the backtrace commit based on the pool-id, but maybe that's too much to solve this.
Slight correction - the ops_vec vector is for a CInode, so the set of updates it tracks is the backtrace for the file and dirty parent and that's where I believe the issue is stemming from.
Updated by Xiubo Li 3 months ago
Venky Shankar wrote:
So, the issue is that one of the commit operation in the set of backtrace to update failed with ENOENT since the previous test added a data pool, create a file, deleted it and then removed the data pool. The mdlog had reference to the (now gone) data pool for which backtrace updation fails (pool nuked). Since the commit ops use C_Gather, the error from one of the failed commit ops gets trickled to every other commit op in the set.
Venky, doesn't this a correct operation to mark the filesystem to be read-only in case the corresponding the data pool was deleted ?
Updated by Venky Shankar 3 months ago
Xiubo Li wrote:
Venky Shankar wrote:
So, the issue is that one of the commit operation in the set of backtrace to update failed with ENOENT since the previous test added a data pool, create a file, deleted it and then removed the data pool. The mdlog had reference to the (now gone) data pool for which backtrace updation fails (pool nuked). Since the commit ops use C_Gather, the error from one of the failed commit ops gets trickled to every other commit op in the set.
Venky, doesn't this a correct operation to mark the filesystem to be read-only in case the corresponding the data pool was deleted ?
There is special handling for ENOENT where the operation is treated as a success and rightly so since cephfs allowed removing a data pool (but an event in mdlog can have a reference to it and fail when the event is flushed out at a later point in time).
Updated by Venky Shankar 3 months ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 55421