Bug #48562: qa: scrub - object missing on disk; some files may be lost - CephFS - Ceph

Actions

Copy link

Bug #48562

open

qa: scrub - object missing on disk; some files may be lost

Added by Milind Changire over 3 years ago. Updated 4 days ago.

Status:

Pending Backport

Priority:

High

Assignee:

Venky Shankar

Category:

fsck/damage handling

Target version:

Ceph - v20.0.0

% Done:

Source:

Q/A

Tags:

backport_processed

Backport:

squid,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash, qa-failure, scrub

Pull request ID:

56699

Crash signature (v1):

Crash signature (v2):

Description

2020-12-10T05:14:53.213 INFO:tasks.ceph.mds.b.smithi165.stderr:2020-12-10T05:14:53.212+0000 7f27f1562700 -1 log_channel(cluster) log [ERR] : dir 0x10000000070.110101* object missing on disk; some files may be lost (/client.0/tmp/testdir/dir1/dir2)

teuthology run URL:
http://pulpito.front.sepia.ceph.com/mchangir-2020-12-10_04:47:36-fs:workload-wip-mchangir-qa-forward-scrub-task-distro-basic-smithi/5697353/

Related issues 3 (2 open — 1 closed)

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Priority changed from Normal to Urgent
Target version set to v16.0.0
Source set to Q/A
Component(FS) MDS added
Labels (FS) qa-failure added

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from New to Triaged
Assignee set to Milind Changire

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Target version changed from v16.0.0 to v17.0.0
Backport set to pacific,octopus,nautilus

Actions

Copy link

Updated by Patrick Donnelly almost 2 years ago

Target version deleted (~~v17.0.0~~)

Actions

Copy link

Updated by Milind Changire over 1 year ago

Status changed from Triaged to Closed
Priority changed from Urgent to Low

closing tracker for now
lowering priority to low
please reopen in case this seen again

Actions

Copy link

Updated by Patrick Donnelly 2 months ago

Category set to fsck/damage handling
Status changed from Closed to New
Priority changed from Low to High
Target version set to v20.0.0
Backport changed from pacific,octopus,nautilus to squid,reef

https://pulpito.ceph.com/yuriw-2024-03-12_14:59:27-fs-wip-yuri11-testing-2024-03-11-0838-reef-distro-default-smithi/7593867/

Actions

Copy link

Updated by Venky Shankar about 2 months ago

Oh wow, after 3 years. Did we merge something that made this show up, especially since https://tracker.ceph.com/issues/64730 also showed up in and around when this showed up.

Actions

Copy link

Updated by Venky Shankar about 2 months ago

/a/yuriw-2024-03-16_15:03:17-fs-wip-yuri10-testing-2024-03-15-1653-reef-distro-default-smithi/7606353

Actions

Copy link

Updated by Milind Changire about 2 months ago

Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?

Actions

Copy link

#10

Updated by Venky Shankar about 2 months ago

Milind Changire wrote:

Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?

Is this the underlying reason for the test failure? The projected state is an interim state, say for an inode, till it gets journaled, after which the projection is popped. At this point (esp. for an inode), the parent gets marked as dirty which is then checked by scrub to not consider the item as damaged.

Actions

Copy link

#11

Updated by Milind Changire about 2 months ago

According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.

Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.

BTW, where is the teuthology.log created when running the tests ?
Is it on the mounted cephfs volume ?

Actions

Copy link

#12

Updated by Rishabh Dave about 2 months ago

https://pulpito.ceph.com/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625622

Actions

Copy link

#13

Updated by Venky Shankar about 2 months ago

Milind Changire wrote:

According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.

/a/yuriw-2024-03-12_14:59:27-fs-wip-yuri11-testing-2024-03-11-0838-reef-distro-default-smithi/7593867 does have test_health_status_after_dirfrag_repair

2024-03-12T19:04:01.704 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:01.719+0000 7f640abb9640  1 -- 172.21.15.92:0/3726236821 --> [v2:172.21.15.92:3300/0,v1:172.21.15.92:6789/0] -- mon_command({"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_
forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]} v 0) v1 -- 0x7f64040b3500 con 0x7f64040b1680
2024-03-12T19:04:02.031 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:02.045+0000 7f64017fa640  1 -- 172.21.15.92:0/3726236821 <== mon.0 v2:172.21.15.92:3300/0 7 ==== mon_command_ack([{"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]}]=0  v377) v1 ==== 167+0+0 (secure 0 0 0) 0x7f63fc018020 con 0x7f64040b1680

Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.

In that case, this warning needs to be ignore listed.