Bug #48562: qa: scrub - object missing on disk; some files may be lost - CephFS - Ceph

Actions

Copy link

Bug #48562

open

qa: scrub - object missing on disk; some files may be lost

Added by Milind Changire over 3 years ago. Updated 16 days ago.

Status:

Fix Under Review

Priority:

High

Assignee:

Venky Shankar

Category:

fsck/damage handling

Target version:

Ceph - v20.0.0

% Done:

Source:

Q/A

Tags:

Backport:

squid,reef

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash, qa-failure, scrub

Pull request ID:

56699

Crash signature (v1):

Crash signature (v2):

Description

2020-12-10T05:14:53.213 INFO:tasks.ceph.mds.b.smithi165.stderr:2020-12-10T05:14:53.212+0000 7f27f1562700 -1 log_channel(cluster) log [ERR] : dir 0x10000000070.110101* object missing on disk; some files may be lost (/client.0/tmp/testdir/dir1/dir2)

teuthology run URL:
http://pulpito.front.sepia.ceph.com/mchangir-2020-12-10_04:47:36-fs:workload-wip-mchangir-qa-forward-scrub-task-distro-basic-smithi/5697353/

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Priority changed from Normal to Urgent
Target version set to v16.0.0
Source set to Q/A
Component(FS) MDS added
Labels (FS) qa-failure added

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Status changed from New to Triaged
Assignee set to Milind Changire

Actions

Copy link

Updated by Patrick Donnelly over 3 years ago

Target version changed from v16.0.0 to v17.0.0
Backport set to pacific,octopus,nautilus

Actions

Copy link

Updated by Patrick Donnelly almost 2 years ago

Target version deleted (~~v17.0.0~~)

Actions

Copy link

Updated by Milind Changire over 1 year ago

Status changed from Triaged to Closed
Priority changed from Urgent to Low

closing tracker for now
lowering priority to low
please reopen in case this seen again

Actions

Copy link

Updated by Patrick Donnelly about 1 month ago

Category set to fsck/damage handling
Status changed from Closed to New
Priority changed from Low to High
Target version set to v20.0.0
Backport changed from pacific,octopus,nautilus to squid,reef

https://pulpito.ceph.com/yuriw-2024-03-12_14:59:27-fs-wip-yuri11-testing-2024-03-11-0838-reef-distro-default-smithi/7593867/

Actions

Copy link

Updated by Venky Shankar about 1 month ago

Oh wow, after 3 years. Did we merge something that made this show up, especially since https://tracker.ceph.com/issues/64730 also showed up in and around when this showed up.

Actions

Copy link

Updated by Venky Shankar about 1 month ago

/a/yuriw-2024-03-16_15:03:17-fs-wip-yuri10-testing-2024-03-15-1653-reef-distro-default-smithi/7606353

Actions

Copy link

Updated by Milind Changire about 1 month ago

Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?

Actions

Copy link

#10

Updated by Venky Shankar about 1 month ago

Milind Changire wrote:

Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?

Is this the underlying reason for the test failure? The projected state is an interim state, say for an inode, till it gets journaled, after which the projection is popped. At this point (esp. for an inode), the parent gets marked as dirty which is then checked by scrub to not consider the item as damaged.

Actions

Copy link

#11

Updated by Milind Changire about 1 month ago

According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.

Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.

BTW, where is the teuthology.log created when running the tests ?
Is it on the mounted cephfs volume ?

Actions

Copy link

#12

Updated by Rishabh Dave 22 days ago

https://pulpito.ceph.com/rishabh-2024-03-27_05:27:11-fs-wip-rishabh-testing-20240326.131558-testing-default-smithi/7625622

Actions

Copy link

#13

Updated by Venky Shankar 19 days ago

Milind Changire wrote:

According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.

/a/yuriw-2024-03-12_14:59:27-fs-wip-yuri11-testing-2024-03-11-0838-reef-distro-default-smithi/7593867 does have test_health_status_after_dirfrag_repair

2024-03-12T19:04:01.704 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:01.719+0000 7f640abb9640  1 -- 172.21.15.92:0/3726236821 --> [v2:172.21.15.92:3300/0,v1:172.21.15.92:6789/0] -- mon_command({"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_
forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]} v 0) v1 -- 0x7f64040b3500 con 0x7f64040b1680
2024-03-12T19:04:02.031 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:02.045+0000 7f64017fa640  1 -- 172.21.15.92:0/3726236821 <== mon.0 v2:172.21.15.92:3300/0 7 ==== mon_command_ack([{"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]}]=0  v377) v1 ==== 167+0+0 (secure 0 0 0) 0x7f63fc018020 con 0x7f64040b1680

Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.

In that case, this warning needs to be ignore listed.

BTW, where is the teuthology.log created when running the tests ?
Is it on the mounted cephfs volume ?

I think yes.

Actions

Copy link

#14

Updated by Venky Shankar 16 days ago

Status changed from New to Fix Under Review
Assignee changed from Milind Changire to Venky Shankar
Pull request ID set to 56699
Labels (FS) crash added

Milind, I'm taking this one.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #48562

qa: scrub - object missing on disk; some files may be lost

Updated by Patrick Donnelly over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Patrick Donnelly over 3 years ago

Updated by Patrick Donnelly almost 2 years ago

Updated by Milind Changire over 1 year ago

Updated by Patrick Donnelly about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Milind Changire about 1 month ago

Updated by Venky Shankar about 1 month ago

Updated by Milind Changire about 1 month ago

Updated by Rishabh Dave 22 days ago

Updated by Venky Shankar 19 days ago

Updated by Venky Shankar 16 days ago