Project

General

Profile

Actions

Bug #48562

open

qa: scrub - object missing on disk; some files may be lost

Added by Milind Changire over 3 years ago. Updated 16 days ago.

Status:
Fix Under Review
Priority:
High
Assignee:
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
squid,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
crash, qa-failure, scrub
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-12-10T05:14:53.213 INFO:tasks.ceph.mds.b.smithi165.stderr:2020-12-10T05:14:53.212+0000 7f27f1562700 -1 log_channel(cluster) log [ERR] : dir 0x10000000070.110101* object missing on disk; some files may be lost (/client.0/tmp/testdir/dir1/dir2)

teuthology run URL:
http://pulpito.front.sepia.ceph.com/mchangir-2020-12-10_04:47:36-fs:workload-wip-mchangir-qa-forward-scrub-task-distro-basic-smithi/5697353/

Actions #1

Updated by Patrick Donnelly over 3 years ago

  • Priority changed from Normal to Urgent
  • Target version set to v16.0.0
  • Source set to Q/A
  • Component(FS) MDS added
  • Labels (FS) qa-failure added
Actions #2

Updated by Patrick Donnelly over 3 years ago

  • Status changed from New to Triaged
  • Assignee set to Milind Changire
Actions #3

Updated by Patrick Donnelly over 3 years ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport set to pacific,octopus,nautilus
Actions #4

Updated by Patrick Donnelly almost 2 years ago

  • Target version deleted (v17.0.0)
Actions #5

Updated by Milind Changire over 1 year ago

  • Status changed from Triaged to Closed
  • Priority changed from Urgent to Low

closing tracker for now
lowering priority to low
please reopen in case this seen again

Actions #6

Updated by Patrick Donnelly about 1 month ago

  • Category set to fsck/damage handling
  • Status changed from Closed to New
  • Priority changed from Low to High
  • Target version set to v20.0.0
  • Backport changed from pacific,octopus,nautilus to squid,reef
Actions #7

Updated by Venky Shankar about 1 month ago

Oh wow, after 3 years. Did we merge something that made this show up, especially since https://tracker.ceph.com/issues/64730 also showed up in and around when this showed up.

Actions #8

Updated by Venky Shankar about 1 month ago

/a/yuriw-2024-03-16_15:03:17-fs-wip-yuri10-testing-2024-03-15-1653-reef-distro-default-smithi/7606353

Actions #9

Updated by Milind Changire about 1 month ago

Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?

Actions #10

Updated by Venky Shankar about 1 month ago

Milind Changire wrote:

Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?

Is this the underlying reason for the test failure? The projected state is an interim state, say for an inode, till it gets journaled, after which the projection is popped. At this point (esp. for an inode), the parent gets marked as dirty which is then checked by scrub to not consider the item as damaged.

Actions #11

Updated by Milind Changire about 1 month ago

According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.

Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.

BTW, where is the teuthology.log created when running the tests ?
Is it on the mounted cephfs volume ?

Actions #13

Updated by Venky Shankar 19 days ago

Milind Changire wrote:

According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.

/a/yuriw-2024-03-12_14:59:27-fs-wip-yuri11-testing-2024-03-11-0838-reef-distro-default-smithi/7593867 does have test_health_status_after_dirfrag_repair

2024-03-12T19:04:01.704 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:01.719+0000 7f640abb9640  1 -- 172.21.15.92:0/3726236821 --> [v2:172.21.15.92:3300/0,v1:172.21.15.92:6789/0] -- mon_command({"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_
forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]} v 0) v1 -- 0x7f64040b3500 con 0x7f64040b1680
2024-03-12T19:04:02.031 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:02.045+0000 7f64017fa640  1 -- 172.21.15.92:0/3726236821 <== mon.0 v2:172.21.15.92:3300/0 7 ==== mon_command_ack([{"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]}]=0  v377) v1 ==== 167+0+0 (secure 0 0 0) 0x7f63fc018020 con 0x7f64040b1680

Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.

In that case, this warning needs to be ignore listed.

BTW, where is the teuthology.log created when running the tests ?
Is it on the mounted cephfs volume ?

I think yes.

Actions #14

Updated by Venky Shankar 16 days ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Milind Changire to Venky Shankar
  • Pull request ID set to 56699
  • Labels (FS) crash added

Milind, I'm taking this one.

Actions

Also available in: Atom PDF