Project

General

Profile

Actions

Bug #48562

open

qa: scrub - object missing on disk; some files may be lost

Added by Milind Changire over 3 years ago. Updated 4 days ago.

Status:
Pending Backport
Priority:
High
Assignee:
Category:
fsck/damage handling
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
squid,reef
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
MDS
Labels (FS):
crash, qa-failure, scrub
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2020-12-10T05:14:53.213 INFO:tasks.ceph.mds.b.smithi165.stderr:2020-12-10T05:14:53.212+0000 7f27f1562700 -1 log_channel(cluster) log [ERR] : dir 0x10000000070.110101* object missing on disk; some files may be lost (/client.0/tmp/testdir/dir1/dir2)

teuthology run URL:
http://pulpito.front.sepia.ceph.com/mchangir-2020-12-10_04:47:36-fs:workload-wip-mchangir-qa-forward-scrub-task-distro-basic-smithi/5697353/


Related issues 3 (2 open1 closed)

Related to CephFS - Bug #65966: qa: cluster [ERR] dir 0x10000000000 object missing on disk; some files may be lost (/dir)Duplicate

Actions
Copied to CephFS - Backport #65987: reef: qa: scrub - object missing on disk; some files may be lostNewVenky ShankarActions
Copied to CephFS - Backport #65988: squid: qa: scrub - object missing on disk; some files may be lostNewVenky ShankarActions
Actions #1

Updated by Patrick Donnelly over 3 years ago

  • Priority changed from Normal to Urgent
  • Target version set to v16.0.0
  • Source set to Q/A
  • Component(FS) MDS added
  • Labels (FS) qa-failure added
Actions #2

Updated by Patrick Donnelly over 3 years ago

  • Status changed from New to Triaged
  • Assignee set to Milind Changire
Actions #3

Updated by Patrick Donnelly over 3 years ago

  • Target version changed from v16.0.0 to v17.0.0
  • Backport set to pacific,octopus,nautilus
Actions #4

Updated by Patrick Donnelly almost 2 years ago

  • Target version deleted (v17.0.0)
Actions #5

Updated by Milind Changire over 1 year ago

  • Status changed from Triaged to Closed
  • Priority changed from Urgent to Low

closing tracker for now
lowering priority to low
please reopen in case this seen again

Actions #6

Updated by Patrick Donnelly 2 months ago

  • Category set to fsck/damage handling
  • Status changed from Closed to New
  • Priority changed from Low to High
  • Target version set to v20.0.0
  • Backport changed from pacific,octopus,nautilus to squid,reef
Actions #7

Updated by Venky Shankar about 2 months ago

Oh wow, after 3 years. Did we merge something that made this show up, especially since https://tracker.ceph.com/issues/64730 also showed up in and around when this showed up.

Actions #8

Updated by Venky Shankar about 2 months ago

/a/yuriw-2024-03-16_15:03:17-fs-wip-yuri10-testing-2024-03-15-1653-reef-distro-default-smithi/7606353

Actions #9

Updated by Milind Changire about 2 months ago

Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?

Actions #10

Updated by Venky Shankar about 2 months ago

Milind Changire wrote:

Is it okay to ignore dir/inode/dentry during scrub if there are corresponding projections active for them, implying that the element state is not stable and that any checks could potentially fail ?

Is this the underlying reason for the test failure? The projected state is an interim state, say for an inode, till it gets journaled, after which the projection is popped. At this point (esp. for an inode), the parent gets marked as dirty which is then checked by scrub to not consider the item as damaged.

Actions #11

Updated by Milind Changire about 2 months ago

According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.

Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.

BTW, where is the teuthology.log created when running the tests ?
Is it on the mounted cephfs volume ?

Actions #13

Updated by Venky Shankar about 2 months ago

Milind Changire wrote:

According to qa/tasks/cephfs/test_forward_scrub.py the test that causes 'stat testdir/hardlink' to fail is test_health_status_after_dirfrag_repair.
However, there is no trace of teuthology ever starting this test in teuthlogy.log. This is odd. Can anybody explain this teuthology behavior.

/a/yuriw-2024-03-12_14:59:27-fs-wip-yuri11-testing-2024-03-11-0838-reef-distro-default-smithi/7593867 does have test_health_status_after_dirfrag_repair

2024-03-12T19:04:01.704 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:01.719+0000 7f640abb9640  1 -- 172.21.15.92:0/3726236821 --> [v2:172.21.15.92:3300/0,v1:172.21.15.92:6789/0] -- mon_command({"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_
forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]} v 0) v1 -- 0x7f64040b3500 con 0x7f64040b1680
2024-03-12T19:04:02.031 INFO:teuthology.orchestra.run.smithi092.stderr:2024-03-12T19:04:02.045+0000 7f64017fa640  1 -- 172.21.15.92:0/3726236821 <== mon.0 v2:172.21.15.92:3300/0 7 ==== mon_command_ack([{"prefix": "log", "logtext": ["Ended test tasks.cephfs.test_forward_scrub.TestForwardScrub.test_health_status_after_dirfrag_repair"]}]=0  v377) v1 ==== 167+0+0 (secure 0 0 0) 0x7f63fc018020 con 0x7f64040b1680

Apart from the odd behavior mentioned above, the test test_health_status_after_dirfrag_repair intentionally deletes the RADOS object leading to the ERR log. So this might not be a cephfs failure at all.

In that case, this warning needs to be ignore listed.

BTW, where is the teuthology.log created when running the tests ?
Is it on the mounted cephfs volume ?

I think yes.

Actions #14

Updated by Venky Shankar about 1 month ago

  • Status changed from New to Fix Under Review
  • Assignee changed from Milind Changire to Venky Shankar
  • Pull request ID set to 56699
  • Labels (FS) crash added

Milind, I'm taking this one.

Actions #15

Updated by Venky Shankar 4 days ago

  • Status changed from Fix Under Review to Pending Backport
Actions #16

Updated by Backport Bot 4 days ago

  • Copied to Backport #65987: reef: qa: scrub - object missing on disk; some files may be lost added
Actions #17

Updated by Backport Bot 4 days ago

  • Copied to Backport #65988: squid: qa: scrub - object missing on disk; some files may be lost added
Actions #18

Updated by Backport Bot 4 days ago

  • Tags set to backport_processed
Actions #19

Updated by Venky Shankar 2 days ago

  • Related to Bug #65966: qa: cluster [ERR] dir 0x10000000000 object missing on disk; some files may be lost (/dir) added
Actions

Also available in: Atom PDF