Project

General

Profile

Actions

Bug #62381

open

mds: Bug still exists: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())

Added by Igor Fedotov 8 months ago. Updated 7 months ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy, reef
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Despite https://tracker.ceph.com/issues/53597 being marked as resolved we could still face the problem in v17.2.5

It occurred at multiple MDS-es quite a few of times within a few hours time frame and has been finally repaired by scrubbing.


Files

cephfs_bug.txt (65.8 KB) cephfs_bug.txt Igor Fedotov, 08/09/2023 06:18 PM

Related issues 1 (0 open1 closed)

Related to CephFS - Bug #53597: mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())Resolved玮文 胡

Actions
Actions #1

Updated by Igor Fedotov 8 months ago

The attached file contains log snippets with apparently relevant information for a few crashes as well as intermediate and final scrubbings.

Actions #2

Updated by Igor Fedotov 8 months ago

  • Related to Bug #53597: mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) added
Actions #3

Updated by Igor Fedotov 8 months ago

  • Backport set to quincy, reef
  • Severity changed from 3 - minor to 2 - major
Actions #4

Updated by Venky Shankar 8 months ago

  • Category set to Correctness/Safety
  • Assignee set to Venky Shankar
  • Target version set to v19.0.0
Actions #5

Updated by Venky Shankar 8 months ago

Igor Fedotov wrote:

The attached file contains log snippets with apparently relevant information for a few crashes as well as intermediate and final scrubbings.

Thanks, Igor. I'll have a look.

Actions #6

Updated by Venky Shankar 8 months ago

  • Status changed from New to In Progress
Actions #7

Updated by Venky Shankar 8 months ago

FWIW, logs hint at missing (RADOS) objects:

Jul 27 15:53:34 R07-NVME-03 ceph-mds[2482685]: 2023-07-27T15:53:34.908+0000 7fcbd0c39700  0 mds.0.cache.dir(0x600012ddb0e) _fetched missing object for [dir 0x600012ddb0e /volumes/_deleting/8aa153c0-53d8-41c9-be90-270ad4a91c11/db5a8e9a-e491-4ca8-a8ec-0b47f8c19626/<redacted>/ [2,head] auth v=0 cv=0/0 ap=1+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x5572ee796880]
Jul 27 15:53:34 R07-NVME-03 ceph-mds[2482685]: 2023-07-27T15:53:34.908+0000 7fcbd0c39700 -1 log_channel(cluster) log [ERR] : dir 0x600012ddb0e object missing on disk; some files may be lost (/volumes/_deleting/8aa153c0-53d8-41c9-be90-270ad4a91c11/db5a8e9a-e491-4ca8-a8ec-0b47f8c19626/<redacted>)
Jul 27 15:53:34 R07-NVME-03 ceph-mds[2482685]: 2023-07-27T15:53:34.908+0000 7fcbd0c39700 -1 log_channel(cluster) log [ERR] : dir 0x600012ddb0e object missing on disk; some files may be lost (/volumes/_deleting/8aa153c0-53d8-41c9-be90-270ad4a91c11/db5a8e9a-e491-4ca8-a8ec-0b47f8c19626/<redacted>)

I'm not certain yet if this is the source of the problem or a contributing factor to it, but do we know why this happened, Igor?

Actions #8

Updated by Igor Fedotov 8 months ago

Venky Shankar wrote:

FWIW, logs hint at missing (RADOS) objects:

[...]

I'm not certain yet if this is the source of the problem or a contributing factor to it, but do we know why this happened, Igor?

Unfortunately no.

Actions #9

Updated by Venky Shankar 7 months ago

Venky Shankar wrote:

FWIW, logs hint at missing (RADOS) objects:

[...]

I'm not certain yet if this is the source of the problem or a contributing factor to it, but do we know why this happened, Igor?

I believe the crash has to do with the missing directory objects. The MDS migrator ensures that the mdlog gets flushed to ensure that the fnode version is updated to latest projected fnode version, which in this case mismatched due to missing dir objects. The MDS will invoke CDir::go_bad() at various places when it loads a dirfrag, however, it does not consider all errors as fatal, where it would mark itself as damaged and abort. So, I think, the damaged dir frag is being picked up by the migrator in this case.

Actions

Also available in: Atom PDF