Bug #62381: mds: Bug still exists: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) - CephFS - Ceph

Actions

Copy link

Bug #62381

open

mds: Bug still exists: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())

Added by Igor Fedotov 9 months ago. Updated 8 months ago.

Status:

In Progress

Priority:

Normal

Assignee:

Venky Shankar

Category:

Correctness/Safety

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

Backport:

quincy, reef

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v17.2.5

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Despite https://tracker.ceph.com/issues/53597 being marked as resolved we could still face the problem in v17.2.5

It occurred at multiple MDS-es quite a few of times within a few hours time frame and has been finally repaired by scrubbing.

Files

cephfs_bug.txt (65.8 KB) cephfs_bug.txt

Igor Fedotov, 08/09/2023 06:18 PM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Igor Fedotov 9 months ago

File cephfs_bug.txt cephfs_bug.txt added

The attached file contains log snippets with apparently relevant information for a few crashes as well as intermediate and final scrubbings.

Actions

Copy link

Updated by Igor Fedotov 9 months ago

Related to Bug #53597: mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) added

Actions

Copy link

Updated by Igor Fedotov 9 months ago

Backport set to quincy, reef
Severity changed from 3 - minor to 2 - major

Actions

Copy link

Updated by Venky Shankar 9 months ago

Category set to Correctness/Safety
Assignee set to Venky Shankar
Target version set to v19.0.0

Actions

Copy link

Updated by Venky Shankar 8 months ago

Igor Fedotov wrote:

The attached file contains log snippets with apparently relevant information for a few crashes as well as intermediate and final scrubbings.

Thanks, Igor. I'll have a look.

Actions

Copy link

Updated by Venky Shankar 8 months ago

Status changed from New to In Progress

Actions

Copy link

Updated by Venky Shankar 8 months ago

FWIW, logs hint at missing (RADOS) objects:

Jul 27 15:53:34 R07-NVME-03 ceph-mds[2482685]: 2023-07-27T15:53:34.908+0000 7fcbd0c39700  0 mds.0.cache.dir(0x600012ddb0e) _fetched missing object for [dir 0x600012ddb0e /volumes/_deleting/8aa153c0-53d8-41c9-be90-270ad4a91c11/db5a8e9a-e491-4ca8-a8ec-0b47f8c19626/<redacted>/ [2,head] auth v=0 cv=0/0 ap=1+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x5572ee796880]
Jul 27 15:53:34 R07-NVME-03 ceph-mds[2482685]: 2023-07-27T15:53:34.908+0000 7fcbd0c39700 -1 log_channel(cluster) log [ERR] : dir 0x600012ddb0e object missing on disk; some files may be lost (/volumes/_deleting/8aa153c0-53d8-41c9-be90-270ad4a91c11/db5a8e9a-e491-4ca8-a8ec-0b47f8c19626/<redacted>)
Jul 27 15:53:34 R07-NVME-03 ceph-mds[2482685]: 2023-07-27T15:53:34.908+0000 7fcbd0c39700 -1 log_channel(cluster) log [ERR] : dir 0x600012ddb0e object missing on disk; some files may be lost (/volumes/_deleting/8aa153c0-53d8-41c9-be90-270ad4a91c11/db5a8e9a-e491-4ca8-a8ec-0b47f8c19626/<redacted>)

I'm not certain yet if this is the source of the problem or a contributing factor to it, but do we know why this happened, Igor?

Actions

Copy link

Updated by Igor Fedotov 8 months ago

Venky Shankar wrote:

FWIW, logs hint at missing (RADOS) objects:

[...]

I'm not certain yet if this is the source of the problem or a contributing factor to it, but do we know why this happened, Igor?

Unfortunately no.

Actions

Copy link

Updated by Venky Shankar 8 months ago

Venky Shankar wrote:

FWIW, logs hint at missing (RADOS) objects:

[...]

I'm not certain yet if this is the source of the problem or a contributing factor to it, but do we know why this happened, Igor?

I believe the crash has to do with the missing directory objects. The MDS migrator ensures that the mdlog gets flushed to ensure that the fnode version is updated to latest projected fnode version, which in this case mismatched due to missing dir objects. The MDS will invoke CDir::go_bad() at various places when it loads a dirfrag, however, it does not consider all errors as fatal, where it would mark itself as damaged and abort. So, I think, the damaged dir frag is being picked up by the migrator in this case.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #62381

mds: Bug still exists: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())

Updated by Igor Fedotov 9 months ago

Updated by Igor Fedotov 9 months ago

Updated by Igor Fedotov 9 months ago

Updated by Venky Shankar 9 months ago

Updated by Venky Shankar 8 months ago

Updated by Venky Shankar 8 months ago

Updated by Venky Shankar 8 months ago

Updated by Igor Fedotov 8 months ago

Updated by Venky Shankar 8 months ago