Bug #62381
open
mds: Bug still exists: FAILED ceph_assert(dir->get_projected_version() == dir->get_version())
Added by Igor Fedotov 9 months ago.
Updated 3 days ago.
Category:
Correctness/Safety
Description
Despite https://tracker.ceph.com/issues/53597 being marked as resolved we could still face the problem in v17.2.5
It occurred at multiple MDS-es quite a few of times within a few hours time frame and has been finally repaired by scrubbing.
Files
The attached file contains log snippets with apparently relevant information for a few crashes as well as intermediate and final scrubbings.
- Related to Bug #53597: mds: FAILED ceph_assert(dir->get_projected_version() == dir->get_version()) added
- Backport set to quincy, reef
- Severity changed from 3 - minor to 2 - major
- Category set to Correctness/Safety
- Assignee set to Venky Shankar
- Target version set to v19.0.0
Igor Fedotov wrote:
The attached file contains log snippets with apparently relevant information for a few crashes as well as intermediate and final scrubbings.
Thanks, Igor. I'll have a look.
- Status changed from New to In Progress
FWIW, logs hint at missing (RADOS) objects:
Jul 27 15:53:34 R07-NVME-03 ceph-mds[2482685]: 2023-07-27T15:53:34.908+0000 7fcbd0c39700 0 mds.0.cache.dir(0x600012ddb0e) _fetched missing object for [dir 0x600012ddb0e /volumes/_deleting/8aa153c0-53d8-41c9-be90-270ad4a91c11/db5a8e9a-e491-4ca8-a8ec-0b47f8c19626/<redacted>/ [2,head] auth v=0 cv=0/0 ap=1+0 state=1073741888|fetching f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x5572ee796880]
Jul 27 15:53:34 R07-NVME-03 ceph-mds[2482685]: 2023-07-27T15:53:34.908+0000 7fcbd0c39700 -1 log_channel(cluster) log [ERR] : dir 0x600012ddb0e object missing on disk; some files may be lost (/volumes/_deleting/8aa153c0-53d8-41c9-be90-270ad4a91c11/db5a8e9a-e491-4ca8-a8ec-0b47f8c19626/<redacted>)
Jul 27 15:53:34 R07-NVME-03 ceph-mds[2482685]: 2023-07-27T15:53:34.908+0000 7fcbd0c39700 -1 log_channel(cluster) log [ERR] : dir 0x600012ddb0e object missing on disk; some files may be lost (/volumes/_deleting/8aa153c0-53d8-41c9-be90-270ad4a91c11/db5a8e9a-e491-4ca8-a8ec-0b47f8c19626/<redacted>)
I'm not certain yet if this is the source of the problem or a contributing factor to it, but do we know why this happened, Igor?
Venky Shankar wrote:
FWIW, logs hint at missing (RADOS) objects:
[...]
I'm not certain yet if this is the source of the problem or a contributing factor to it, but do we know why this happened, Igor?
Unfortunately no.
Venky Shankar wrote:
FWIW, logs hint at missing (RADOS) objects:
[...]
I'm not certain yet if this is the source of the problem or a contributing factor to it, but do we know why this happened, Igor?
I believe the crash has to do with the missing directory objects. The MDS migrator ensures that the mdlog gets flushed to ensure that the fnode version is updated to latest projected fnode version, which in this case mismatched due to missing dir objects. The MDS will invoke CDir::go_bad() at various places when it loads a dirfrag, however, it does not consider all errors as fatal, where it would mark itself as damaged and abort. So, I think, the damaged dir frag is being picked up by the migrator in this case.
- Target version changed from v19.0.0 to v20.0.0
Also available in: Atom
PDF