Project

General

Profile

Actions

Bug #37543

closed

mds: purge queue recovery hangs during boot if PQ journal is damaged

Added by Patrick Donnelly over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
High
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Failure: Test failure: test_object_deletion (tasks.cephfs.test_damage.TestDamage)
2 jobs: ['3291762', '3291647']
suites intersection: ['clusters/1-mds-4-client-coloc.yaml', 'conf/{client.yaml', 'fs/basic_functional/{begin.yaml', 'mds.yaml', 'mon.yaml', 'mount/fuse.yaml', 'no_client_pidfile.yaml', 'osd.yaml}', 'overrides/{frag_enable.yaml', 'tasks/damage.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']
suites union: ['clusters/1-mds-4-client-coloc.yaml', 'conf/{client.yaml', 'fs/basic_functional/{begin.yaml', 'mds.yaml', 'mon.yaml', 'mount/fuse.yaml', 'no_client_pidfile.yaml', 'objectstore/bluestore-ec-root.yaml', 'objectstore/bluestore.yaml', 'osd.yaml}', 'overrides/{frag_enable.yaml', 'supported-random-distros$/{ubuntu_16.04.yaml}', 'supported-random-distros$/{ubuntu_latest.yaml}', 'tasks/damage.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']

e.g.:

2018-11-29 11:59:29.590 7fc726f32700 -1 mds.0.purge_queue operator(): Error -22 loading Journaler
2018-11-29 11:59:29.590 7fc726f32700 -1 mds.0.139 unhandled write error (22) Invalid argument, force readonly...
2018-11-29 11:59:29.590 7fc726f32700  1 mds.0.cache force file system read-only
2018-11-29 11:59:29.590 7fc726f32700  0 log_channel(cluster) log [WRN] : force file system read-only
...
2018-11-29 11:59:29.594 7fc725f30700  2 mds.0.139 boot_start 2: replaying mds log
2018-11-29 11:59:29.594 7fc725f30700  2 mds.0.139 boot_start 2: waiting for purge queue recovered

From: /ceph/teuthology-archive/pdonnell-2018-11-29_06:44:45-fs-wip-pdonnell-testing-20181129.042324-distro-basic-smithi/3291762/remote/smithi061/log/ceph-mds.a-s.log.gz

The purge queue never recovers so the MDS sits in up:replay.

This is with testing of https://github.com/ceph/ceph/pull/25270 . I will proceed with merging #25270 since an MDS sitting in up:replay is not much different from a damaged rank from a user perspective. This still needs fixed.


Related issues 4 (0 open4 closed)

Related to CephFS - Bug #37394: mds: PurgeQueue write error handler does not handle EBLACKLISTEDResolvedPatrick Donnelly

Actions
Related to CephFS - Bug #37944: qa: test_damage needs to silence MDS_READ_ONLYResolvedPatrick Donnelly

Actions
Copied to CephFS - Backport #37898: mimic: mds: purge queue recovery hangs during boot if PQ journal is damagedResolvedNathan CutlerActions
Copied to CephFS - Backport #37899: luminous: mds: purge queue recovery hangs during boot if PQ journal is damagedResolvedPatrick DonnellyActions
Actions #1

Updated by Patrick Donnelly over 5 years ago

  • Related to Bug #37394: mds: PurgeQueue write error handler does not handle EBLACKLISTED added
Actions #2

Updated by Patrick Donnelly over 5 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 25621
Actions #3

Updated by Patrick Donnelly over 5 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #4

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #37898: mimic: mds: purge queue recovery hangs during boot if PQ journal is damaged added
Actions #5

Updated by Nathan Cutler over 5 years ago

  • Copied to Backport #37899: luminous: mds: purge queue recovery hangs during boot if PQ journal is damaged added
Actions #6

Updated by Patrick Donnelly over 5 years ago

  • Related to Bug #37944: qa: test_damage needs to silence MDS_READ_ONLY added
Actions #7

Updated by Patrick Donnelly about 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF