Project

General

Profile

Bug #37543

mds: purge queue recovery hangs during boot if PQ journal is damaged

Added by Patrick Donnelly 4 days ago.

Status:
New
Priority:
High
Category:
Correctness/Safety
Target version:
Start date:
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:

Description

Failure: Test failure: test_object_deletion (tasks.cephfs.test_damage.TestDamage)
2 jobs: ['3291762', '3291647']
suites intersection: ['clusters/1-mds-4-client-coloc.yaml', 'conf/{client.yaml', 'fs/basic_functional/{begin.yaml', 'mds.yaml', 'mon.yaml', 'mount/fuse.yaml', 'no_client_pidfile.yaml', 'osd.yaml}', 'overrides/{frag_enable.yaml', 'tasks/damage.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']
suites union: ['clusters/1-mds-4-client-coloc.yaml', 'conf/{client.yaml', 'fs/basic_functional/{begin.yaml', 'mds.yaml', 'mon.yaml', 'mount/fuse.yaml', 'no_client_pidfile.yaml', 'objectstore/bluestore-ec-root.yaml', 'objectstore/bluestore.yaml', 'osd.yaml}', 'overrides/{frag_enable.yaml', 'supported-random-distros$/{ubuntu_16.04.yaml}', 'supported-random-distros$/{ubuntu_latest.yaml}', 'tasks/damage.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']

e.g.:

2018-11-29 11:59:29.590 7fc726f32700 -1 mds.0.purge_queue operator(): Error -22 loading Journaler
2018-11-29 11:59:29.590 7fc726f32700 -1 mds.0.139 unhandled write error (22) Invalid argument, force readonly...
2018-11-29 11:59:29.590 7fc726f32700  1 mds.0.cache force file system read-only
2018-11-29 11:59:29.590 7fc726f32700  0 log_channel(cluster) log [WRN] : force file system read-only
...
2018-11-29 11:59:29.594 7fc725f30700  2 mds.0.139 boot_start 2: replaying mds log
2018-11-29 11:59:29.594 7fc725f30700  2 mds.0.139 boot_start 2: waiting for purge queue recovered

From: /ceph/teuthology-archive/pdonnell-2018-11-29_06:44:45-fs-wip-pdonnell-testing-20181129.042324-distro-basic-smithi/3291762/remote/smithi061/log/ceph-mds.a-s.log.gz

The purge queue never recovers so the MDS sits in up:replay.

This is with testing of https://github.com/ceph/ceph/pull/25270 . I will proceed with merging #25270 since an MDS sitting in up:replay is not much different from a damaged rank from a user perspective. This still needs fixed.


Related issues

Related to fs - Bug #37394: mds: PurgeQueue write error handler does not handle EBLACKLISTED Pending Backport

History

#1 Updated by Patrick Donnelly 4 days ago

  • Related to Bug #37394: mds: PurgeQueue write error handler does not handle EBLACKLISTED added

Also available in: Atom PDF