Project

General

Profile

Actions

Bug #44638

closed

test_scrub_pause_and_resume (tasks.cephfs.test_scrub_checks.TestScrubControls) fails intermittently

Added by Venky Shankar about 4 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
nautilus, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Greg saw this in nautilus during test: http://pulpito.front.sepia.ceph.com/gregf-2020-03-13_20:56:54-fs-wip-greg-testing-nautilus-313-distro-basic-smithi/

Happens in master too: http://pulpito.ceph.com/vshankar-2020-03-16_10:45:29-fs-master-testing-basic-smithi/

The problem seems to be arising when a laggy (active) mds receives an mgrmap. When an active MDS is laggy, an incoming message is queued to be processed later (when the mds is not laggy anymore). If this message is a mgrmap, the (laggy) MDS queues the message, thereby marking the message as processed (returning `true` from `ms_dispatch()`). Later, when the mds processes the message queue, it doesn't handle this message thereby dropping it.

The side effect of this is that the mgr client instance in mds does not get a chance to process the mgr map which kickstarts things like periodic report updates to the manager. Mgr report (`MMgrReport`) carries `task_status` which contains MDS scrub status. Since the updated scrub status is not sent to ceph mgr, it does get recorded in service map (and, not displayed in `ceph status`) causing the test to fail.


Related issues 3 (0 open3 closed)

Has duplicate CephFS - Bug #46058: qa: test_scrub_pause_and_resume KeyError: 'a'DuplicateVenky Shankar

Actions
Copied to CephFS - Backport #46151: nautilus: test_scrub_pause_and_resume (tasks.cephfs.test_scrub_checks.TestScrubControls) fails intermittentlyResolvedNathan CutlerActions
Copied to CephFS - Backport #46152: octopus: test_scrub_pause_and_resume (tasks.cephfs.test_scrub_checks.TestScrubControls) fails intermittentlyResolvedNathan CutlerActions
Actions #1

Updated by Venky Shankar about 4 years ago

  • Subject changed from test_scrub_pause_and_resume (tasks.cephfs.test_scrub_checks.TestScrubControls) failed intermittently to test_scrub_pause_and_resume (tasks.cephfs.test_scrub_checks.TestScrubControls) failes intermittently
Actions #2

Updated by Venky Shankar about 4 years ago

  • Subject changed from test_scrub_pause_and_resume (tasks.cephfs.test_scrub_checks.TestScrubControls) failes intermittently to test_scrub_pause_and_resume (tasks.cephfs.test_scrub_checks.TestScrubControls) fails intermittently
Actions #3

Updated by Venky Shankar about 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 34024
Actions #4

Updated by Venky Shankar almost 4 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from nautilus to nautilus, octopus
Actions #5

Updated by Nathan Cutler almost 4 years ago

  • Copied to Backport #46151: nautilus: test_scrub_pause_and_resume (tasks.cephfs.test_scrub_checks.TestScrubControls) fails intermittently added
Actions #6

Updated by Nathan Cutler almost 4 years ago

  • Copied to Backport #46152: octopus: test_scrub_pause_and_resume (tasks.cephfs.test_scrub_checks.TestScrubControls) fails intermittently added
Actions #7

Updated by Patrick Donnelly almost 4 years ago

  • Has duplicate Bug #46058: qa: test_scrub_pause_and_resume KeyError: 'a' added
Actions #8

Updated by Nathan Cutler over 3 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF