Project

General

Profile

Bug #16881

RuntimeError: Files in flight high water is unexpectedly low (0 / 6)

Added by Jeff Layton over 3 years ago. Updated 3 months ago.

Status:
Pending Backport
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
qa-suite
Labels (FS):
Pull request ID:
Crash signature:

Description

During a test_files_throttle test run, I hit the following error:

2016-07-30T21:07:16.863 INFO:tasks.cephfs_test_runner:======================================================================
2016-07-30T21:07:16.863 INFO:tasks.cephfs_test_runner:ERROR: test_files_throttle (tasks.cephfs.test_strays.TestStrays)
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/test_strays.py", line 27, in test_files_throttle
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:    self._test_throttling(self.FILES_THROTTLE)
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/test_strays.py", line 208, in _test_throttling
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:    ops_high_water, mds_max_purge_files
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:RuntimeError: Files in flight high water is unexpectedly low (0 / 6)
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------

It's not clear to me what this is actually warning about, so I'll need to track that down to see if it's an actual bug. The test run output is here:

http://pulpito.ceph.com/jlayton-2016-07-29_18:51:42-fs-wip-jlayton-nlink---basic-mira/341563

Related issues

Duplicated by fs - Bug #20118: Test failure: test_ops_throttle (tasks.cephfs.test_strays.TestStrays) Duplicate 05/31/2017
Copied to fs - Backport #43790: nautilus: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) Resolved
Copied to fs - Backport #43791: mimic: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) New

History

#1 Updated by Jeff Layton over 3 years ago

  • Component(FS) qa-suite added

Comments in test_strays.py seem to indicate that this test is racy anyway:

        # Check that we got up to a respectable rate during the purge.  This is totally
        # racy, but should be safeish unless the cluster is pathologically slow, or
        # insanely fast such that the deletions all pass before we have polled the
        # statistics.                                                           

I'm guessing maybe the latter happened here?

#2 Updated by John Spray over 3 years ago

They are racy, but this particular case is a bit odd:

2016-07-30T21:07:08.248 INFO:tasks.cephfs.filesystem:_json_asok output: {
    "mds_cache": {
        "num_strays": 1447,
        "num_strays_purging": 0,
        "num_strays_delayed": 996,
        "num_purge_ops": 0,
        "strays_created": 3201,
        "strays_purged": 1754,
        "strays_reintegrated": 0,
        "strays_migrated": 0,
        "num_recovering_processing": 0,
        "num_recovering_enqueued": 0,
        "num_recovering_prioritized": 0,
        "recovery_started": 0,
        "recovery_completed": 0
    }
}

2016-07-30T21:07:08.249 INFO:tasks.cephfs.test_strays:Waiting for purge to complete 0/1447, 1754/3201

It's got all those strays waiting to be purged, so why aren't any in flight (num_strays_purging=0)?

#3 Updated by John Spray over 3 years ago

  • Assignee set to John Spray

#4 Updated by John Spray over 3 years ago

  • Status changed from New to Can't reproduce

This hasn't happened again afaik

#5 Updated by Greg Farnum about 3 years ago

  • Status changed from Can't reproduce to 12

This exact error popped up in a branch of mine (http://pulpito.ceph.com/gregf-2017-02-04_03:30:50-fs-wip-17594---basic-smithi/782344/; the run has many other failures but they're about permissions enforcement), and I see two failures of this test on kraken when searching: http://pulpito.ceph.com/teuthology-2016-12-31_11:10:02-fs-kraken---basic-smithi/678355/ and http://pulpito.ceph.com/teuthology-2017-01-12_11:10:02-fs-kraken---basic-smithi/711224/

#6 Updated by John Spray about 3 years ago

I've also seen this one rear its ugly head again in my last fs run.

I looked at this one:
/a/jspray-2017-02-06_11:13:20-fs-wip-jcsp-testing-20170204-distro-basic-smithi/789958

The teuthology.log shows the test code is waiting far too long before its calls to "perf dump", hence not seeing its expected operations in flight. It could be a symptom of an overloaded teuthology host.

I do notice that the the MDS is spewing a lot of "was purgeable but no longer is!" messages (the reception of which may be generating load on the teuthology host?) which is a bit odd but I'll probably only look at that if it's still happening post-purgequeue.

#7 Updated by Patrick Donnelly over 2 years ago

  • Duplicated by Bug #20118: Test failure: test_ops_throttle (tasks.cephfs.test_strays.TestStrays) added

#8 Updated by Patrick Donnelly over 2 years ago

  • Assignee changed from John Spray to Zheng Yan

Zheng, please take a look.

#9 Updated by Nathan Cutler over 2 years ago

Hello CephFS developers, I am reproducing this bug in the latest jewel integration branch. Here are the prime suspects:

The full list of PRs included in the integration branch is here: http://tracker.ceph.com/issues/20613#note-2

The failing test is: fs/recovery/{clusters/4-remote-clients.yaml debug/mds_client.yaml dirfrag/frag_enable.yaml mounts/ceph-fuse.yaml tasks/strays.yaml xfs.yaml}

And the runs with the failure are here: http://tracker.ceph.com/issues/20613#note-7

#10 Updated by Nathan Cutler over 2 years ago

Note: the failure is transient (occurred in 2 out of 5 runs so far).

#12 Updated by Patrick Donnelly about 1 year ago

  • Assignee deleted (Zheng Yan)

#13 Updated by Patrick Donnelly 4 months ago

  • Status changed from 12 to New

#14 Updated by Patrick Donnelly 3 months ago

  • Assignee set to Patrick Donnelly

#15 Updated by Patrick Donnelly 3 months ago

  • Status changed from New to Fix Under Review
  • Priority changed from Normal to Urgent
  • Target version set to v15.0.0
  • Backport set to nautilus,mimic
  • Pull request ID set to 32667

#16 Updated by Patrick Donnelly 3 months ago

  • Status changed from Fix Under Review to Pending Backport

#17 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #43790: nautilus: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) added

#18 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #43791: mimic: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) added

Also available in: Atom PDF