Project

General

Profile

Actions

Bug #16881

closed

RuntimeError: Files in flight high water is unexpectedly low (0 / 6)

Added by Jeff Layton over 7 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Urgent
Category:
-
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
fs
Component(FS):
qa-suite
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During a test_files_throttle test run, I hit the following error:

2016-07-30T21:07:16.863 INFO:tasks.cephfs_test_runner:======================================================================
2016-07-30T21:07:16.863 INFO:tasks.cephfs_test_runner:ERROR: test_files_throttle (tasks.cephfs.test_strays.TestStrays)
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/test_strays.py", line 27, in test_files_throttle
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:    self._test_throttling(self.FILES_THROTTLE)
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/test_strays.py", line 208, in _test_throttling
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:    ops_high_water, mds_max_purge_files
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:RuntimeError: Files in flight high water is unexpectedly low (0 / 6)
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------

It's not clear to me what this is actually warning about, so I'll need to track that down to see if it's an actual bug. The test run output is here:

http://pulpito.ceph.com/jlayton-2016-07-29_18:51:42-fs-wip-jlayton-nlink---basic-mira/341563

Related issues 3 (0 open3 closed)

Has duplicate CephFS - Bug #20118: Test failure: test_ops_throttle (tasks.cephfs.test_strays.TestStrays)Duplicate05/31/2017

Actions
Copied to CephFS - Backport #43790: nautilus: RuntimeError: Files in flight high water is unexpectedly low (0 / 6)ResolvedPatrick DonnellyActions
Copied to CephFS - Backport #43791: mimic: RuntimeError: Files in flight high water is unexpectedly low (0 / 6)RejectedActions
Actions #1

Updated by Jeff Layton over 7 years ago

  • Component(FS) qa-suite added

Comments in test_strays.py seem to indicate that this test is racy anyway:

        # Check that we got up to a respectable rate during the purge.  This is totally
        # racy, but should be safeish unless the cluster is pathologically slow, or
        # insanely fast such that the deletions all pass before we have polled the
        # statistics.                                                           

I'm guessing maybe the latter happened here?

Actions #2

Updated by John Spray over 7 years ago

They are racy, but this particular case is a bit odd:

2016-07-30T21:07:08.248 INFO:tasks.cephfs.filesystem:_json_asok output: {
    "mds_cache": {
        "num_strays": 1447,
        "num_strays_purging": 0,
        "num_strays_delayed": 996,
        "num_purge_ops": 0,
        "strays_created": 3201,
        "strays_purged": 1754,
        "strays_reintegrated": 0,
        "strays_migrated": 0,
        "num_recovering_processing": 0,
        "num_recovering_enqueued": 0,
        "num_recovering_prioritized": 0,
        "recovery_started": 0,
        "recovery_completed": 0
    }
}

2016-07-30T21:07:08.249 INFO:tasks.cephfs.test_strays:Waiting for purge to complete 0/1447, 1754/3201

It's got all those strays waiting to be purged, so why aren't any in flight (num_strays_purging=0)?

Actions #3

Updated by John Spray over 7 years ago

  • Assignee set to John Spray
Actions #4

Updated by John Spray over 7 years ago

  • Status changed from New to Can't reproduce

This hasn't happened again afaik

Actions #5

Updated by Greg Farnum about 7 years ago

  • Status changed from Can't reproduce to 12

This exact error popped up in a branch of mine (http://pulpito.ceph.com/gregf-2017-02-04_03:30:50-fs-wip-17594---basic-smithi/782344/; the run has many other failures but they're about permissions enforcement), and I see two failures of this test on kraken when searching: http://pulpito.ceph.com/teuthology-2016-12-31_11:10:02-fs-kraken---basic-smithi/678355/ and http://pulpito.ceph.com/teuthology-2017-01-12_11:10:02-fs-kraken---basic-smithi/711224/

Actions #6

Updated by John Spray about 7 years ago

I've also seen this one rear its ugly head again in my last fs run.

I looked at this one:
/a/jspray-2017-02-06_11:13:20-fs-wip-jcsp-testing-20170204-distro-basic-smithi/789958

The teuthology.log shows the test code is waiting far too long before its calls to "perf dump", hence not seeing its expected operations in flight. It could be a symptom of an overloaded teuthology host.

I do notice that the the MDS is spewing a lot of "was purgeable but no longer is!" messages (the reception of which may be generating load on the teuthology host?) which is a bit odd but I'll probably only look at that if it's still happening post-purgequeue.

Actions #7

Updated by Patrick Donnelly over 6 years ago

  • Has duplicate Bug #20118: Test failure: test_ops_throttle (tasks.cephfs.test_strays.TestStrays) added
Actions #8

Updated by Patrick Donnelly over 6 years ago

  • Assignee changed from John Spray to Zheng Yan

Zheng, please take a look.

Actions #9

Updated by Nathan Cutler over 6 years ago

Hello CephFS developers, I am reproducing this bug in the latest jewel integration branch. Here are the prime suspects:

The full list of PRs included in the integration branch is here: http://tracker.ceph.com/issues/20613#note-2

The failing test is: fs/recovery/{clusters/4-remote-clients.yaml debug/mds_client.yaml dirfrag/frag_enable.yaml mounts/ceph-fuse.yaml tasks/strays.yaml xfs.yaml}

And the runs with the failure are here: http://tracker.ceph.com/issues/20613#note-7

Actions #10

Updated by Nathan Cutler over 6 years ago

Note: the failure is transient (occurred in 2 out of 5 runs so far).

Actions #12

Updated by Patrick Donnelly about 5 years ago

  • Assignee deleted (Zheng Yan)
Actions #13

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions #14

Updated by Patrick Donnelly over 4 years ago

  • Assignee set to Patrick Donnelly
Actions #15

Updated by Patrick Donnelly over 4 years ago

  • Status changed from New to Fix Under Review
  • Priority changed from Normal to Urgent
  • Target version set to v15.0.0
  • Backport set to nautilus,mimic
  • Pull request ID set to 32667
Actions #16

Updated by Patrick Donnelly about 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #17

Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #43790: nautilus: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) added
Actions #18

Updated by Nathan Cutler about 4 years ago

  • Copied to Backport #43791: mimic: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) added
Actions #19

Updated by Nathan Cutler almost 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF