Bug #16881: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) - CephFS - Ceph

Actions

Copy link

Bug #16881

closed

RuntimeError: Files in flight high water is unexpectedly low (0 / 6)

Added by Jeff Layton over 7 years ago. Updated almost 4 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Patrick Donnelly

Category:

Target version:

Ceph - v15.0.0

% Done:

Source:

other

Tags:

Backport:

nautilus,mimic

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

qa-suite

Labels (FS):

Pull request ID:

32667

Crash signature (v1):

Crash signature (v2):

Description

During a test_files_throttle test run, I hit the following error:

2016-07-30T21:07:16.863 INFO:tasks.cephfs_test_runner:======================================================================
2016-07-30T21:07:16.863 INFO:tasks.cephfs_test_runner:ERROR: test_files_throttle (tasks.cephfs.test_strays.TestStrays)
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/test_strays.py", line 27, in test_files_throttle
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:    self._test_throttling(self.FILES_THROTTLE)
2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/test_strays.py", line 208, in _test_throttling
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:    ops_high_water, mds_max_purge_files
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:RuntimeError: Files in flight high water is unexpectedly low (0 / 6)
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:
2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------

It's not clear to me what this is actually warning about, so I'll need to track that down to see if it's an actual bug. The test run output is here:

http://pulpito.ceph.com/jlayton-2016-07-29_18:51:42-fs-wip-jlayton-nlink---basic-mira/341563

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Component(FS) qa-suite added

Comments in test_strays.py seem to indicate that this test is racy anyway:

        # Check that we got up to a respectable rate during the purge.  This is totally
        # racy, but should be safeish unless the cluster is pathologically slow, or
        # insanely fast such that the deletions all pass before we have polled the
        # statistics.

I'm guessing maybe the latter happened here?

Actions

Copy link

Updated by John Spray over 7 years ago

They are racy, but this particular case is a bit odd:

2016-07-30T21:07:08.248 INFO:tasks.cephfs.filesystem:_json_asok output: {
    "mds_cache": {
        "num_strays": 1447,
        "num_strays_purging": 0,
        "num_strays_delayed": 996,
        "num_purge_ops": 0,
        "strays_created": 3201,
        "strays_purged": 1754,
        "strays_reintegrated": 0,
        "strays_migrated": 0,
        "num_recovering_processing": 0,
        "num_recovering_enqueued": 0,
        "num_recovering_prioritized": 0,
        "recovery_started": 0,
        "recovery_completed": 0
    }
}

2016-07-30T21:07:08.249 INFO:tasks.cephfs.test_strays:Waiting for purge to complete 0/1447, 1754/3201

It's got all those strays waiting to be purged, so why aren't any in flight (num_strays_purging=0)?

Actions

Copy link

Updated by John Spray over 7 years ago

Assignee set to John Spray

Actions

Copy link

Updated by John Spray over 7 years ago

Status changed from New to Can't reproduce

This hasn't happened again afaik

Actions

Copy link

Updated by Greg Farnum about 7 years ago

Status changed from Can't reproduce to 12

This exact error popped up in a branch of mine (http://pulpito.ceph.com/gregf-2017-02-04_03:30:50-fs-wip-17594---basic-smithi/782344/; the run has many other failures but they're about permissions enforcement), and I see two failures of this test on kraken when searching: http://pulpito.ceph.com/teuthology-2016-12-31_11:10:02-fs-kraken---basic-smithi/678355/ and http://pulpito.ceph.com/teuthology-2017-01-12_11:10:02-fs-kraken---basic-smithi/711224/

Actions

Copy link

Updated by John Spray about 7 years ago

I've also seen this one rear its ugly head again in my last fs run.

I looked at this one:
/a/jspray-2017-02-06_11:13:20-fs-wip-jcsp-testing-20170204-distro-basic-smithi/789958

The teuthology.log shows the test code is waiting far too long before its calls to "perf dump", hence not seeing its expected operations in flight. It could be a symptom of an overloaded teuthology host.

I do notice that the the MDS is spewing a lot of "was purgeable but no longer is!" messages (the reception of which may be generating load on the teuthology host?) which is a bit odd but I'll probably only look at that if it's still happening post-purgequeue.

Actions

Copy link

Updated by Patrick Donnelly over 6 years ago

Has duplicate Bug #20118: Test failure: test_ops_throttle (tasks.cephfs.test_strays.TestStrays) added

Actions

Copy link

Updated by Patrick Donnelly over 6 years ago

Assignee changed from John Spray to Zheng Yan

Zheng, please take a look.

Actions

Copy link

Updated by Nathan Cutler over 6 years ago

Hello CephFS developers, I am reproducing this bug in the latest jewel integration branch. Here are the prime suspects:

The full list of PRs included in the integration branch is here: http://tracker.ceph.com/issues/20613#note-2

The failing test is: fs/recovery/{clusters/4-remote-clients.yaml debug/mds_client.yaml dirfrag/frag_enable.yaml mounts/ceph-fuse.yaml tasks/strays.yaml xfs.yaml}

And the runs with the failure are here: http://tracker.ceph.com/issues/20613#note-7

Actions

Copy link

#10

Updated by Nathan Cutler over 6 years ago

Note: the failure is transient (occurred in 2 out of 5 runs so far).

Actions

Copy link

#11

Updated by Patrick Donnelly about 6 years ago

Latest instance: http://pulpito.ceph.com/yuriw-2018-03-09_16:22:21-fs-wip-yuriw-master-3.8.18-distro-basic-smithi/2271781

Actions

Copy link

#12

Updated by Patrick Donnelly about 5 years ago

Assignee deleted (~~Zheng Yan~~)

Actions

Copy link

#13

Updated by Patrick Donnelly over 4 years ago

Status changed from 12 to New

Actions

Copy link

#14

Updated by Patrick Donnelly over 4 years ago

Assignee set to Patrick Donnelly

Actions

Copy link

#15

Updated by Patrick Donnelly over 4 years ago

Status changed from New to Fix Under Review
Priority changed from Normal to Urgent
Target version set to v15.0.0
Backport set to nautilus,mimic
Pull request ID set to 32667

Actions

Copy link

#16

Updated by Patrick Donnelly about 4 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#17

Updated by Nathan Cutler about 4 years ago

Copied to Backport #43790: nautilus: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) added

Actions

Copy link

#18

Updated by Nathan Cutler about 4 years ago

Copied to Backport #43791: mimic: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) added

Actions

Copy link

#19

Updated by Nathan Cutler almost 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #16881

RuntimeError: Files in flight high water is unexpectedly low (0 / 6)

Updated by Jeff Layton over 7 years ago

Updated by John Spray over 7 years ago

Updated by John Spray over 7 years ago

Updated by John Spray over 7 years ago

Updated by Greg Farnum about 7 years ago

Updated by John Spray about 7 years ago

Updated by Patrick Donnelly over 6 years ago

Updated by Patrick Donnelly over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Patrick Donnelly about 6 years ago

Updated by Patrick Donnelly about 5 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Patrick Donnelly about 4 years ago

Updated by Nathan Cutler about 4 years ago

Updated by Nathan Cutler about 4 years ago

Updated by Nathan Cutler almost 4 years ago