Bug #16881
closedRuntimeError: Files in flight high water is unexpectedly low (0 / 6)
0%
Description
During a test_files_throttle test run, I hit the following error:
2016-07-30T21:07:16.863 INFO:tasks.cephfs_test_runner:====================================================================== 2016-07-30T21:07:16.863 INFO:tasks.cephfs_test_runner:ERROR: test_files_throttle (tasks.cephfs.test_strays.TestStrays) 2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner: File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/test_strays.py", line 27, in test_files_throttle 2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner: self._test_throttling(self.FILES_THROTTLE) 2016-07-30T21:07:16.864 INFO:tasks.cephfs_test_runner: File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/test_strays.py", line 208, in _test_throttling 2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner: ops_high_water, mds_max_purge_files 2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:RuntimeError: Files in flight high water is unexpectedly low (0 / 6) 2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner: 2016-07-30T21:07:16.865 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
It's not clear to me what this is actually warning about, so I'll need to track that down to see if it's an actual bug. The test run output is here:
http://pulpito.ceph.com/jlayton-2016-07-29_18:51:42-fs-wip-jlayton-nlink---basic-mira/341563
Updated by Jeff Layton over 7 years ago
- Component(FS) qa-suite added
Comments in test_strays.py seem to indicate that this test is racy anyway:
# Check that we got up to a respectable rate during the purge. This is totally # racy, but should be safeish unless the cluster is pathologically slow, or # insanely fast such that the deletions all pass before we have polled the # statistics.
I'm guessing maybe the latter happened here?
Updated by John Spray over 7 years ago
They are racy, but this particular case is a bit odd:
2016-07-30T21:07:08.248 INFO:tasks.cephfs.filesystem:_json_asok output: { "mds_cache": { "num_strays": 1447, "num_strays_purging": 0, "num_strays_delayed": 996, "num_purge_ops": 0, "strays_created": 3201, "strays_purged": 1754, "strays_reintegrated": 0, "strays_migrated": 0, "num_recovering_processing": 0, "num_recovering_enqueued": 0, "num_recovering_prioritized": 0, "recovery_started": 0, "recovery_completed": 0 } } 2016-07-30T21:07:08.249 INFO:tasks.cephfs.test_strays:Waiting for purge to complete 0/1447, 1754/3201
It's got all those strays waiting to be purged, so why aren't any in flight (num_strays_purging=0)?
Updated by John Spray over 7 years ago
- Status changed from New to Can't reproduce
This hasn't happened again afaik
Updated by Greg Farnum about 7 years ago
- Status changed from Can't reproduce to 12
This exact error popped up in a branch of mine (http://pulpito.ceph.com/gregf-2017-02-04_03:30:50-fs-wip-17594---basic-smithi/782344/; the run has many other failures but they're about permissions enforcement), and I see two failures of this test on kraken when searching: http://pulpito.ceph.com/teuthology-2016-12-31_11:10:02-fs-kraken---basic-smithi/678355/ and http://pulpito.ceph.com/teuthology-2017-01-12_11:10:02-fs-kraken---basic-smithi/711224/
Updated by John Spray about 7 years ago
I've also seen this one rear its ugly head again in my last fs run.
I looked at this one:
/a/jspray-2017-02-06_11:13:20-fs-wip-jcsp-testing-20170204-distro-basic-smithi/789958
The teuthology.log shows the test code is waiting far too long before its calls to "perf dump", hence not seeing its expected operations in flight. It could be a symptom of an overloaded teuthology host.
I do notice that the the MDS is spewing a lot of "was purgeable but no longer is!" messages (the reception of which may be generating load on the teuthology host?) which is a bit odd but I'll probably only look at that if it's still happening post-purgequeue.
Updated by Patrick Donnelly over 6 years ago
- Has duplicate Bug #20118: Test failure: test_ops_throttle (tasks.cephfs.test_strays.TestStrays) added
Updated by Patrick Donnelly over 6 years ago
- Assignee changed from John Spray to Zheng Yan
Zheng, please take a look.
Updated by Nathan Cutler over 6 years ago
Hello CephFS developers, I am reproducing this bug in the latest jewel integration branch. Here are the prime suspects:
- https://github.com/ceph/ceph/pull/14673
- https://github.com/ceph/ceph/pull/14699
- https://github.com/ceph/ceph/pull/15442
The full list of PRs included in the integration branch is here: http://tracker.ceph.com/issues/20613#note-2
The failing test is: fs/recovery/{clusters/4-remote-clients.yaml debug/mds_client.yaml dirfrag/frag_enable.yaml mounts/ceph-fuse.yaml tasks/strays.yaml xfs.yaml}
And the runs with the failure are here: http://tracker.ceph.com/issues/20613#note-7
Updated by Nathan Cutler over 6 years ago
Note: the failure is transient (occurred in 2 out of 5 runs so far).
Updated by Patrick Donnelly about 6 years ago
Updated by Patrick Donnelly over 4 years ago
- Status changed from New to Fix Under Review
- Priority changed from Normal to Urgent
- Target version set to v15.0.0
- Backport set to nautilus,mimic
- Pull request ID set to 32667
Updated by Patrick Donnelly about 4 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #43790: nautilus: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) added
Updated by Nathan Cutler about 4 years ago
- Copied to Backport #43791: mimic: RuntimeError: Files in flight high water is unexpectedly low (0 / 6) added
Updated by Nathan Cutler almost 4 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".