Bug #14195
closedtest_full_fclose fails (tasks.cephfs.test_full.TestClusterFull)
0%
Updated by John Spray over 8 years ago
Did you mean to link a different log? The one above shows failure in TestStrays.test_files_throttle (like #13903)
Updated by John Spray over 8 years ago
OK, here are some matching instances:
http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2015-12-21_23:04:02-fs-master---basic-openstack/48392/teuthology.log
http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2015-12-30_23:04:01-fs-master---basic-openstack/56499/teuthology.log
http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2015-12-28_23:04:02-fs-master---basic-openstack/54409/
http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2015-12-23_23:04:02-fs-master---basic-openstack/51162/
Updated by John Spray over 8 years ago
The symptom is that we're getting ENOSPC from write() calls during buffered IO, where we should be getting them from fclose() calls. I think I've also seen this fail recently on the equivalent fsync() test in TestFull.
We're only seeing this on openstack.
I think the key event is that the client is getting an updated OSDMap when it the test doesn't expect it to. The test does this:
A write(enough data to trigger full flag) B fsync() <- should succeed because of the delay between writing data and the full flag getting set C sleep(30) <- full flag should get set during this period, but the client shouldn't find out because it doesn't subscribe continuously to the map D write(a little more data) <- should succeed because buffered E fclose() <- should fail because when flushing buffers we'll get the new map and realise that full flag is set.
So during C, we are getting an OSDMap, which is causing D to fail because the client now knows that the full flag is set.
Updated by John Spray over 8 years ago
- Status changed from New to In Progress
- Assignee set to John Spray
Updated by John Spray over 8 years ago
So there are two different failure modes here, we either get the exception out of the fsync() or out of the following write(). It's a result of the flush of buffers taking an absurdly long time. This is using memstore with 100MB stores, and the nodes (in theory) have 8GB of RAM, so I don't know why it's running so slowly.
/a/jspray-2016-01-12_12:26:08-fs:recovery-master---basic-openstack/2460 failed in fsync() /a/jspray-2016-01-12_12:25:44-fs:recovery-master---basic-openstack/2447/ failed in write()
Ah. Objecter is finding laggy ops, and calling _maybe_request_map in tick().
Updated by John Spray almost 8 years ago
- Status changed from In Progress to Won't Fix
This was only a test race, and it only happens on pathologically slow clusters.
Updated by Patrick Donnelly over 7 years ago
I just ran across this in http://pulpito.ceph.com/pdonnell-2016-10-19_23:26:58-fs:recovery-master---basic-mira/486355/
I'm going to add a comment to the test so future eyes know about the race...