Bug #14195: test_full_fclose fails (tasks.cephfs.test_full.TestClusterFull) - CephFS - Ceph

Actions

Copy link

Bug #14195

closed

test_full_fclose fails (tasks.cephfs.test_full.TestClusterFull)

Added by Zheng Yan over 8 years ago. Updated over 7 years ago.

Status:

Won't Fix

Priority:

Normal

Assignee:

John Spray

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2015-12-21_23:04:02-fs-master---basic-openstack/48408/teuthology.log

Actions

Copy link

Updated by John Spray over 8 years ago

Did you mean to link a different log? The one above shows failure in TestStrays.test_files_throttle (like #13903)

Actions

Copy link

Updated by John Spray over 8 years ago

OK, here are some matching instances:
http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2015-12-21_23:04:02-fs-master---basic-openstack/48392/teuthology.log
http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2015-12-30_23:04:01-fs-master---basic-openstack/56499/teuthology.log
http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2015-12-28_23:04:02-fs-master---basic-openstack/54409/
http://teuthology.ovh.sepia.ceph.com/teuthology/teuthology-2015-12-23_23:04:02-fs-master---basic-openstack/51162/

Actions

Copy link

Updated by John Spray over 8 years ago

http://pulpito.ovh.sepia.ceph.com:8081/teuthology-2016-01-11_23:04:01-fs-master---basic-openstack/2263/

Actions

Copy link

Updated by John Spray over 8 years ago

The symptom is that we're getting ENOSPC from write() calls during buffered IO, where we should be getting them from fclose() calls. I think I've also seen this fail recently on the equivalent fsync() test in TestFull.

We're only seeing this on openstack.

I think the key event is that the client is getting an updated OSDMap when it the test doesn't expect it to. The test does this:

 A write(enough data to trigger full flag)
 B fsync() <- should succeed because of the delay between writing data and the full flag getting set
 C sleep(30) <- full flag should get set during this period, but the client shouldn't find out because it doesn't subscribe continuously to the map
 D write(a little more data) <- should succeed because buffered
 E fclose() <- should fail because when flushing buffers we'll get the new map and realise that full flag is set.

So during C, we are getting an OSDMap, which is causing D to fail because the client now knows that the full flag is set.

Actions

Copy link

Updated by John Spray over 8 years ago

Status changed from New to In Progress
Assignee set to John Spray

Actions

Copy link

Updated by John Spray over 8 years ago

So there are two different failure modes here, we either get the exception out of the fsync() or out of the following write(). It's a result of the flush of buffers taking an absurdly long time. This is using memstore with 100MB stores, and the nodes (in theory) have 8GB of RAM, so I don't know why it's running so slowly.

/a/jspray-2016-01-12_12:26:08-fs:recovery-master---basic-openstack/2460

failed in fsync()

/a/jspray-2016-01-12_12:25:44-fs:recovery-master---basic-openstack/2447/

failed in write()

Ah. Objecter is finding laggy ops, and calling _maybe_request_map in tick().

Actions

Copy link