Project

General

Profile

Actions

Bug #14195

closed

test_full_fclose fails (tasks.cephfs.test_full.TestClusterFull)

Added by Zheng Yan over 8 years ago. Updated over 7 years ago.

Status:
Won't Fix
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Actions #1

Updated by John Spray over 8 years ago

Did you mean to link a different log? The one above shows failure in TestStrays.test_files_throttle (like #13903)

Actions #4

Updated by John Spray over 8 years ago

The symptom is that we're getting ENOSPC from write() calls during buffered IO, where we should be getting them from fclose() calls. I think I've also seen this fail recently on the equivalent fsync() test in TestFull.

We're only seeing this on openstack.

I think the key event is that the client is getting an updated OSDMap when it the test doesn't expect it to. The test does this:

 A write(enough data to trigger full flag)
 B fsync() <- should succeed because of the delay between writing data and the full flag getting set
 C sleep(30) <- full flag should get set during this period, but the client shouldn't find out because it doesn't subscribe continuously to the map
 D write(a little more data) <- should succeed because buffered
 E fclose() <- should fail because when flushing buffers we'll get the new map and realise that full flag is set.

So during C, we are getting an OSDMap, which is causing D to fail because the client now knows that the full flag is set.

Actions #5

Updated by John Spray over 8 years ago

  • Status changed from New to In Progress
  • Assignee set to John Spray
Actions #6

Updated by John Spray over 8 years ago

So there are two different failure modes here, we either get the exception out of the fsync() or out of the following write(). It's a result of the flush of buffers taking an absurdly long time. This is using memstore with 100MB stores, and the nodes (in theory) have 8GB of RAM, so I don't know why it's running so slowly.

/a/jspray-2016-01-12_12:26:08-fs:recovery-master---basic-openstack/2460

failed in fsync()

/a/jspray-2016-01-12_12:25:44-fs:recovery-master---basic-openstack/2447/

failed in write()

Ah. Objecter is finding laggy ops, and calling _maybe_request_map in tick().

Actions #7

Updated by John Spray almost 8 years ago

  • Status changed from In Progress to Won't Fix

This was only a test race, and it only happens on pathologically slow clusters.

Actions #8

Updated by Patrick Donnelly over 7 years ago

I just ran across this in http://pulpito.ceph.com/pdonnell-2016-10-19_23:26:58-fs:recovery-master---basic-mira/486355/

I'm going to add a comment to the test so future eyes know about the race...

Actions

Also available in: Atom PDF