Bug #12971: TestQuotaFull fails - CephFS - Ceph

Actions

Copy link

Bug #12971

closed

TestQuotaFull fails

Added by Zheng Yan over 8 years ago. Updated over 8 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

John Spray

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

http://qa-proxy.ceph.com/teuthology/teuthology-2015-09-01_23:04:01-fs-infernalis---basic-multi/1041649/teuthology.log

2015-09-04T08:46:56.854 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2015-09-04T08:46:56.854 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_infernalis/tasks/cephfs/test_full.py", line 202, in test_full_different_file
2015-09-04T08:46:56.854 INFO:tasks.cephfs_test_runner:    self._test_full(True)
2015-09-04T08:46:56.854 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_infernalis/tasks/cephfs/test_full.py", line 187, in _test_full
2015-09-04T08:46:56.855 INFO:tasks.cephfs_test_runner:    timeout=osd_mon_report_interval_max * 5)
2015-09-04T08:46:56.855 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_infernalis/tasks/cephfs/cephfs_test_case.py", line 205, in wait_until_true
2015-09-04T08:46:56.855 INFO:tasks.cephfs_test_runner:    raise RuntimeError("Timed out after {0} seconds".format(elapsed))
2015-09-04T08:46:56.855 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out after 25 seconds
2015-09-04T08:46:56.855 INFO:tasks.cephfs_test_runner:
2015-09-04T08:46:56.856 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2015-09-04T08:46:56.856 INFO:tasks.cephfs_test_runner:Ran 7 tests in 518.185s
2015-09-04T08:46:56.856 INFO:tasks.cephfs_test_runner:
2015-09-04T08:46:56.856 INFO:tasks.cephfs_test_runner:FAILED (errors=1)
2015-09-04T08:46:56.856 INFO:tasks.cephfs_test_runner:
2015-09-04T08:46:56.857 INFO:tasks.cephfs_test_runner:======================================================================
2015-09-04T08:46:56.857 INFO:tasks.cephfs_test_runner:ERROR: test_full_different_file (tasks.cephfs.test_full.TestQuotaFull)
2015-09-04T08:46:56.857 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2015-09-04T08:46:56.857 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2015-09-04T08:46:56.857 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_infernalis/tasks/cephfs/test_full.py", line 202, in test_full_different_file
2015-09-04T08:46:56.857 INFO:tasks.cephfs_test_runner:    self._test_full(True)
2015-09-04T08:46:56.858 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_infernalis/tasks/cephfs/test_full.py", line 187, in _test_full
2015-09-04T08:46:56.858 INFO:tasks.cephfs_test_runner:    timeout=osd_mon_report_interval_max * 5)
2015-09-04T08:46:56.858 INFO:tasks.cephfs_test_runner:  File "/var/lib/teuthworker/src/ceph-qa-suite_infernalis/tasks/cephfs/cephfs_test_case.py", line 205, in wait_until_true
2015-09-04T08:46:56.858 INFO:tasks.cephfs_test_runner:    raise RuntimeError("Timed out after {0} seconds".format(elapsed))
2015-09-04T08:46:56.858 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out after 25 seconds

Actions

Copy link

Updated by Zheng Yan over 8 years ago

2015-09-04 11:46:30.042620 7f00e7929700 10 mds.0.cache.strays _consume: allocating allowance 5 to 0 in flight
2015-09-04 11:46:30.042624 7f00e7929700 10 mds.0.cache.strays purge [dentry #100/stray1/10000000000 [2,head] auth (dversion lock) v=23 inode=0x7f00f439c320 state=new | request=0 lock=0 inodepin=1 purging=1 dirty=1 authpin=0 0x7f00f43d12c0] [inode 10000000000 [2,head] ~mds0/stray1/10000000000 auth v23 s=16777216 nl=0 n(v0 b16777216 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 caps=0 dirtyparent=0 dirty=1 waiter=0 authpin=0 0x7f00f439c320]
2015-09-04 11:46:30.042658 7f00e7929700 10 mds.0.cache.strays  realm snaprealm(100 seq 1 lc 0 cr 0 cps 1 snaps={} 0x7f00f421ba80)
2015-09-04 11:46:30.042662 7f00e7929700 10 mds.0.cache.strays purge 0~16777216 objects 0~4 snapc 0x7f00f421bc00 on [inode 10000000000 [2,head] ~mds0/stray1/10000000000 auth v23 s=16777216 nl=0 n(v0 b16777216 1=1+0) (iversion lock) | ptrwaiter=0 request=0 lock=0 caps=0 dirtyparent=0 dirty=1 waiter=0 authpin=0 0x7f00f439c320]
2015-09-04 11:46:30.042696 7f00e7929700  0 mds.0.objecter  FULL, paused modify 0x7f00f421e300 tid 55
2015-09-04 11:46:30.042709 7f00e7929700  0 mds.0.objecter  FULL, paused modify 0x7f00f421e540 tid 56
2015-09-04 11:46:30.042719 7f00e7929700  0 mds.0.objecter  FULL, paused modify 0x7f00f4219f80 tid 57
2015-09-04 11:46:30.042728 7f00e7929700  0 mds.0.objecter  FULL, paused modify 0x7f00f421e0c0 tid 58

looks like objector pause osd requests that delete objects.

Actions

Copy link

Updated by Zheng Yan over 8 years ago

Assignee set to John Spray

the objecter pauses all write osd request (including delete request) when cluster is full or pool-quota is reached. Test case 'test_full_different_file' first writes data to files until pool-quota is reached. then delete the files and expects pool's full flag get cleared. This does not suppose to work because MDS's objecter should pauses requests that purge strays.

Actions

Copy link

Updated by Greg Farnum over 8 years ago

That's not quite the problem. The MDS calls Objecter::unset_honor_osdmap_full(), at which point the Objecter should let write operations pass through. Perhaps we're not setting the flag early enough in startup and this was during a replay or something?

Actions

Copy link

Updated by John Spray over 8 years ago

Greg's theory sounds plausible, will look

Actions

Copy link

Updated by John Spray over 8 years ago

Status changed from New to In Progress

Looks like this commit might have broken it:

commit dbcf2e40d3e8a92f280879c74b7ca954a902b2d1
Author: xinxin shu <xinxin.shu@intel.com>
Date:   Thu Jun 18 09:44:47 2015 +0800

    Fixes : #12018

    resend writes after pool loses full flag

    Signed-off-by: xinxin shu <xinxin.shu@intel.com>

as it introduced an unguarded read of FULL flag that didn't respect the honor_osdmap_full setting

Actions

Copy link