Project

General

Profile

Bug #39634

qa: test_full_same_file timeout

Added by Patrick Donnelly almost 5 years ago. Updated over 2 years ago.

Status:
Fix Under Review
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
qa
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2019-05-08T05:26:30.146 INFO:tasks.cephfs_test_runner:======================================================================
2019-05-08T05:26:30.146 INFO:tasks.cephfs_test_runner:ERROR: test_full_same_file (tasks.cephfs.test_full.TestClusterFull)
2019-05-08T05:26:30.146 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2019-05-08T05:26:30.146 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2019-05-08T05:26:30.147 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20190501.220632/qa/tasks/cephfs/test_full.py", line 205, in test_full_same_file
2019-05-08T05:26:30.147 INFO:tasks.cephfs_test_runner:    self._test_full(False)
2019-05-08T05:26:30.147 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20190501.220632/qa/tasks/cephfs/test_full.py", line 187, in _test_full
2019-05-08T05:26:30.147 INFO:tasks.cephfs_test_runner:    timeout=osd_mon_report_interval * 5)
2019-05-08T05:26:30.147 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-pdonnell-testing-20190501.220632/qa/tasks/ceph_test_case.py", line 163, in wait_until_true
2019-05-08T05:26:30.148 INFO:tasks.cephfs_test_runner:    raise RuntimeError("Timed out after {0}s".format(elapsed))
2019-05-08T05:26:30.148 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out after 25s

/ceph/teuthology-archive/pdonnell-2019-05-08_04:00:30-fs-wip-pdonnell-testing-20190501.220632-distro-basic-smithi/3939426/teuthology.log

History

#1 Updated by Patrick Donnelly about 4 years ago

  • Target version deleted (v15.0.0)

#2 Updated by Kotresh Hiremath Ravishankar over 2 years ago

Seen in the below pacific run

http://pulpito.front.sepia.ceph.com/yuriw-2021-11-08_20:21:11-fs-wip-yuri5-testing-2021-11-08-1003-pacific-distro-basic-smithi/6492110/

Log: /ceph/teuthology-archive/yuriw-2021-11-08_20:21:11-fs-wip-yuri5-testing-2021-11-08-1003-pacific-distro-basic-smithi/6492110/teuthology.log

#3 Updated by Xiubo Li over 2 years ago

  • Assignee set to Xiubo Li

Will work on it.

#4 Updated by Xiubo Li over 2 years ago

When the test_full test case was deleting the "large_file_b" and "large_file_a", from /ceph/teuthology-archive/yuriw-2021-11-08_20:21:11-fs-wip-yuri5-testing-2021-11-08-1003-pacific-distro-basic-smithi/6492110/teuthology.log:

2021-11-09T15:08:25.412 DEBUG:teuthology.orchestra.run.smithi053:> (cd /home/ubuntu/cephtest/mnt.0 && exec rm -f large_file_b)
2021-11-09T15:08:25.485 INFO:teuthology.orchestra.run:Running command with timeout 900
2021-11-09T15:08:25.486 DEBUG:teuthology.orchestra.run.smithi053:> (cd /home/ubuntu/cephtest/mnt.0 && exec rm -f large_file_a)

In /ceph/teuthology-archive/yuriw-2021-11-08_20:21:11-fs-wip-yuri5-testing-2021-11-08-1003-pacific-distro-basic-smithi/6492110/remote/smithi060/log/ceph-osd.4.log.gz:

2021-11-09T15:08:20.659+0000 7f0e1144e700 20 osd.4 110 check_full_status cur ratio 1, physical ratio 1, new state full
2021-11-09T15:08:20.659+0000 7f0e1144e700 10 osd.4 110 check_full_status none -> full
...
2021-11-09T15:08:26.563+0000 7f0e1144e700 20 osd.4 112 check_full_status cur ratio 1, physical ratio 1, new state full
...
2021-11-09T15:08:27.659+0000 7f0e1144e700 20 osd.4 113 check_full_status cur ratio 1, physical ratio 1, new state full
...
2021-11-09T15:08:27.763+0000 7f0e13452700 20 osd.4 pg_epoch: 114 pg[14.24( v 114'9 (0'0,114'9] local-lis/les=113/114 n=0 ec=107/107 lis/c=113/107 les/c/f=114/108/110 sis=113) [7,4] r=1 lpr=113 pi=[107,113)/1 luod=0'0 lua=114'5 crt=114'9 lcod 114'7 mlcod 109'2 active mbc={}] rollforward: entry=114'9 (109'4) delete   14:276f434d:::10000000001.0000003d:head by mds.0.43:200 2021-11-09T15:08:27.255282+0000 0 ObjectCleanRegions clean_offsets: [4194304~18446744073705357311], clean_omap: 0, new_object: 0
...
2021-11-09T15:08:33.563+0000 7f0e1144e700 20 osd.4 116 check_full_status cur ratio 9.22337e+10, physical ratio 9.22337e+10, new state failsafe
2021-11-09T15:08:33.563+0000 7f0e1144e700 10 osd.4 116 check_full_status full -> failsafe
...
2021-11-09T15:20:43.168+0000 7f0e1144e700 20 osd.4 127 check_full_status cur ratio 9.22337e+10, physical ratio 9.22337e+10, new state failsafe

During deleting the objects for those two files, the osd.4 went into failsafe state until the test_full timedout.

This should be one OSD side bug.

#5 Updated by Xiubo Li over 2 years ago

Checked all the other OSDs, they all didn't reach the "mon osd full ratio: 0.7". Only the osd.4 did.
That means the pgs still didn't evenly distributed among the OSDs for the fs pools.

#6 Updated by Xiubo Li over 2 years ago

In /ceph/teuthology-archive/yuriw-2021-11-08_20:21:11-fs-wip-yuri5-testing-2021-11-08-1003-pacific-distro-basic-smithi/6492110/remote/smithi060/log/ceph-osd.4.log.gz:

2021-11-09T15:08:27.659+0000 7f0e1144e700  5 osd.4 113 heartbeat osd_stat(store_statfs(0x0/0x0/0xbebc200, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0), peers [0,1,2,3,5,6,7] op hist [])
2021-11-09T15:08:27.659+0000 7f0e1144e700 20 osd.4 113 check_full_status cur ratio 1, physical ratio 1, new state full

...

2021-11-09T15:08:33.563+0000 7f0e1144e700  5 osd.4 116 heartbeat osd_stat(store_statfs(0xc215534/0x0/0xbebc200, data 0x0/0x0, compress 0x0/0x0/0x0, omap 0x0, meta 0x0), peers [0,1,2,3,5,6,7] op hist [])
2021-11-09T15:08:33.563+0000 7f0e1144e700 20 osd.4 116 check_full_status cur ratio 9.22337e+10, physical ratio 9.22337e+10, new state failsafe
2021-11-09T15:08:33.563+0000 7f0e1144e700 10 osd.4 116 check_full_status full -> failsafe

There has something wrong when calculating the statfs "0xc215534/0x0/0xbebc200", which is "available/internally_reserved/total". The available disk space is larger than total.

#7 Updated by Xiubo Li over 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 44128

Also available in: Atom PDF