Project

General

Profile

Actions

Bug #57147

open

qa: test_full_fsync (tasks.cephfs.test_full.TestClusterFull) failure

Added by Kotresh Hiremath Ravishankar over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The teuthology link https://pulpito.ceph.com/yuriw-2022-08-11_16:57:01-fs-wip-yuri3-testing-2022-08-11-0809-pacific-distro-default-smithi/6968267

The mds didn't become healthy and the test timed out.

2022-08-11T23:44:41.536 INFO:tasks.cephfs_test_runner:======================================================================
2022-08-11T23:44:41.536 INFO:tasks.cephfs_test_runner:ERROR: test_full_fsync (tasks.cephfs.test_full.TestClusterFull)
2022-08-11T23:44:41.536 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2022-08-11T23:44:41.537 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2022-08-11T23:44:41.537 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_eb4319a2b19ca3fba01742173e97dd5b50b2f291/qa/tasks/cephfs/test_full.py", line 395, in setUp
2022-08-11T23:44:41.537 INFO:tasks.cephfs_test_runner:    super(TestClusterFull, self).setUp()
2022-08-11T23:44:41.537 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_eb4319a2b19ca3fba01742173e97dd5b50b2f291/qa/tasks/cephfs/test_full.py", line 32, in setUp
2022-08-11T23:44:41.538 INFO:tasks.cephfs_test_runner:    CephFSTestCase.setUp(self)
2022-08-11T23:44:41.538 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_eb4319a2b19ca3fba01742173e97dd5b50b2f291/qa/tasks/cephfs/cephfs_test_case.py", line 169, in setUp
2022-08-11T23:44:41.538 INFO:tasks.cephfs_test_runner:    self.fs.wait_for_daemons()
2022-08-11T23:44:41.539 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_eb4319a2b19ca3fba01742173e97dd5b50b2f291/qa/tasks/cephfs/filesystem.py", line 1108, in wait_for_daemons
2022-08-11T23:44:41.539 INFO:tasks.cephfs_test_runner:    raise RuntimeError("Timed out waiting for MDS daemons to become healthy")
2022-08-11T23:44:41.539 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out waiting for MDS daemons to become healthy
2022-08-11T23:44:41.539 INFO:tasks.cephfs_test_runner:
2022-08-11T23:44:41.540 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------

I think the osd which backed the mds crashed causing mds to stuck in up:creating state.

 ceph version 16.2.10-668-geb4319a2 (eb4319a2b19ca3fba01742173e97dd5b50b2f291) pacific (stable)
 1: /lib64/libpthread.so.0(+0x12b20) [0x7f19659eeb20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x556d542ce711]
 5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x198) [0x556d5477b758]
 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x556d5477d8f8]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x556d545ad242]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x556d545509fe]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x556d543d7b39]
 10: (ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x556d54637328]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x556d543f51b8]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x556d54a74a64]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x556d54a77944]
 14: /lib64/libpthread.so.0(+0x814a) [0x7f19659e414a]
 15: clone()

The crash log can be found on teuthology at `/a/yuriw-2022-08-11_16:57:01-fs-wip-yuri3-testing-2022-08-11-0809-pacific-distro-default-smithi/6968267/remote/smithi163/crash/posted/2022-08-11T23:56:04.325332Z_e45de76b-08f9-4145-bc3c-5dd9acb3942d`

Actions #1

Updated by Venky Shankar over 1 year ago

  • Project changed from CephFS to RADOS
  • Assignee set to Neha Ojha
Actions #2

Updated by Neha Ojha over 1 year ago

How reproducible is this? Following logs indicate that we ran out of space.

   -68> 2022-08-11T23:56:04.316+0000 7f194a5ff700 10 osd.0 94 maybe_share_map con 0x556d6030c000 v2:172.21.15.191:6826/77065 map epoch 93 -> 94 (as per caller)
   -67> 2022-08-11T23:56:04.316+0000 7f194a5ff700 10 osd.0 pg_epoch: 94 pg[11.30( v 90'4 lc 90'1 (0'0,90'4] local-lis/les=92/93 n=2 ec=88/88 lis/c=92/88 les/c/f=93/89/0 sis=92) [4,0] r=1 lpr=92 pi=[88,92)/1 luod=0'0 crt=90'4 lcod 0'0 mlcod 0'0 active m=2 mbc={}] _handle_message: 0x556d82f2edc0
   -66> 2022-08-11T23:56:04.316+0000 7f194a5ff700 10 osd.0 pg_epoch: 94 pg[11.30( v 90'4 lc 90'1 (0'0,90'4] local-lis/les=92/93 n=2 ec=88/88 lis/c=92/88 les/c/f=93/89/0 sis=92) [4,0] r=1 lpr=92 pi=[88,92)/1 luod=0'0 crt=90'4 lcod 0'0 mlcod 0'0 active m=2 mbc={}] _check_full current usage is 9.22337e+10 physical 9.22337e+10
   -65> 2022-08-11T23:56:04.316+0000 7f194a5ff700 10 osd.0 pg_epoch: 94 pg[11.30( v 90'4 lc 90'1 (0'0,90'4] local-lis/les=92/93 n=2 ec=88/88 lis/c=92/88 les/c/f=93/89/0 sis=92) [4,0] r=1 lpr=92 pi=[88,92)/1 luod=0'0 crt=90'4 lcod 0'0 mlcod 0'0 active m=2 mbc={}] _do_push Out of space (failsafe) processing push request.
...
    -1> 2022-08-11T23:56:04.320+0000 7f194a5ff700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10-668-geb4319a2/rpm/el8/BUILD/ceph-16.2.10-668-geb4319a2/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::_do_push(OpRequestRef)' thread 7f194a5ff700 time 2022-08-11T23:56:04.317976+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10-668-geb4319a2/rpm/el8/BUILD/ceph-16.2.10-668-geb4319a2/src/osd/ReplicatedBackend.cc: 789: ceph_abort_msg("abort() called")
Actions #3

Updated by Kotresh Hiremath Ravishankar over 1 year ago

  • Project changed from RADOS to CephFS
  • Assignee deleted (Neha Ojha)

Neha Ojha wrote:

How reproducible is this? Following logs indicate that we ran out of space.

We have seen this only once. The subsequent re-run of the test passed.

Actions #4

Updated by Kotresh Hiremath Ravishankar over 1 year ago

  • Project changed from CephFS to RADOS
  • Assignee set to Neha Ojha
Actions

Also available in: Atom PDF