Bug #57147
qa: test_full_fsync (tasks.cephfs.test_full.TestClusterFull) failure
0%
Description
The teuthology link https://pulpito.ceph.com/yuriw-2022-08-11_16:57:01-fs-wip-yuri3-testing-2022-08-11-0809-pacific-distro-default-smithi/6968267
The mds didn't become healthy and the test timed out.
2022-08-11T23:44:41.536 INFO:tasks.cephfs_test_runner:====================================================================== 2022-08-11T23:44:41.536 INFO:tasks.cephfs_test_runner:ERROR: test_full_fsync (tasks.cephfs.test_full.TestClusterFull) 2022-08-11T23:44:41.536 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2022-08-11T23:44:41.537 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2022-08-11T23:44:41.537 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_eb4319a2b19ca3fba01742173e97dd5b50b2f291/qa/tasks/cephfs/test_full.py", line 395, in setUp 2022-08-11T23:44:41.537 INFO:tasks.cephfs_test_runner: super(TestClusterFull, self).setUp() 2022-08-11T23:44:41.537 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_eb4319a2b19ca3fba01742173e97dd5b50b2f291/qa/tasks/cephfs/test_full.py", line 32, in setUp 2022-08-11T23:44:41.538 INFO:tasks.cephfs_test_runner: CephFSTestCase.setUp(self) 2022-08-11T23:44:41.538 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_eb4319a2b19ca3fba01742173e97dd5b50b2f291/qa/tasks/cephfs/cephfs_test_case.py", line 169, in setUp 2022-08-11T23:44:41.538 INFO:tasks.cephfs_test_runner: self.fs.wait_for_daemons() 2022-08-11T23:44:41.539 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_eb4319a2b19ca3fba01742173e97dd5b50b2f291/qa/tasks/cephfs/filesystem.py", line 1108, in wait_for_daemons 2022-08-11T23:44:41.539 INFO:tasks.cephfs_test_runner: raise RuntimeError("Timed out waiting for MDS daemons to become healthy") 2022-08-11T23:44:41.539 INFO:tasks.cephfs_test_runner:RuntimeError: Timed out waiting for MDS daemons to become healthy 2022-08-11T23:44:41.539 INFO:tasks.cephfs_test_runner: 2022-08-11T23:44:41.540 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
I think the osd which backed the mds crashed causing mds to stuck in up:creating state.
ceph version 16.2.10-668-geb4319a2 (eb4319a2b19ca3fba01742173e97dd5b50b2f291) pacific (stable) 1: /lib64/libpthread.so.0(+0x12b20) [0x7f19659eeb20] 2: gsignal() 3: abort() 4: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x556d542ce711] 5: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x198) [0x556d5477b758] 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2a8) [0x556d5477d8f8] 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x556d545ad242] 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x556d545509fe] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x556d543d7b39] 10: (ceph::osd::scheduler::PGRecoveryMsg::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x556d54637328] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x556d543f51b8] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x556d54a74a64] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x556d54a77944] 14: /lib64/libpthread.so.0(+0x814a) [0x7f19659e414a] 15: clone()
The crash log can be found on teuthology at `/a/yuriw-2022-08-11_16:57:01-fs-wip-yuri3-testing-2022-08-11-0809-pacific-distro-default-smithi/6968267/remote/smithi163/crash/posted/2022-08-11T23:56:04.325332Z_e45de76b-08f9-4145-bc3c-5dd9acb3942d`
History
#1 Updated by Venky Shankar about 1 year ago
- Project changed from CephFS to RADOS
- Assignee set to Neha Ojha
#2 Updated by Neha Ojha about 1 year ago
How reproducible is this? Following logs indicate that we ran out of space.
-68> 2022-08-11T23:56:04.316+0000 7f194a5ff700 10 osd.0 94 maybe_share_map con 0x556d6030c000 v2:172.21.15.191:6826/77065 map epoch 93 -> 94 (as per caller) -67> 2022-08-11T23:56:04.316+0000 7f194a5ff700 10 osd.0 pg_epoch: 94 pg[11.30( v 90'4 lc 90'1 (0'0,90'4] local-lis/les=92/93 n=2 ec=88/88 lis/c=92/88 les/c/f=93/89/0 sis=92) [4,0] r=1 lpr=92 pi=[88,92)/1 luod=0'0 crt=90'4 lcod 0'0 mlcod 0'0 active m=2 mbc={}] _handle_message: 0x556d82f2edc0 -66> 2022-08-11T23:56:04.316+0000 7f194a5ff700 10 osd.0 pg_epoch: 94 pg[11.30( v 90'4 lc 90'1 (0'0,90'4] local-lis/les=92/93 n=2 ec=88/88 lis/c=92/88 les/c/f=93/89/0 sis=92) [4,0] r=1 lpr=92 pi=[88,92)/1 luod=0'0 crt=90'4 lcod 0'0 mlcod 0'0 active m=2 mbc={}] _check_full current usage is 9.22337e+10 physical 9.22337e+10 -65> 2022-08-11T23:56:04.316+0000 7f194a5ff700 10 osd.0 pg_epoch: 94 pg[11.30( v 90'4 lc 90'1 (0'0,90'4] local-lis/les=92/93 n=2 ec=88/88 lis/c=92/88 les/c/f=93/89/0 sis=92) [4,0] r=1 lpr=92 pi=[88,92)/1 luod=0'0 crt=90'4 lcod 0'0 mlcod 0'0 active m=2 mbc={}] _do_push Out of space (failsafe) processing push request. ... -1> 2022-08-11T23:56:04.320+0000 7f194a5ff700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10-668-geb4319a2/rpm/el8/BUILD/ceph-16.2.10-668-geb4319a2/src/osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::_do_push(OpRequestRef)' thread 7f194a5ff700 time 2022-08-11T23:56:04.317976+0000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.10-668-geb4319a2/rpm/el8/BUILD/ceph-16.2.10-668-geb4319a2/src/osd/ReplicatedBackend.cc: 789: ceph_abort_msg("abort() called")
#3 Updated by Kotresh Hiremath Ravishankar about 1 year ago
- Project changed from RADOS to CephFS
- Assignee deleted (
Neha Ojha)
Neha Ojha wrote:
How reproducible is this? Following logs indicate that we ran out of space.
We have seen this only once. The subsequent re-run of the test passed.
#4 Updated by Kotresh Hiremath Ravishankar about 1 year ago
- Project changed from CephFS to RADOS
- Assignee set to Neha Ojha