Bug #61201
closedqa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific
0%
Description
Description: fs/functional/{begin clusters/1a3s-mds-4c-client conf/{client mds mon osd} distro/{ubuntu_latest} mount/kclient/{mount overrides/{distro/stock/{k-stock rhel_8} ms-die-on-skipped}} objectstore/bluestore-bitmap overrides/{ignorelist_health ignorelist_wrongly_marked_down no_client_pidfile} tasks/data-scan}
Sentry event: https://sentry.ceph.com/organizations/ceph/?query=1660ede559ce46d6bcc4eff3f8ec2d71
Failure Reason:
Test failure: test_rebuild_moved_file (tasks.cephfs.test_data_scan.TestDataScan)
The test failed waiting for mds to become up:standby
2023-05-16T05:53:56.526 INFO:tasks.cephfs_test_runner:====================================================================== 2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:ERROR: test_rebuild_moved_file (tasks.cephfs.test_data_scan.TestDataScan) 2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_20531b61f434caac5e29cea07411fac6911e8e91/qa/tasks/cephfs/test_data_scan.py", line 399, in test_rebuild_moved_file 2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner: self._rebuild_metadata(MovedFile(self.fs, self.mount_a)) 2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_20531b61f434caac5e29cea07411fac6911e8e91/qa/tasks/cephfs/test_data_scan.py", line 352, in _rebuild_metadata 2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner: self.wait_until_equal( 2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_20531b61f434caac5e29cea07411fac6911e8e91/qa/tasks/ceph_test_case.py", line 198, in wait_until_equal 2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner: raise TestTimeoutError("Timed out after {0} seconds waiting for {1} (currently {2})".format( 2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 60 seconds waiting for up:standby (currently None) 2023-05-16T05:53:56.529 INFO:tasks.cephfs_test_runner: 2023-05-16T05:53:56.529 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
The mds stack trace is as below
-30> 2023-05-16T05:52:41.036+0000 7f7365170700 10 log_client will send 2023-05-16T05:52:41.038437+0000 mds.c (mds.0) 1 : cluster [ERR] error reading table object 'mds0_inotable' -2 ((2) No such file or directory) -29> 2023-05-16T05:52:41.036+0000 7f7365170700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.16:3300/0 -28> 2023-05-16T05:52:41.036+0000 7f7365170700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- log(1 entries from seq 1 at 2023-05-16T05:52:41.038437+0000) v1 -- 0x55df90ab2700 con 0x55df90b83000 -27> 2023-05-16T05:52:41.036+0000 7f736d981700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.1 v2:172.21.15.16:6816/116597 1 ==== osd_op_reply(5 mds0_openfiles.0 [omap-get-header,omap-get-vals] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 202+0+0 (crc 0 0 0) 0x55df91780fc0 con 0x55df918c5800 -26> 2023-05-16T05:52:41.036+0000 7f7365170700 5 mds.beacon.c Sending beacon down:damaged seq 2 -25> 2023-05-16T05:52:41.037+0000 7f7365170700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.16:3300/0 -24> 2023-05-16T05:52:41.037+0000 7f7365170700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- mdsbeacon(10998/c down:damaged seq=2 v152) v8 -- 0x55df90b4cdc0 con 0x55df90b83000 -23> 2023-05-16T05:52:41.037+0000 7f7365170700 20 mds.beacon.c send_and_wait: awaiting 2 for up to 5s -22> 2023-05-16T05:52:41.037+0000 7f736d981700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.1 v2:172.21.15.16:6816/116597 2 ==== osd_op_reply(6 mds_snaptable [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 157+0+0 (crc 0 0 0) 0x55df91780fc0 con 0x55df918c5800 -21> 2023-05-16T05:52:41.038+0000 7f736b17c700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== mon.0 v2:172.21.15.16:3300/0 17 ==== osd_map(168..168 src has 1..168) v4 ==== 366+0+0 (secure 0 0 0) 0x55df90ab270 0 con 0x55df90b83000 -20> 2023-05-16T05:52:41.039+0000 7f736e182700 1 --2- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] >> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] conn(0x55df918c4c00 0x55df918ab200 unknown :-1 s=BANNER _CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0 -19> 2023-05-16T05:52:41.039+0000 7f736e182700 10 monclient: get_auth_request con 0x55df918c4c00 auth_method 0 -18> 2023-05-16T05:52:41.039+0000 7f736e182700 1 --2- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] >> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] conn(0x55df918c4c00 0x55df918ab200 crc :-1 s=READY pgs= 66 cs=0 l=1 rev1=1 rx=0 tx=0).ready entity=osd.2 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 -17> 2023-05-16T05:52:41.039+0000 7f736e182700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 1 ==== osd_op_reply(4 500.00000000 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00 -16> 2023-05-16T05:52:41.039+0000 7f7366172700 0 mds.0.journaler.pq(ro) error getting journal off disk -15> 2023-05-16T05:52:41.039+0000 7f736e182700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 2 ==== osd_op_reply(3 400.00000000 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such file or directory)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00 -14> 2023-05-16T05:52:41.039+0000 7f7366172700 1 mds.0.purge_queue operator(): Purge Queue not found, assuming this is an upgrade and creating it. -13> 2023-05-16T05:52:41.039+0000 7f7366172700 4 mds.0.purge_queue create: creating -12> 2023-05-16T05:52:41.039+0000 7f7366172700 1 mds.0.journaler.pq(ro) set_writeable -11> 2023-05-16T05:52:41.039+0000 7f7366172700 1 mds.0.journaler.pq(rw) created blank journal at inode 0x0x500, format=1 -10> 2023-05-16T05:52:41.039+0000 7f736496f700 1 mds.0.journalpointer Journal pointer '400.00000000' read failed: (2) No such file or directory -9> 2023-05-16T05:52:41.039+0000 7f736496f700 4 mds.0.journalpointer Writing pointer object '400.00000000': 0x0x200:0x0x0 -8> 2023-05-16T05:52:41.039+0000 7f7366172700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] -- osd_op(unknown.0.152:7 21.15 21:aa448500:::500.000000 00:head [writefull 0~90 in=90b] snapc 0=[] ondisk+write+known_if_redirected+full_force e168) v8 -- 0x55df918c5000 con 0x55df918c4c00 -7> 2023-05-16T05:52:41.039+0000 7f736496f700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] -- osd_op(unknown.0.152:8 21.f 21:f1f69726:::400.0000000 0:head [writefull 0~22 in=22b] snapc 0=[] ondisk+write+known_if_redirected+full_force e168) v8 -- 0x55df918c5400 con 0x55df918c4c00 -6> 2023-05-16T05:52:41.041+0000 7f736e182700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 3 ==== osd_op_reply(7 500.00000000 [writefull 0~90] v0'0 uv0 ondisk = -108 ((108 ) Cannot send after transport endpoint shutdown)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00 -5> 2023-05-16T05:52:41.041+0000 7f7366172700 -1 mds.0.journaler.pq(rw) _finish_write_head got (108) Cannot send after transport endpoint shutdown -4> 2023-05-16T05:52:41.041+0000 7f7366172700 -1 mds.0.journaler.pq(rw) handle_write_error (108) Cannot send after transport endpoint shutdown -3> 2023-05-16T05:52:41.041+0000 7f736e182700 1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 4 ==== osd_op_reply(8 400.00000000 [writefull 0~22] v0'0 uv0 ondisk = -108 ((108 ) Cannot send after transport endpoint shutdown)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00 -2> 2023-05-16T05:52:41.042+0000 7f736496f700 -1 mds.0.journalpointer Error writing pointer object '400.00000000': (108) Cannot send after transport endpoint shutdown -1> 2023-05-16T05:52:41.042+0000 7f736496f700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.13-145-g20531b61/rpm/el8/ BUILD/ceph-16.2.13-145-g20531b61/src/mds/MDLog.cc: In function 'void MDLog::_recovery_thread(MDSContext*)' thread 7f736496f700 time 2023-05-16T05:52:41.043527+0000 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.13-145-g20531b61/rpm/el8/BUILD/ceph-16.2.13-145-g20531b61/src/mds/MDLog.cc: 979: FAILED ceph_assert(write_result >= 0) ceph version 16.2.13-145-g20531b61 (20531b61f434caac5e29cea07411fac6911e8e91) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x7f73739911e2] 2: /usr/lib64/ceph/libceph-common.so.2(+0x2793fc) [0x7f73739913fc] 3: (MDLog::_recovery_thread(MDSContext*)+0xd72) [0x55df8ef15132] 4: (MDLog::RecoveryThread::entry()+0x15) [0x55df8ec07845] 5: /lib64/libpthread.so.0(+0x814a) [0x7f737297514a] 6: clone()
Updated by Kotresh Hiremath Ravishankar 12 months ago
PRs in the test batch:
- pacific: qa: mirror tests should cleanup fs during unwind by batrick · Pull Request #50765 · ceph/ceph - pacific: qa: mirror tests should cleanup fs during unwind
- pacific: qa: data-scan/journal-tool do not output debugging in upstream testing by batrick · Pull Request #50773 · ceph/ceph - pacific: qa: data-scan/journal-tool do not output debugging in upstream testing
- pacific: mgr/nfs: disallow non-existent paths when creating export by dparmar18 · Pull Request #50809 · ceph/ceph - pacific: mgr/nfs: disallow non-existent paths when creating export
- pacific: mds: record and dump last tid for trimming completed requests (or flushes) by vshankar · Pull Request #50811 · ceph/ceph - pacific: mds: record and dump last tid for trimming completed requests (or flushes)
- pacific: mds: fix stray evaluation using scrub and introduce new option by dparmar18 · Pull Request #50814 · ceph/ceph - pacific: mds: fix stray evaluation using scrub and introduce new option
- pacific: mds: wait for unlink operation to finish by lxbsz · Pull Request #50986 · ceph/ceph - pacific: mds: wait for unlink operation to finish
- pacific: client: clear the suid/sgid in fallocate path by lxbsz · Pull Request #50988 · ceph/ceph - pacific: client: clear the suid/sgid in fallocate path
- pacific: qa: enable kclient test for newop test
Updated by Jos Collin 10 months ago
-10> 2023-05-16T05:52:41.039+0000 7f736496f700 1 mds.0.journalpointer Journal pointer '400.00000000' read failed: (2) No such file or directory -2> 2023-05-16T05:52:41.042+0000 7f736496f700 -1 mds.0.journalpointer Error writing pointer object '400.00000000': (108) Cannot send after transport endpoint shutdown
From teuthology.log^, the mds.c has been blocklisted since the user ran `fs fail`. But in `MDLog::_recovery_thread`, while the blocklisting is already there, the jp.load() returned an -CEPHFS_ENOENT instead of -CEPHFS_EBLOCKLISTED, which caused the assert. We need to figure out why jp.load() returned an -CEPHFS_ENOENT instead of -CEPHFS_EBLOCKLISTED. There are no hints in the osd logs. So we need to check the osd code to see what happened.
There's no need for a ceph_assert here too. It could be changed to something like shutdown gracefully, which would fix the issue.
Updated by Venky Shankar 10 months ago
- Category set to Administration/Usability
- Target version set to v19.0.0
- Backport set to reef,quincy,pacific
- Component(FS) MDS added
Updated by Jos Collin 10 months ago
- Status changed from In Progress to Fix Under Review
Updated by Venky Shankar 9 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 9 months ago
- Copied to Backport #62054: reef: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific added
Updated by Backport Bot 9 months ago
- Copied to Backport #62055: pacific: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific added
Updated by Backport Bot 9 months ago
- Copied to Backport #62056: quincy: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific added
Updated by Yuri Weinstein 9 months ago
Updated by Yuri Weinstein 8 months ago
Updated by Jos Collin 8 months ago
- Status changed from Pending Backport to Resolved