Bug #61201: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific - CephFS - Ceph

Actions

Copy link

Bug #61201

closed

qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific

Added by Kotresh Hiremath Ravishankar 12 months ago. Updated 8 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Jos Collin

Category:

Administration/Usability

Target version:

Ceph - v19.0.0

% Done:

Source:

Tags:

backport_processed

Backport:

reef,quincy,pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

Pull request ID:

52173

Crash signature (v1):

Crash signature (v2):

Description

Description: fs/functional/{begin clusters/1a3s-mds-4c-client conf/{client mds mon osd} distro/{ubuntu_latest} mount/kclient/{mount overrides/{distro/stock/{k-stock rhel_8} ms-die-on-skipped}} objectstore/bluestore-bitmap overrides/{ignorelist_health ignorelist_wrongly_marked_down no_client_pidfile} tasks/data-scan}

Log: http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-15_21:56:33-fs-wip-yuri2-testing-2023-05-15-0810-pacific_2-distro-default-smithi/7274986/teuthology.log

Job: https://pulpito.ceph.com/yuriw-2023-05-15_21:56:33-fs-wip-yuri2-testing-2023-05-15-0810-pacific_2-distro-default-smithi/7274986/

Sentry event: https://sentry.ceph.com/organizations/ceph/?query=1660ede559ce46d6bcc4eff3f8ec2d71

Failure Reason:
Test failure: test_rebuild_moved_file (tasks.cephfs.test_data_scan.TestDataScan)

The test failed waiting for mds to become up:standby

2023-05-16T05:53:56.526 INFO:tasks.cephfs_test_runner:======================================================================
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:ERROR: test_rebuild_moved_file (tasks.cephfs.test_data_scan.TestDataScan)
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_20531b61f434caac5e29cea07411fac6911e8e91/qa/tasks/cephfs/test_data_scan.py", line 399, in test_rebuild_moved_file
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:    self._rebuild_metadata(MovedFile(self.fs, self.mount_a))
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_20531b61f434caac5e29cea07411fac6911e8e91/qa/tasks/cephfs/test_data_scan.py", line 352, in _rebuild_metadata
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:    self.wait_until_equal(
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_20531b61f434caac5e29cea07411fac6911e8e91/qa/tasks/ceph_test_case.py", line 198, in wait_until_equal
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:    raise TestTimeoutError("Timed out after {0} seconds waiting for {1} (currently {2})".format(
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 60 seconds waiting for up:standby (currently None)
2023-05-16T05:53:56.529 INFO:tasks.cephfs_test_runner:
2023-05-16T05:53:56.529 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------

The mds stack trace is as below

      -30> 2023-05-16T05:52:41.036+0000 7f7365170700 10 log_client  will send 2023-05-16T05:52:41.038437+0000 mds.c (mds.0) 1 : cluster [ERR] error reading table object 'mds0_inotable' -2 ((2) No such file or directory)
   -29> 2023-05-16T05:52:41.036+0000 7f7365170700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.16:3300/0
   -28> 2023-05-16T05:52:41.036+0000 7f7365170700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- log(1 entries from seq 1 at 2023-05-16T05:52:41.038437+0000) v1
 -- 0x55df90ab2700 con 0x55df90b83000
   -27> 2023-05-16T05:52:41.036+0000 7f736d981700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.1 v2:172.21.15.16:6816/116597 1 ==== osd_op_reply(5 mds0_openfiles.0 [omap-get-header,omap-get-vals] v0'0 uv0 
ondisk = -2 ((2) No such file or directory)) v8 ==== 202+0+0 (crc 0 0 0) 0x55df91780fc0 con 0x55df918c5800
   -26> 2023-05-16T05:52:41.036+0000 7f7365170700  5 mds.beacon.c Sending beacon down:damaged seq 2
   -25> 2023-05-16T05:52:41.037+0000 7f7365170700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.16:3300/0
   -24> 2023-05-16T05:52:41.037+0000 7f7365170700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- mdsbeacon(10998/c down:damaged seq=2 v152) v8 -- 0x55df90b4cdc0
 con 0x55df90b83000
   -23> 2023-05-16T05:52:41.037+0000 7f7365170700 20 mds.beacon.c send_and_wait: awaiting 2 for up to 5s
   -22> 2023-05-16T05:52:41.037+0000 7f736d981700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.1 v2:172.21.15.16:6816/116597 2 ==== osd_op_reply(6 mds_snaptable [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such
 file or directory)) v8 ==== 157+0+0 (crc 0 0 0) 0x55df91780fc0 con 0x55df918c5800
   -21> 2023-05-16T05:52:41.038+0000 7f736b17c700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== mon.0 v2:172.21.15.16:3300/0 17 ==== osd_map(168..168 src has 1..168) v4 ==== 366+0+0 (secure 0 0 0) 0x55df90ab270
0 con 0x55df90b83000
   -20> 2023-05-16T05:52:41.039+0000 7f736e182700  1 --2- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] >> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] conn(0x55df918c4c00 0x55df918ab200 unknown :-1 s=BANNER
_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
   -19> 2023-05-16T05:52:41.039+0000 7f736e182700 10 monclient: get_auth_request con 0x55df918c4c00 auth_method 0
   -18> 2023-05-16T05:52:41.039+0000 7f736e182700  1 --2- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] >> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] conn(0x55df918c4c00 0x55df918ab200 crc :-1 s=READY pgs=
66 cs=0 l=1 rev1=1 rx=0 tx=0).ready entity=osd.2 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
   -17> 2023-05-16T05:52:41.039+0000 7f736e182700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 1 ==== osd_op_reply(4 500.00000000 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such 
file or directory)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00
   -16> 2023-05-16T05:52:41.039+0000 7f7366172700  0 mds.0.journaler.pq(ro) error getting journal off disk
   -15> 2023-05-16T05:52:41.039+0000 7f736e182700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 2 ==== osd_op_reply(3 400.00000000 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such 
file or directory)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00
   -14> 2023-05-16T05:52:41.039+0000 7f7366172700  1 mds.0.purge_queue operator(): Purge Queue not found, assuming this is an upgrade and creating it.
   -13> 2023-05-16T05:52:41.039+0000 7f7366172700  4 mds.0.purge_queue create: creating
   -12> 2023-05-16T05:52:41.039+0000 7f7366172700  1 mds.0.journaler.pq(ro) set_writeable
   -11> 2023-05-16T05:52:41.039+0000 7f7366172700  1 mds.0.journaler.pq(rw) created blank journal at inode 0x0x500, format=1
   -10> 2023-05-16T05:52:41.039+0000 7f736496f700  1 mds.0.journalpointer Journal pointer '400.00000000' read failed: (2) No such file or directory
    -9> 2023-05-16T05:52:41.039+0000 7f736496f700  4 mds.0.journalpointer Writing pointer object '400.00000000': 0x0x200:0x0x0
    -8> 2023-05-16T05:52:41.039+0000 7f7366172700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] -- osd_op(unknown.0.152:7 21.15 21:aa448500:::500.000000
00:head [writefull 0~90 in=90b] snapc 0=[] ondisk+write+known_if_redirected+full_force e168) v8 -- 0x55df918c5000 con 0x55df918c4c00
    -7> 2023-05-16T05:52:41.039+0000 7f736496f700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] -- osd_op(unknown.0.152:8 21.f 21:f1f69726:::400.0000000
0:head [writefull 0~22 in=22b] snapc 0=[] ondisk+write+known_if_redirected+full_force e168) v8 -- 0x55df918c5400 con 0x55df918c4c00
    -6> 2023-05-16T05:52:41.041+0000 7f736e182700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 3 ==== osd_op_reply(7 500.00000000 [writefull 0~90] v0'0 uv0 ondisk = -108 ((108
) Cannot send after transport endpoint shutdown)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00
    -5> 2023-05-16T05:52:41.041+0000 7f7366172700 -1 mds.0.journaler.pq(rw) _finish_write_head got (108) Cannot send after transport endpoint shutdown
    -4> 2023-05-16T05:52:41.041+0000 7f7366172700 -1 mds.0.journaler.pq(rw) handle_write_error (108) Cannot send after transport endpoint shutdown
    -3> 2023-05-16T05:52:41.041+0000 7f736e182700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 4 ==== osd_op_reply(8 400.00000000 [writefull 0~22] v0'0 uv0 ondisk = -108 ((108
) Cannot send after transport endpoint shutdown)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00
    -2> 2023-05-16T05:52:41.042+0000 7f736496f700 -1 mds.0.journalpointer Error writing pointer object '400.00000000': (108) Cannot send after transport endpoint shutdown
    -1> 2023-05-16T05:52:41.042+0000 7f736496f700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.13-145-g20531b61/rpm/el8/
BUILD/ceph-16.2.13-145-g20531b61/src/mds/MDLog.cc: In function 'void MDLog::_recovery_thread(MDSContext*)' thread 7f736496f700 time 2023-05-16T05:52:41.043527+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.13-145-g20531b61/rpm/el8/BUILD/ceph-16.2.13-145-g20531b61/src/mds/MDLog.cc: 979: FAILED ceph_assert(write_result >= 0)

 ceph version 16.2.13-145-g20531b61 (20531b61f434caac5e29cea07411fac6911e8e91) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x7f73739911e2]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2793fc) [0x7f73739913fc]
 3: (MDLog::_recovery_thread(MDSContext*)+0xd72) [0x55df8ef15132]
 4: (MDLog::RecoveryThread::entry()+0x15) [0x55df8ec07845]
 5: /lib64/libpthread.so.0(+0x814a) [0x7f737297514a]
 6: clone()

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Kotresh Hiremath Ravishankar 12 months ago

PRs in the test batch:

pacific: qa: mirror tests should cleanup fs during unwind by batrick · Pull Request #50765 · ceph/ceph - pacific: qa: mirror tests should cleanup fs during unwind
pacific: qa: data-scan/journal-tool do not output debugging in upstream testing by batrick · Pull Request #50773 · ceph/ceph - pacific: qa: data-scan/journal-tool do not output debugging in upstream testing
pacific: mgr/nfs: disallow non-existent paths when creating export by dparmar18 · Pull Request #50809 · ceph/ceph - pacific: mgr/nfs: disallow non-existent paths when creating export
pacific: mds: record and dump last tid for trimming completed requests (or flushes) by vshankar · Pull Request #50811 · ceph/ceph - pacific: mds: record and dump last tid for trimming completed requests (or flushes)
pacific: mds: fix stray evaluation using scrub and introduce new option by dparmar18 · Pull Request #50814 · ceph/ceph - pacific: mds: fix stray evaluation using scrub and introduce new option
pacific: mds: wait for unlink operation to finish by lxbsz · Pull Request #50986 · ceph/ceph - pacific: mds: wait for unlink operation to finish
pacific: client: clear the suid/sgid in fallocate path by lxbsz · Pull Request #50988 · ceph/ceph - pacific: client: clear the suid/sgid in fallocate path
pacific: qa: enable kclient test for newop test

Actions

Copy link

Updated by Milind Changire 11 months ago

Assignee set to Jos Collin

Actions

Copy link

Updated by Jos Collin 11 months ago

Status changed from New to In Progress

Actions

Copy link

Updated by Jos Collin 10 months ago

-10> 2023-05-16T05:52:41.039+0000 7f736496f700  1 mds.0.journalpointer Journal pointer '400.00000000' read failed: (2) No such file or directory
-2> 2023-05-16T05:52:41.042+0000 7f736496f700 -1 mds.0.journalpointer Error writing pointer object '400.00000000': (108) Cannot send after transport endpoint shutdown

http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-15_21:56:33-fs-wip-yuri2-testing-2023-05-15-0810-pacific_2-distro-default-smithi/7274986/teuthology.log

From teuthology.log^, the mds.c has been blocklisted since the user ran `fs fail`. But in `MDLog::_recovery_thread`, while the blocklisting is already there, the jp.load() returned an -CEPHFS_ENOENT instead of -CEPHFS_EBLOCKLISTED, which caused the assert. We need to figure out why jp.load() returned an -CEPHFS_ENOENT instead of -CEPHFS_EBLOCKLISTED. There are no hints in the osd logs. So we need to check the osd code to see what happened.

There's no need for a ceph_assert here too. It could be changed to something like shutdown gracefully, which would fix the issue.

Actions

Copy link