Project

General

Profile

Actions

Bug #61201

closed

qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific

Added by Kotresh Hiremath Ravishankar 12 months ago. Updated 8 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Administration/Usability
Target version:
% Done:

0%

Source:
Tags:
backport_processed
Backport:
reef,quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Description: fs/functional/{begin clusters/1a3s-mds-4c-client conf/{client mds mon osd} distro/{ubuntu_latest} mount/kclient/{mount overrides/{distro/stock/{k-stock rhel_8} ms-die-on-skipped}} objectstore/bluestore-bitmap overrides/{ignorelist_health ignorelist_wrongly_marked_down no_client_pidfile} tasks/data-scan}

Log: http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-15_21:56:33-fs-wip-yuri2-testing-2023-05-15-0810-pacific_2-distro-default-smithi/7274986/teuthology.log

Job: https://pulpito.ceph.com/yuriw-2023-05-15_21:56:33-fs-wip-yuri2-testing-2023-05-15-0810-pacific_2-distro-default-smithi/7274986/

Sentry event: https://sentry.ceph.com/organizations/ceph/?query=1660ede559ce46d6bcc4eff3f8ec2d71

Failure Reason:
Test failure: test_rebuild_moved_file (tasks.cephfs.test_data_scan.TestDataScan)

The test failed waiting for mds to become up:standby

2023-05-16T05:53:56.526 INFO:tasks.cephfs_test_runner:======================================================================
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:ERROR: test_rebuild_moved_file (tasks.cephfs.test_data_scan.TestDataScan)
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_20531b61f434caac5e29cea07411fac6911e8e91/qa/tasks/cephfs/test_data_scan.py", line 399, in test_rebuild_moved_file
2023-05-16T05:53:56.527 INFO:tasks.cephfs_test_runner:    self._rebuild_metadata(MovedFile(self.fs, self.mount_a))
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_20531b61f434caac5e29cea07411fac6911e8e91/qa/tasks/cephfs/test_data_scan.py", line 352, in _rebuild_metadata
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:    self.wait_until_equal(
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_20531b61f434caac5e29cea07411fac6911e8e91/qa/tasks/ceph_test_case.py", line 198, in wait_until_equal
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:    raise TestTimeoutError("Timed out after {0} seconds waiting for {1} (currently {2})".format(
2023-05-16T05:53:56.528 INFO:tasks.cephfs_test_runner:tasks.ceph_test_case.TestTimeoutError: Timed out after 60 seconds waiting for up:standby (currently None)
2023-05-16T05:53:56.529 INFO:tasks.cephfs_test_runner:
2023-05-16T05:53:56.529 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------

The mds stack trace is as below

      -30> 2023-05-16T05:52:41.036+0000 7f7365170700 10 log_client  will send 2023-05-16T05:52:41.038437+0000 mds.c (mds.0) 1 : cluster [ERR] error reading table object 'mds0_inotable' -2 ((2) No such file or directory)
   -29> 2023-05-16T05:52:41.036+0000 7f7365170700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.16:3300/0
   -28> 2023-05-16T05:52:41.036+0000 7f7365170700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- log(1 entries from seq 1 at 2023-05-16T05:52:41.038437+0000) v1
 -- 0x55df90ab2700 con 0x55df90b83000
   -27> 2023-05-16T05:52:41.036+0000 7f736d981700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.1 v2:172.21.15.16:6816/116597 1 ==== osd_op_reply(5 mds0_openfiles.0 [omap-get-header,omap-get-vals] v0'0 uv0 
ondisk = -2 ((2) No such file or directory)) v8 ==== 202+0+0 (crc 0 0 0) 0x55df91780fc0 con 0x55df918c5800
   -26> 2023-05-16T05:52:41.036+0000 7f7365170700  5 mds.beacon.c Sending beacon down:damaged seq 2
   -25> 2023-05-16T05:52:41.037+0000 7f7365170700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.16:3300/0
   -24> 2023-05-16T05:52:41.037+0000 7f7365170700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:3300/0,v1:172.21.15.16:6789/0] -- mdsbeacon(10998/c down:damaged seq=2 v152) v8 -- 0x55df90b4cdc0
 con 0x55df90b83000
   -23> 2023-05-16T05:52:41.037+0000 7f7365170700 20 mds.beacon.c send_and_wait: awaiting 2 for up to 5s
   -22> 2023-05-16T05:52:41.037+0000 7f736d981700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.1 v2:172.21.15.16:6816/116597 2 ==== osd_op_reply(6 mds_snaptable [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such
 file or directory)) v8 ==== 157+0+0 (crc 0 0 0) 0x55df91780fc0 con 0x55df918c5800
   -21> 2023-05-16T05:52:41.038+0000 7f736b17c700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== mon.0 v2:172.21.15.16:3300/0 17 ==== osd_map(168..168 src has 1..168) v4 ==== 366+0+0 (secure 0 0 0) 0x55df90ab270
0 con 0x55df90b83000
   -20> 2023-05-16T05:52:41.039+0000 7f736e182700  1 --2- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] >> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] conn(0x55df918c4c00 0x55df918ab200 unknown :-1 s=BANNER
_CONNECTING pgs=0 cs=0 l=1 rev1=0 rx=0 tx=0)._handle_peer_banner_payload supported=1 required=0
   -19> 2023-05-16T05:52:41.039+0000 7f736e182700 10 monclient: get_auth_request con 0x55df918c4c00 auth_method 0
   -18> 2023-05-16T05:52:41.039+0000 7f736e182700  1 --2- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] >> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] conn(0x55df918c4c00 0x55df918ab200 crc :-1 s=READY pgs=
66 cs=0 l=1 rev1=1 rx=0 tx=0).ready entity=osd.2 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
   -17> 2023-05-16T05:52:41.039+0000 7f736e182700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 1 ==== osd_op_reply(4 500.00000000 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such 
file or directory)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00
   -16> 2023-05-16T05:52:41.039+0000 7f7366172700  0 mds.0.journaler.pq(ro) error getting journal off disk
   -15> 2023-05-16T05:52:41.039+0000 7f736e182700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 2 ==== osd_op_reply(3 400.00000000 [read 0~0] v0'0 uv0 ondisk = -2 ((2) No such 
file or directory)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00
   -14> 2023-05-16T05:52:41.039+0000 7f7366172700  1 mds.0.purge_queue operator(): Purge Queue not found, assuming this is an upgrade and creating it.
   -13> 2023-05-16T05:52:41.039+0000 7f7366172700  4 mds.0.purge_queue create: creating
   -12> 2023-05-16T05:52:41.039+0000 7f7366172700  1 mds.0.journaler.pq(ro) set_writeable
   -11> 2023-05-16T05:52:41.039+0000 7f7366172700  1 mds.0.journaler.pq(rw) created blank journal at inode 0x0x500, format=1
   -10> 2023-05-16T05:52:41.039+0000 7f736496f700  1 mds.0.journalpointer Journal pointer '400.00000000' read failed: (2) No such file or directory
    -9> 2023-05-16T05:52:41.039+0000 7f736496f700  4 mds.0.journalpointer Writing pointer object '400.00000000': 0x0x200:0x0x0
    -8> 2023-05-16T05:52:41.039+0000 7f7366172700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] -- osd_op(unknown.0.152:7 21.15 21:aa448500:::500.000000
00:head [writefull 0~90 in=90b] snapc 0=[] ondisk+write+known_if_redirected+full_force e168) v8 -- 0x55df918c5000 con 0x55df918c4c00
    -7> 2023-05-16T05:52:41.039+0000 7f736496f700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] --> [v2:172.21.15.16:6802/116595,v1:172.21.15.16:6804/116595] -- osd_op(unknown.0.152:8 21.f 21:f1f69726:::400.0000000
0:head [writefull 0~22 in=22b] snapc 0=[] ondisk+write+known_if_redirected+full_force e168) v8 -- 0x55df918c5400 con 0x55df918c4c00
    -6> 2023-05-16T05:52:41.041+0000 7f736e182700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 3 ==== osd_op_reply(7 500.00000000 [writefull 0~90] v0'0 uv0 ondisk = -108 ((108
) Cannot send after transport endpoint shutdown)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00
    -5> 2023-05-16T05:52:41.041+0000 7f7366172700 -1 mds.0.journaler.pq(rw) _finish_write_head got (108) Cannot send after transport endpoint shutdown
    -4> 2023-05-16T05:52:41.041+0000 7f7366172700 -1 mds.0.journaler.pq(rw) handle_write_error (108) Cannot send after transport endpoint shutdown
    -3> 2023-05-16T05:52:41.041+0000 7f736e182700  1 -- [v2:172.21.15.16:6833/627211370,v1:172.21.15.16:6835/627211370] <== osd.2 v2:172.21.15.16:6802/116595 4 ==== osd_op_reply(8 400.00000000 [writefull 0~22] v0'0 uv0 ondisk = -108 ((108
) Cannot send after transport endpoint shutdown)) v8 ==== 156+0+0 (crc 0 0 0) 0x55df9188ab40 con 0x55df918c4c00
    -2> 2023-05-16T05:52:41.042+0000 7f736496f700 -1 mds.0.journalpointer Error writing pointer object '400.00000000': (108) Cannot send after transport endpoint shutdown
    -1> 2023-05-16T05:52:41.042+0000 7f736496f700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.13-145-g20531b61/rpm/el8/
BUILD/ceph-16.2.13-145-g20531b61/src/mds/MDLog.cc: In function 'void MDLog::_recovery_thread(MDSContext*)' thread 7f736496f700 time 2023-05-16T05:52:41.043527+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.13-145-g20531b61/rpm/el8/BUILD/ceph-16.2.13-145-g20531b61/src/mds/MDLog.cc: 979: FAILED ceph_assert(write_result >= 0)

 ceph version 16.2.13-145-g20531b61 (20531b61f434caac5e29cea07411fac6911e8e91) pacific (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14f) [0x7f73739911e2]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2793fc) [0x7f73739913fc]
 3: (MDLog::_recovery_thread(MDSContext*)+0xd72) [0x55df8ef15132]
 4: (MDLog::RecoveryThread::entry()+0x15) [0x55df8ec07845]
 5: /lib64/libpthread.so.0(+0x814a) [0x7f737297514a]
 6: clone()

Related issues 3 (0 open3 closed)

Copied to CephFS - Backport #62054: reef: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacificResolvedJos CollinActions
Copied to CephFS - Backport #62055: pacific: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacificResolvedJos CollinActions
Copied to CephFS - Backport #62056: quincy: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacificResolvedJos CollinActions
Actions #1

Updated by Kotresh Hiremath Ravishankar 12 months ago

PRs in the test batch:

  1. pacific: qa: mirror tests should cleanup fs during unwind by batrick · Pull Request #50765 · ceph/ceph - pacific: qa: mirror tests should cleanup fs during unwind
  2. pacific: qa: data-scan/journal-tool do not output debugging in upstream testing by batrick · Pull Request #50773 · ceph/ceph - pacific: qa: data-scan/journal-tool do not output debugging in upstream testing
  3. pacific: mgr/nfs: disallow non-existent paths when creating export by dparmar18 · Pull Request #50809 · ceph/ceph - pacific: mgr/nfs: disallow non-existent paths when creating export
  4. pacific: mds: record and dump last tid for trimming completed requests (or flushes) by vshankar · Pull Request #50811 · ceph/ceph - pacific: mds: record and dump last tid for trimming completed requests (or flushes)
  5. pacific: mds: fix stray evaluation using scrub and introduce new option by dparmar18 · Pull Request #50814 · ceph/ceph - pacific: mds: fix stray evaluation using scrub and introduce new option
  6. pacific: mds: wait for unlink operation to finish by lxbsz · Pull Request #50986 · ceph/ceph - pacific: mds: wait for unlink operation to finish
  7. pacific: client: clear the suid/sgid in fallocate path by lxbsz · Pull Request #50988 · ceph/ceph - pacific: client: clear the suid/sgid in fallocate path
  8. pacific: qa: enable kclient test for newop test
Actions #2

Updated by Milind Changire 11 months ago

  • Assignee set to Jos Collin
Actions #3

Updated by Jos Collin 11 months ago

  • Status changed from New to In Progress
Actions #4

Updated by Jos Collin 10 months ago

-10> 2023-05-16T05:52:41.039+0000 7f736496f700  1 mds.0.journalpointer Journal pointer '400.00000000' read failed: (2) No such file or directory
-2> 2023-05-16T05:52:41.042+0000 7f736496f700 -1 mds.0.journalpointer Error writing pointer object '400.00000000': (108) Cannot send after transport endpoint shutdown

http://qa-proxy.ceph.com/teuthology/yuriw-2023-05-15_21:56:33-fs-wip-yuri2-testing-2023-05-15-0810-pacific_2-distro-default-smithi/7274986/teuthology.log

From teuthology.log^, the mds.c has been blocklisted since the user ran `fs fail`. But in `MDLog::_recovery_thread`, while the blocklisting is already there, the jp.load() returned an -CEPHFS_ENOENT instead of -CEPHFS_EBLOCKLISTED, which caused the assert. We need to figure out why jp.load() returned an -CEPHFS_ENOENT instead of -CEPHFS_EBLOCKLISTED. There are no hints in the osd logs. So we need to check the osd code to see what happened.

There's no need for a ceph_assert here too. It could be changed to something like shutdown gracefully, which would fix the issue.

Actions #5

Updated by Jos Collin 10 months ago

  • Pull request ID set to 52173
Actions #6

Updated by Venky Shankar 10 months ago

  • Category set to Administration/Usability
  • Target version set to v19.0.0
  • Backport set to reef,quincy,pacific
  • Component(FS) MDS added
Actions #7

Updated by Jos Collin 10 months ago

  • Status changed from In Progress to Fix Under Review
Actions #8

Updated by Venky Shankar 9 months ago

  • Status changed from Fix Under Review to Pending Backport
Actions #9

Updated by Backport Bot 9 months ago

  • Copied to Backport #62054: reef: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific added
Actions #10

Updated by Backport Bot 9 months ago

  • Copied to Backport #62055: pacific: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific added
Actions #11

Updated by Backport Bot 9 months ago

  • Copied to Backport #62056: quincy: qa: test_rebuild_moved_file (tasks/data-scan) fails because mds crashes in pacific added
Actions #12

Updated by Backport Bot 9 months ago

  • Tags set to backport_processed
Actions #15

Updated by Jos Collin 8 months ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF