Bug #50224: qa: test_mirroring_init_failure_with_recovery failure - CephFS - Ceph

Actions

Copy link

Bug #50224

closed

qa: test_mirroring_init_failure_with_recovery failure

Added by Patrick Donnelly about 3 years ago. Updated 3 months ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Venky Shankar

Category:

Target version:

Ceph - v17.0.0

% Done:

Source:

Q/A

Tags:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

qa-failure

Pull request ID:

41171

Crash signature (v1):

Crash signature (v2):

Description

2021-04-07T04:01:07.734 INFO:tasks.cephfs_test_runner:======================================================================
2021-04-07T04:01:07.734 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring)
2021-04-07T04:01:07.734 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure
2021-04-07T04:01:07.735 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-04-07T04:01:07.735 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-04-07T04:01:07.736 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_e4f7126b95459fb0239be660756fb943526cec3f/qa/tasks/cephfs/test_mirroring.py", line 755, in test_mirroring_init_failure_with_recovery
2021-04-07T04:01:07.736 INFO:tasks.cephfs_test_runner:    while proceed():
2021-04-07T04:01:07.736 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/contextutil.py", line 133, in __call__
2021-04-07T04:01:07.736 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2021-04-07T04:01:07.737 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (20) after waiting for 100 seconds

From: /ceph/teuthology-archive/pdonnell-2021-04-07_02:12:41-fs-wip-pdonnell-testing-20210406.213012-distro-basic-smithi/6025522/teuthology.log

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by Venky Shankar about 3 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Patrick Donnelly about 3 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 41000

Actions

Copy link

Updated by Venky Shankar about 3 years ago

Status changed from Fix Under Review to In Progress
Pull request ID deleted (~~41000~~)

Patrick,

PR #41000 not a fix for this tracker. I've started to look into this today.

PR #41000 is an additional test case fix over PR #40157

Actions

Copy link

Updated by Venky Shankar almost 3 years ago

Hit this again recently: https://pulpito.ceph.com/vshankar-2021-04-30_17:19:54-fs-wip-cephfs-mirror-incremental-sync-20210430-093812-distro-basic-gibba/6086766/

Analysis:

Before test `test_mirroring_init_failure_with_recovery` starts, test `test_mirroring_init_failure` is run. This test simulates an mirror initialization failure by enabling mirroring via monitor interface (which does not create the `cephfs_mirror` index object, causing mirror initialization for a given file system to fail -- which the test expects). The test then disables mirroring. cephfs_test_runner then deletes the file system and the MDSs and then start running the next test. However, the mirror daemon is retrying initializing mirroring for the file system (note that the mirroring disabled event is received after the restart get kicked). The restart gets blocked on mount:

2021-04-30T20:44:23.066+0000 10b70700 20 cephfs::mirror::Mirror update_fs_mirrors
2021-04-30T20:44:23.067+0000 10b70700  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=44, fs_name=cephfs} failed mirroring -- restarting
2021-04-30T20:44:23.068+0000 10b70700 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=44, fs_name=cephfs}
2021-04-30T20:44:23.069+0000 10b70700 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x1716f830) after 2 seconds
2021-04-30T20:44:23.069+0000 1036f700 20 cephfs::mirror::FSMirror ~FSMirror
2021-04-30T20:44:23.070+0000 1036f700 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=44, fs_name=cephfs}
2021-04-30T20:44:23.070+0000 1036f700 10 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 add_or_update_fs_attribute: fscid=44
2021-04-30T20:44:23.070+0000 1036f700 10 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 schedule_update_status
2021-04-30T20:44:23.070+0000 1036f700 20 cephfs::mirror::FSMirror init
2021-04-30T20:44:23.070+0000 1036f700 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2021-04-30T20:44:23.261+0000 1036f700 -1 asok(0x168a1c60) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2021-04-30T20:44:23.393+0000 1036f700 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2021-04-30T20:44:23.407+0000 1036f700 20 cephfs::mirror::Utils mount: filesystem={fscid=44, fs_name=cephfs}
2021-04-30T20:44:23.817+0000 cb68700 20 cephfs::mirror::ClusterWatcher handle_fsmap
2021-04-30T20:44:23.819+0000 cb68700  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[], mirroring_disabled=[{fscid=44, fs_name=cephfs}]
2021-04-30T20:44:23.819+0000 cb68700 10 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 remove_filesystem: fscid=44
2021-04-30T20:44:23.819+0000 cb68700 10 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 schedule_update_status
2021-04-30T20:44:23.819+0000 cb68700 10 cephfs::mirror::Mirror mirroring_disabled: filesystem={fscid=44, fs_name=cephfs}
2021-04-30T20:44:24.070+0000 14b78700 20 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 update_status: 0 filesystem(s)
2021-04-30T20:44:25.070+0000 10b70700 20 cephfs::mirror::Mirror update_fs_mirrors
2021-04-30T20:49:12.059+0000 231f3700  0 client.11380 ms_handle_reset on v2:172.21.2.115:3300/0

The mount timeout is 300 secs so the updater thread is not freed soon to process mirroring events for other file systems.

I'll do a test with this PR: https://github.com/ceph/ceph/pull/40885 which sets the mount timeout to a much lower value to avoid this lockup.

Actions

Copy link

Updated by Venky Shankar almost 3 years ago

Tested with https://github.com/ceph/ceph/pull/40885 and the failures due to blocked updated thread have gone away: https://pulpito.ceph.com/vshankar-2021-05-03_17:19:07-fs-wip-cephfs-mirror-incremental-sync-20210503-010749-distro-basic-gibba/

However, the failure now seems to be the admin socket command not getting registered:

2021-05-03T19:44:15.576 INFO:teuthology.orchestra.run:Running command with timeout 30
2021-05-03T19:44:15.576 DEBUG:teuthology.orchestra.run.gibba029:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@46
2021-05-03T19:44:15.647 INFO:teuthology.orchestra.run.gibba029.stderr:admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
2021-05-03T19:44:15.648 DEBUG:teuthology.orchestra.run:got remote process result: 22
2021-05-03T19:44:15.650 WARNING:tasks.cephfs.test_mirroring:mirror daemon command with label "mirror status for fs: cephfs" failed: Command failed (mirror status for fs: cephfs) on gibba029 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@46'
2021-05-03T19:44:19.330 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.331 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.331 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.332 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.332 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.335 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.336 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.336 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.336 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.337 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:20.652 INFO:teuthology.orchestra.run:Running command with timeout 30
2021-05-03T19:44:20.652 DEBUG:teuthology.orchestra.run.gibba029:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@46
2021-05-03T19:44:20.722 INFO:teuthology.orchestra.run.gibba029.stderr:admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

Running with "debug asok = 20" for further debugging.

Actions

Copy link

Updated by Venky Shankar almost 3 years ago

Status changed from In Progress to Fix Under Review
Pull request ID set to 41171

Actions

Copy link

Updated by Venky Shankar almost 3 years ago

Related to Bug #50035: cephfs-mirror: use sensible mount/shutdown timeouts added

Actions

Copy link

Updated by Venky Shankar almost 3 years ago

Requires PR https://github.com/ceph/ceph/pull/40885 for fully fixing the failed test.

Actions

Copy link

Updated by Patrick Donnelly almost 3 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#10

Updated by Backport Bot almost 3 years ago

Copied to Backport #50877: pacific: qa: test_mirroring_init_failure_with_recovery failure added

Actions

Copy link

#11

Updated by Loïc Dachary almost 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

#12

Updated by Kotresh Hiremath Ravishankar over 1 year ago

This is seen recently in https://pulpito.ceph.com/yuriw-2022-08-11_16:57:01-fs-wip-yuri3-testing-2022-08-11-0809-pacific-distro-default-smithi/6968329

Actions

Copy link

#13

Updated by Rishabh Dave 10 months ago

Seeing this failure again - https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862

2023-06-23T20:20:27.098 INFO:tasks.cephfs_test_runner:======================================================================
2023-06-23T20:20:27.099 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring)
2023-06-23T20:20:27.099 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure
2023-06-23T20:20:27.099 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2023-06-23T20:20:27.099 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2023-06-23T20:20:27.100 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_8627af3e0adcb765a3c249fcc209cba9f4873e1b/qa/tasks/cephfs/test_mirroring.py", line 742, in test_mirroring_init_failure_with_recovery
2023-06-23T20:20:27.100 INFO:tasks.cephfs_test_runner:    while proceed():
2023-06-23T20:20:27.100 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_076bbebc42a14f7d568aaa78eabb0038327bcb23/teuthology/contextutil.py", line 134, in __call__
2023-06-23T20:20:27.100 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2023-06-23T20:20:27.101 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (21) after waiting for 100 seconds
2023-06-23T20:20:27.101 INFO:tasks.cephfs_test_runner:

2023-06-23T18:30:29.202 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.202 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.202 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.202 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.202 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.202 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.203 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.203 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.203 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.203 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.197+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.214 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.209+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.214 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.210+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.214 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.210+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.215 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.210+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.215 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.210+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.215 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.210+0000 7f1ac6474700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.215 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.210+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.215 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.210+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.215 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.210+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:29.215 INFO:tasks.ceph.mgr.x.smithi111.stderr:2023-06-23T18:30:29.210+0000 7f1ac2c6d700 -1 client.0 error registering admin socket command: (17) File exists
2023-06-23T18:30:31.138 INFO:teuthology.orchestra.run.smithi111.stdout:{}

Actions

Copy link

#14

Updated by Milind Changire 4 months ago

quincy:
https://pulpito.ceph.com/vshankar-2023-12-13_09:42:45-fs-wip-vshankar-testing3-2023-12-13-1225-quincy-testing-default-smithi/7489625

2023-12-13T13:52:39.777 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ERROR
2023-12-13T13:52:39.777 INFO:tasks.cephfs_test_runner:
2023-12-13T13:52:39.777 INFO:tasks.cephfs_test_runner:======================================================================
2023-12-13T13:52:39.777 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring)
2023-12-13T13:52:39.777 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure
2023-12-13T13:52:39.777 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2023-12-13T13:52:39.777 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2023-12-13T13:52:39.777 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_2167f1c9167e5cb12f58a96ff6fa1bd01d4c8350/qa/tasks/cephfs/test_mirroring.py", line 742, in test_mirroring_init_failure_with_recovery
2023-12-13T13:52:39.778 INFO:tasks.cephfs_test_runner:    while proceed():
2023-12-13T13:52:39.778 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_5f3731ee8689ef3b0ef7d571e4cc2cb968c78d73/teuthology/contextutil.py", line 134, in __call__
2023-12-13T13:52:39.778 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2023-12-13T13:52:39.778 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (21) after waiting for 100 seconds
2023-12-13T13:52:39.778 INFO:tasks.cephfs_test_runner:
2023-12-13T13:52:39.778 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------

Actions

Copy link

#15