Project

General

Profile

Bug #50224

qa: test_mirroring_init_failure_with_recovery failure

Added by Patrick Donnelly over 1 year ago. Updated 4 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2021-04-07T04:01:07.734 INFO:tasks.cephfs_test_runner:======================================================================
2021-04-07T04:01:07.734 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring)
2021-04-07T04:01:07.734 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure
2021-04-07T04:01:07.735 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2021-04-07T04:01:07.735 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2021-04-07T04:01:07.736 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_ceph-c_e4f7126b95459fb0239be660756fb943526cec3f/qa/tasks/cephfs/test_mirroring.py", line 755, in test_mirroring_init_failure_with_recovery
2021-04-07T04:01:07.736 INFO:tasks.cephfs_test_runner:    while proceed():
2021-04-07T04:01:07.736 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_git_teuthology_85d61eae4759f46ce21e9a37cd816a7a1a66c9d5/teuthology/contextutil.py", line 133, in __call__
2021-04-07T04:01:07.736 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2021-04-07T04:01:07.737 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (20) after waiting for 100 seconds

From: /ceph/teuthology-archive/pdonnell-2021-04-07_02:12:41-fs-wip-pdonnell-testing-20210406.213012-distro-basic-smithi/6025522/teuthology.log


Related issues

Related to CephFS - Bug #50035: cephfs-mirror: use sensible mount/shutdown timeouts Resolved
Copied to CephFS - Backport #50877: pacific: qa: test_mirroring_init_failure_with_recovery failure Resolved

History

#1 Updated by Venky Shankar over 1 year ago

  • Status changed from New to In Progress

#2 Updated by Patrick Donnelly over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 41000

#3 Updated by Venky Shankar over 1 year ago

  • Status changed from Fix Under Review to In Progress
  • Pull request ID deleted (41000)

Patrick,

PR #41000 not a fix for this tracker. I've started to look into this today.

PR #41000 is an additional test case fix over PR #40157

#4 Updated by Venky Shankar over 1 year ago

Hit this again recently: https://pulpito.ceph.com/vshankar-2021-04-30_17:19:54-fs-wip-cephfs-mirror-incremental-sync-20210430-093812-distro-basic-gibba/6086766/

Analysis:

Before test `test_mirroring_init_failure_with_recovery` starts, test `test_mirroring_init_failure` is run. This test simulates an mirror initialization failure by enabling mirroring via monitor interface (which does not create the `cephfs_mirror` index object, causing mirror initialization for a given file system to fail -- which the test expects). The test then disables mirroring. cephfs_test_runner then deletes the file system and the MDSs and then start running the next test. However, the mirror daemon is retrying initializing mirroring for the file system (note that the mirroring disabled event is received after the restart get kicked). The restart gets blocked on mount:

2021-04-30T20:44:23.066+0000 10b70700 20 cephfs::mirror::Mirror update_fs_mirrors
2021-04-30T20:44:23.067+0000 10b70700  5 cephfs::mirror::Mirror update_fs_mirrors: filesystem={fscid=44, fs_name=cephfs} failed mirroring -- restarting
2021-04-30T20:44:23.068+0000 10b70700 10 cephfs::mirror::Mirror disable_mirroring: init failed for filesystem={fscid=44, fs_name=cephfs}
2021-04-30T20:44:23.069+0000 10b70700 20 cephfs::mirror::Mirror schedule_mirror_update_task: scheduling fs mirror update (0x1716f830) after 2 seconds
2021-04-30T20:44:23.069+0000 1036f700 20 cephfs::mirror::FSMirror ~FSMirror
2021-04-30T20:44:23.070+0000 1036f700 10 cephfs::mirror::Mirror enable_mirroring: starting FSMirror: filesystem={fscid=44, fs_name=cephfs}
2021-04-30T20:44:23.070+0000 1036f700 10 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 add_or_update_fs_attribute: fscid=44
2021-04-30T20:44:23.070+0000 1036f700 10 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 schedule_update_status
2021-04-30T20:44:23.070+0000 1036f700 20 cephfs::mirror::FSMirror init
2021-04-30T20:44:23.070+0000 1036f700 20 cephfs::mirror::Utils connect: connecting to cluster=ceph, client=client.mirror, mon_host=
2021-04-30T20:44:23.261+0000 1036f700 -1 asok(0x168a1c60) AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to bind the UNIX domain socket to '/var/run/ceph/cephfs-mirror.asok': (17) File exists
2021-04-30T20:44:23.393+0000 1036f700 10 cephfs::mirror::Utils connect: connected to cluster=ceph using client=client.mirror
2021-04-30T20:44:23.407+0000 1036f700 20 cephfs::mirror::Utils mount: filesystem={fscid=44, fs_name=cephfs}
2021-04-30T20:44:23.817+0000 cb68700 20 cephfs::mirror::ClusterWatcher handle_fsmap
2021-04-30T20:44:23.819+0000 cb68700  5 cephfs::mirror::ClusterWatcher handle_fsmap: mirroring enabled=[], mirroring_disabled=[{fscid=44, fs_name=cephfs}]
2021-04-30T20:44:23.819+0000 cb68700 10 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 remove_filesystem: fscid=44
2021-04-30T20:44:23.819+0000 cb68700 10 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 schedule_update_status
2021-04-30T20:44:23.819+0000 cb68700 10 cephfs::mirror::Mirror mirroring_disabled: filesystem={fscid=44, fs_name=cephfs}
2021-04-30T20:44:24.070+0000 14b78700 20 cephfs::mirror::ServiceDaemon: 0xb0ea9e0 update_status: 0 filesystem(s)
2021-04-30T20:44:25.070+0000 10b70700 20 cephfs::mirror::Mirror update_fs_mirrors
2021-04-30T20:49:12.059+0000 231f3700  0 client.11380 ms_handle_reset on v2:172.21.2.115:3300/0

The mount timeout is 300 secs so the updater thread is not freed soon to process mirroring events for other file systems.

I'll do a test with this PR: https://github.com/ceph/ceph/pull/40885 which sets the mount timeout to a much lower value to avoid this lockup.

#5 Updated by Venky Shankar over 1 year ago

Tested with https://github.com/ceph/ceph/pull/40885 and the failures due to blocked updated thread have gone away: https://pulpito.ceph.com/vshankar-2021-05-03_17:19:07-fs-wip-cephfs-mirror-incremental-sync-20210503-010749-distro-basic-gibba/

However, the failure now seems to be the admin socket command not getting registered:

2021-05-03T19:44:15.576 INFO:teuthology.orchestra.run:Running command with timeout 30
2021-05-03T19:44:15.576 DEBUG:teuthology.orchestra.run.gibba029:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@46
2021-05-03T19:44:15.647 INFO:teuthology.orchestra.run.gibba029.stderr:admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
2021-05-03T19:44:15.648 DEBUG:teuthology.orchestra.run:got remote process result: 22
2021-05-03T19:44:15.650 WARNING:tasks.cephfs.test_mirroring:mirror daemon command with label "mirror status for fs: cephfs" failed: Command failed (mirror status for fs: cephfs) on gibba029 with status 22: 'ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@46'
2021-05-03T19:44:19.330 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.331 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.331 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.332 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.332 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.328+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.335 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.336 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.336 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.336 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:19.337 INFO:tasks.ceph.mgr.x.gibba029.stderr:2021-05-03T19:44:19.334+0000 7f3dcc33c700 -1 client.0 error registering admin socket command: (17) File exists
2021-05-03T19:44:20.652 INFO:teuthology.orchestra.run:Running command with timeout 30
2021-05-03T19:44:20.652 DEBUG:teuthology.orchestra.run.gibba029:mirror status for fs: cephfs> ceph --admin-daemon /var/run/ceph/cephfs-mirror.asok fs mirror status cephfs@46
2021-05-03T19:44:20.722 INFO:teuthology.orchestra.run.gibba029.stderr:admin_socket: exception getting command descriptions: [Errno 2] No such file or directory

Running with "debug asok = 20" for further debugging.

#6 Updated by Venky Shankar over 1 year ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 41171

#7 Updated by Venky Shankar over 1 year ago

  • Related to Bug #50035: cephfs-mirror: use sensible mount/shutdown timeouts added

#8 Updated by Venky Shankar over 1 year ago

Requires PR https://github.com/ceph/ceph/pull/40885 for fully fixing the failed test.

#9 Updated by Patrick Donnelly over 1 year ago

  • Status changed from Fix Under Review to Pending Backport

#10 Updated by Backport Bot over 1 year ago

  • Copied to Backport #50877: pacific: qa: test_mirroring_init_failure_with_recovery failure added

#11 Updated by Loïc Dachary over 1 year ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF