Project

General

Profile

Actions

Bug #63931

open

qa: test_mirroring_init_failure_with_recovery failure

Added by Kotresh Hiremath Ravishankar 4 months ago. Updated 28 days ago.

Status:
In Progress
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
quincy,reef,squid
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
cephfs-mirror
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Seen in a few qa runs recently.

https://pulpito.ceph.com/yuriw-2022-08-11_16:57:01-fs-wip-yuri3-testing-2022-08-11-0809-pacific-distro-default-smithi/6968329
https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862
https://pulpito.ceph.com/vshankar-2023-12-13_09:42:45-fs-wip-vshankar-testing3-2023-12-13-1225-quincy-testing-default-smithi/7489625
https://pulpito.ceph.com/yuriw-2023-12-28_15:45:00-fs-wip-yuri7-testing-2023-12-27-1008-pacific-distro-default-smithi/7502932

Latest:
https://pulpito.ceph.com/yuriw-2023-12-27_16:33:25-fs-wip-yuri-testing-2023-12-26-0957-quincy-distro-default-smithi/7501960

tasks.cephfs.test_mirroring.TestMirroring.test_mirroring_init_failure_with_recovery'
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ERROR
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:======================================================================
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring)
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_5b5a60d2d510bfaa9213c41e32227fb008245a30/qa/tasks/cephfs/test_mirroring.py", line 742, in test_mirroring_init_failure_with_recovery
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:    while proceed():
2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_9680d2aba1e76f64ade295c81599c97907457848/teuthology/contextutil.py", line 134, in __call__
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (21) after waiting for 100 seconds
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:Ran 29 tests in 6496.852s
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:FAILED (errors=1)
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:======================================================================
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring)
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/github.com_ceph_ceph-c_5b5a60d2d510bfaa9213c41e32227fb008245a30/qa/tasks/cephfs/test_mirroring.py", line 742, in test_mirroring_init_failure_with_recovery
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:    while proceed():
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/git.ceph.com_teuthology_9680d2aba1e76f64ade295c81599c97907457848/teuthology/contextutil.py", line 134, in __call__
2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:    raise MaxWhileTries(error_msg)
2023-12-27T18:49:39.572 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (21) after waiting for 100 seconds
2023-12-27T18:49:39.572 INFO:tasks.cephfs_test_runner:

Files

client.mirror_remote.367187.log (62.1 KB) client.mirror_remote.367187.log Jos Collin, 04/04/2024 08:35 AM
client.admin.367187.log (103 KB) client.admin.367187.log Jos Collin, 04/04/2024 08:35 AM

Related issues 1 (0 open1 closed)

Related to CephFS - Bug #50224: qa: test_mirroring_init_failure_with_recovery failureResolvedVenky Shankar

Actions
Actions #1

Updated by Kotresh Hiremath Ravishankar 4 months ago

  • Related to Bug #50224: qa: test_mirroring_init_failure_with_recovery failure added
Actions #2

Updated by Venky Shankar 4 months ago

  • Category set to Correctness/Safety
  • Status changed from New to Triaged
  • Assignee set to Jos Collin
  • Target version set to v19.0.0
  • Backport set to quincy,reef
Actions #3

Updated by Jos Collin about 1 month ago

  • Status changed from Triaged to In Progress

@Venky,

When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.

It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.

Actions #4

Updated by Venky Shankar 29 days ago

  • Target version changed from v19.0.0 to v20.0.0
  • Backport changed from quincy,reef to quincy,reef,squid
Actions #5

Updated by Venky Shankar 29 days ago

Jos Collin wrote:

@Venky,

When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.

It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.

So, some change likely broke the part where m_init_failed gets set or there is a delayed check for m_init_failed. I would start with the latter and hash that out first.

Actions #6

Updated by Jos Collin 28 days ago

Venky Shankar wrote:

Jos Collin wrote:

@Venky,

When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.

It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.

So, some change likely broke the part where m_init_failed gets set or there is a delayed check for m_init_failed. I would start with the latter and hash that out first.

It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:

 460  void FSMirror::mirror_status(Formatter *f) {
 461    std::scoped_lock locker(m_lock);
!462    f->open_object_section("status");
 463    if (m_init_failed) {
 464      f->dump_string("state", "failed");
 465    } else if (is_blocklisted(locker)) {
 466      f->dump_string("state", "blocklisted");
 467    } else {
 468      // dump rados addr for blocklist test
 469      f->dump_string("rados_inst", m_addrs);
 470      f->open_object_section("peers");
 471      for ([[maybe_unused]] auto &[peer, peer_replayer] : m_peer_replayers) {
 472        peer.dump(f);
 473      }
─── Variables ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
arg this = 0x5601b6cfe000: {m_cct = 0x5601b1e9a000,m_filesystem = {fscid = 1,fs_name = "a"},m_pool_id …, f = 0x5601b6d28e00: {_vptr.Formatter = 0x7fb7cf694b30 <vtable for ceph::JSONFormatter+16>}
loc locker = {_M_device = @0x5601b6cfe060}
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
>>> p m_init_failed
$4 = false

The command:

./bin/ceph --admin-daemon asok/client.admin.342404.asok fs mirror status a@1

works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.

Actions #7

Updated by Venky Shankar 28 days ago

Jos Collin wrote:

Venky Shankar wrote:

Jos Collin wrote:

@Venky,

When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.

It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.

So, some change likely broke the part where m_init_failed gets set or there is a delayed check for m_init_failed. I would start with the latter and hash that out first.

It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]

The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.

Do you have the run link? I'll have a look now.

Actions #8

Updated by Jos Collin 28 days ago

Venky Shankar wrote:

Jos Collin wrote:

Venky Shankar wrote:

Jos Collin wrote:

@Venky,

When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.

It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.

So, some change likely broke the part where m_init_failed gets set or there is a delayed check for m_init_failed. I would start with the latter and hash that out first.

It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]

The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.

Do you have the run link? I'll have a look now.

The above finding is from the vstart cluster reproduced.
I was also looking at https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862.

Actions #9

Updated by Venky Shankar 28 days ago

Jos Collin wrote:

Venky Shankar wrote:

Jos Collin wrote:

Venky Shankar wrote:

Jos Collin wrote:

@Venky,

When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.

It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.

So, some change likely broke the part where m_init_failed gets set or there is a delayed check for m_init_failed. I would start with the latter and hash that out first.

It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]

The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.

Do you have the run link? I'll have a look now.

The above finding is from the vstart cluster reproduced.
I was also looking at https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862.

No problem. Just upload the cephfs-mirror debug logs from the vstart cluster when running this test.

Updated by Jos Collin 28 days ago

Venky Shankar wrote:

Jos Collin wrote:

Venky Shankar wrote:

Jos Collin wrote:

Venky Shankar wrote:

Jos Collin wrote:

@Venky,

When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.

It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.

So, some change likely broke the part where m_init_failed gets set or there is a delayed check for m_init_failed. I would start with the latter and hash that out first.

It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]

The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.

Do you have the run link? I'll have a look now.

The above finding is from the vstart cluster reproduced.
I was also looking at https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862.

No problem. Just upload the cephfs-mirror debug logs from the vstart cluster when running this test.

Logs attached.

Actions

Also available in: Atom PDF