Bug #63931
openqa: test_mirroring_init_failure_with_recovery failure
0%
Description
Seen in a few qa runs recently.
https://pulpito.ceph.com/yuriw-2022-08-11_16:57:01-fs-wip-yuri3-testing-2022-08-11-0809-pacific-distro-default-smithi/6968329
https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862
https://pulpito.ceph.com/vshankar-2023-12-13_09:42:45-fs-wip-vshankar-testing3-2023-12-13-1225-quincy-testing-default-smithi/7489625
https://pulpito.ceph.com/yuriw-2023-12-28_15:45:00-fs-wip-yuri7-testing-2023-12-27-1008-pacific-distro-default-smithi/7502932
tasks.cephfs.test_mirroring.TestMirroring.test_mirroring_init_failure_with_recovery' 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure ... ERROR 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner: 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:====================================================================== 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring) 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_5b5a60d2d510bfaa9213c41e32227fb008245a30/qa/tasks/cephfs/test_mirroring.py", line 742, in test_mirroring_init_failure_with_recovery 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner: while proceed(): 2023-12-27T18:49:39.570 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_teuthology_9680d2aba1e76f64ade295c81599c97907457848/teuthology/contextutil.py", line 134, in __call__ 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: raise MaxWhileTries(error_msg) 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (21) after waiting for 100 seconds 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:Ran 29 tests in 6496.852s 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:FAILED (errors=1) 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:====================================================================== 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:ERROR: test_mirroring_init_failure_with_recovery (tasks.cephfs.test_mirroring.TestMirroring) 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:Test if the mirror daemon can recover from a init failure 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:---------------------------------------------------------------------- 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner:Traceback (most recent call last): 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/github.com_ceph_ceph-c_5b5a60d2d510bfaa9213c41e32227fb008245a30/qa/tasks/cephfs/test_mirroring.py", line 742, in test_mirroring_init_failure_with_recovery 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: while proceed(): 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: File "/home/teuthworker/src/git.ceph.com_teuthology_9680d2aba1e76f64ade295c81599c97907457848/teuthology/contextutil.py", line 134, in __call__ 2023-12-27T18:49:39.571 INFO:tasks.cephfs_test_runner: raise MaxWhileTries(error_msg) 2023-12-27T18:49:39.572 INFO:tasks.cephfs_test_runner:teuthology.exceptions.MaxWhileTries: 'wait for failed state' reached maximum tries (21) after waiting for 100 seconds 2023-12-27T18:49:39.572 INFO:tasks.cephfs_test_runner:
Files
Updated by Kotresh Hiremath Ravishankar 4 months ago
- Related to Bug #50224: qa: test_mirroring_init_failure_with_recovery failure added
Updated by Venky Shankar 4 months ago
- Category set to Correctness/Safety
- Status changed from New to Triaged
- Assignee set to Jos Collin
- Target version set to v19.0.0
- Backport set to quincy,reef
Updated by Jos Collin about 1 month ago
- Status changed from Triaged to In Progress
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
Updated by Venky Shankar 29 days ago
- Target version changed from v19.0.0 to v20.0.0
- Backport changed from quincy,reef to quincy,reef,squid
Updated by Venky Shankar 29 days ago
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where m_init_failed
gets set or there is a delayed check for m_init_failed
. I would start with the latter and hash that out first.
Updated by Jos Collin 28 days ago
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failed
gets set or there is a delayed check form_init_failed
. I would start with the latter and hash that out first.
It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
460 void FSMirror::mirror_status(Formatter *f) { 461 std::scoped_lock locker(m_lock); !462 f->open_object_section("status"); 463 if (m_init_failed) { 464 f->dump_string("state", "failed"); 465 } else if (is_blocklisted(locker)) { 466 f->dump_string("state", "blocklisted"); 467 } else { 468 // dump rados addr for blocklist test 469 f->dump_string("rados_inst", m_addrs); 470 f->open_object_section("peers"); 471 for ([[maybe_unused]] auto &[peer, peer_replayer] : m_peer_replayers) { 472 peer.dump(f); 473 } ─── Variables ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── arg this = 0x5601b6cfe000: {m_cct = 0x5601b1e9a000,m_filesystem = {fscid = 1,fs_name = "a"},m_pool_id …, f = 0x5601b6d28e00: {_vptr.Formatter = 0x7fb7cf694b30 <vtable for ceph::JSONFormatter+16>} loc locker = {_M_device = @0x5601b6cfe060} ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── >>> p m_init_failed $4 = false
The command:
./bin/ceph --admin-daemon asok/client.admin.342404.asok fs mirror status a@1
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.
Updated by Venky Shankar 28 days ago
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failed
gets set or there is a delayed check form_init_failed
. I would start with the latter and hash that out first.It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.
Do you have the run link? I'll have a look now.
Updated by Jos Collin 28 days ago
Venky Shankar wrote:
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failed
gets set or there is a delayed check form_init_failed
. I would start with the latter and hash that out first.It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.Do you have the run link? I'll have a look now.
The above finding is from the vstart cluster reproduced.
I was also looking at https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862.
Updated by Venky Shankar 28 days ago
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failed
gets set or there is a delayed check form_init_failed
. I would start with the latter and hash that out first.It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.Do you have the run link? I'll have a look now.
The above finding is from the vstart cluster reproduced.
I was also looking at https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862.
No problem. Just upload the cephfs-mirror debug logs from the vstart cluster when running this test.
Updated by Jos Collin 28 days ago
- File client.admin.367187.log client.admin.367187.log added
- File client.mirror_remote.367187.log client.mirror_remote.367187.log added
Venky Shankar wrote:
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
Venky Shankar wrote:
Jos Collin wrote:
@Venky,
When we disable mirroring using 'ceph mgr module disable mirroring' and then enable using 'ceph fs mirror enable <primary_fs_name>', the 'fs mirror status' command doesn't output the "state: failed" status. The mirroring seems failed, but the "state: failed" doesn't show.
It doesn't hit 'FSMirror::init', so that m_init_failed is set to True.
So, some change likely broke the part where
m_init_failed
gets set or there is a delayed check form_init_failed
. I would start with the latter and hash that out first.It doesn't seem to be a delayed check for m_init_failed. The debug always shows m_init_failed = false, even after waiting for a few minutes:
[...]The command:
[...]
works, but it doesn't show the 'state'. So this should be a problem with failed to set m_init_failed value. Checking that.Do you have the run link? I'll have a look now.
The above finding is from the vstart cluster reproduced.
I was also looking at https://pulpito.ceph.com/rishabh-2023-06-23_17:37:30-fs-wip-rishabh-improvements-authmon-distro-default-smithi/7313862.No problem. Just upload the cephfs-mirror debug logs from the vstart cluster when running this test.
Logs attached.