Bug #54108
openqa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}"
0%
Description
Updated by Venky Shankar about 2 years ago
- Assignee set to Ramana Raja
- Priority changed from High to Low
Updated by Venky Shankar about 2 years ago
Ramana, this is showing up a bit in master. Please take a look.
Updated by Ramana Raja almost 2 years ago
The counters mds.imported and mds.exported were not incremented during iogen workloads. I'm not sure why the exports were not generated, or not detected.
Looking at the code,
l_mds_exported is incremented in Migrator::export_go_synced() ← Migrator::handle_export_ack() ← Migrator::dispatch() case: MSG_MDS_EXPORTDIRACK
l_mds_imported is incremented in Migrator::handle_export_dir() ← Migrator::dispatch case MSG_MDS_EXPORTDIR
In the latest teuthology run on "main" branch, I see the same test pass and fail,
Passed:
https://pulpito.ceph.com/vshankar-2022-06-19_08:22:46-fs-wip-vshankar-testing1-20220619-102531-testing-default-smithi/6886671/
https://pulpito.ceph.com/vshankar-2022-06-19_08:22:46-fs-wip-vshankar-testing1-20220619-102531-testing-default-smithi/6886683/
Failed:
https://pulpito.ceph.com/vshankar-2022-06-15_04:03:39-fs-wip-vshankar-testing1-20220615-072516-testing-default-smithi/6879552/
https://pulpito.ceph.com/vshankar-2022-06-15_04:03:39-fs-wip-vshankar-testing1-20220615-072516-testing-default-smithi/6879616/
Looking at the passed test,
https://pulpito.ceph.com/vshankar-2022-06-19_08:22:46-fs-wip-vshankar-testing1-20220619-102531-testing-default-smithi/6886671/
Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-bitmap omap_limit/10 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/3 scrub/no standby-replay subvolume/{with-namespace-isolated-and-quota} tasks/{0-check-counter workunit/suites/iogen}}
In the teuthology log,
iogen workload start
2022-06-19T09:10:17.235 INFO:tasks.workunit:Running workunits matching suites/iogen.sh on client.0... 2022-06-19T09:10:17.236 INFO:tasks.workunit:Running workunit suites/iogen.sh... 2022-06-19T09:10:17.237 DEBUG:teuthology.orchestra.run.smithi060:workunit test suites/iogen.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=4322afc801c50aea640f0a7d6c01cfe9e520b178 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/iogen.sh 2022-06-19T09:10:17.325 INFO:tasks.workunit.client.0.smithi060.stdout:getting iogen
10 mins later iogen workload is stopped
2022-06-19T09:20:23.910 INFO:tasks.workunit:Stopping ['suites/iogen.sh'] on client.0...
Then, mds.exported and mds.imported counters are checked in each of the 12 mds daemons. The counters are incremented in active daemons mds.l and mds.i
2022-06-19T09:21:05.207 INFO:tasks.check_counter:Daemon mds.i mds.exported=1 2022-06-19T09:21:05.207 INFO:tasks.check_counter:Daemon mds.i mds.imported=1 2022-06-19T09:21:05.207 DEBUG:tasks.check_counter:Getting stats from l 2022-06-19T09:21:05.208 DEBUG:teuthology.orchestra.run.smithi133:> sudo /home/ubuntu/cephtest/cephadm --image quay.ceph.io/ceph-ci/ceph:4322afc801c50aea640f0a7d6c01cfe9e520b178 shell --fsid bf08e008-efad-11ec-8427-001a4aab830c -- ceph daemon mds.l perf dump ... 2022-06-19T09:21:08.177 INFO:tasks.check_counter:Daemon mds.l mds.exported=1 2022-06-19T09:21:08.178 INFO:tasks.check_counter:Daemon mds.l mds.imported=1
In the mds.i log,
2022-06-19T09:10:29.178+0000 7fcae3608700 7 mds.1.mig decode_import_dir done [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{0=1} v=13 cv=0/0 dir_auth=1,0 state=1610614787|complete|frozentree|importing f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 child=1 frozen=1 subtree=1 importing=1 replicated=1 dirty=1 0x5615dcf35680] 2022-06-19T09:10:29.178+0000 7fcae3608700 10 mds.1.mig handle_export_dir 0 imported bounds 2022-06-19T09:10:29.178+0000 7fcae3608700 7 mds.1.mig handle_export_dir did [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{0=1} v=13 cv=0/0 dir_auth=1,0 state=1610614787|complete|frozentree|importing f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 child=1 frozen=1 subtree=1 importing=1 replicated=1 dirty=1 0x5615dcf35680] 2022-06-19T09:10:29.178+0000 7fcadcdfb700 5 mds.1.log _submit_thread 4195880~8871 : EImportStart 0x10000000202 from mds.0 [metablob 0x1, 9 dirs] ... 2022-06-19T09:20:25.529+0000 7fcae3608700 7 mds.1.mig export_go [dir 0x10000000202 ~mds0/stray1/10000000202/ [2,head] auth{0=2} v=18 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f(v1 m2022-06-19T09:20:20.897774+0000) n(v0 rc2022-06-19T09:20:20.897774+0000)/n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=0+0,ss=0+0 | ptrwaiter=0 request=0 child=0 frozen=1 subtree=1 importing=0 replicated=1 dirty=0 waiter=0 authpin=0 0x5615dcf35680] to 0 2022-06-19T09:20:25.529+0000 7fcadd5fc700 10 MDSIOContextBase::complete: 12C_IO_Wrapper 2022-06-19T09:20:25.529+0000 7fcadd5fc700 10 MDSContext::complete: 12C_IO_Wrapper 2022-06-19T09:20:25.529+0000 7fcadd5fc700 10 MDSContext::complete: 12C_M_ExportGo 2022-06-19T09:20:25.529+0000 7fcadd5fc700 7 mds.1.mig export_go_synced [dir 0x10000000202 ~mds0/stray1/10000000202/ [2,head] auth{0=2} v=18 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f(v1 m2022-06-19T09:20:20.897774+0000) n(v0 rc2022-06-19T09:20:20.897774+0000)/n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=0+0,ss=0+0 | ptrwaiter=1 request=0 child=0 frozen=1 subtree=1 importing=0 replicated=1 dirty=0 waiter=0 authpin=0 0x5615dcf35680] to 0 2022-06-19T09:20:25.529+0000 7fcadd5fc700 15 mds.1.cache show_subtrees
In the mds.l log,
2022-06-19T09:10:29.177+0000 7f6d76d88700 7 mds.0.mig export_go [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{1=1} v=13 cv=0/0 dir_auth=0,0 state=1610875907|complete|frozentree|exporting f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 replicated=1 dirty=1 waiter=0 authpin=0 0x558940f95200] to 1 2022-06-19T09:10:29.177+0000 7f6d70d7c700 10 MDSIOContextBase::complete: 12C_IO_Wrapper 2022-06-19T09:10:29.177+0000 7f6d70d7c700 10 MDSContext::complete: 12C_IO_Wrapper 2022-06-19T09:10:29.177+0000 7f6d70d7c700 10 MDSContext::complete: 12C_M_ExportGo 2022-06-19T09:10:29.177+0000 7f6d70d7c700 7 mds.0.mig export_go_synced [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{1=1} v=13 cv=0/0 dir_auth=0,0 state=1610875907|complete|frozentree|exporting f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 replicated=1 dirty=1 waiter=0 authpin=0 0x558940f95200] to 1 2022-06-19T09:10:29.177+0000 7f6d70d7c700 15 mds.0.cache show_subtrees ... 2022-06-19T09:20:25.530+0000 7f6d76d88700 10 mds.0.mig handle_export_dir 0 imported bounds 2022-06-19T09:20:25.530+0000 7f6d76d88700 7 mds.0.mig handle_export_dir did [dir 0x10000000202 ~mds0/stray1/10000000202/ [2,head] auth{1=1} v=18 cv=0/0 dir_auth=0,1 state=1073743875|complete|frozentree|importing f(v1 m2022-06-19T09:20:20.897774+0000) n(v0 rc2022-06-19T09:20:20.897774+0000)/n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=0+0,ss=0+0 | ptrwaiter=1 request=0 child=0 frozen=1 subtree=1 importing=1 replicated=1 dirty=0 waiter=0 authpin=0 tempexporting=0 0x558940f95200] 2022-06-19T09:20:25.530+0000 7f6d7057b700 5 mds.0.log _submit_thread 4924336~520 : EImportStart 0x10000000202 from mds.1 [metablob 0x10000000202, 1 dirs]
Looking at the failed test,
https://pulpito.ceph.com/vshankar-2022-06-15_04:03:39-fs-wip-vshankar-testing1-20220615-072516-testing-default-smithi/6879552/
Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-bitmap omap_limit/10 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/3 scrub/no standby-replay subvolume/{with-namespace-isolated-and-quota} tasks/{0-check-counter workunit/suites/iogen}}
As in the tests that passed, the iogen workload ran in client.0 for close to 10 minutes. The mds.exported and mds.imported counters were then checked in all 12 mds daemons. The active daemons (mds.l, mds.i and mds.h) did not export or import subtrees.
Need to dig further why the same iogen workload sometimes trigger subtree migration and sometimes don't. Maybe identify CephFS settings thats lower the threshold for subtree migration to be initiated?
Updated by Ramana Raja almost 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 46861
Updated by Rishabh Dave almost 2 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot almost 2 years ago
- Copied to Backport #56590: quincy: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}" added
Updated by Backport Bot almost 2 years ago
- Copied to Backport #56591: pacific: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}" added
Updated by Rishabh Dave about 1 year ago
Venky,
This failure occured in Pacific backport runs - https://pulpito.ceph.com/yuriw-2023-04-04_15:06:57-fs-wip-yuri8-testing-2023-03-31-0812-pacific-distro-default-smithi/7230918. Should the patch for this ticket be backported?