Project

General

Profile

Bug #54108

qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}"

Added by Patrick Donnelly 10 months ago. Updated 4 months ago.

Status:
Pending Backport
Priority:
Low
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Related issues

Copied to CephFS - Backport #56590: quincy: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}" New
Copied to CephFS - Backport #56591: pacific: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}" New

History

#1 Updated by Venky Shankar 10 months ago

  • Assignee set to Ramana Raja
  • Priority changed from High to Low

#2 Updated by Venky Shankar 8 months ago

Ramana, this is showing up a bit in master. Please take a look.

#4 Updated by Ramana Raja 6 months ago

The counters mds.imported and mds.exported were not incremented during iogen workloads. I'm not sure why the exports were not generated, or not detected.

Looking at the code,
l_mds_exported is incremented in Migrator::export_go_synced() ← Migrator::handle_export_ack() ← Migrator::dispatch() case: MSG_MDS_EXPORTDIRACK
l_mds_imported is incremented in Migrator::handle_export_dir() ← Migrator::dispatch case MSG_MDS_EXPORTDIR

In the latest teuthology run on "main" branch, I see the same test pass and fail,

Passed:
https://pulpito.ceph.com/vshankar-2022-06-19_08:22:46-fs-wip-vshankar-testing1-20220619-102531-testing-default-smithi/6886671/
https://pulpito.ceph.com/vshankar-2022-06-19_08:22:46-fs-wip-vshankar-testing1-20220619-102531-testing-default-smithi/6886683/

Failed:
https://pulpito.ceph.com/vshankar-2022-06-15_04:03:39-fs-wip-vshankar-testing1-20220615-072516-testing-default-smithi/6879552/
https://pulpito.ceph.com/vshankar-2022-06-15_04:03:39-fs-wip-vshankar-testing1-20220615-072516-testing-default-smithi/6879616/

Looking at the passed test,
https://pulpito.ceph.com/vshankar-2022-06-19_08:22:46-fs-wip-vshankar-testing1-20220619-102531-testing-default-smithi/6886671/
Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-bitmap omap_limit/10 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/3 scrub/no standby-replay subvolume/{with-namespace-isolated-and-quota} tasks/{0-check-counter workunit/suites/iogen}}

In the teuthology log,

iogen workload start

2022-06-19T09:10:17.235 INFO:tasks.workunit:Running workunits matching suites/iogen.sh on client.0...
2022-06-19T09:10:17.236 INFO:tasks.workunit:Running workunit suites/iogen.sh...
2022-06-19T09:10:17.237 DEBUG:teuthology.orchestra.run.smithi060:workunit test suites/iogen.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=4322afc801c50aea640f0a7d6c01cfe9e520b178 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/iogen.sh
2022-06-19T09:10:17.325 INFO:tasks.workunit.client.0.smithi060.stdout:getting iogen

10 mins later iogen workload is stopped

2022-06-19T09:20:23.910 INFO:tasks.workunit:Stopping ['suites/iogen.sh'] on client.0...

Then, mds.exported and mds.imported counters are checked in each of the 12 mds daemons. The counters are incremented in active daemons mds.l and mds.i

2022-06-19T09:21:05.207 INFO:tasks.check_counter:Daemon mds.i mds.exported=1
2022-06-19T09:21:05.207 INFO:tasks.check_counter:Daemon mds.i mds.imported=1

2022-06-19T09:21:05.207 DEBUG:tasks.check_counter:Getting stats from l
2022-06-19T09:21:05.208 DEBUG:teuthology.orchestra.run.smithi133:> sudo /home/ubuntu/cephtest/cephadm --image quay.ceph.io/ceph-ci/ceph:4322afc801c50aea640f0a7d6c01cfe9e520b178 shell --fsid bf08e008-efad-11ec-8427-001a4aab830c -- ceph daemon mds.l perf dump

...

2022-06-19T09:21:08.177 INFO:tasks.check_counter:Daemon mds.l mds.exported=1
2022-06-19T09:21:08.178 INFO:tasks.check_counter:Daemon mds.l mds.imported=1

In the mds.i log,

2022-06-19T09:10:29.178+0000 7fcae3608700  7 mds.1.mig decode_import_dir  done [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{0=1} v=13 cv=0/0 dir_auth=1,0 state=1610614787|complete|frozentree|importing f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 child=1 frozen=1 subtree=1 importing=1 replicated=1 dirty=1 0x5615dcf35680]
2022-06-19T09:10:29.178+0000 7fcae3608700 10 mds.1.mig handle_export_dir  0 imported bounds
2022-06-19T09:10:29.178+0000 7fcae3608700  7 mds.1.mig handle_export_dir did [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{0=1} v=13 cv=0/0 dir_auth=1,0 state=1610614787|complete|frozentree|importing f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 child=1 frozen=1 subtree=1 importing=1 replicated=1 dirty=1 0x5615dcf35680]
2022-06-19T09:10:29.178+0000 7fcadcdfb700  5 mds.1.log _submit_thread 4195880~8871 : EImportStart 0x10000000202 from mds.0 [metablob 0x1, 9 dirs]

...

2022-06-19T09:20:25.529+0000 7fcae3608700  7 mds.1.mig export_go [dir 0x10000000202 ~mds0/stray1/10000000202/ [2,head] auth{0=2} v=18 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f(v1 m2022-06-19T09:20:20.897774+0000) n(v0 rc2022-06-19T09:20:20.897774+0000)/n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=0+0,ss=0+0 | ptrwaiter=0 request=0 child=0 frozen=1 subtree=1 importing=0 replicated=1 dirty=0 waiter=0 authpin=0 0x5615dcf35680] to 0
2022-06-19T09:20:25.529+0000 7fcadd5fc700 10 MDSIOContextBase::complete: 12C_IO_Wrapper
2022-06-19T09:20:25.529+0000 7fcadd5fc700 10 MDSContext::complete: 12C_IO_Wrapper
2022-06-19T09:20:25.529+0000 7fcadd5fc700 10 MDSContext::complete: 12C_M_ExportGo
2022-06-19T09:20:25.529+0000 7fcadd5fc700  7 mds.1.mig export_go_synced [dir 0x10000000202 ~mds0/stray1/10000000202/ [2,head] auth{0=2} v=18 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f(v1 m2022-06-19T09:20:20.897774+0000) n(v0 rc2022-06-19T09:20:20.897774+0000)/n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=0+0,ss=0+0 | ptrwaiter=1 request=0 child=0 frozen=1 subtree=1 importing=0 replicated=1 dirty=0 waiter=0 authpin=0 0x5615dcf35680] to 0
2022-06-19T09:20:25.529+0000 7fcadd5fc700 15 mds.1.cache show_subtrees

In the mds.l log,

2022-06-19T09:10:29.177+0000 7f6d76d88700  7 mds.0.mig export_go [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{1=1} v=13 cv=0/0 dir_auth=0,0 state=1610875907|complete|frozentree|exporting f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 replicated=1 dirty=1 waiter=0 authpin=0 0x558940f95200] to 1
2022-06-19T09:10:29.177+0000 7f6d70d7c700 10 MDSIOContextBase::complete: 12C_IO_Wrapper
2022-06-19T09:10:29.177+0000 7f6d70d7c700 10 MDSContext::complete: 12C_IO_Wrapper
2022-06-19T09:10:29.177+0000 7f6d70d7c700 10 MDSContext::complete: 12C_M_ExportGo
2022-06-19T09:10:29.177+0000 7f6d70d7c700  7 mds.0.mig export_go_synced [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{1=1} v=13 cv=0/0 dir_auth=0,0 state=1610875907|complete|frozentree|exporting f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 replicated=1 dirty=1 waiter=0 authpin=0 0x558940f95200] to 1
2022-06-19T09:10:29.177+0000 7f6d70d7c700 15 mds.0.cache show_subtrees

...

2022-06-19T09:20:25.530+0000 7f6d76d88700 10 mds.0.mig handle_export_dir  0 imported bounds
2022-06-19T09:20:25.530+0000 7f6d76d88700  7 mds.0.mig handle_export_dir did [dir 0x10000000202 ~mds0/stray1/10000000202/ [2,head] auth{1=1} v=18 cv=0/0 dir_auth=0,1 state=1073743875|complete|frozentree|importing f(v1 m2022-06-19T09:20:20.897774+0000) n(v0 rc2022-06-19T09:20:20.897774+0000)/n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=0+0,ss=0+0 | ptrwaiter=1 request=0 child=0 frozen=1 subtree=1 importing=1 replicated=1 dirty=0 waiter=0 authpin=0 tempexporting=0 0x558940f95200]
2022-06-19T09:20:25.530+0000 7f6d7057b700  5 mds.0.log _submit_thread 4924336~520 : EImportStart 0x10000000202 from mds.1 [metablob 0x10000000202, 1 dirs]

Looking at the failed test,
https://pulpito.ceph.com/vshankar-2022-06-15_04:03:39-fs-wip-vshankar-testing1-20220615-072516-testing-default-smithi/6879552/
Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-bitmap omap_limit/10 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/3 scrub/no standby-replay subvolume/{with-namespace-isolated-and-quota} tasks/{0-check-counter workunit/suites/iogen}}

As in the tests that passed, the iogen workload ran in client.0 for close to 10 minutes. The mds.exported and mds.imported counters were then checked in all 12 mds daemons. The active daemons (mds.l, mds.i and mds.h) did not export or import subtrees.

Need to dig further why the same iogen workload sometimes trigger subtree migration and sometimes don't. Maybe identify CephFS settings thats lower the threshold for subtree migration to be initiated?

#5 Updated by Ramana Raja 5 months ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 46861

#6 Updated by Rishabh Dave 5 months ago

  • Status changed from Fix Under Review to Pending Backport

#7 Updated by Backport Bot 5 months ago

  • Copied to Backport #56590: quincy: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}" added

#8 Updated by Backport Bot 5 months ago

  • Copied to Backport #56591: pacific: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}" added

#9 Updated by Backport Bot 4 months ago

  • Tags set to backport_processed

Also available in: Atom PDF