Project

General

Profile

Actions

Bug #54108

open

qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}"

Added by Patrick Donnelly about 2 years ago. Updated about 1 year ago.

Status:
Pending Backport
Priority:
Low
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
backport_processed
Backport:
quincy,pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
qa-suite
Labels (FS):
qa, qa-failure
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Related issues 2 (1 open1 closed)

Copied to CephFS - Backport #56590: quincy: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}"NewRamana RajaActions
Copied to CephFS - Backport #56591: pacific: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}"RejectedRamana RajaActions
Actions #1

Updated by Venky Shankar about 2 years ago

  • Assignee set to Ramana Raja
  • Priority changed from High to Low
Actions #2

Updated by Venky Shankar about 2 years ago

Ramana, this is showing up a bit in master. Please take a look.

Actions #4

Updated by Ramana Raja almost 2 years ago

The counters mds.imported and mds.exported were not incremented during iogen workloads. I'm not sure why the exports were not generated, or not detected.

Looking at the code,
l_mds_exported is incremented in Migrator::export_go_synced() ← Migrator::handle_export_ack() ← Migrator::dispatch() case: MSG_MDS_EXPORTDIRACK
l_mds_imported is incremented in Migrator::handle_export_dir() ← Migrator::dispatch case MSG_MDS_EXPORTDIR

In the latest teuthology run on "main" branch, I see the same test pass and fail,

Passed:
https://pulpito.ceph.com/vshankar-2022-06-19_08:22:46-fs-wip-vshankar-testing1-20220619-102531-testing-default-smithi/6886671/
https://pulpito.ceph.com/vshankar-2022-06-19_08:22:46-fs-wip-vshankar-testing1-20220619-102531-testing-default-smithi/6886683/

Failed:
https://pulpito.ceph.com/vshankar-2022-06-15_04:03:39-fs-wip-vshankar-testing1-20220615-072516-testing-default-smithi/6879552/
https://pulpito.ceph.com/vshankar-2022-06-15_04:03:39-fs-wip-vshankar-testing1-20220615-072516-testing-default-smithi/6879616/

Looking at the passed test,
https://pulpito.ceph.com/vshankar-2022-06-19_08:22:46-fs-wip-vshankar-testing1-20220619-102531-testing-default-smithi/6886671/
Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-bitmap omap_limit/10 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/3 scrub/no standby-replay subvolume/{with-namespace-isolated-and-quota} tasks/{0-check-counter workunit/suites/iogen}}

In the teuthology log,

iogen workload start

2022-06-19T09:10:17.235 INFO:tasks.workunit:Running workunits matching suites/iogen.sh on client.0...
2022-06-19T09:10:17.236 INFO:tasks.workunit:Running workunit suites/iogen.sh...
2022-06-19T09:10:17.237 DEBUG:teuthology.orchestra.run.smithi060:workunit test suites/iogen.sh> mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=4322afc801c50aea640f0a7d6c01cfe9e520b178 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 CEPH_MNT=/home/ubuntu/cephtest/mnt.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/suites/iogen.sh
2022-06-19T09:10:17.325 INFO:tasks.workunit.client.0.smithi060.stdout:getting iogen

10 mins later iogen workload is stopped

2022-06-19T09:20:23.910 INFO:tasks.workunit:Stopping ['suites/iogen.sh'] on client.0...

Then, mds.exported and mds.imported counters are checked in each of the 12 mds daemons. The counters are incremented in active daemons mds.l and mds.i

2022-06-19T09:21:05.207 INFO:tasks.check_counter:Daemon mds.i mds.exported=1
2022-06-19T09:21:05.207 INFO:tasks.check_counter:Daemon mds.i mds.imported=1

2022-06-19T09:21:05.207 DEBUG:tasks.check_counter:Getting stats from l
2022-06-19T09:21:05.208 DEBUG:teuthology.orchestra.run.smithi133:> sudo /home/ubuntu/cephtest/cephadm --image quay.ceph.io/ceph-ci/ceph:4322afc801c50aea640f0a7d6c01cfe9e520b178 shell --fsid bf08e008-efad-11ec-8427-001a4aab830c -- ceph daemon mds.l perf dump

...

2022-06-19T09:21:08.177 INFO:tasks.check_counter:Daemon mds.l mds.exported=1
2022-06-19T09:21:08.178 INFO:tasks.check_counter:Daemon mds.l mds.imported=1

In the mds.i log,

2022-06-19T09:10:29.178+0000 7fcae3608700  7 mds.1.mig decode_import_dir  done [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{0=1} v=13 cv=0/0 dir_auth=1,0 state=1610614787|complete|frozentree|importing f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 child=1 frozen=1 subtree=1 importing=1 replicated=1 dirty=1 0x5615dcf35680]
2022-06-19T09:10:29.178+0000 7fcae3608700 10 mds.1.mig handle_export_dir  0 imported bounds
2022-06-19T09:10:29.178+0000 7fcae3608700  7 mds.1.mig handle_export_dir did [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{0=1} v=13 cv=0/0 dir_auth=1,0 state=1610614787|complete|frozentree|importing f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 child=1 frozen=1 subtree=1 importing=1 replicated=1 dirty=1 0x5615dcf35680]
2022-06-19T09:10:29.178+0000 7fcadcdfb700  5 mds.1.log _submit_thread 4195880~8871 : EImportStart 0x10000000202 from mds.0 [metablob 0x1, 9 dirs]

...

2022-06-19T09:20:25.529+0000 7fcae3608700  7 mds.1.mig export_go [dir 0x10000000202 ~mds0/stray1/10000000202/ [2,head] auth{0=2} v=18 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f(v1 m2022-06-19T09:20:20.897774+0000) n(v0 rc2022-06-19T09:20:20.897774+0000)/n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=0+0,ss=0+0 | ptrwaiter=0 request=0 child=0 frozen=1 subtree=1 importing=0 replicated=1 dirty=0 waiter=0 authpin=0 0x5615dcf35680] to 0
2022-06-19T09:20:25.529+0000 7fcadd5fc700 10 MDSIOContextBase::complete: 12C_IO_Wrapper
2022-06-19T09:20:25.529+0000 7fcadd5fc700 10 MDSContext::complete: 12C_IO_Wrapper
2022-06-19T09:20:25.529+0000 7fcadd5fc700 10 MDSContext::complete: 12C_M_ExportGo
2022-06-19T09:20:25.529+0000 7fcadd5fc700  7 mds.1.mig export_go_synced [dir 0x10000000202 ~mds0/stray1/10000000202/ [2,head] auth{0=2} v=18 cv=0/0 dir_auth=1,1 state=1073742851|complete|frozentree|exporting f(v1 m2022-06-19T09:20:20.897774+0000) n(v0 rc2022-06-19T09:20:20.897774+0000)/n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=0+0,ss=0+0 | ptrwaiter=1 request=0 child=0 frozen=1 subtree=1 importing=0 replicated=1 dirty=0 waiter=0 authpin=0 0x5615dcf35680] to 0
2022-06-19T09:20:25.529+0000 7fcadd5fc700 15 mds.1.cache show_subtrees

In the mds.l log,

2022-06-19T09:10:29.177+0000 7f6d76d88700  7 mds.0.mig export_go [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{1=1} v=13 cv=0/0 dir_auth=0,0 state=1610875907|complete|frozentree|exporting f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 replicated=1 dirty=1 waiter=0 authpin=0 0x558940f95200] to 1
2022-06-19T09:10:29.177+0000 7f6d70d7c700 10 MDSIOContextBase::complete: 12C_IO_Wrapper
2022-06-19T09:10:29.177+0000 7f6d70d7c700 10 MDSContext::complete: 12C_IO_Wrapper
2022-06-19T09:10:29.177+0000 7f6d70d7c700 10 MDSContext::complete: 12C_M_ExportGo
2022-06-19T09:10:29.177+0000 7f6d70d7c700  7 mds.0.mig export_go_synced [dir 0x10000000202 /volumes/_nogroup/sv_0/d458e51a-f962-4bd1-bcdd-9898e57977dd/client.0/tmp/iogen_3.1p0/src/ [2,head] auth{1=1} v=13 cv=0/0 dir_auth=0,0 state=1610875907|complete|frozentree|exporting f(v0 m2022-06-19T09:10:17.450796+0000 2=2+0) n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=2+0,ss=0+0 dirty=2 | ptrwaiter=1 request=0 child=1 frozen=1 subtree=1 replicated=1 dirty=1 waiter=0 authpin=0 0x558940f95200] to 1
2022-06-19T09:10:29.177+0000 7f6d70d7c700 15 mds.0.cache show_subtrees

...

2022-06-19T09:20:25.530+0000 7f6d76d88700 10 mds.0.mig handle_export_dir  0 imported bounds
2022-06-19T09:20:25.530+0000 7f6d76d88700  7 mds.0.mig handle_export_dir did [dir 0x10000000202 ~mds0/stray1/10000000202/ [2,head] auth{1=1} v=18 cv=0/0 dir_auth=0,1 state=1073743875|complete|frozentree|importing f(v1 m2022-06-19T09:20:20.897774+0000) n(v0 rc2022-06-19T09:20:20.897774+0000)/n(v0 rc2022-06-19T09:10:17.454201+0000 b23855 2=2+0) hs=0+0,ss=0+0 | ptrwaiter=1 request=0 child=0 frozen=1 subtree=1 importing=1 replicated=1 dirty=0 waiter=0 authpin=0 tempexporting=0 0x558940f95200]
2022-06-19T09:20:25.530+0000 7f6d7057b700  5 mds.0.log _submit_thread 4924336~520 : EImportStart 0x10000000202 from mds.1 [metablob 0x10000000202, 1 dirs]

Looking at the failed test,
https://pulpito.ceph.com/vshankar-2022-06-15_04:03:39-fs-wip-vshankar-testing1-20220615-072516-testing-default-smithi/6879552/
Description: fs/workload/{0-rhel_8 begin/{0-install 1-cephadm 2-logrotate} clusters/1a11s-mds-1c-client-3node conf/{client mds mon osd} mount/fuse objectstore-ec/bluestore-bitmap omap_limit/10 overrides/{frag ignorelist_health ignorelist_wrongly_marked_down osd-asserts session_timeout} ranks/3 scrub/no standby-replay subvolume/{with-namespace-isolated-and-quota} tasks/{0-check-counter workunit/suites/iogen}}

As in the tests that passed, the iogen workload ran in client.0 for close to 10 minutes. The mds.exported and mds.imported counters were then checked in all 12 mds daemons. The active daemons (mds.l, mds.i and mds.h) did not export or import subtrees.

Need to dig further why the same iogen workload sometimes trigger subtree migration and sometimes don't. Maybe identify CephFS settings thats lower the threshold for subtree migration to be initiated?

Actions #5

Updated by Ramana Raja almost 2 years ago

  • Status changed from New to Fix Under Review
  • Pull request ID set to 46861
Actions #6

Updated by Rishabh Dave almost 2 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #7

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #56590: quincy: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}" added
Actions #8

Updated by Backport Bot almost 2 years ago

  • Copied to Backport #56591: pacific: qa: iogen workunit: "The following counters failed to be set on mds daemons: {'mds.exported', 'mds.imported'}" added
Actions #9

Updated by Backport Bot over 1 year ago

  • Tags set to backport_processed
Actions #10

Updated by Rishabh Dave about 1 year ago

Venky,

This failure occured in Pacific backport runs - https://pulpito.ceph.com/yuriw-2023-04-04_15:06:57-fs-wip-yuri8-testing-2023-03-31-0812-pacific-distro-default-smithi/7230918. Should the patch for this ticket be backported?

Actions

Also available in: Atom PDF