Bug #12869
closedoverride debug settings are not being applied
0%
Description
commit aa84941cf9fb1c52c8992da21be157e70fe99b98 Author: John Spray <john.spray@redhat.com> Date: Thu Aug 13 19:08:16 2015 +0100 tasks/kcephfs: enable MDS debug To help us debug #11482 Signed-off-by: John Spray <john.spray@redhat.com> diff --git a/suites/kcephfs/cephfs/conf.yaml b/suites/kcephfs/cephfs/conf.yaml index 30da870..b3ef404 100644 --- a/suites/kcephfs/cephfs/conf.yaml +++ b/suites/kcephfs/cephfs/conf.yaml @@ -3,3 +3,5 @@ overrides: conf: global: ms die on skipped message: false + mds: + debug mds: 20
applied in
commit 641169f2542d8fa23c1452b53288fe732be74503 Merge: 48a8b23 aa84941 Author: Yan, Zheng <ukernel@gmail.com> Date: Tue Aug 18 17:38:34 2015 +0800 Merge pull request #531 from ceph/wip-mds-debug tasks/kcephfs: enable MDS debug
Is still not showing up in the kcephfs suite runs. I believe these should be auto-updating so I'm not sure how it could be failing. For instance, http://pulpito.ceph.com/teuthology-2015-08-24_23:08:02-kcephfs-master-testing-basic-multi/1030719/ does not have any mention of "debug mds". That's a master run. I've checked the master branch and the yaml fragment is good:
gregf@rex004:~/src/ceph-qa-suite [master]$ cat suites/kcephfs/cephfs/conf.yaml overrides: ceph: conf: global: ms die on skipped message: false mds: debug mds: 20
The only thing I can think of is that there are override stanzas in other conf.yaml files which are also being included in the run (from different fragment directories).
Updated by Zack Cerza over 8 years ago
teuthology-suite --dry-run -v -s kcephfs:cephfs -l 1 [...] 2015-08-31 11:28:03,236.236 INFO:teuthology.suite:dry-run: /Users/zack/inkdev/teuthology/virtualenv/bin/teuthology-schedule --name zack-2015-08-31_11:27:58-kcephfs:cephfs-master---basic-magna --num 1 --worker magna --priority 1000 -v --description 'kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}' -- /var/folders/lh/723f8c417xz2n8dfzjqnmk3c0000gn/T/schedule_suite_bco4er /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/clusters/fixed-3-cephfs.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/conf.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/fs/btrfs.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/inline/no.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/tasks/kclient_workunit_direct_io.yaml
conf.yaml is indeed being ignored because of fs/btrfs.yaml.
When multiple fragments are specified, they are concatenated and then parsed by PyYAML (as opposed to being parsed first, then deep-merged). Changing this would probably break lots of things :)
Updated by Zack Cerza over 8 years ago
- Status changed from New to Won't Fix
So, it would be best to merge your overrides by hand, I think.
Updated by Greg Farnum over 8 years ago
- Status changed from Won't Fix to 12
- Priority changed from Normal to Urgent
Copying my argument from irc:
[18:49:59] <gregsfortytwo> if you can't have two overrides the whole fragment system breaks down [18:50:40] <gregsfortytwo> we need to override the ceph.conf default values for all kinds of stuff [18:50:59] <vasu> but it can be still done with one overrides, this would be less confusing? [18:51:11] <gregsfortytwo> having *conflicting* overrides would be user error, and the system could barf on that and it would be fine/good [18:51:24] <gregsfortytwo> but right now it's apparently silently discarding one of the override values [18:51:32] <gregsfortytwo> vasu: no, it can't be done with one override [18:51:41] <vasu> its a static file? [18:52:04] <gregsfortytwo> I can't say "this suite needs a different value for foo than the ceph task normally does, and this one workload also needs a different value for bar" [18:52:10] <gregsfortytwo> in only one file [18:52:44] <gregsfortytwo> I could push the "foo" override into all the workload folder yamls and that would work [18:53:07] <gregsfortytwo> but then if I also need to specify a different value for eg the btrfs-backed OSDs, it breaks down and I'd need to put them in different suites or something
And I've also discovered that most of the suites include a msgr-failures folder that sets override values, and then have overrides in the workloads folder. If these two override stanzas aren't merging we have a serious problem and it needs to be fixed in teuthology to do a proper merge. (I'm not sure if that's actually the case or if this is insufficiently diagnosed.)
Updated by Zack Cerza over 8 years ago
Upon closer inspection with a hacked teuthology-suite
:
2015-08-31 12:28:11,913.913 INFO:teuthology.suite:Scheduling kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml} ['/Users/zack/inkdev/teuthology/virtualenv/bin/teuthology-schedule', '--name', 'zack-2015-08-31_12:28:08-kcephfs:cephfs-master---basic-magna', '--num', '1', '--worker', 'magna', '--dry-run', '--priority', '1000', '-v', '--description', 'kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}', '--', '/var/folders/lh/723f8c417xz2n8dfzjqnmk3c0000gn/T/schedule_suite_NerDkv', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/clusters/fixed-3-cephfs.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/conf.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/fs/btrfs.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/inline/no.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/tasks/kclient_workunit_direct_io.yaml'] {'branch': 'master', 'description': 'kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}', 'email': 'zack@redhat.com', 'last_in_suite': False, 'log-rotate': {'ceph-mds': '10G', 'ceph-osd': '10G'}, 'machine_type': 'magna', 'name': 'zack-2015-08-31_12:28:08-kcephfs:cephfs-master---basic-magna', 'nuke-on-error': True, 'overrides': {'admin_socket': {'branch': 'master'}, 'ceph': {'conf': {'global': {'ms die on skipped message': False}, 'mds': {'debug mds': 20}, 'mon': {'debug mon': 20, 'debug ms': 1, 'debug paxos': 20}, 'osd': {'debug filestore': 20, 'debug journal': 20, 'debug ms': 1, 'debug osd': 20, 'osd op thread timeout': 60, 'osd sloppy crc': True}}, 'fs': 'btrfs', 'log-whitelist': ['slow request'], 'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'}, 'ceph-deploy': {'branch': {'dev-commit': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'}, 'conf': {'client': {'log file': '/var/log/ceph/ceph-$name.$pid.log'}, 'mon': {'debug mon': 1, 'debug ms': 20, 'debug paxos': 20, 'osd default pool size': 2}}}, 'install': {'ceph': {'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'}}, 'workunit': {'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'}}, 'owner': 'scheduled_zack@zwork.local', 'priority': 1000, 'roles': [['mon.a', 'mds.a', 'osd.0', 'osd.1'], ['mon.b', 'mds.a-s', 'mon.c', 'osd.2', 'osd.3'], ['client.0']], 'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2', 'suite': 'kcephfs:cephfs', 'suite_branch': 'master', 'tasks': [{'ansible.cephlab': None}, {'clock.check': None}, {'install': None}, {'ceph': None}, {'kclient': None}, {'workunit': {'clients': {'all': ['direct_io']}}}], 'teuthology_branch': 'master', 'tube': 'magna', 'verbose': True} 2015-08-31 12:28:12,239.239 INFO:teuthology.suite:Suite kcephfs:cephfs in /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs scheduled 1 jobs. 2015-08-31 12:28:12,239.239 INFO:teuthology.suite:Suite kcephfs:cephfs in /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs -- 23 jobs were filtered out. 2015-08-31 12:28:12,239.239 INFO:teuthology.suite:Suite kcephfs:cephfs in /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs scheduled 0 jobs with missing packages. 2015-08-31 12:28:12,502.502 INFO:teuthology.suite:Test results viewable at http://pulpito.ceph.com/zack-2015-08-31_12:28:08-kcephfs:cephfs-master---basic-magna/ (virtualenv)12:28:12 zack@zwork teuthology master ? cat /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/conf.yaml overrides: ceph: conf: global: ms die on skipped message: false mds: debug mds: 20 (virtualenv)12:29:05 zack@zwork teuthology master ? cat /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/fs/btrfs.yaml overrides: ceph: fs: btrfs conf: osd: osd sloppy crc: true osd op thread timeout: 60
This actually appears to be working as you want it to. I just scheduled a job with a stock teuthology-suite
:teuthology-suite -v -s kcephfs:cephfs -l 1
Here it is in pulpito:
http://pulpito.ceph.com/zack-2015-08-31_11:33:52-kcephfs:cephfs-master---basic-multi/1039781/
Here is its overrides stanza:
overrides: admin_socket: branch: master ceph: conf: global: ms die on skipped message: false mds: debug mds: 20 mon: debug mon: 20 debug ms: 1 debug paxos: 20 osd: debug filestore: 20 debug journal: 20 debug ms: 1 debug osd: 20 osd op thread timeout: 60 osd sloppy crc: true fs: btrfs log-whitelist: - slow request sha1: 6dc9ed581441aade22750d1eb541cdbeddeb37d2 ceph-deploy: branch: dev-commit: 6dc9ed581441aade22750d1eb541cdbeddeb37d2 conf: client: log file: /var/log/ceph/ceph-$name.$pid.log mon: debug mon: 1 debug ms: 20 debug paxos: 20 osd default pool size: 2 install: ceph: sha1: 6dc9ed581441aade22750d1eb541cdbeddeb37d2 workunit: sha1: 6dc9ed581441aade22750d1eb541cdbeddeb37d2
This of course does not explain why the correct bits didn't get included on the particular job which inspired this ticket. I would be curious to see the teuthology-suite
line which scheduled that job.
Updated by Zack Cerza over 8 years ago
I just noticed that this was a scheduled rados run and those are scheduled using teuthology-suite --subset
. At this point this really seems like a bug in --subset
.
Updated by Zack Cerza over 8 years ago
I just filed a teuthology PR (https://github.com/ceph/teuthology/pull/607) to give teuthology-suite
a -vv
(double-verbose) mode, which causes teuthology-schedule --dry-run
to be run for each generated job in the suite.
Here is its output for teuthology-suite --dry-run -vv -s kcephfs:cephfs
:
http://fpaste.org/261703/14410566/
Updated by Zack Cerza over 8 years ago
- Status changed from 12 to Need More Info
- Assignee set to Greg Farnum
Greg, the feature I mentioned in the previous comment is merged into master. Please see if you can reproduce the problem using that.
Updated by Greg Farnum over 8 years ago
- Assignee changed from Greg Farnum to Zack Cerza
Well, I tried to on three different machines.
rex004 hangs when I execute it, I think because it doesn't have a VPN connection to sepia and so can't reach the lock server.
Sepia's teuthology box seems to have a version of teuthology from June, and when I attempted to update everything barfed because it wanted libpython-dev (which doesn't seem to exist for its Ubuntu, although python-dev is installed).
magna002 failed on some stupid thing but started working when I updated, deleted the virtualenv, and recreated it. I get similar/the same output to what you do. I also tried checking out the version of teuthology that sepia's machine seems to be running and scheduled a single job, and things looked fine. (This was my only idea as to what might be broken.)
So at this point all I can say is yes, the things you are showing me here seem to indicate that it's working. But nonetheless, when running scheduled jobs we are not getting "debug mds: 20" included in the configs. We need it to do so. Perhaps the sepia teuthology box is not actually grabbing the newest ceph-qa-suite checkouts to schedule against? But if I look at /home/teuthology/src/ceph-qa-suite_master it's currently on
commit 8331556b9b8f947319433b6a0bb234088ba073c0 Merge: 5df0ceb e8d4cf1 Author: David Zafman <dzafman@redhat.com> Date: Tue Sep 1 12:27:12 2015 -0700
which looks to be the newest one. Despite that, looking at http://pulpito.ceph.com/teuthology-2015-08-31_23:08:01-kcephfs-master-testing-basic-multi/ (the newest one in pulpito) we still aren't getting the "debug mds" line included. I would wonder if I had done something truly stupid like put it in the wrong branch or suite or something if all our other attempts to reproduce it elsewhere weren't behaving as expected...
Updated by Greg Farnum over 8 years ago
- Status changed from Need More Info to Rejected
Ugh. Okay, the last run that didn't have the debug mds settings I was interested in...didn't have that yaml fragment included, by virtue of being in a different subsuite. I should have spotted this earlier but was primed for it to be broken because it also took a long time from us including the yaml fragment to it being included in a suite result, but I think that's just the unfortunately long queue times we've got going on right now.