Project

General

Profile

Bug #12869

override debug settings are not being applied

Added by Greg Farnum over 8 years ago. Updated over 8 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
Category:
Core Tasks
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

commit aa84941cf9fb1c52c8992da21be157e70fe99b98
Author: John Spray <john.spray@redhat.com>
Date:   Thu Aug 13 19:08:16 2015 +0100

    tasks/kcephfs: enable MDS debug

    To help us debug #11482

    Signed-off-by: John Spray <john.spray@redhat.com>

diff --git a/suites/kcephfs/cephfs/conf.yaml b/suites/kcephfs/cephfs/conf.yaml
index 30da870..b3ef404 100644
--- a/suites/kcephfs/cephfs/conf.yaml
+++ b/suites/kcephfs/cephfs/conf.yaml
@@ -3,3 +3,5 @@ overrides:
     conf:
       global:
         ms die on skipped message: false
+      mds:
+        debug mds: 20

applied in

commit 641169f2542d8fa23c1452b53288fe732be74503
Merge: 48a8b23 aa84941
Author: Yan, Zheng <ukernel@gmail.com>
Date:   Tue Aug 18 17:38:34 2015 +0800

    Merge pull request #531 from ceph/wip-mds-debug

    tasks/kcephfs: enable MDS debug

Is still not showing up in the kcephfs suite runs. I believe these should be auto-updating so I'm not sure how it could be failing. For instance, http://pulpito.ceph.com/teuthology-2015-08-24_23:08:02-kcephfs-master-testing-basic-multi/1030719/ does not have any mention of "debug mds". That's a master run. I've checked the master branch and the yaml fragment is good:

gregf@rex004:~/src/ceph-qa-suite [master]$ cat suites/kcephfs/cephfs/conf.yaml 
overrides:
  ceph:
    conf:
      global:
        ms die on skipped message: false
      mds:
        debug mds: 20

The only thing I can think of is that there are override stanzas in other conf.yaml files which are also being included in the run (from different fragment directories).

History

#1 Updated by Zack Cerza over 8 years ago

teuthology-suite --dry-run -v -s kcephfs:cephfs -l 1
[...]
2015-08-31 11:28:03,236.236 INFO:teuthology.suite:dry-run: /Users/zack/inkdev/teuthology/virtualenv/bin/teuthology-schedule --name zack-2015-08-31_11:27:58-kcephfs:cephfs-master---basic-magna --num 1 --worker magna --priority 1000 -v --description 'kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}' -- /var/folders/lh/723f8c417xz2n8dfzjqnmk3c0000gn/T/schedule_suite_bco4er /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/clusters/fixed-3-cephfs.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/conf.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/fs/btrfs.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/inline/no.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/tasks/kclient_workunit_direct_io.yaml

conf.yaml is indeed being ignored because of fs/btrfs.yaml.

When multiple fragments are specified, they are concatenated and then parsed by PyYAML (as opposed to being parsed first, then deep-merged). Changing this would probably break lots of things :)

#2 Updated by Zack Cerza over 8 years ago

  • Status changed from New to Won't Fix

So, it would be best to merge your overrides by hand, I think.

#3 Updated by Greg Farnum over 8 years ago

  • Status changed from Won't Fix to 12
  • Priority changed from Normal to Urgent

Copying my argument from irc:

[18:49:59]  <gregsfortytwo>    if you can't have two overrides the whole fragment system breaks down
[18:50:40]  <gregsfortytwo>    we need to override the ceph.conf default values for all kinds of stuff
[18:50:59]  <vasu>    but it can be still done with one overrides, this would be less confusing?
[18:51:11]  <gregsfortytwo>    having *conflicting* overrides would be user error, and the system could barf on that and it would be fine/good
[18:51:24]  <gregsfortytwo>    but right now it's apparently silently discarding one of the override values
[18:51:32]  <gregsfortytwo>    vasu: no, it can't be done with one override
[18:51:41]  <vasu>    its a static file?
[18:52:04]  <gregsfortytwo>    I can't say "this suite needs a different value for foo than the ceph task normally does, and this one workload also needs a different value for bar" 
[18:52:10]  <gregsfortytwo>    in only one file
[18:52:44]  <gregsfortytwo>    I could push the "foo" override into all the workload folder yamls and that would work
[18:53:07]  <gregsfortytwo>    but then if I also need to specify a different value for eg the btrfs-backed OSDs, it breaks down and I'd need to put them in different suites or something

And I've also discovered that most of the suites include a msgr-failures folder that sets override values, and then have overrides in the workloads folder. If these two override stanzas aren't merging we have a serious problem and it needs to be fixed in teuthology to do a proper merge. (I'm not sure if that's actually the case or if this is insufficiently diagnosed.)

#4 Updated by Zack Cerza over 8 years ago

Upon closer inspection with a hacked teuthology-suite:

2015-08-31 12:28:11,913.913 INFO:teuthology.suite:Scheduling kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}
['/Users/zack/inkdev/teuthology/virtualenv/bin/teuthology-schedule', '--name', 'zack-2015-08-31_12:28:08-kcephfs:cephfs-master---basic-magna', '--num', '1', '--worker', 'magna', '--dry-run', '--priority', '1000', '-v', '--description', 'kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}', '--', '/var/folders/lh/723f8c417xz2n8dfzjqnmk3c0000gn/T/schedule_suite_NerDkv', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/clusters/fixed-3-cephfs.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/conf.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/fs/btrfs.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/inline/no.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/tasks/kclient_workunit_direct_io.yaml']
{'branch': 'master',
 'description': 'kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}',
 'email': 'zack@redhat.com',
 'last_in_suite': False,
 'log-rotate': {'ceph-mds': '10G', 'ceph-osd': '10G'},
 'machine_type': 'magna',
 'name': 'zack-2015-08-31_12:28:08-kcephfs:cephfs-master---basic-magna',
 'nuke-on-error': True,
 'overrides': {'admin_socket': {'branch': 'master'},
               'ceph': {'conf': {'global': {'ms die on skipped message': False},
                                 'mds': {'debug mds': 20},
                                 'mon': {'debug mon': 20,
                                         'debug ms': 1,
                                         'debug paxos': 20},
                                 'osd': {'debug filestore': 20,
                                         'debug journal': 20,
                                         'debug ms': 1,
                                         'debug osd': 20,
                                         'osd op thread timeout': 60,
                                         'osd sloppy crc': True}},
                        'fs': 'btrfs',
                        'log-whitelist': ['slow request'],
                        'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'},
               'ceph-deploy': {'branch': {'dev-commit': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'},
                               'conf': {'client': {'log file': '/var/log/ceph/ceph-$name.$pid.log'},
                                        'mon': {'debug mon': 1,
                                                'debug ms': 20,
                                                'debug paxos': 20,
                                                'osd default pool size': 2}}},
               'install': {'ceph': {'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'}},
               'workunit': {'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'}},
 'owner': 'scheduled_zack@zwork.local',
 'priority': 1000,
 'roles': [['mon.a', 'mds.a', 'osd.0', 'osd.1'],
           ['mon.b', 'mds.a-s', 'mon.c', 'osd.2', 'osd.3'],
           ['client.0']],
 'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2',
 'suite': 'kcephfs:cephfs',
 'suite_branch': 'master',
 'tasks': [{'ansible.cephlab': None},
           {'clock.check': None},
           {'install': None},
           {'ceph': None},
           {'kclient': None},
           {'workunit': {'clients': {'all': ['direct_io']}}}],
 'teuthology_branch': 'master',
 'tube': 'magna',
 'verbose': True}
2015-08-31 12:28:12,239.239 INFO:teuthology.suite:Suite kcephfs:cephfs in /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs scheduled 1 jobs.
2015-08-31 12:28:12,239.239 INFO:teuthology.suite:Suite kcephfs:cephfs in /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs -- 23 jobs were filtered out.
2015-08-31 12:28:12,239.239 INFO:teuthology.suite:Suite kcephfs:cephfs in /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs scheduled 0 jobs with missing packages.
2015-08-31 12:28:12,502.502 INFO:teuthology.suite:Test results viewable at http://pulpito.ceph.com/zack-2015-08-31_12:28:08-kcephfs:cephfs-master---basic-magna/
(virtualenv)12:28:12 zack@zwork teuthology master ? cat /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/conf.yaml
overrides:
  ceph:
    conf:
      global:
        ms die on skipped message: false
      mds:
        debug mds: 20
(virtualenv)12:29:05 zack@zwork teuthology master ? cat /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/fs/btrfs.yaml
overrides:
  ceph:
    fs: btrfs
    conf:
      osd:
        osd sloppy crc: true
        osd op thread timeout: 60

This actually appears to be working as you want it to. I just scheduled a job with a stock teuthology-suite:
teuthology-suite -v -s kcephfs:cephfs -l 1

Here it is in pulpito:
http://pulpito.ceph.com/zack-2015-08-31_11:33:52-kcephfs:cephfs-master---basic-multi/1039781/

Here is its overrides stanza:

overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms die on skipped message: false
      mds:
        debug mds: 20
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
        osd op thread timeout: 60
        osd sloppy crc: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 6dc9ed581441aade22750d1eb541cdbeddeb37d2
  ceph-deploy:
    branch:
      dev-commit: 6dc9ed581441aade22750d1eb541cdbeddeb37d2
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 6dc9ed581441aade22750d1eb541cdbeddeb37d2
  workunit:
    sha1: 6dc9ed581441aade22750d1eb541cdbeddeb37d2

This of course does not explain why the correct bits didn't get included on the particular job which inspired this ticket. I would be curious to see the teuthology-suite line which scheduled that job.

#5 Updated by Zack Cerza over 8 years ago

I just noticed that this was a scheduled rados run and those are scheduled using teuthology-suite --subset. At this point this really seems like a bug in --subset.

#6 Updated by Zack Cerza over 8 years ago

I just filed a teuthology PR (https://github.com/ceph/teuthology/pull/607) to give teuthology-suite a -vv (double-verbose) mode, which causes teuthology-schedule --dry-run to be run for each generated job in the suite.

Here is its output for teuthology-suite --dry-run -vv -s kcephfs:cephfs :
http://fpaste.org/261703/14410566/

#7 Updated by Zack Cerza over 8 years ago

  • Status changed from 12 to Need More Info
  • Assignee set to Greg Farnum

Greg, the feature I mentioned in the previous comment is merged into master. Please see if you can reproduce the problem using that.

#8 Updated by Greg Farnum over 8 years ago

  • Assignee changed from Greg Farnum to Zack Cerza

Well, I tried to on three different machines.
rex004 hangs when I execute it, I think because it doesn't have a VPN connection to sepia and so can't reach the lock server.
Sepia's teuthology box seems to have a version of teuthology from June, and when I attempted to update everything barfed because it wanted libpython-dev (which doesn't seem to exist for its Ubuntu, although python-dev is installed).
magna002 failed on some stupid thing but started working when I updated, deleted the virtualenv, and recreated it. I get similar/the same output to what you do. I also tried checking out the version of teuthology that sepia's machine seems to be running and scheduled a single job, and things looked fine. (This was my only idea as to what might be broken.)

So at this point all I can say is yes, the things you are showing me here seem to indicate that it's working. But nonetheless, when running scheduled jobs we are not getting "debug mds: 20" included in the configs. We need it to do so. Perhaps the sepia teuthology box is not actually grabbing the newest ceph-qa-suite checkouts to schedule against? But if I look at /home/teuthology/src/ceph-qa-suite_master it's currently on

commit 8331556b9b8f947319433b6a0bb234088ba073c0
Merge: 5df0ceb e8d4cf1
Author: David Zafman <dzafman@redhat.com>
Date:   Tue Sep 1 12:27:12 2015 -0700

which looks to be the newest one. Despite that, looking at http://pulpito.ceph.com/teuthology-2015-08-31_23:08:01-kcephfs-master-testing-basic-multi/ (the newest one in pulpito) we still aren't getting the "debug mds" line included. I would wonder if I had done something truly stupid like put it in the wrong branch or suite or something if all our other attempts to reproduce it elsewhere weren't behaving as expected...

#9 Updated by Greg Farnum over 8 years ago

  • Status changed from Need More Info to Rejected

Ugh. Okay, the last run that didn't have the debug mds settings I was interested in...didn't have that yaml fragment included, by virtue of being in a different subsuite. I should have spotted this earlier but was primed for it to be broken because it also took a long time from us including the yaml fragment to it being included in a suite result, but I think that's just the unfortunately long queue times we've got going on right now.

Also available in: Atom PDF