Bug #12869: override debug settings are not being applied - teuthology - Ceph

Actions

Copy link

Bug #12869

closed

override debug settings are not being applied

Added by Greg Farnum over 8 years ago. Updated over 8 years ago.

Status:

Rejected

Priority:

Urgent

Assignee:

Zack Cerza

Category:

Core Tasks

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

commit aa84941cf9fb1c52c8992da21be157e70fe99b98
Author: John Spray <john.spray@redhat.com>
Date:   Thu Aug 13 19:08:16 2015 +0100

    tasks/kcephfs: enable MDS debug

    To help us debug #11482

    Signed-off-by: John Spray <john.spray@redhat.com>

diff --git a/suites/kcephfs/cephfs/conf.yaml b/suites/kcephfs/cephfs/conf.yaml
index 30da870..b3ef404 100644
--- a/suites/kcephfs/cephfs/conf.yaml
+++ b/suites/kcephfs/cephfs/conf.yaml
@@ -3,3 +3,5 @@ overrides:
     conf:
       global:
         ms die on skipped message: false
+      mds:
+        debug mds: 20

applied in

commit 641169f2542d8fa23c1452b53288fe732be74503
Merge: 48a8b23 aa84941
Author: Yan, Zheng <ukernel@gmail.com>
Date:   Tue Aug 18 17:38:34 2015 +0800

    Merge pull request #531 from ceph/wip-mds-debug

    tasks/kcephfs: enable MDS debug

Is still not showing up in the kcephfs suite runs. I believe these should be auto-updating so I'm not sure how it could be failing. For instance, http://pulpito.ceph.com/teuthology-2015-08-24_23:08:02-kcephfs-master-testing-basic-multi/1030719/ does not have any mention of "debug mds". That's a master run. I've checked the master branch and the yaml fragment is good:

gregf@rex004:~/src/ceph-qa-suite [master]$ cat suites/kcephfs/cephfs/conf.yaml 
overrides:
  ceph:
    conf:
      global:
        ms die on skipped message: false
      mds:
        debug mds: 20

The only thing I can think of is that there are override stanzas in other conf.yaml files which are also being included in the run (from different fragment directories).

Actions

Copy link

Updated by Zack Cerza over 8 years ago

teuthology-suite --dry-run -v -s kcephfs:cephfs -l 1
[...]
2015-08-31 11:28:03,236.236 INFO:teuthology.suite:dry-run: /Users/zack/inkdev/teuthology/virtualenv/bin/teuthology-schedule --name zack-2015-08-31_11:27:58-kcephfs:cephfs-master---basic-magna --num 1 --worker magna --priority 1000 -v --description 'kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}' -- /var/folders/lh/723f8c417xz2n8dfzjqnmk3c0000gn/T/schedule_suite_bco4er /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/clusters/fixed-3-cephfs.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/conf.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/fs/btrfs.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/inline/no.yaml /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/tasks/kclient_workunit_direct_io.yaml

conf.yaml is indeed being ignored because of fs/btrfs.yaml.

When multiple fragments are specified, they are concatenated and then parsed by PyYAML (as opposed to being parsed first, then deep-merged). Changing this would probably break lots of things :)

Actions

Copy link

Updated by Zack Cerza over 8 years ago

Status changed from New to Won't Fix

So, it would be best to merge your overrides by hand, I think.

Actions

Copy link

Updated by Greg Farnum over 8 years ago

Status changed from Won't Fix to 12
Priority changed from Normal to Urgent

Copying my argument from irc:

[18:49:59]  <gregsfortytwo>    if you can't have two overrides the whole fragment system breaks down
[18:50:40]  <gregsfortytwo>    we need to override the ceph.conf default values for all kinds of stuff
[18:50:59]  <vasu>    but it can be still done with one overrides, this would be less confusing?
[18:51:11]  <gregsfortytwo>    having *conflicting* overrides would be user error, and the system could barf on that and it would be fine/good
[18:51:24]  <gregsfortytwo>    but right now it's apparently silently discarding one of the override values
[18:51:32]  <gregsfortytwo>    vasu: no, it can't be done with one override
[18:51:41]  <vasu>    its a static file?
[18:52:04]  <gregsfortytwo>    I can't say "this suite needs a different value for foo than the ceph task normally does, and this one workload also needs a different value for bar" 
[18:52:10]  <gregsfortytwo>    in only one file
[18:52:44]  <gregsfortytwo>    I could push the "foo" override into all the workload folder yamls and that would work
[18:53:07]  <gregsfortytwo>    but then if I also need to specify a different value for eg the btrfs-backed OSDs, it breaks down and I'd need to put them in different suites or something

And I've also discovered that most of the suites include a msgr-failures folder that sets override values, and then have overrides in the workloads folder. If these two override stanzas aren't merging we have a serious problem and it needs to be fixed in teuthology to do a proper merge. (I'm not sure if that's actually the case or if this is insufficiently diagnosed.)

Actions

Copy link

Updated by Zack Cerza over 8 years ago

Upon closer inspection with a hacked teuthology-suite:

2015-08-31 12:28:11,913.913 INFO:teuthology.suite:Scheduling kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}
['/Users/zack/inkdev/teuthology/virtualenv/bin/teuthology-schedule', '--name', 'zack-2015-08-31_12:28:08-kcephfs:cephfs-master---basic-magna', '--num', '1', '--worker', 'magna', '--dry-run', '--priority', '1000', '-v', '--description', 'kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}', '--', '/var/folders/lh/723f8c417xz2n8dfzjqnmk3c0000gn/T/schedule_suite_NerDkv', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/clusters/fixed-3-cephfs.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/conf.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/fs/btrfs.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/inline/no.yaml', '/Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/tasks/kclient_workunit_direct_io.yaml']
{'branch': 'master',
 'description': 'kcephfs:cephfs/{conf.yaml clusters/fixed-3-cephfs.yaml fs/btrfs.yaml inline/no.yaml tasks/kclient_workunit_direct_io.yaml}',
 'email': 'zack@redhat.com',
 'last_in_suite': False,
 'log-rotate': {'ceph-mds': '10G', 'ceph-osd': '10G'},
 'machine_type': 'magna',
 'name': 'zack-2015-08-31_12:28:08-kcephfs:cephfs-master---basic-magna',
 'nuke-on-error': True,
 'overrides': {'admin_socket': {'branch': 'master'},
               'ceph': {'conf': {'global': {'ms die on skipped message': False},
                                 'mds': {'debug mds': 20},
                                 'mon': {'debug mon': 20,
                                         'debug ms': 1,
                                         'debug paxos': 20},
                                 'osd': {'debug filestore': 20,
                                         'debug journal': 20,
                                         'debug ms': 1,
                                         'debug osd': 20,
                                         'osd op thread timeout': 60,
                                         'osd sloppy crc': True}},
                        'fs': 'btrfs',
                        'log-whitelist': ['slow request'],
                        'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'},
               'ceph-deploy': {'branch': {'dev-commit': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'},
                               'conf': {'client': {'log file': '/var/log/ceph/ceph-$name.$pid.log'},
                                        'mon': {'debug mon': 1,
                                                'debug ms': 20,
                                                'debug paxos': 20,
                                                'osd default pool size': 2}}},
               'install': {'ceph': {'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'}},
               'workunit': {'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2'}},
 'owner': 'scheduled_zack@zwork.local',
 'priority': 1000,
 'roles': [['mon.a', 'mds.a', 'osd.0', 'osd.1'],
           ['mon.b', 'mds.a-s', 'mon.c', 'osd.2', 'osd.3'],
           ['client.0']],
 'sha1': '6dc9ed581441aade22750d1eb541cdbeddeb37d2',
 'suite': 'kcephfs:cephfs',
 'suite_branch': 'master',
 'tasks': [{'ansible.cephlab': None},
           {'clock.check': None},
           {'install': None},
           {'ceph': None},
           {'kclient': None},
           {'workunit': {'clients': {'all': ['direct_io']}}}],
 'teuthology_branch': 'master',
 'tube': 'magna',
 'verbose': True}
2015-08-31 12:28:12,239.239 INFO:teuthology.suite:Suite kcephfs:cephfs in /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs scheduled 1 jobs.
2015-08-31 12:28:12,239.239 INFO:teuthology.suite:Suite kcephfs:cephfs in /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs -- 23 jobs were filtered out.
2015-08-31 12:28:12,239.239 INFO:teuthology.suite:Suite kcephfs:cephfs in /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs scheduled 0 jobs with missing packages.
2015-08-31 12:28:12,502.502 INFO:teuthology.suite:Test results viewable at http://pulpito.ceph.com/zack-2015-08-31_12:28:08-kcephfs:cephfs-master---basic-magna/
(virtualenv)12:28:12 zack@zwork teuthology master ? cat /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/conf.yaml
overrides:
  ceph:
    conf:
      global:
        ms die on skipped message: false
      mds:
        debug mds: 20
(virtualenv)12:29:05 zack@zwork teuthology master ? cat /Users/zack/src/ceph-qa-suite_master/suites/kcephfs/cephfs/fs/btrfs.yaml
overrides:
  ceph:
    fs: btrfs
    conf:
      osd:
        osd sloppy crc: true
        osd op thread timeout: 60

This actually appears to be working as you want it to. I just scheduled a job with a stock teuthology-suite:
teuthology-suite -v -s kcephfs:cephfs -l 1

Here it is in pulpito:
http://pulpito.ceph.com/zack-2015-08-31_11:33:52-kcephfs:cephfs-master---basic-multi/1039781/

Here is its overrides stanza:

overrides:
  admin_socket:
    branch: master
  ceph:
    conf:
      global:
        ms die on skipped message: false
      mds:
        debug mds: 20
      mon:
        debug mon: 20
        debug ms: 1
        debug paxos: 20
      osd:
        debug filestore: 20
        debug journal: 20
        debug ms: 1
        debug osd: 20
        osd op thread timeout: 60
        osd sloppy crc: true
    fs: btrfs
    log-whitelist:
    - slow request
    sha1: 6dc9ed581441aade22750d1eb541cdbeddeb37d2
  ceph-deploy:
    branch:
      dev-commit: 6dc9ed581441aade22750d1eb541cdbeddeb37d2
    conf:
      client:
        log file: /var/log/ceph/ceph-$name.$pid.log
      mon:
        debug mon: 1
        debug ms: 20
        debug paxos: 20
        osd default pool size: 2
  install:
    ceph:
      sha1: 6dc9ed581441aade22750d1eb541cdbeddeb37d2
  workunit:
    sha1: 6dc9ed581441aade22750d1eb541cdbeddeb37d2

This of course does not explain why the correct bits didn't get included on the particular job which inspired this ticket. I would be curious to see the teuthology-suite line which scheduled that job.

Actions

Copy link

Updated by Zack Cerza over 8 years ago

I just noticed that this was a scheduled rados run and those are scheduled using teuthology-suite --subset. At this point this really seems like a bug in --subset.

Actions

Copy link

Updated by Zack Cerza over 8 years ago

I just filed a teuthology PR (https://github.com/ceph/teuthology/pull/607) to give teuthology-suite a -vv (double-verbose) mode, which causes teuthology-schedule --dry-run to be run for each generated job in the suite.

Here is its output for teuthology-suite --dry-run -vv -s kcephfs:cephfs :
http://fpaste.org/261703/14410566/

Actions

Copy link

Updated by Zack Cerza over 8 years ago

Status changed from 12 to Need More Info
Assignee set to Greg Farnum

Greg, the feature I mentioned in the previous comment is merged into master. Please see if you can reproduce the problem using that.

Actions

Copy link

Updated by Greg Farnum over 8 years ago

Assignee changed from Greg Farnum to Zack Cerza

Well, I tried to on three different machines.
rex004 hangs when I execute it, I think because it doesn't have a VPN connection to sepia and so can't reach the lock server.
Sepia's teuthology box seems to have a version of teuthology from June, and when I attempted to update everything barfed because it wanted libpython-dev (which doesn't seem to exist for its Ubuntu, although python-dev is installed).
magna002 failed on some stupid thing but started working when I updated, deleted the virtualenv, and recreated it. I get similar/the same output to what you do. I also tried checking out the version of teuthology that sepia's machine seems to be running and scheduled a single job, and things looked fine. (This was my only idea as to what might be broken.)

So at this point all I can say is yes, the things you are showing me here seem to indicate that it's working. But nonetheless, when running scheduled jobs we are not getting "debug mds: 20" included in the configs. We need it to do so. Perhaps the sepia teuthology box is not actually grabbing the newest ceph-qa-suite checkouts to schedule against? But if I look at /home/teuthology/src/ceph-qa-suite_master it's currently on

commit 8331556b9b8f947319433b6a0bb234088ba073c0
Merge: 5df0ceb e8d4cf1
Author: David Zafman <dzafman@redhat.com>
Date:   Tue Sep 1 12:27:12 2015 -0700

which looks to be the newest one. Despite that, looking at http://pulpito.ceph.com/teuthology-2015-08-31_23:08:01-kcephfs-master-testing-basic-multi/ (the newest one in pulpito) we still aren't getting the "debug mds" line included. I would wonder if I had done something truly stupid like put it in the wrong branch or suite or something if all our other attempts to reproduce it elsewhere weren't behaving as expected...

Actions

Copy link

Updated by Greg Farnum over 8 years ago

Status changed from Need More Info to Rejected

Ugh. Okay, the last run that didn't have the debug mds settings I was interested in...didn't have that yaml fragment included, by virtue of being in a different subsuite. I should have spotted this earlier but was primed for it to be broken because it also took a long time from us including the yaml fragment to it being included in a suite result, but I think that's just the unfortunately long queue times we've got going on right now.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Tools » teuthology

Custom queries

Bug #12869

override debug settings are not being applied

Updated by Zack Cerza over 8 years ago

Updated by Zack Cerza over 8 years ago

Updated by Greg Farnum over 8 years ago

Updated by Zack Cerza over 8 years ago

Updated by Zack Cerza over 8 years ago

Updated by Zack Cerza over 8 years ago

Updated by Zack Cerza over 8 years ago

Updated by Greg Farnum over 8 years ago

Updated by Greg Farnum over 8 years ago