Bug #38115: qa/tasks/ceph.py monmap bootstrap does not support multiple clusters - Ceph - Ceph

Actions

Copy link

Bug #38115

closed

qa/tasks/ceph.py monmap bootstrap does not support multiple clusters

Added by Casey Bodley about 5 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

High

Assignee:

Casey Bodley

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The rgw multisite suites have been failing recently with valgrind issues in ceph-mon, but the runs start with this error from the 'ceph' command:

2019-01-30T14:17:21.881 INFO:tasks.ceph:Setting crush tunables to default
2019-01-30T14:17:21.881 INFO:teuthology.orchestra.run.smithi055:Running:
2019-01-30T14:17:21.881 INFO:teuthology.orchestra.run.smithi055:> sudo ceph --cluster c1 osd crush tunables default
2019-01-30T14:17:28.923 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:==00:00:00:06.882 32990== Warning: unimplemented fcntl command: 1036
2019-01-30T14:17:29.445 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:2019-01-30 14:17:29.425 405b040 -1 no public_addr or public_network specified, and mon.a not present in monmap or ceph.conf
2019-01-30T14:17:30.595 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:daemon-helper: command failed with exit status 1
2019-01-30T14:22:22.147 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:22:22.152 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T14:27:22.145 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:27:22.153 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T14:32:22.138 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:32:22.154 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T14:37:22.133 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:37:22.155 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T14:42:22.129 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:42:22.156 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T14:47:22.130 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:47:22.157 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T14:52:22.130 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:52:22.158 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T14:57:22.130 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:57:22.158 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T15:02:22.159 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 15:02:22.160 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T15:07:22.049 INFO:tasks.ceph.c1.mgr.x.smithi055.stderr:failed to fetch mon config (--no-mon-config to skip)
2019-01-30T15:07:22.160 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 15:07:22.160 7f8d5c960700  0 monclient(hunting): authenticate timed out after 300
2019-01-30T15:07:22.184 INFO:teuthology.orchestra.run.smithi055.stderr:[errno 110] error connecting to the cluster
2019-01-30T15:07:22.205 DEBUG:teuthology.orchestra.run:got remote process result: 1
2019-01-30T15:07:22.205 ERROR:teuthology.contextutil:Saw exception from nested tasks
Traceback (most recent call last):
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 30, in nested
    vars.append(enter())
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-cbodley-testing/qa/tasks/ceph.py", line 366, in crush_setup
    'osd', 'crush', 'tunables', profile])
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 194, in run
    r = self._runner(client=self.ssh, name=self.shortname, **kwargs)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 435, in run
    r.wait()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait
    self._raise_for_status()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status
    node=self.hostname, label=self.label
CommandFailedError: Command failed on smithi055 with status 1: 'sudo ceph --cluster c1 osd crush tunables default'

Full log at http://qa-proxy.ceph.com/teuthology/cbodley-2019-01-30_13:52:20-rgw-wip-cbodley-testing-distro-basic-smithi/3526974/teuthology.log

The other rgw suites aren't hitting this, so I suspect that the --cluster flag is the issue here.

Actions

Copy link

Updated by Casey Bodley about 5 years ago

It appears that ceph-mon isn't actually starting:

2019-01-30T14:17:29.445 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:2019-01-30 14:17:29.425 405b040 -1 no public_addr or public_network specified, and mon.a not present in monmap or ceph.conf
2019-01-30T14:17:30.595 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:daemon-helper: command failed with exit status 1

Actions

Copy link

Updated by Casey Bodley about 5 years ago

from http://qa-proxy.ceph.com/teuthology/cbodley-2019-01-30_13:52:20-rgw-wip-cbodley-testing-distro-basic-smithi/3527019/remote/smithi198/log/c1-mon.a.log.gz:

2019-01-30 14:40:23.420 405b040 10 main monmap:
{
    "epoch": 0,
    "fsid": "847f8c08-e09c-400a-a9a9-742f8ccce8da",
    "modified": "2019-01-30 14:40:01.541788",
    "created": "2019-01-30 14:40:01.541788",
    "features": {
        "persistent": [
            "kraken",
            "luminous",
            "mimic",
            "osdmap-prune",
            "nautilus" 
        ],
        "optional": []
    },
    "mons": [
        {
            "rank": 0,
            "name": "on.a",
            "public_addrs": {
                "addrvec": [
                    {
                        "type": "v2",
                        "addr": "172.21.15.159:3300",
                        "nonce": 0
                    },
                    {
                        "type": "v1",
                        "addr": "172.21.15.159:6789",
                        "nonce": 0
                    }
                ]
            },
            "addr": "172.21.15.159:6789/0",
            "public_addr": "172.21.15.159:6789/0" 
        }
    ]
}

2019-01-30 14:40:23.447 405b040  0 mon.a does not exist in monmap, will attempt to join an existing cluster

somehow the monitor name "on.a" ends up in the monmap

Actions

Copy link

Updated by Casey Bodley about 5 years ago

It looks like the ceph task isn't handling the monitors correctly across clusters:

2019-01-30T14:40:01.393 DEBUG:tasks.ceph:Ceph mon addresses: [('c1.mon.a', '172.21.15.198'), ('c2.mon.a', '172.21.15.159')]
2019-01-30T14:40:01.393 INFO:teuthology.orchestra.run.smithi198:Running:
2019-01-30T14:40:01.393 INFO:teuthology.orchestra.run.smithi198:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage monmaptool --create --clobber --enable-all-features --add on.a 172.21.15.198 --add on.a 172.21.15.159 --print /home/ubuntu/cephtest/c1.monmap

Actions

Copy link

Updated by Casey Bodley about 5 years ago

Subject changed from 'ceph osd crush tunables default' command not accepting --cluster? to qa/tasks/ceph.py monmap bootstrap does not support multiple clusters
Status changed from New to 7
Assignee set to Casey Bodley

testing fix in https://github.com/ceph/ceph/pull/26205

Actions

Copy link