Bug #38115
closedqa/tasks/ceph.py monmap bootstrap does not support multiple clusters
0%
Description
The rgw multisite suites have been failing recently with valgrind issues in ceph-mon, but the runs start with this error from the 'ceph' command:
2019-01-30T14:17:21.881 INFO:tasks.ceph:Setting crush tunables to default 2019-01-30T14:17:21.881 INFO:teuthology.orchestra.run.smithi055:Running: 2019-01-30T14:17:21.881 INFO:teuthology.orchestra.run.smithi055:> sudo ceph --cluster c1 osd crush tunables default 2019-01-30T14:17:28.923 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:==00:00:00:06.882 32990== Warning: unimplemented fcntl command: 1036 2019-01-30T14:17:29.445 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:2019-01-30 14:17:29.425 405b040 -1 no public_addr or public_network specified, and mon.a not present in monmap or ceph.conf 2019-01-30T14:17:30.595 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:daemon-helper: command failed with exit status 1 2019-01-30T14:22:22.147 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:22:22.152 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T14:27:22.145 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:27:22.153 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T14:32:22.138 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:32:22.154 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T14:37:22.133 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:37:22.155 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T14:42:22.129 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:42:22.156 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T14:47:22.130 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:47:22.157 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T14:52:22.130 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:52:22.158 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T14:57:22.130 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 14:57:22.158 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T15:02:22.159 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 15:02:22.160 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T15:07:22.049 INFO:tasks.ceph.c1.mgr.x.smithi055.stderr:failed to fetch mon config (--no-mon-config to skip) 2019-01-30T15:07:22.160 INFO:teuthology.orchestra.run.smithi055.stderr:2019-01-30 15:07:22.160 7f8d5c960700 0 monclient(hunting): authenticate timed out after 300 2019-01-30T15:07:22.184 INFO:teuthology.orchestra.run.smithi055.stderr:[errno 110] error connecting to the cluster 2019-01-30T15:07:22.205 DEBUG:teuthology.orchestra.run:got remote process result: 1 2019-01-30T15:07:22.205 ERROR:teuthology.contextutil:Saw exception from nested tasks Traceback (most recent call last): File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/contextutil.py", line 30, in nested vars.append(enter()) File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ return self.gen.next() File "/home/teuthworker/src/git.ceph.com_ceph-c_wip-cbodley-testing/qa/tasks/ceph.py", line 366, in crush_setup 'osd', 'crush', 'tunables', profile]) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/remote.py", line 194, in run r = self._runner(client=self.ssh, name=self.shortname, **kwargs) File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 435, in run r.wait() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 162, in wait self._raise_for_status() File "/home/teuthworker/src/git.ceph.com_git_teuthology_master/teuthology/orchestra/run.py", line 184, in _raise_for_status node=self.hostname, label=self.label CommandFailedError: Command failed on smithi055 with status 1: 'sudo ceph --cluster c1 osd crush tunables default'
The other rgw suites aren't hitting this, so I suspect that the --cluster flag is the issue here.
Updated by Casey Bodley about 5 years ago
It appears that ceph-mon isn't actually starting:
2019-01-30T14:17:29.445 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:2019-01-30 14:17:29.425 405b040 -1 no public_addr or public_network specified, and mon.a not present in monmap or ceph.conf
2019-01-30T14:17:30.595 INFO:tasks.ceph.c1.mon.a.smithi055.stderr:daemon-helper: command failed with exit status 1
Updated by Casey Bodley about 5 years ago
2019-01-30 14:40:23.420 405b040 10 main monmap: { "epoch": 0, "fsid": "847f8c08-e09c-400a-a9a9-742f8ccce8da", "modified": "2019-01-30 14:40:01.541788", "created": "2019-01-30 14:40:01.541788", "features": { "persistent": [ "kraken", "luminous", "mimic", "osdmap-prune", "nautilus" ], "optional": [] }, "mons": [ { "rank": 0, "name": "on.a", "public_addrs": { "addrvec": [ { "type": "v2", "addr": "172.21.15.159:3300", "nonce": 0 }, { "type": "v1", "addr": "172.21.15.159:6789", "nonce": 0 } ] }, "addr": "172.21.15.159:6789/0", "public_addr": "172.21.15.159:6789/0" } ] } 2019-01-30 14:40:23.447 405b040 0 mon.a does not exist in monmap, will attempt to join an existing cluster
somehow the monitor name "on.a" ends up in the monmap
Updated by Casey Bodley about 5 years ago
It looks like the ceph task isn't handling the monitors correctly across clusters:
2019-01-30T14:40:01.393 DEBUG:tasks.ceph:Ceph mon addresses: [('c1.mon.a', '172.21.15.198'), ('c2.mon.a', '172.21.15.159')] 2019-01-30T14:40:01.393 INFO:teuthology.orchestra.run.smithi198:Running: 2019-01-30T14:40:01.393 INFO:teuthology.orchestra.run.smithi198:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage monmaptool --create --clobber --enable-all-features --add on.a 172.21.15.198 --add on.a 172.21.15.159 --print /home/ubuntu/cephtest/c1.monmap
Updated by Casey Bodley about 5 years ago
- Subject changed from 'ceph osd crush tunables default' command not accepting --cluster? to qa/tasks/ceph.py monmap bootstrap does not support multiple clusters
- Status changed from New to 7
- Assignee set to Casey Bodley
testing fix in https://github.com/ceph/ceph/pull/26205