Bug #49870: When 'ceph mgr dump' returns invalid JSON during the middle of spec application the spec application fails - Orchestrator - Ceph

Actions

Copy link

Bug #49870

closed

When 'ceph mgr dump' returns invalid JSON during the middle of spec application the spec application fails

Added by John Fulton about 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

orchestrator

Target version:

% Done:

Source:

Development

Tags:

Backport:

pacific,octopus

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.9

ceph-qa-suite:

Pull request ID:

40203

Crash signature (v1):

Crash signature (v2):

Description

When using cephadm-15.2.9-2.el8.noarch with the command below to apply the attached spec:

/usr/sbin/cephadm --image undercloud.ctlplane.mydomain.tld:8787/ceph-ci/daemon:v5.0.7-stable-5.0-octopus-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid 91c9b592-5317-4d92-97c1-6a9f0e9460cc --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 172.16.11.239

The output is the following with a failure from invalid JSON:

Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chronyd.service is enabled and running
Repeating the final host check...
podman|docker (/bin/podman) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
Cluster fsid: 91c9b592-5317-4d92-97c1-6a9f0e9460cc
Verifying IP 172.16.11.239 port 3300 ...
Verifying IP 172.16.11.239 port 6789 ...
Mon IP 172.16.11.239 is in CIDR network 172.16.11.0/24
Pulling container image undercloud.ctlplane.mydomain.tld:8787/ceph-ci/daemon:v5.0.7-stable-5.0-octopus-centos-8-x86_64...
Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting mon public_network...
Creating mgr...
Verifying port 9283 ...
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Wrote config to /etc/ceph/ceph.conf
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/10)...
mgr not available, waiting (2/10)...
mgr not available, waiting (3/10)...
mgr is available
Enabling cephadm module...
Traceback (most recent call last):
File \"/usr/sbin/cephadm\", line 6151, in <module>
r = args.func()
File \"/usr/sbin/cephadm\", line 1410, in _default_image
return func()
File \"/usr/sbin/cephadm\", line 3145, in command_bootstrap
wait_for_mgr_restart()
File \"/usr/sbin/cephadm\", line 3124, in wait_for_mgr_restart
j = json.loads(out)
File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads
return _default_decoder.decode(s)
File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File \"/usr/lib64/python3.6/json/decoder.py\", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 3217 column 25 (char 114687)

Which is here in the code:

https://github.com/ceph/ceph/blob/octopus/src/cephadm/cephadm#L3123

Looks like in the following:

out = cli(['mgr', 'dump'])
        j = json.loads(out)

that out contains invalid JSON so json.loads(out) throws an excpetion.

I can reproduce the failure condition by doing the following:

While the spec is applying run the following on the same host:

sudo cephadm shell -- ceph mgr dump  | jq .

There will be times that jq complains about invalid JSON.

I conjecture it's becuase the MGR is restarting. Perhaps a try catch retry is needed in case invalid json is returned because the mgr is not available?

Files

ceph_spec.yaml (963 Bytes) ceph_spec.yaml

John Fulton, 03/17/2021 04:35 PM

Actions

Copy link

Updated by Sebastian Wagner about 3 years ago

Description updated (diff)

Actions

Copy link

Updated by Sebastian Wagner about 3 years ago

let me guess: podman 2.2.1 ?

Actions

Copy link

Updated by John Fulton about 3 years ago

[ceph-admin@oc0-controller-0 ~]$ podman --version
podman version 3.0.0-dev
[ceph-admin@oc0-controller-0 ~]$

Actions

Copy link

Updated by John Fulton about 3 years ago

I made the modification below and I am not having the issue any more from repeated testing. My logs show the following. I'll send a PR in.

```
"Enabling cephadm module...",
"Waiting 1 second for mgr to return valid JSON...",
"Waiting 1 second for mgr to return valid JSON...",
"Waiting 1 second for mgr to return valid JSON...",
"Waiting for the mgr to restart...",
```

```
def wait_for_mgr_restart(): # first get latest mgrmap epoch from the mon
retries = 5
retry = 0
while (retry < retries):
try:
out = cli(['mgr', 'dump'])
j = json.loads(out)
break
except json.decoder.JSONDecodeError:
time.sleep(1)
logger.info('Waiting 1 second for mgr to return valid JSON...')
retry += 1
epoch = j['epoch'] # wait for mgr to have it
logger.info('Waiting for the mgr to restart...')
def mgr_has_latest_epoch(): # type: () -> bool
try:
out = cli(['tell', 'mgr', 'mgr_status'])
j = json.loads(out)
return j['mgrmap_epoch'] >= epoch
except Exception as e:
logger.debug('tell mgr mgr_status failed: %s' % e)
return False
is_available('Mgr epoch %d' % epoch, mgr_has_latest_epoch)
```

Actions

Copy link

Updated by John Fulton about 3 years ago

Sorry still getting used to tracker for formatting. I can send a PR.

    # wait for mgr to restart (after enabling a module)
    def wait_for_mgr_restart():
        # first get latest mgrmap epoch from the mon
        retries = 5
        retry = 0
        while (retry < retries):
          try:
            out = cli(['mgr', 'dump'])
            j = json.loads(out)
            break
          except json.decoder.JSONDecodeError:
            time.sleep(1)
            logger.info('Waiting 1 second for mgr to return valid JSON...')
          retry += 1
        epoch = j['epoch']
        # wait for mgr to have it
        logger.info('Waiting for the mgr to restart...')
        def mgr_has_latest_epoch():
            # type: () -> bool
            try:
                out = cli(['tell', 'mgr', 'mgr_status'])
                j = json.loads(out)
                return j['mgrmap_epoch'] >= epoch
            except Exception as e:
                logger.debug('tell mgr mgr_status failed: %s' % e)
                return False
        is_available('Mgr epoch %d' % epoch, mgr_has_latest_epoch)

Actions

Copy link