Project

General

Profile

Actions

Bug #49870

closed

When 'ceph mgr dump' returns invalid JSON during the middle of spec application the spec application fails

Added by John Fulton about 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
orchestrator
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
pacific,octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When using cephadm-15.2.9-2.el8.noarch with the command below to apply the attached spec:

/usr/sbin/cephadm --image undercloud.ctlplane.mydomain.tld:8787/ceph-ci/daemon:v5.0.7-stable-5.0-octopus-centos-8-x86_64 bootstrap --skip-firewalld --ssh-private-key /home/ceph-admin/.ssh/id_rsa --ssh-public-key /home/ceph-admin/.ssh/id_rsa.pub --ssh-user ceph-admin --allow-fqdn-hostname --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --fsid 91c9b592-5317-4d92-97c1-6a9f0e9460cc --config /home/ceph-admin/bootstrap_ceph.conf --skip-monitoring-stack --skip-dashboard --mon-ip 172.16.11.239

The output is the following with a failure from invalid JSON:

Verifying podman|docker is present...
Verifying lvm2 is present...
Verifying time synchronization is in place...
Unit chronyd.service is enabled and running
Repeating the final host check...
podman|docker (/bin/podman) is present
systemctl is present
lvcreate is present
Unit chronyd.service is enabled and running
Host looks OK
Cluster fsid: 91c9b592-5317-4d92-97c1-6a9f0e9460cc
Verifying IP 172.16.11.239 port 3300 ...
Verifying IP 172.16.11.239 port 6789 ...
Mon IP 172.16.11.239 is in CIDR network 172.16.11.0/24
Pulling container image undercloud.ctlplane.mydomain.tld:8787/ceph-ci/daemon:v5.0.7-stable-5.0-octopus-centos-8-x86_64...
Extracting ceph user uid/gid from container image...
Creating initial keys...
Creating initial monmap...
Creating mon...
Waiting for mon to start...
Waiting for mon...
mon is available
Assimilating anything we can from ceph.conf...
Generating new minimal ceph.conf...
Restarting the monitor...
Setting mon public_network...
Creating mgr...
Verifying port 9283 ...
Wrote keyring to /etc/ceph/ceph.client.admin.keyring
Wrote config to /etc/ceph/ceph.conf
Waiting for mgr to start...
Waiting for mgr...
mgr not available, waiting (1/10)...
mgr not available, waiting (2/10)...
mgr not available, waiting (3/10)...
mgr is available
Enabling cephadm module...
Traceback (most recent call last):
File \"/usr/sbin/cephadm\", line 6151, in <module>
r = args.func()
File \"/usr/sbin/cephadm\", line 1410, in _default_image
return func()
File \"/usr/sbin/cephadm\", line 3145, in command_bootstrap
wait_for_mgr_restart()
File \"/usr/sbin/cephadm\", line 3124, in wait_for_mgr_restart
j = json.loads(out)
File \"/usr/lib64/python3.6/json/__init__.py\", line 354, in loads
return _default_decoder.decode(s)
File \"/usr/lib64/python3.6/json/decoder.py\", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File \"/usr/lib64/python3.6/json/decoder.py\", line 355, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 3217 column 25 (char 114687)

Which is here in the code:

https://github.com/ceph/ceph/blob/octopus/src/cephadm/cephadm#L3123

Looks like in the following:

out = cli(['mgr', 'dump'])
j = json.loads(out)

that out contains invalid JSON so json.loads(out) throws an excpetion.

I can reproduce the failure condition by doing the following:

While the spec is applying run the following on the same host:

sudo cephadm shell -- ceph mgr dump  | jq .

There will be times that jq complains about invalid JSON.

I conjecture it's becuase the MGR is restarting. Perhaps a try catch retry is needed in case invalid json is returned because the mgr is not available?


Files

ceph_spec.yaml (963 Bytes) ceph_spec.yaml John Fulton, 03/17/2021 04:35 PM
Actions #1

Updated by Sebastian Wagner about 3 years ago

  • Description updated (diff)
Actions #2

Updated by Sebastian Wagner about 3 years ago

let me guess: podman 2.2.1 ?

Actions #3

Updated by John Fulton about 3 years ago

[ceph-admin@oc0-controller-0 ~]$ podman --version
podman version 3.0.0-dev
[ceph-admin@oc0-controller-0 ~]$

Actions #4

Updated by John Fulton about 3 years ago

I made the modification below and I am not having the issue any more from repeated testing. My logs show the following. I'll send a PR in.

```
"Enabling cephadm module...",
"Waiting 1 second for mgr to return valid JSON...",
"Waiting 1 second for mgr to return valid JSON...",
"Waiting 1 second for mgr to return valid JSON...",
"Waiting for the mgr to restart...",
```

```
def wait_for_mgr_restart(): # first get latest mgrmap epoch from the mon
retries = 5
retry = 0
while (retry < retries):
try:
out = cli(['mgr', 'dump'])
j = json.loads(out)
break
except json.decoder.JSONDecodeError:
time.sleep(1)
logger.info('Waiting 1 second for mgr to return valid JSON...')
retry += 1
epoch = j['epoch'] # wait for mgr to have it
logger.info('Waiting for the mgr to restart...')
def mgr_has_latest_epoch(): # type: () -> bool
try:
out = cli(['tell', 'mgr', 'mgr_status'])
j = json.loads(out)
return j['mgrmap_epoch'] >= epoch
except Exception as e:
logger.debug('tell mgr mgr_status failed: %s' % e)
return False
is_available('Mgr epoch %d' % epoch, mgr_has_latest_epoch)
```

Actions #5

Updated by John Fulton about 3 years ago

Sorry still getting used to tracker for formatting. I can send a PR.

    # wait for mgr to restart (after enabling a module)
    def wait_for_mgr_restart():
        # first get latest mgrmap epoch from the mon
        retries = 5
        retry = 0
        while (retry < retries):
          try:
            out = cli(['mgr', 'dump'])
            j = json.loads(out)
            break
          except json.decoder.JSONDecodeError:
            time.sleep(1)
            logger.info('Waiting 1 second for mgr to return valid JSON...')
          retry += 1
        epoch = j['epoch']
        # wait for mgr to have it
        logger.info('Waiting for the mgr to restart...')
        def mgr_has_latest_epoch():
            # type: () -> bool
            try:
                out = cli(['tell', 'mgr', 'mgr_status'])
                j = json.loads(out)
                return j['mgrmap_epoch'] >= epoch
            except Exception as e:
                logger.debug('tell mgr mgr_status failed: %s' % e)
                return False
        is_available('Mgr epoch %d' % epoch, mgr_has_latest_epoch)

Actions #6

Updated by John Fulton about 3 years ago

Here's a potential fix: https://github.com/ceph/ceph/pull/40203

First patch to cephadm. Happy to update as needed. Thanks for your review.

(I was not able to edit the original bug to set the above to the "Pull request ID:").

Actions #7

Updated by Sage Weil about 3 years ago

  • Status changed from New to Pending Backport
  • Backport set to pacific,octopus
  • Pull request ID set to 40203
Actions #8

Updated by Sage Weil about 3 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF