Bug #44642
closedcephadm: mgr dump might be too huge
0%
Description
/usr/bin/python3 /usr/sbin/cephadm --image registry.opensuse.org/filesystems/ceph/octopus/upstream/images/ceph/ceph --verbose bootstrap --mon-ip 10.20.62.200 --config /root/ceph.conf --initial-dashboard-user admin --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --skip-prepare-host --skip-pull --skip-ssh ... DEBUG:cephadm:/usr/bin/ceph:stdout "name": "log_to_cluster", DEBUG:cephadm:/usr/bin/ceph:stdout "type": "bool", DEBUG:cephadm:/usr/bin/ceph:stdout "level": "advanced", DEBUG:cephadm:/usr/bin/ceph:stdout Traceback (most recent call last): File "/usr/sbin/cephadm", line 4246, in <module> r = args.func() File "/usr/sbin/cephadm", line 2362, in command_bootstrap wait_for_mgr_restart() File "/usr/sbin/cephadm", line 2293, in wait_for_mgr_restart j = json.loads(out) File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads return _default_decoder.decode(s) File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python3.6/json/decoder.py", line 355, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 3042 column 13 (char 107520) DEBUG:cephadm:Releasing lock 140670058269104 on /run/cephadm/5c6aa640-6920-11ea-8417-5254009bd4ee.lock DEBUG:cephadm:Lock 140670058269104 released on /run/cephadm/5c6aa640-6920-11ea-8417-5254009bd4ee.lock
this might also appear as a hang, when calling `mgr dump`
podman run --rm --net=host -e CONTAINER_IMAGE=registry.opensuse.org/filesystems/ceph/master/upstream/images/ceph/ceph -e NODE_NAME=master -v /var/log/ceph/cd0bbbb4-6849-11ea-a3be-52540052cb35:/var/log/ceph:z -v /tmp/ceph-tmp7ws_6wkt:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmp8dk1r1k2:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph registry.opensuse.org/filesystems/ceph/master/upstream/images/ceph/ceph mgr dump
Updated by Sebastian Wagner about 4 years ago
- Related to Bug #44612: mgr stuck in activating state added
Updated by Sebastian Wagner about 4 years ago
- Status changed from In Progress to Rejected
seems to be a downstream issue.
Updated by Sebastian Wagner about 4 years ago
- Subject changed from cephadm: cephadm bootstrap hangs indefinitely to cephadm: mgr dump might be too huge
- Description updated (diff)
- Status changed from Rejected to New
- Priority changed from Urgent to High
Updated by Sebastian Wagner about 4 years ago
- Assignee changed from Ricardo Marques to Sebastian Wagner
Updated by Sebastian Wagner about 4 years ago
- Status changed from New to In Progress
Updated by Nathan Cutler about 4 years ago
Now, with both cephadm and container at 15.1.1-168-g06ecd31e39 I am seeing "cephadm bootstrap" fail on "ceph mgr dump" with the following Traceback:
DEBUG:cephadm:Running command: /usr/bin/podman run --rm --net=host -e CONTAINER_IMAGE=registry.suse.de/devel/storage/7.0/cr/containers/ses/7/ceph/ceph -e NODE_NAME=master -v /var/log/ceph/41e32c20-6afb-11ea-8147-525400a7f085:/var/log/ceph:z -v /tmp/ceph-tmp6nv9tfvk:/etc/ceph/ceph.client.admin.keyring:z -v /tmp/ceph-tmpjq0etg1v:/etc/ceph/ceph.conf:z --entrypoint /usr/bin/ceph registry.suse.de/devel/storage/7.0/cr/containers/ses/7/ceph/ceph mgr dump DEBUG:cephadm:/usr/bin/ceph:stdout { DEBUG:cephadm:/usr/bin/ceph:stdout "epoch": 5, ... ... [THOUSANDS OF LINES OMITTED] ... DEBUG:cephadm:/usr/bin/ceph:stdout }, DEBUG:cephadm:/usr/bin/ceph:stdout "organization": { DEBUG:cephadm:/usr/bin/ceph:stdout "name": "organization", DEBUG:cephadm:/usr/bin/ceph:stdout "type": "str", DEBUG:cephadm:/usr/bin/ceph:stdout "level": "advanced", DEBUG:cephadm:/usr/bin/ceph:stdout Traceback (most recent call last): File "/usr/sbin/cephadm", line 4261, in <module> r = args.func() File "/usr/sbin/cephadm", line 967, in _default_image return func() File "/usr/sbin/cephadm", line 2396, in command_bootstrap wait_for_mgr_restart() File "/usr/sbin/cephadm", line 2327, in wait_for_mgr_restart j = json.loads(out) File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads return _default_decoder.decode(s) File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python3.6/json/decoder.py", line 355, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 3854 column 13 (char 135168)
Now, of course, line 3854 is the very last line in the "ceph mgr dump" output, but it is truncated so we don't see the whole line - just the first 12 characters, which are all spaces.
However, if I run "ceph mgr dump" without the container, I can see what's on that line:
"flags": 0,
Do you know why the line "j = json.loads(out)" is choking on the integer value sent by "ceph mgr dump"?
Updated by Sebastian Wagner about 4 years ago
- Priority changed from High to Low
I don't know what caused this. Might actually be an artifact of our podman hang. prio=low for now.
edit: oh, you can now actually reproduce it?
Updated by Sebastian Wagner about 4 years ago
- Priority changed from Low to Normal
Updated by Sebastian Wagner about 4 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 34031
Updated by Sebastian Wagner about 4 years ago
- Assignee changed from Sebastian Wagner to Tim Serong
Updated by Nathan Cutler about 4 years ago
- Assignee changed from Tim Serong to Sebastian Wagner
Do you know why the line "j = json.loads(out)" is choking on the integer value sent by "ceph mgr dump"?
Now I see it was not choking on the integer value. The error message "Expecting property name enclosed in double quotes" is there because the "ceph mgr dump" output was truncated.
oh, you can now actually reproduce it?
Apparently not. I'm testing 15.2.0 now and it's no longer happening.
Updated by Tim Serong about 4 years ago
I can't help but think this is somehow related to the hangs we're getting after ~100k output when podman is run via salt (https://github.com/ceph/ceph-salt/issues/130).
This works fine for me:
CEPHADM_IMAGE=192.168.121.1:5000/devel/storage/7.0/cr/containers/ses/7/ceph/ceph cephadm --verbose bootstrap --mon-ip 10.20.146.200 --config /root/ceph.conf --initial-dashboard-user admin --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --skip-prepare-host --skip-pull --skip-ssh --allow-overwrite > /var/log/ceph/cephadm.log 2>&1
...wheres this dies with truncated output and thus a json.decoder.JSONDecodeError:
salt master.ses7-mini.com cmd.run cmd='unset NOTIFY_SOCKET; CEPHADM_IMAGE=192.168.121.1:5000/devel/storage/7.0/cr/containers/ses/7/ceph/ceph cephadm --verbose bootstrap --mon-ip 10.20.146.200 \--config /root/ceph.conf --initial-dashboard-user admin --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --skip-prepare-host --skip-pull --skip-ssh --allow-overwrite > /var/log/ceph/cephadm.log.via-salt 2>&1'
It's not always truncated at the same point either. One run was at 141302 chars, another at 129024 chars, a third at 100352, a fourth at 96256. I rebooted and tried the same command again and it truncated at 90112. Odd that each run seems to truncate sooner than the last... Note that I didn't really clean up the old clusters created, I just stopped them with systemctl stop ceph.target
Updated by Sebastian Wagner about 4 years ago
interesting. does it work when using https://github.com/ceph/ceph/pull/34031 ?
Updated by Tim Serong about 4 years ago
Sebastian Wagner wrote:
interesting. does it work when using https://github.com/ceph/ceph/pull/34031 ?
Yeah, it works with that patch applied.
This is nuts.
Updated by Tim Serong about 4 years ago
OK, I've now seen it work at least once without that patch applied, and I've also seen it fail at least once without salt in the picture (and also without that patch applied).
Updated by Tim Serong about 4 years ago
To try to obtain some further clarity, I ran this:
for n in $(seq 1 10); do sleep 5; systemctl stop ceph.target ; sleep 5 ; CEPHADM_IMAGE=192.168.121.1:5000/devel/storage/7.0/cr/containers/ses/7/ceph/ceph cephadm --verbose bootstrap --mon-ip 10.20.59.200 --initial-dashboard-user admin --output-keyring /etc/ceph/ceph.client.admin.keyring --output-config /etc/ceph/ceph.conf --skip-prepare-host --skip-pull --skip-ssh --allow-overwrite > /var/log/ceph/cephadm.log.$n 2>&1 ; done
Doing that with ceph 15.1.1-190-g5d09c481a4, I hit the truncated output five times, and it completed successfully the other five fimes.
I repeated the above loop with https://github.com/ceph/ceph/pull/34031 applied, and it worked fine all ten times.
I'm not sure we should apply that patch though, because I'm concerned that while it seems to fix the problem, we still don't know the underlying cause of the truncated output.
I'm going to test again (same ten runs on SLE 15 SP2), but with the latest master image from registry.opensuse.org, and also the latest from upstream.
Updated by Tim Serong about 4 years ago
Ten runs using cephadm from https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm (so the container image is docker.io/ceph/ceph:v15), the first one failed with truncated output, the following nine succeeded.
Updated by Tim Serong about 4 years ago
Just ran another loop of ten with https://github.com/ceph/ceph/raw/octopus/src/cephadm/cephadm and docker.io/ceph/ceph:v15, on the same test VM, without re-imaging, just to make sure it wasn't only the first run that failed. Hit one more truncated output failure on the sixth run. So it's still there, and intermittent, just less often than in my previous tests with older downstream images.
Updated by Sebastian Wagner about 4 years ago
while sleep 1 ; do cephadm shell -- ceph mgr dump | wc -l ; done
works perfectly every time
Updated by Sebastian Wagner about 4 years ago
master:~ # diff -u cephadm /usr/sbin/cephadm --- cephadm 2020-03-26 16:14:40.596852119 +0100 +++ /usr/sbin/cephadm 2020-03-26 16:20:44.014178437 +0100 @@ -2594,8 +2594,9 @@ envs=args.env, privileged=True) command = c.shell_cmd(command) - - return call_timeout(command, args.timeout) + + return call(command, verbose_on_failure=False)[0] + #return call_timeout(command, args.timeout) ################################## master:~ # while true ; do cephadm shell -- ceph mgr dump 2>&1 | wc -c ; done 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 152521 ^C
also works perfectly
Updated by Tim Serong about 4 years ago
Same, cephadm shell -- ceph mgr dump
seems perfectly happy for me too. So it's only a problem during bootstrap somehow?
Updated by Sage Weil about 4 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport set to octopus
Updated by Sage Weil about 4 years ago
- Status changed from Pending Backport to Resolved
Updated by Nathan Cutler about 4 years ago
Backported to octopus by https://github.com/ceph/ceph/pull/34258