Project

General

Profile

Bug #56485

ceph orch upgrade stuck, ceph orch not updating

Added by Guillaume Lefranc over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Ceph upgrade started with:

$ ceph orch upgrade start --ceph-version 16.2.9

caused the following error message one one of the nodes after some time:

debug 2022-07-06T15:53:34.429+0000 7ff0f9a36700 0 [cephadm ERROR cephadm.serve] cephadm exited with an error code: 1, stderr:Pulling container image 16.2.9...
Non-zero exit code 1 from /usr/bin/docker pull 16.2.9
/usr/bin/docker: stdout Using default tag: latest
/usr/bin/docker: stderr Error response from daemon: pull access denied for 16.2.9, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
ERROR: Failed command: /usr/bin/docker pull 16.2.9
Traceback (most recent call last):
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1429, in _remote_connection
yield (conn, connr)
File "/usr/share/ceph/mgr/cephadm/serve.py", line 1326, in _run_cephadm
code, '\n'.join(err)))
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr:Pulling container image 16.2.9...
Non-zero exit code 1 from /usr/bin/docker pull 16.2.9

The error above appears in a loop.

Following that, even if `ceph upgrade stop` is attempted, `ceph orch ps` stops to update. Nothing is possible anymore with the orchestrator, e.g. redeploying or restarting services is not possible anymore.

History

#1 Updated by Guillaume Lefranc over 1 year ago

debug logs:

2022-07-06T18:46:27.417593+0200 mgr.ceph-12.rdtjyq [DBG] Sleeping for 60 seconds
2022-07-06T18:47:27.445452+0200 mgr.ceph-12.rdtjyq [DBG] mon_command: 'config dump' -> 0 in 0.028s
2022-07-06T18:47:27.445727+0200 mgr.ceph-12.rdtjyq [DBG] _run_cephadm : command = pull
2022-07-06T18:47:27.445773+0200 mgr.ceph-12.rdtjyq [DBG] _run_cephadm : args = []
2022-07-06T18:47:27.445849+0200 mgr.ceph-12.rdtjyq [DBG] Have connection to 10.0.10.6
2022-07-06T18:47:27.445901+0200 mgr.ceph-12.rdtjyq [DBG] args: --image 16.2.9 pull
2022-07-06T18:47:29.211196+0200 mgr.ceph-12.rdtjyq [DBG] code: 1
2022-07-06T18:47:29.211265+0200 mgr.ceph-12.rdtjyq [DBG] err: Pulling container image 16.2.9...

#2 Updated by Adam King over 1 year ago

if cephadm is well and truly stuck, the best thing to do might be a mgr failover "ceph mgr fail". That will at least get cephad mto restart and try something else. Often when it gets stuck like this, it hit some sort of exception. Do you see anything relevant in "ceph health"? Also, I know we have the --ceph-version flag for upgrade start command, but I'm always going to recommend using --image and just specifying the the image (e.g. quay.io/ceph/ceph:v16.2.9). That option is tested significantly more that --ceph-version. Now that I think about it, I'm not even 100% sure --ceph-version is working since the move from docker to quay.

#3 Updated by Guillaume Lefranc over 1 year ago

Yes, I have tried a mgr failover but to no effect, the next mgr continued on with the same task.
I assume there is some kind of persistence to mgr state. The only way I could solve it, and it was a bit annoying, is edit the source code of cephadm on another mgr to fix the issue - I just hardcoded the image version, I did not find a permanent solution. That allowed cephadm to continue on with deployment and also refresh the state of the cluster in orchestrator.

About the --ceph-version flag, the documentation is slightly misleading because the official https://docs.ceph.com/en/quincy/cephadm/upgrade/ page suggests to use ceph-version, though if you scroll down the line it says:

From version v16.2.6 the Docker Hub registry is no longer used, so if you use Docker you have to point it to the image in the quay.io registry

In this case I think at least the ceph-version flag should be disabled if version is 16.2.6 or higher, since it can cause as demonstrated the upgrade operation, and by consequence the whole orchestrator, to be stuck indefinitely.

Also available in: Atom PDF