Project

General

Profile

Actions

Bug #39355

closed

running ceph command on a partially upgraded cluster might fail

Added by Guillaume Abrioux about 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Trying to perform an upgrade from mimic to octopus fails because running a ceph command on a partially upgraded cluster might end up with error.

Environment:

mon0 192.168.1.10
mon1 192.168.1.11
mon2 192.168.1.12
mgr0 192.168.1.30
osd0 192.168.1.100
osd1 192.168.1.101

upgrade processes node by node : mon0, mon1, mon2, mgr0, osd0 and osd1

once mon0 gets upgraded, here is how the cluster looks like:

{
"mon": {
"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 2,
"ceph version 15.0.0-408-gc74cffd (c74cffd8a8529f99a46bad67803112be483a81ba) octopus (dev)": 1
},
"mgr": {
"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 1
},
"osd": {
"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 4
},
"mds": {},
"overall": {
"ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable)": 7,
"ceph version 15.0.0-408-gc74cffd (c74cffd8a8529f99a46bad67803112be483a81ba) octopus (dev)": 1
}
}

from either mon1 or mon2, if I run a basic command like `ceph -s` it might fail depending which monitor is actually executing the command:

[root@mon1 ~]# ceph s
Traceback (most recent call last):
File "/bin/ceph", line 1222, in <module>
retval = main()
File "/bin/ceph", line 1146, in main
sigdict = parse_json_funcsigs(outbuf.decode('utf-8'), 'cli')
File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 788, in parse_json_funcsigs
cmd['sig'] = parse_funcsig(cmd['sig'])
File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 728, in parse_funcsig
raise JsonFormat(s)
ceph_argparse.JsonFormat: unknown type CephBool
[root@mon1 ~]# ceph -s
Traceback (most recent call last):
File "/bin/ceph", line 1222, in <module>
retval = main()
File "/bin/ceph", line 1146, in main
sigdict = parse_json_funcsigs(outbuf.decode('utf-8'), 'cli')
File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 788, in parse_json_funcsigs
cmd['sig'] = parse_funcsig(cmd['sig'])
File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 728, in parse_funcsig
raise JsonFormat(s)
ceph_argparse.JsonFormat: unknown type CephBool
[root@mon1 ~]# ceph -s # <----
this time the command must have run on mon0.
cluster:
id: 68d9bc4b-ac11-43e0-850c-61a78a188b78
health: HEALTH_WARN
noout,norebalance flag(s) set
too few PGs per OSD (4 < min 30)

services:
mon: 3 daemons, quorum mon0,mon1,mon2 (age 2h)
mgr: mon0(active)
osd: 4 osds: 4 up, 4 in
flags noout,norebalance

To verify this, I added the -m flag to the same command so I force the execution on mon0 which is upgraded, still from mon1 :

[root@mon1 ~]# ceph -m 192.168.1.10:6789 -s
cluster:
id: 68d9bc4b-ac11-43e0-850c-61a78a188b78
health: HEALTH_WARN
noout,norebalance flag(s) set
too few PGs per OSD (4 < min 30)

services: 
mon: 3 daemons, quorum mon0,mon1,mon2 (age 2h)
mgr: mon0(active)
osd: 4 osds: 4 up, 4 in
flags noout,norebalance

adding `-m 192.168.1.10:6789` makes that command never failing.
As soon as I run this same command with `-m 192.168.1.11:6789` or `-m 192.168.1.12:6789` it always fails:

[root@mon1 ~]# ceph -m 192.168.1.11:6789 -s
Traceback (most recent call last):
File "/bin/ceph", line 1222, in <module>
retval = main()
File "/bin/ceph", line 1146, in main
sigdict = parse_json_funcsigs(outbuf.decode('utf-8'), 'cli')
File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 788, in parse_json_funcsigs
cmd['sig'] = parse_funcsig(cmd['sig'])
File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 728, in parse_funcsig
raise JsonFormat(s)
ceph_argparse.JsonFormat: unknown type CephBool
[root@mon1 ~]# ceph -m 192.168.1.12:6789 -s
Traceback (most recent call last):
File "/bin/ceph", line 1222, in <module>
retval = main()
File "/bin/ceph", line 1146, in main
sigdict = parse_json_funcsigs(outbuf.decode('utf-8'), 'cli')
File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 788, in parse_json_funcsigs
cmd['sig'] = parse_funcsig(cmd['sig'])
File "/usr/lib/python2.7/site-packages/ceph_argparse.py", line 728, in parse_funcsig
raise JsonFormat(s)
ceph_argparse.JsonFormat: unknown type CephBool

in ceph-ansible, when we upgrade a cluster, we have to run ceph commands before and after each node is upgraded.
Until now, depending on which monitor node these commands landed, the upgrade could fail right after the first monitor is upgraded.
I got around this bug by using the -m flag but I think it is still worth to open an issue for this behavior.


Related issues 1 (1 open0 closed)

Related to Ceph - Bug #41535: Trying to upgrade from Ceph Mimic to Nautilus can failNew08/27/2019

Actions
Actions #1

Updated by Igor Fedotov about 5 years ago

  • Project changed from bluestore to Ceph
Actions #2

Updated by Greg Farnum almost 5 years ago

  • Status changed from New to Closed

15.0.0 is obviously an in-development release; I believe I saw PRs go by fixing up issues with Ceph bool.

Actions #3

Updated by Chris MacNaughton over 4 years ago

I've seen exactly this error when upgrading from Mimic to Nautilus

Actions #4

Updated by Chris MacNaughton over 4 years ago

I've opened https://tracker.ceph.com/issues/41535 to track this issue on Mimic->Nautilus

Actions #5

Updated by Nathan Cutler over 4 years ago

  • Related to Bug #41535: Trying to upgrade from Ceph Mimic to Nautilus can fail added
Actions

Also available in: Atom PDF