Bug #53751
"N monitors have not enabled msgr2" is always shown for new clusters
0%
Description
I am experiencing that for new clusters (currently Ceph 16.2.7), `ceph status` always shows e.g.:
3 monitors have not enabled msgr2
I read the docs at https://docs.ceph.com/en/pacific/rados/configuration/msgr2/ and https://docs.ceph.com/en/pacific/rados/operations/health-checks/#mon-msgr2-not-enabled which suggests that running
ceph mon enable-msgr2
fixes the issue, but it only does so for mons that were already added. When I add another mon, the error appears again.
Is there a story for being able to make msgr2 enabled by default for all new mons for new clusters, so that declarative cluster bootstraps can come up warning-free?
Thanks!
History
#1 Updated by Niklas Hambuechen about 2 years ago
Another thing I don't understand from the docs:
https://docs.ceph.com/en/pacific/rados/configuration/msgr2/#transitioning-from-v1-only-to-v2-plus-v1
By default, `ms_bind_msgr2` is true starting with Nautilus 14.2.z.
Why then does `ceph health detail` show on new clusters:
mon.test-node-1 is not bound to a msgr2 port, only v1:10.0.0.5:6789/0
Indeed the config says that binding is enabled, so why is it "not bound" then?
- ceph config get mon.test-node-1 ms_bind_msgr2
true
Finally https://docs.ceph.com/en/pacific/rados/configuration/network-config-ref/#monitor-ip-tables says:
Ceph Monitors listen on ports `3300` and `6789` by default
But that's not the case for my new cluster, `netstat -antp` shows only `:6789`.
On these systems, `ceph-mon`'s command line is:
ceph-mon -f --cluster ceph --id benaco-node-5 --setuser ceph --setgroup ceph
and `ceph.conf` is:
[global]
fsid = d9000ec0-93c2-479f-bd5d-94ae9673e347
mon initial members = test-node-1,test-node-2,test-node-3
mon_host = 10.0.0.1,10.0.0.2,10.0.0.3
# By default, new Ceph clusters go into WARN health mode, until
# the following setting is made strict by setting it to `false`:
# See: https://docs.ceph.com/en/latest/security/CVE-2021-20288/#recommendations
# As of writing, this setting is not documented outside of the CVE note :(
auth_allow_insecure_global_id_reclaim = false
public network = 10.0.0.0/24
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
and I've also tried with:
mon_host = v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0,v2:10.0.0.2:3300/0,v1:10.0.0.2:6789/0,v2:10.0.0.3:3300/0,v1:10.0.0.3:6789/0
So it doesn't look like any nonstandard config turns the port off.
#2 Updated by Neha Ojha about 2 years ago
Maybe you are missing the square brackets when specifying the mon_host like in https://docs.ceph.com/en/pacific/rados/configuration/msgr2/#updating-ceph-conf-and-mon-host? worth giving it a try.
#3 Updated by Neha Ojha about 2 years ago
- Project changed from Ceph to RADOS
#4 Updated by Niklas Hambuechen about 2 years ago
Neha Ojha wrote:
Maybe you are missing the square brackets when specifying the mon_host like in https://docs.ceph.com/en/pacific/rados/configuration/msgr2/#updating-ceph-conf-and-mon-host? worth giving it a try.
I'm quite sure I tried with square brackets as well that day (unfortunately Redmine doesn't allow editing posts, so I didn't write it down).
In any case, the docs say that it should laso work with just
mon_host = 10.0.0.1,10.0.0.2,10.0.0.3
right?
#5 Updated by Neha Ojha about 2 years ago
- Status changed from New to Need More Info
Can you share the output of "ceph mon dump"? And how did you install this cluster? We are not seeing this issue in 16.2.7 cluster installed using cephadm.
#6 Updated by Niklas Hambuechen about 2 years ago
I installed the cluster using the "Manual Deployment" method (https://docs.ceph.com/en/pacific/install/manual-deployment/).
But I must have made a mistake somewhere. I can no longer reproduce any of the above: It works fine now without brackets, without `ceph mon enable-msgr2` and even without `auth_allow_insecure_global_id_reclaim = false`, and the netstat is listening as epxected.
I'm very sorry for the noise -- this can be closed. I will report if I can reproduce it in the future.
#7 Updated by Niklas Hambuechen about 2 years ago
Hmm, I've just tried to get rid of
auth_allow_insecure_global_id_reclaim = false
on my production cluster, and `N monitors have not enabled msgr2` reappeared.
Is it possible that only fully-from-scratch deployed clusters do not need to configure this?
I got that from https://docs.ceph.com/en/latest/security/CVE-2021-20288/#recommendations
I had assumed that `ceph mon enable-msgr2` would make that unnecessary, but maybe there's more nuance to that?
#8 Updated by Radoslaw Zarzynski about 2 years ago
Hello. Could you please provide the output from ceph health detail
? We suspect the warning might got replaced with another one. Just to double check.
#9 Updated by Niklas Hambuechen 11 months ago
The fundamental issue here seems to be that in my newly deployed test cluster, nothing listens on port 3300 even though "ms_bind_msgr2" is true.
# ceph daemon mon.test-node-1 config show | grep ms_bind_msgr2 "ms_bind_msgr2": "true", # netstat -antp | grep ceph-mon tcp 0 0 10.0.0.1:6789 0.0.0.0:* LISTEN 100450/ceph-mon tcp 0 0 10.0.0.1:58542 10.0.0.2:6789 ESTABLISHED 100450/ceph-mon tcp 0 0 10.0.0.1:42444 10.0.0.3:6789 ESTABLISHED 100450/ceph-mon
I re-tried it now with bracket syntax, does not help:
[global] mon_host = [v2:10.0.0.1:3300/0,v1:10.0.0.1:6789/0],[v2:10.0.0.2:3300/0,v1:10.0.0.2:6789/0],[v2:10.0.0.3:3300/0,v1:10.0.0.3:6789/0]
#10 Updated by Radoslaw Zarzynski 11 months ago
Hello Niklas!
Thanks for getting back to it! Could you please collect monitor's logs with debug_ms=20
and debug_mon=20
during the bootup? Perhaps the underling bind(2)
syscall has failed for 3300
.
#11 Updated by Niklas Hambuechen 11 months ago
Hi Radoslaw, before that, a quick thing for your consideration I just found:
Running monmaptool is step 13 in https://docs.ceph.com/en/pacific/install/manual-deployment/
I believe that the issue is that `monmaptool --create`, generates only v1 addresses, contrary to the docs at:
https://docs.ceph.com/en/pacific/man/8/monmaptool/#cmdoption-monmaptool-add
If the nautilus feature is set, and the port is not, the monitor will be added for both messenger protocols.
Repro:
# monmaptool --create --add test-node-1 '10.0.0.1' --fsid 478b062f-a6e4-4ddf-96e0-7cdad91816e4 testmonmap monmaptool: monmap file testmonmap monmaptool: set fsid to 478b062f-a6e4-4ddf-96e0-7cdad91816e4 monmaptool: writing epoch 0 to testmonmap (1 monitors)
# monmaptool --print testmonmap monmaptool: monmap file testmonmap epoch 0 fsid 478b062f-a6e4-4ddf-96e0-7cdad91816e4 last_changed 2023-04-26T21:50:00.764055+0200 created 2023-04-26T21:50:00.764055+0200 min_mon_release 0 (unknown) election_strategy: 1 0: v1:10.0.0.1:6789/0 mon.test-node-1
In contrast, this works with --addv (with a `v` at the end):
monmaptool --create --addv test-node-1 '[v2:10.0.0.1:3300,v1:10.0.0.1:6789]' --fsid 478b062f-a6e4-4ddf-96e0-7cdad91816e4 testmonmap-v2
Neither of the following helps:
--feature-set nautilus --feature-set pacific --set-min-mon-release nautilus
This works:
--enable-all-features
#12 Updated by Radoslaw Zarzynski 11 months ago
This might be a doc sure but I'm not sure. Bumping for deep bug scrub.