Bug #58145
openorch/cephadm: nfs tests failing to mount exports (mount -t nfs 10.0.31.120:/fake /mnt/foo' fails)
0%
Description
Currently, since the sepia lab has recovered, all tests that attempt to mount NFS exports are no longer passing. All fail with some variation of
mount -t nfs 10.0.31.120:/fake /mnt/foo
Additional error messages include
mounting 10.0.31.120:/fake failed, reason given by server: No such file or directory
requested NFS version or transport protocol is not supported
mount.nfs: Protocol not supported
This has also been reproducible outside of teuthology tests and the sepia lab in general (at least the "mounting 10.0.31.120:/fake failed, reason given by server: No such file or directory" case), so likely not a problem with the images used for testing (or, if there is an issue there, it's not the only one).
This is seen when using nfs through an ingress service and also trying to mount the export on the standalone nfs.
Some examples from a main baseline test run (https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/)
https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/7089333
https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/7089334
https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/7089339
https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/7089351
Files
Updated by John Mulligan over 1 year ago
I attempted to debug this situation locally on a 3-node VM cluster. I am able to reproduce the case where mount.nfs fails with 'No such file or directory'
We first investigated the mgr nfs module, but it appears to be functioning as expected. It creates the nfs-ganesha containers and populates the .nfs rados pool with configuration objects. The cluster was deployed using cephadm from 'main' branch.
Performing the following steps I reproduced the behavior seen in some of the test runs:
On node 0:
[ceph@ceph0 ~]$ sudo cephadm shell Inferring fsid 0a922e44-7195-11ed-8137-525400220000 Inferring config /var/lib/ceph/0a922e44-7195-11ed-8137-525400220000/mon.ceph0/config Using ceph image with id '1606959841d3' and tag 'main' created on 2022-12-01 16:19:23 +0000 UTC quay.ceph.io/ceph-ci/ceph@sha256:d7f07a8dc58edb9e4a6e64966a36cf3fd5e52698983308fd7a75d4c18fa957c3 [ceph: root@ceph0 /]# ceph -s cluster: id: 0a922e44-7195-11ed-8137-525400220000 health: HEALTH_OK services: mon: 3 daemons, quorum ceph0,ceph1,ceph2 (age 26m) mgr: ceph0.bbwifl(active, since 30m), standbys: ceph1.pmrxrr osd: 6 osds: 6 up (since 26m), 6 in (since 26m) data: pools: 1 pools, 1 pgs objects: 2 objects, 449 KiB usage: 921 MiB used, 29 GiB / 30 GiB avail pgs: 1 active+clean [ceph: root@ceph0 /]# ceph fs volume create fs1 [ceph: root@ceph0 /]# ceph fs volume ls [ { "name": "fs1" } ] [ceph: root@ceph0 /]# ceph nfs cluster create nfs1 [ceph: root@ceph0 /]# ceph nfs export create cephfs --cluster-id=nfs1 --pseudo-path=/fs1 --path=/ --fsname=fs1 { "bind": "/fs1", "fs": "fs1", "path": "/", "cluster": "nfs1", "mode": "RW" } [ceph: root@ceph0 /]# ceph orch ps | grep nfs nfs.nfs1.0.0.ceph0.mhjcwu ceph0 *:2049 running (2m) 2m ago 2m 17.6M - 4.2 1606959841d3 49e76424dbdf
On node 1:
[ceph@ceph1 ~]$ sudo mount.nfs ceph0.cx.fdopen.net:/fs1 /mnt mount.nfs: mounting ceph0.cx.fdopen.net:/fs1 failed, reason given by server: No such file or directory
Prior to running the commands on node 1 I edited the unit.run file to start ganesha with NIV_DEBUG. The logs are attached as ganesha-log-2022-12-01-01.txt.xz
I examined the config object created by the mgr module:
[ceph: root@ceph0 /]# rados get --pool=.nfs --namespace=nfs1 export-1 /dev/stdout EXPORT { FSAL { name = "CEPH"; user_id = "nfs.nfs1.1"; filesystem = "fs1"; secret_access_key = "AQBE3ohj6c+aCRAAdOAqF2oNHTYbargyiX2bnw=="; } export_id = 1; path = "/"; pseudo = "/fs1"; access_type = "RW"; squash = "none"; attr_expiration_time = 0; security_label = true; protocols = 4; transports = "TCP"; }
Nothing jumped out at me as wrong. Logs showed:
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] reclaim_reset :FSAL :DEBUG :Issuing reclaim reset for ganesha-nfs.nfs1.0-0001 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] create_export :FSAL :DEBUG :Ceph module export /. 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] dirmap_lru_init :NFS READDIR :DEBUG :Skipping dirmap Ceph/MDC 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] export_commit_common :CONFIG :WARN :A protocol is specified for export 1 that is not enabled in NFS_CORE_PARAM, fixing up 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] export_commit_common :CONFIG :INFO :Export 1 created at pseudo (/fs1) with path (/) and tag ((null)) perms (options=020031e0/077801e7 no_root_squash, RWrw, ---, ---, TCP, ----, , , , , expire= 0) 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] export_commit_common :CONFIG :INFO :Export 1 has 0 defined clients 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] build_default_root :EXPORT :DEBUG :Allocating Pseudo root export 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] pseudofs_create_export :FSAL :DEBUG :Created exp 0x15bb200 - / 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] dirmap_lru_init :NFS READDIR :DEBUG :Skipping dirmap PSEUDO/MDC 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] build_default_root :CONFIG :INFO :Export 0 (/) successfully created 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] ReadExports :EXPORT :INFO :Export 1 pseudo (/fs1) with path (/) and tag ((null)) perms (options=020031e0/077801e7 no_root_squash, RWrw, ---, ---, TCP, ----, , , , , expire= 0) 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] ReadExports :EXPORT :INFO :Export 0 pseudo (/) with path (/) and tag ((null)) perms (options=0221f080/0771f3e7 no_root_squash, --r-, -4-, ---, TCP, ----, , , , , , none, sys, krb5, krb5i, krb5p) 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] lower_my_caps :NFS STARTUP :EVENT :currently set capabilities are: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap=ep 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_pkginit :DBUS :DEBUG :init 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_pkginit :DBUS :CRIT :dbus_bus_get failed (Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory) 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] nfs_Init :NFS STARTUP :DEBUG :Now building NFSv4 ACL cache 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] nfs4_acls_init :NFS4 ACL :DEBUG :Initialize NFSv4 ACLs 01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] nfs4_acls_init :NFS4 ACL :DEBUG :sizeof(fsal_ace_t)=20, sizeof(fsal_acl_t)=80
And at the connection attempt from the client
01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] cih_get_by_key_latch :HT CACHE :DEBUG :cih cache hit slot 18744 01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] complete_op :NFS4 :DEBUG :Status of OP_PUTFH in position 1 = NFS4_OK, op response size is 4 total response size is 84 01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] process_one_op :NFS4 :DEBUG :Request 2: opcode 15 is OP_LOOKUP 01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] nfs4_op_lookup :NFS4 :DEBUG :name=fs1 01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] mdc_lookup :NFS READDIR :DEBUG :Cache Miss detected for fs1 01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] mdc_lookup_uncached :NFS READDIR :DEBUG :lookup fs1 failed with No such file or directory 01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] complete_op :NFS4 :DEBUG :Status of OP_LOOKUP in position 2 = NFS4ERR_NOENT, op response size is 4 total response size is 92 01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] complete_nfs4_compound :NFS4 :DEBUG :End status = NFS4ERR_NOENT lastindex = 3
Versions:
[ceph@ceph0 ~]$ sudo podman exec -it ceph-0a922e44-7195-11ed-8137-525400220000-nfs-nfs1-0-0-ceph0-mhjcwu bash [root@ceph0 /]# ganesha.nfsd -v NFS-Ganesha Release = V4.2
[ceph@ceph0 ~]$ sudo podman image ls REPOSITORY TAG IMAGE ID CREATED SIZE quay.ceph.io/ceph-ci/ceph main 1606959841d3 57 minutes ago 1.4 GB quay.io/ceph/ceph-grafana 8.3.5 dad864ee21e9 7 months ago 571 MB quay.io/prometheus/prometheus v2.33.4 514e6a882f6e 9 months ago 205 MB quay.io/prometheus/node-exporter v1.3.1 1dbe0e931976 12 months ago 22.3 MB quay.io/prometheus/alertmanager v0.23.0 ba2b418f427c 15 months ago 58.9 MB
I also tried the same using a cephfs subvolume path, after getting the path with ceph fs subvolume getpath. The mount.nfs result was the same "reason given by server: No such file or directory". I can provide logs for this as well if requested.
Updated by Laura Flores over 1 year ago
- Related to Bug #58096: test_cluster_set_reset_user_config: NFS mount fails due to missing ceph directory added
Updated by Ramana Raja over 1 year ago
I went through https://tracker.ceph.com/issues/58145#note-1 and the ganesha log. I don't nothing see anything obviously incorrect with the setup.
The following warning in the ganesha log looks interesting. I've not seen this before,
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] export_commit_common :CONFIG :WARN :A protocol is specified for export 1 that is not enabled in NFS_CORE_PARAM, fixing up
Can you share the actual ganesha config file, "/etc/ganesha/ganesha.conf"?
Can you increase the log level from 'NIV_DEBUG' to 'NIV_FULL_DEBUG' for all components of NFS-Ganesha server temporarily using,
https://docs.ceph.com/en/quincy/mgr/nfs/#set-customized-nfs-ganesha-configuration (See example use case 1.)
Maybe this will provide more hints?
I suggest reaching out to Frank Filz from the NFS-Ganesha team to look at this tracker ticket.
Updated by John Mulligan over 1 year ago
Here's the /etc/ganesha/ganesha.conf from the original run.
# This file is generated by cephadm. NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; NFS_Port = 2049; } NFSv4 { Delegations = false; RecoveryBackend = 'rados_cluster'; Minor_Versions = 1, 2; } RADOS_KV { UserId = "nfs.nfs1.0.0.ceph0.mhjcwu"; nodeid = "nfs.nfs1.0"; pool = ".nfs"; namespace = "nfs1"; } RADOS_URLS { UserId = "nfs.nfs1.0.0.ceph0.mhjcwu"; watch_url = "rados://.nfs/nfs1/conf-nfs.nfs1"; } RGW { cluster = "ceph"; name = "client.nfs.nfs1.0.0.ceph0.mhjcwu-rgw"; } %url rados://.nfs/nfs1/conf-nfs.nfs1
Next, I'll rerun my test setup with the NIV_FULL_DEBUG log level. I should have it soon.
Updated by John Mulligan over 1 year ago
New log attached, NIV_FULL_DEBUG level. Same procedure (in short):
[ceph@ceph0 ~]$ sudo cephadm shell Inferring fsid 6a8b74bc-7579-11ed-b256-525400220000 Inferring config /var/lib/ceph/6a8b74bc-7579-11ed-b256-525400220000/mon.ceph0/config Using ceph image with id '5700871d3e5a' and tag 'main' created on 2022-12-06 09:14:13 +0000 UTC quay.ceph.io/ceph-ci/ceph@sha256:94a1dbe7c4ccbe6101d5f771c6763c553aa7da783bd0819147f44dd9617a4bfd [ceph: root@ceph0 /]# ceph -s cluster: id: 6a8b74bc-7579-11ed-b256-525400220000 health: HEALTH_OK services: mon: 3 daemons, quorum ceph0,ceph2,ceph1 (age 9m) mgr: ceph0.egfash(active, since 9m), standbys: ceph2.xqmxcg osd: 6 osds: 6 up (since 8m), 6 in (since 8m) data: pools: 1 pools, 1 pgs objects: 2 objects, 449 KiB usage: 920 MiB used, 29 GiB / 30 GiB avail pgs: 1 active+clean [ceph: root@ceph0 /]# ceph fs volume create fs1 [ceph: root@ceph0 /]# ceph fs volume ls [ { "name": "fs1" } ] [ceph: root@ceph0 /]# ceph nfs cluster create nfs1 [ceph: root@ceph0 /]# ceph nfs export create cephfs --cluster-id=nfs1 --pseudo-path=/fs1 --path=/ --fsname=fs1 { "bind": "/fs1", "fs": "fs1", "path": "/", "cluster": "nfs1", "mode": "RW" } [ceph: root@ceph0 /]# ceph orch ps | grep nfs nfs.nfs1.0.0.ceph0.fqfjvf ceph0 *:2049 running (21s) 17s ago 21s 17.5M - 4.2 5700871d3e5a 700d096e6a4d
[ceph@ceph1 ~]$ sudo mount.nfs ceph0.cx.fdopen.net:/fs1 /mnt mount.nfs: mounting ceph0.cx.fdopen.net:/fs1 failed, reason given by server: No such file or directory
Updated by John Mulligan over 1 year ago
$ sudo podman exec -it ceph-6a8b74bc-7579-11ed-b256-525400220000-nfs-nfs1-0-0-ceph0-fqfjvf cat /etc/ganesha/ganesha.conf ;echo # This file is generated by cephadm. NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; NFS_Port = 2049; } NFSv4 { Delegations = false; RecoveryBackend = 'rados_cluster'; Minor_Versions = 1, 2; } RADOS_KV { UserId = "nfs.nfs1.0.0.ceph0.fqfjvf"; nodeid = "nfs.nfs1.0"; pool = ".nfs"; namespace = "nfs1"; } RADOS_URLS { UserId = "nfs.nfs1.0.0.ceph0.fqfjvf"; watch_url = "rados://.nfs/nfs1/conf-nfs.nfs1"; } RGW { cluster = "ceph"; name = "client.nfs.nfs1.0.0.ceph0.fqfjvf-rgw"; } %url rados://.nfs/nfs1/conf-nfs.nfs1
Updated by Ramana Raja over 1 year ago
Frank suspects it's https://github.com/nfs-ganesha/nfs-ganesha/issues/888
Updated by John Mulligan over 1 year ago
Based on suggestions in ceph-devel IRC, I added a EXPORT_DEFAULTS section:
# This file is generated by cephadm. NFS_CORE_PARAM { Enable_NLM = false; Enable_RQUOTA = false; Protocols = 4; NFS_Port = 2049; } NFSv4 { Delegations = false; RecoveryBackend = 'rados_cluster'; Minor_Versions = 1, 2; } RADOS_KV { UserId = "nfs.nfs1.0.0.ceph0.fqfjvf"; nodeid = "nfs.nfs1.0"; pool = ".nfs"; namespace = "nfs1"; } RADOS_URLS { UserId = "nfs.nfs1.0.0.ceph0.fqfjvf"; watch_url = "rados://.nfs/nfs1/conf-nfs.nfs1"; } RGW { cluster = "ceph"; name = "client.nfs.nfs1.0.0.ceph0.fqfjvf-rgw"; } EXPORT_DEFAULTS { Protocols = 4; } %url rados://.nfs/nfs1/conf-nfs.nfs1
Restarted ganesha, and now the client can mount the export:
[ceph@ceph1 ~]$ sudo mount.nfs ceph0.cx.fdopen.net:/fs1 /mnt [ceph@ceph1 ~]$ mount | grep nfs sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime) ceph0.cx.fdopen.net:/fs1 on /mnt type nfs4 (rw,relatime,seclabel,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.76.201,local_lock=none,addr=192.168.76.200)
Updated by Ramana Raja over 1 year ago
Check with Frank on IRC, we can add the following section in the template, src/pybind/mgr/cephadm/templates/services/nfs/ganesha.conf.j2
EXPORT_DEFAULTS { Protocols = 4; }
which should be backwards compatible. But as pointed out by John in the orchestrators weekly, we may want to wait for the fix in NFS-Ganesha that would void the need for the EXPORT_DEFAULTS block.
The NFS-Ganesha issue is tracked here, https://github.com/nfs-ganesha/nfs-ganesha/issues/888
Updated by Frank Filz over 1 year ago
I believe I have a fix:
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/547188
There are two patches, so if you want to check it out, download:
git fetch ssh://ffilz@review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/88/547188/1 && git checkout FETCH_HEAD
It may be a few days before we tag a new V4.3 that includes this fix, in the meantime, if you can test this fix it would be helpful.
Thanks
Frank
Updated by Ramana Raja over 1 year ago
Frank Filz wrote:
I believe I have a fix:
https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/547188
There are two patches, so if you want to check it out, download:
git fetch ssh://ffilz@review.gerrithub.io:29418/ffilz/nfs-ganesha refs/changes/88/547188/1 && git checkout FETCH_HEAD
I tested this fix, and it worked for me. I locally built nfs-ganesha with this fix and used it with Ceph built from main branch. With this fix, I didn't have to add the EXPORT_DEFAULTS section to make the mounting work.
It may be a few days before we tag a new V4.3 that includes this fix, in the meantime, if you can test this fix it would be helpful.
Thanks
Frank
Updated by Adam King over 1 year ago
- Status changed from New to Resolved
this was fixed on the ganesha side by https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/547188
Updated by Kamoltat (Junior) Sirivadhna 2 months ago
- Status changed from Resolved to New
Hi guys,
this problem popped up in a RADOS Pacific branch run:
/a/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566724/
Updated by Kamoltat (Junior) Sirivadhna 2 months ago
- Status changed from New to Pending Backport
As discussed offline with Adam King
``tracker was "fixed" by a change in ganesha itself and then the version we're using in main being updated. I assume it's the same here where there is an issue with the ganesha version we use in pacific that's causing this``