Project

General

Profile

Actions

Bug #58145

open

orch/cephadm: nfs tests failing to mount exports (mount -t nfs 10.0.31.120:/fake /mnt/foo' fails)

Added by Adam King over 1 year ago. Updated 2 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
backport_processed
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Currently, since the sepia lab has recovered, all tests that attempt to mount NFS exports are no longer passing. All fail with some variation of

mount -t nfs 10.0.31.120:/fake /mnt/foo

Additional error messages include

mounting 10.0.31.120:/fake failed, reason given by server: No such file or directory
requested NFS version or transport protocol is not supported
mount.nfs: Protocol not supported

This has also been reproducible outside of teuthology tests and the sepia lab in general (at least the "mounting 10.0.31.120:/fake failed, reason given by server: No such file or directory" case), so likely not a problem with the images used for testing (or, if there is an issue there, it's not the only one).

This is seen when using nfs through an ingress service and also trying to mount the export on the standalone nfs.

Some examples from a main baseline test run (https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/)
https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/7089333
https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/7089334
https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/7089339
https://pulpito.ceph.com/yuriw-2022-11-23_17:41:56-orch-main-distro-default-smithi/7089351


Files


Related issues 1 (1 open0 closed)

Related to Orchestrator - Bug #58096: test_cluster_set_reset_user_config: NFS mount fails due to missing ceph directoryNew

Actions
Actions #1

Updated by John Mulligan over 1 year ago

I attempted to debug this situation locally on a 3-node VM cluster. I am able to reproduce the case where mount.nfs fails with 'No such file or directory'

We first investigated the mgr nfs module, but it appears to be functioning as expected. It creates the nfs-ganesha containers and populates the .nfs rados pool with configuration objects. The cluster was deployed using cephadm from 'main' branch.

Performing the following steps I reproduced the behavior seen in some of the test runs:

On node 0:

[ceph@ceph0 ~]$ sudo cephadm shell
Inferring fsid 0a922e44-7195-11ed-8137-525400220000
Inferring config /var/lib/ceph/0a922e44-7195-11ed-8137-525400220000/mon.ceph0/config
Using ceph image with id '1606959841d3' and tag 'main' created on 2022-12-01 16:19:23 +0000 UTC
quay.ceph.io/ceph-ci/ceph@sha256:d7f07a8dc58edb9e4a6e64966a36cf3fd5e52698983308fd7a75d4c18fa957c3
[ceph: root@ceph0 /]# ceph -s
  cluster:
    id:     0a922e44-7195-11ed-8137-525400220000
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph0,ceph1,ceph2 (age 26m)
    mgr: ceph0.bbwifl(active, since 30m), standbys: ceph1.pmrxrr
    osd: 6 osds: 6 up (since 26m), 6 in (since 26m)

  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage:   921 MiB used, 29 GiB / 30 GiB avail
    pgs:     1 active+clean

[ceph: root@ceph0 /]# ceph fs volume create fs1
[ceph: root@ceph0 /]# ceph fs volume ls
[
    {
        "name": "fs1" 
    }
]

[ceph: root@ceph0 /]# ceph nfs cluster create nfs1

[ceph: root@ceph0 /]# ceph nfs export create cephfs --cluster-id=nfs1  --pseudo-path=/fs1 --path=/ --fsname=fs1
{
    "bind": "/fs1",
    "fs": "fs1",
    "path": "/",
    "cluster": "nfs1",
    "mode": "RW" 
}

[ceph: root@ceph0 /]# ceph orch ps | grep nfs
nfs.nfs1.0.0.ceph0.mhjcwu  ceph0  *:2049       running (2m)      2m ago   2m    17.6M        -  4.2                    1606959841d3  49e76424dbdf  

On node 1:

[ceph@ceph1 ~]$ sudo mount.nfs ceph0.cx.fdopen.net:/fs1  /mnt
mount.nfs: mounting ceph0.cx.fdopen.net:/fs1 failed, reason given by server: No such file or directory

Prior to running the commands on node 1 I edited the unit.run file to start ganesha with NIV_DEBUG. The logs are attached as ganesha-log-2022-12-01-01.txt.xz

I examined the config object created by the mgr module:

[ceph: root@ceph0 /]# rados get --pool=.nfs --namespace=nfs1 export-1 /dev/stdout
EXPORT {
    FSAL {
        name = "CEPH";
        user_id = "nfs.nfs1.1";
        filesystem = "fs1";
        secret_access_key = "AQBE3ohj6c+aCRAAdOAqF2oNHTYbargyiX2bnw==";
    }
    export_id = 1;
    path = "/";
    pseudo = "/fs1";
    access_type = "RW";
    squash = "none";
    attr_expiration_time = 0;
    security_label = true;
    protocols = 4;
    transports = "TCP";
}

Nothing jumped out at me as wrong. Logs showed:

01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] reclaim_reset :FSAL :DEBUG :Issuing reclaim reset for ganesha-nfs.nfs1.0-0001
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] create_export :FSAL :DEBUG :Ceph module export /.
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] dirmap_lru_init :NFS READDIR :DEBUG :Skipping dirmap Ceph/MDC
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] export_commit_common :CONFIG :WARN :A protocol is specified for export 1 that is not enabled in NFS_CORE_PARAM, fixing up
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] export_commit_common :CONFIG :INFO :Export 1 created at pseudo (/fs1) with path (/) and tag ((null)) perms (options=020031e0/077801e7 no_root_squash, RWrw, ---, ---, TCP, ----,               ,         ,                ,                , expire=       0)
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] export_commit_common :CONFIG :INFO :Export 1 has 0 defined clients
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] build_default_root :EXPORT :DEBUG :Allocating Pseudo root export
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] pseudofs_create_export :FSAL :DEBUG :Created exp 0x15bb200 - / 
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] dirmap_lru_init :NFS READDIR :DEBUG :Skipping dirmap PSEUDO/MDC
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] build_default_root :CONFIG :INFO :Export 0 (/) successfully created
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] ReadExports :EXPORT :INFO :Export     1 pseudo (/fs1) with path (/) and tag ((null)) perms (options=020031e0/077801e7 no_root_squash, RWrw, ---, ---, TCP, ----,               ,         ,                ,                , expire=       0)
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] ReadExports :EXPORT :INFO :Export     0 pseudo (/) with path (/) and tag ((null)) perms (options=0221f080/0771f3e7 no_root_squash, --r-, -4-, ---, TCP, ----,               ,         ,                ,                ,                , none, sys, krb5, krb5i, krb5p)
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] lower_my_caps :NFS STARTUP :EVENT :CAP_SYS_RESOURCE was successfully removed for proper quota management in FSAL
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] lower_my_caps :NFS STARTUP :EVENT :currently set capabilities are: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap=ep
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_pkginit :DBUS :DEBUG :init
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_pkginit :DBUS :CRIT :dbus_bus_get failed (Failed to connect to socket /run/dbus/system_bus_socket: No such file or directory)
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] gsh_dbus_register_path :DBUS :CRIT :dbus_connection_register_object_path called with no DBUS connection
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] nfs_Init :NFS STARTUP :DEBUG :Now building NFSv4 ACL cache
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] nfs4_acls_init :NFS4 ACL :DEBUG :Initialize NFSv4 ACLs
01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] nfs4_acls_init :NFS4 ACL :DEBUG :sizeof(fsal_ace_t)=20, sizeof(fsal_acl_t)=80

And at the connection attempt from the client

01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] cih_get_by_key_latch :HT CACHE :DEBUG :cih cache hit slot 18744
01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] complete_op :NFS4 :DEBUG :Status of OP_PUTFH in position 1 = NFS4_OK, op response size is 4 total response size is 84
01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] process_one_op :NFS4 :DEBUG :Request 2: opcode 15 is OP_LOOKUP
01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] nfs4_op_lookup :NFS4 :DEBUG :name=fs1
01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] mdc_lookup :NFS READDIR :DEBUG :Cache Miss detected for fs1
01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] mdc_lookup_uncached :NFS READDIR :DEBUG :lookup fs1 failed with No such file or directory
01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] complete_op :NFS4 :DEBUG :Status of OP_LOOKUP in position 2 = NFS4ERR_NOENT, op response size is 4 total response size is 92
01/12/2022 17:08:24 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[svc_4] complete_nfs4_compound :NFS4 :DEBUG :End status = NFS4ERR_NOENT lastindex = 3

Versions:

[ceph@ceph0 ~]$ sudo podman exec -it ceph-0a922e44-7195-11ed-8137-525400220000-nfs-nfs1-0-0-ceph0-mhjcwu  bash
[root@ceph0 /]# ganesha.nfsd -v
NFS-Ganesha Release = V4.2

[ceph@ceph0 ~]$ sudo podman image ls
REPOSITORY                        TAG         IMAGE ID      CREATED         SIZE
quay.ceph.io/ceph-ci/ceph         main        1606959841d3  57 minutes ago  1.4 GB
quay.io/ceph/ceph-grafana         8.3.5       dad864ee21e9  7 months ago    571 MB
quay.io/prometheus/prometheus     v2.33.4     514e6a882f6e  9 months ago    205 MB
quay.io/prometheus/node-exporter  v1.3.1      1dbe0e931976  12 months ago   22.3 MB
quay.io/prometheus/alertmanager   v0.23.0     ba2b418f427c  15 months ago   58.9 MB

I also tried the same using a cephfs subvolume path, after getting the path with ceph fs subvolume getpath. The mount.nfs result was the same "reason given by server: No such file or directory". I can provide logs for this as well if requested.

Actions #2

Updated by Laura Flores over 1 year ago

  • Related to Bug #58096: test_cluster_set_reset_user_config: NFS mount fails due to missing ceph directory added
Actions #3

Updated by Ramana Raja over 1 year ago

I went through https://tracker.ceph.com/issues/58145#note-1 and the ganesha log. I don't nothing see anything obviously incorrect with the setup.

The following warning in the ganesha log looks interesting. I've not seen this before,

01/12/2022 17:06:13 : epoch 6388df02 : ceph0 : ganesha.nfsd-7[main] export_commit_common :CONFIG :WARN :A protocol is specified for export 1 that is not enabled in NFS_CORE_PARAM, fixing up

Can you share the actual ganesha config file, "/etc/ganesha/ganesha.conf"?

Can you increase the log level from 'NIV_DEBUG' to 'NIV_FULL_DEBUG' for all components of NFS-Ganesha server temporarily using,
https://docs.ceph.com/en/quincy/mgr/nfs/#set-customized-nfs-ganesha-configuration (See example use case 1.)
Maybe this will provide more hints?

I suggest reaching out to Frank Filz from the NFS-Ganesha team to look at this tracker ticket.

Actions #4

Updated by John Mulligan over 1 year ago

Here's the /etc/ganesha/ganesha.conf from the original run.

# This file is generated by cephadm.
NFS_CORE_PARAM {
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 4;
        NFS_Port = 2049;
}

NFSv4 {
        Delegations = false;
        RecoveryBackend = 'rados_cluster';
        Minor_Versions = 1, 2;
}

RADOS_KV {
        UserId = "nfs.nfs1.0.0.ceph0.mhjcwu";
        nodeid = "nfs.nfs1.0";
        pool = ".nfs";
        namespace = "nfs1";
}

RADOS_URLS {
        UserId = "nfs.nfs1.0.0.ceph0.mhjcwu";
        watch_url = "rados://.nfs/nfs1/conf-nfs.nfs1";
}

RGW {
        cluster = "ceph";
        name = "client.nfs.nfs1.0.0.ceph0.mhjcwu-rgw";
}

%url    rados://.nfs/nfs1/conf-nfs.nfs1

Next, I'll rerun my test setup with the NIV_FULL_DEBUG log level. I should have it soon.

Actions #5

Updated by John Mulligan over 1 year ago

New log attached, NIV_FULL_DEBUG level. Same procedure (in short):

[ceph@ceph0 ~]$ sudo cephadm shell
Inferring fsid 6a8b74bc-7579-11ed-b256-525400220000
Inferring config /var/lib/ceph/6a8b74bc-7579-11ed-b256-525400220000/mon.ceph0/config
Using ceph image with id '5700871d3e5a' and tag 'main' created on 2022-12-06 09:14:13 +0000 UTC
quay.ceph.io/ceph-ci/ceph@sha256:94a1dbe7c4ccbe6101d5f771c6763c553aa7da783bd0819147f44dd9617a4bfd
[ceph: root@ceph0 /]# ceph -s
  cluster:
    id:     6a8b74bc-7579-11ed-b256-525400220000
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph0,ceph2,ceph1 (age 9m)
    mgr: ceph0.egfash(active, since 9m), standbys: ceph2.xqmxcg
    osd: 6 osds: 6 up (since 8m), 6 in (since 8m)

  data:
    pools:   1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage:   920 MiB used, 29 GiB / 30 GiB avail
    pgs:     1 active+clean

 [ceph: root@ceph0 /]# ceph fs volume create fs1
[ceph: root@ceph0 /]# ceph fs volume ls
[
    {
        "name": "fs1" 
    }
]
[ceph: root@ceph0 /]# ceph nfs cluster create nfs1
[ceph: root@ceph0 /]# ceph nfs export create cephfs --cluster-id=nfs1  --pseudo-path=/fs1 --path=/ --fsname=fs1
{
    "bind": "/fs1",
    "fs": "fs1",
    "path": "/",
    "cluster": "nfs1",
    "mode": "RW" 
}

[ceph: root@ceph0 /]# ceph orch ps | grep nfs
nfs.nfs1.0.0.ceph0.fqfjvf  ceph0  *:2049       running (21s)    17s ago  21s    17.5M        -  4.2                    5700871d3e5a  700d096e6a4d  

[ceph@ceph1 ~]$ sudo mount.nfs ceph0.cx.fdopen.net:/fs1  /mnt
mount.nfs: mounting ceph0.cx.fdopen.net:/fs1 failed, reason given by server: No such file or directory

Actions #6

Updated by John Mulligan over 1 year ago

$ sudo podman exec -it ceph-6a8b74bc-7579-11ed-b256-525400220000-nfs-nfs1-0-0-ceph0-fqfjvf  cat /etc/ganesha/ganesha.conf ;echo
# This file is generated by cephadm.
NFS_CORE_PARAM {
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 4;
        NFS_Port = 2049;
}

NFSv4 {
        Delegations = false;
        RecoveryBackend = 'rados_cluster';
        Minor_Versions = 1, 2;
}

RADOS_KV {
        UserId = "nfs.nfs1.0.0.ceph0.fqfjvf";
        nodeid = "nfs.nfs1.0";
        pool = ".nfs";
        namespace = "nfs1";
}

RADOS_URLS {
        UserId = "nfs.nfs1.0.0.ceph0.fqfjvf";
        watch_url = "rados://.nfs/nfs1/conf-nfs.nfs1";
}

RGW {
        cluster = "ceph";
        name = "client.nfs.nfs1.0.0.ceph0.fqfjvf-rgw";
}

%url    rados://.nfs/nfs1/conf-nfs.nfs1

Actions #8

Updated by John Mulligan over 1 year ago

Based on suggestions in ceph-devel IRC, I added a EXPORT_DEFAULTS section:

# This file is generated by cephadm.
NFS_CORE_PARAM {
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 4;
        NFS_Port = 2049;
}

NFSv4 {
        Delegations = false;
        RecoveryBackend = 'rados_cluster';
        Minor_Versions = 1, 2;
}

RADOS_KV {
        UserId = "nfs.nfs1.0.0.ceph0.fqfjvf";
        nodeid = "nfs.nfs1.0";
        pool = ".nfs";
        namespace = "nfs1";
}

RADOS_URLS {
        UserId = "nfs.nfs1.0.0.ceph0.fqfjvf";
        watch_url = "rados://.nfs/nfs1/conf-nfs.nfs1";
}

RGW {
        cluster = "ceph";
        name = "client.nfs.nfs1.0.0.ceph0.fqfjvf-rgw";
}

EXPORT_DEFAULTS {
    Protocols = 4;
}

%url    rados://.nfs/nfs1/conf-nfs.nfs1

Restarted ganesha, and now the client can mount the export:

[ceph@ceph1 ~]$ sudo mount.nfs ceph0.cx.fdopen.net:/fs1  /mnt
[ceph@ceph1 ~]$ mount | grep nfs
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw,relatime)
ceph0.cx.fdopen.net:/fs1 on /mnt type nfs4 (rw,relatime,seclabel,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.76.201,local_lock=none,addr=192.168.76.200)

Actions #9

Updated by Ramana Raja over 1 year ago

Check with Frank on IRC, we can add the following section in the template, src/pybind/mgr/cephadm/templates/services/nfs/ganesha.conf.j2

EXPORT_DEFAULTS {
 Protocols = 4;
}

which should be backwards compatible. But as pointed out by John in the orchestrators weekly, we may want to wait for the fix in NFS-Ganesha that would void the need for the EXPORT_DEFAULTS block.

The NFS-Ganesha issue is tracked here, https://github.com/nfs-ganesha/nfs-ganesha/issues/888

Actions #10

Updated by Frank Filz over 1 year ago

I believe I have a fix:

https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/547188

There are two patches, so if you want to check it out, download:

git fetch ssh::29418/ffilz/nfs-ganesha refs/changes/88/547188/1 && git checkout FETCH_HEAD

It may be a few days before we tag a new V4.3 that includes this fix, in the meantime, if you can test this fix it would be helpful.

Thanks

Frank

Actions #11

Updated by Ramana Raja over 1 year ago

Frank Filz wrote:

I believe I have a fix:

https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/547188

There are two patches, so if you want to check it out, download:

git fetch ssh::29418/ffilz/nfs-ganesha refs/changes/88/547188/1 && git checkout FETCH_HEAD

I tested this fix, and it worked for me. I locally built nfs-ganesha with this fix and used it with Ceph built from main branch. With this fix, I didn't have to add the EXPORT_DEFAULTS section to make the mounting work.

It may be a few days before we tag a new V4.3 that includes this fix, in the meantime, if you can test this fix it would be helpful.

Thanks

Frank

Actions #12

Updated by Adam King over 1 year ago

  • Status changed from New to Resolved

this was fixed on the ganesha side by https://review.gerrithub.io/c/ffilz/nfs-ganesha/+/547188

Actions #13

Updated by Kamoltat (Junior) Sirivadhna 2 months ago

  • Status changed from Resolved to New

Hi guys,
this problem popped up in a RADOS Pacific branch run:

/a/yuriw-2024-02-19_19:25:49-rados-pacific-release-distro-default-smithi/7566724/

Actions #14

Updated by Kamoltat (Junior) Sirivadhna 2 months ago

  • Status changed from New to Pending Backport

As discussed offline with Adam King

``tracker was "fixed" by a change in ganesha itself and then the version we're using in main being updated. I assume it's the same here where there is an issue with the ganesha version we use in pacific that's causing this``

Actions #15

Updated by Backport Bot 2 months ago

  • Tags set to backport_processed
Actions

Also available in: Atom PDF