Bug #57614: "ceph nfs cluster create ..." always show process bound to 2049: unable to deploy ingress - Orchestrator - Ceph

Actions

Copy link

Bug #57614

closed

"ceph nfs cluster create ..." always show process bound to 2049: unable to deploy ingress

Added by Francesco Pantano over 1 year ago. Updated about 1 month ago.

Status:

Resolved

Priority:

Normal

Assignee:

Adam King

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

backport_processed

Backport:

reef, quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

53008

Crash signature (v1):

Crash signature (v2):

Description

Here an example of the issue described in $subject:

root@devstack:/# ceph orch ls
    NAME   PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
    crash             1/1  4m ago     2h   *
    mgr               2/2  4m ago     2h   count:2
    mon               1/5  4m ago     2h   count:5
    osd                 1  4m ago     -    &lt;unmanaged&gt;

root@devstack:/# ceph nfs cluster create cephfs --placement=devstack.localdomain --ingress  --virtual-ip 192.168.24.75/24 --port 2049
    NFS Cluster Created Successfully

root@devstack:/# ceph orch ls
    NAME                PORTS                    RUNNING  REFRESHED  AGE  PLACEMENT
    crash                                            1/1  3s ago     2h   *
    ingress.nfs.cephfs  192.168.24.75:2049,9049      0/4  -          10s  count:2
    mgr                                              2/2  3s ago     2h   count:2
    mon                                              1/5  3s ago     2h   count:5
    nfs.cephfs          ?:12049                      1/1  3s ago     10s  devstack.localdomain
    osd                                                1  3s ago     -    &lt;unmanaged&gt;

root@devstack:/var/log/ceph# ceph orch ps
    NAME                             HOST                  PORTS    STATUS        REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
    crash.devstack                   devstack.localdomain           running (2h)     4m ago   2h    7201k        -  17.2.3   0912465dcea5  b79b84f7c463
    mgr.devstack.jykurw              devstack.localdomain           running (2h)     4m ago   2h     432M        -  17.2.3   0912465dcea5  e70c1bd70f43
    mgr.devstack.localdomain.yworig  devstack.localdomain  *:9283   running (2h)     4m ago   2h     539M        -  17.2.3   0912465dcea5  4b0232569f76
    mon.devstack.localdomain         devstack.localdomain           running (2h)     4m ago   2h     440M    2048M  17.2.3   0912465dcea5  1e53d1de366a
    nfs.cephfs.0.0.devstack.kfvcel   devstack.localdomain  *:12049  running (4m)     4m ago   4m    9353k        -  4.0      0912465dcea5  dab8ed3fd5bb
    osd.0                            devstack.localdomain           running (2h)     4m ago   2h     101M    4096M  17.2.3   0912465dcea5  88a226cee3e7

root@devstack:/# ceph -W cephadm --watch-debug
      cluster:
        id:     15b994ed-4341-4522-94e9-56e75279659a
        health: HEALTH_WARN
                Failed to place 2 daemon(s)

services:
    mon: 1 daemons, quorum devstack.localdomain (age 26h)
    mgr: devstack.localdomain.yworig(active, since 26h), standbys: devstack.jykurw
    osd: 1 osds: 1 up (since 26h), 1 in (since 26h)

data:
    pools:   2 pools, 9 pgs
    objects: 5 objects, 449 KiB
    usage:   21 MiB used, 10 GiB / 10 GiB avail
    pgs:     9 active+clean

io:
    client:   767 B/s rd, 511 B/s wr, 0 op/s rd, 0 op/s wr

But the ingress daemon fails with the following:

2022-09-20 07:42:53,430 7faa31849740 INFO Deploy daemon haproxy.nfs.cephfs.devstack.ultjer ...
  2022-09-20 07:42:53,751 7faa31849740 DEBUG stat: 0 0
  2022-09-20 07:42:53,906 7faa31849740 INFO Verifying port 2049 ...
  2022-09-20 07:42:53,907 7faa31849740 WARNING Cannot bind to IP 0.0.0.0 port 2049: [Errno 98] Address already in use
  2022-09-20 07:42:53,907 7faa31849740 INFO Verifying port 9049 ...
  2022-09-20 07:42:53,907 7faa31849740 ERROR ERROR: TCP Port(s) '2049,9049' required for haproxy already in use

netstat shows the following:

LISTEN     0       64                  0.0.0.0:2049               0.0.0.0:*
  LISTEN     0       128                       :12049                    *:      users:(("ganesha.nfsd",pid=611540,fd=35))
  LISTEN     0       64                     [::]:2049                  [::]:*

I see many problems here:

1. a process is bound on 2049, and it's not haproxy
2. ganesha, which is bound on $port + [1], is bound on '*', which is a limitation for the "ceph nfs cluster" cli
3. even using a spec, the result is still the same

root@devstack:/# cat nfs
  service_type: nfs
  service_id: cephfs
  placement:
    hosts:
      - devstack.localdomain
  spec:
    port: 12345

root@devstack:/# ceph orch ps
  NAME                             HOST                  PORTS    STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID
  crash.devstack                   devstack.localdomain           running (2h)     23s ago   2h    7201k        -  17.2.3   0912465dcea5  b79b84f7c463
  mgr.devstack.jykurw              devstack.localdomain           running (2h)     23s ago   2h     432M        -  17.2.3   0912465dcea5  e70c1bd70f43
  mgr.devstack.localdomain.yworig  devstack.localdomain  *:9283   running (2h)     23s ago   2h     540M        -  17.2.3   0912465dcea5  4b0232569f76
  mon.devstack.localdomain         devstack.localdomain           running (2h)     23s ago   2h     444M    2048M  17.2.3   0912465dcea5  1e53d1de366a
  nfs.cephfs.0.0.devstack.dhqbzc   devstack.localdomain  *:12345  running (26s)    23s ago  26s    9365k        -  4.0      0912465dcea5  2b4d379d2282
  osd.0                            devstack.localdomain           running (2h)     23s ago   2h     101M    4096M  17.2.3   0912465dcea5  88a226cee3e7

stack@devstack:~$ sudo ss -antop | grep 2049
  LISTEN     0       64                  0.0.0.0:2049               0.0.0.0:*
  LISTEN     0       64                     [::]:2049                  [::]:*

I still see a process on 2049: I have no ingress in this config, and it will fail with the error described above if I try to apply:

service_type: ingress
  service_id: nfs.cephfs
  placement:
    count: 1
  spec:
    backend_service: nfs.cephfs
    frontend_port: 2049
    monitor_port: 8000
    virtual_ip: 192.168.24.75/24"

[1] https://github.com/ceph/ceph/blob/beabb1fa114ea75151746817195176ddcf035aa0/src/pybind/mgr/nfs/cluster.py#L70

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Ilya Dryomov over 1 year ago

Target version deleted (~~v17.2.4~~)

Actions

Copy link

Updated by Adam King about 1 year ago

was later found out this issue only appears when the conflict is between the frontend port haproxy is trying to use and the port the backend service is using. In his case, it should actually work because haproxy is only binding to the VIP we setup while the backend service is binding to the host ip. Fixing this will require making the port check in the binary when we deploy daemons aware of what IP is being bound to (currently it just checks if the port is bound on any IP).

Actions

Copy link