Project

General

Profile

Actions

Bug #51291

closed

Adoption fails for Ceph MDS servers

Added by Jesse Roland almost 3 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I'm migrating my Ceph cluster from `ceph-ansible` to `cephadm` by following the guide here: https://docs.ceph.com/en/octopus/cephadm/adoption/

I've made it to step 10 where one runs the command:

# ceph orch apply mds <fs-name> [--placement=<placement>]

After running this nothing changes. I know it did something as now `ceph orch` returns MDS servers, but none deployed

# ceph orch ls
NAME        RUNNING  REFRESHED  AGE  PLACEMENT                     IMAGE NAME                    IMAGE ID      
mds.cephfs      0/3  -          -    athos2;athos3;athos4;count:3  <unknown>                     <unknown>     
mgr             5/0  16m ago    -    <unmanaged>                   docker.io/ceph/ceph:v15.2.13  2cf504fded39  
mon             5/0  16m ago    -    <unmanaged>                   docker.io/ceph/ceph:v15.2.13  2cf504fded39 

The target FS is called `cephfs`

# ceph fs ls
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

If I do a `cephadm ls` on the node, it only returns the legacy MDS server. I've tried disabling the legacy service on the target machine but no success so far.

Digging deeper, I found the following from ceph orch

# ceph orch ls --service_name=mds.cephfs --format yaml
service_type: mds
service_id: cephfs
service_name: mds.cephfs
placement:
  count: 3
  hosts:
  - athos2
  - athos3
  - athos4
status:
  running: 0
  size: 3
events:
- '2021-06-19T23:32:01.844902Z service:mds.cephfs [ERROR] "Failed while placing mds.cephfs.athos4.wqwvixon
  athos4: Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15
  --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mds.cephfs.athos4.wqwvix
  --config-json -"'
- '2021-06-19T23:32:01.949145Z service:mds.cephfs [ERROR] "Failed while placing mds.cephfs.athos2.vemowmon
  athos2: Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15
  --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mds.cephfs.athos2.vemowm
  --config-json -"'
- '2021-06-19T23:32:41.577409Z service:mds.cephfs [ERROR] "Failed while placing mds.cephfs.athos3.iubqwaon
  athos3: Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15
  --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mds.cephfs.athos3.iubqwa
  --config-json -"'
- '2021-06-19T23:32:43.647630Z service:mds.cephfs [ERROR] "Failed while placing mds.cephfs.athos4.amlogwon
  athos4: Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15
  --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mds.cephfs.athos4.amlogw
  --config-json -"'
- '2021-06-19T23:32:49.889821Z service:mds.cephfs [ERROR] "Failed while placing mds.cephfs.athos2.ebrxnmon
  athos2: Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15
  --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mds.cephfs.athos2.ebrxnm
  --config-json -"'

I've been stuck here. Running the command manually hangs without any further output. I had hoped that meant it'd be running in the foreground, but running `cephadm ls` on the node returned no active services.

Actions #1

Updated by Jesse Roland almost 3 years ago

Posting an update with additional details. I was able to get some more verbose output from running `ceph log last cephadm`

RuntimeError: Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15 --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mds.cephfs.athos4.nsevry --config-json -
2021-06-28T12:19:36.468436+0000 mgr.athos2 (mgr.5211678) 1623980 : cephadm [INF] Deploying daemon mds.cephfs.athos2.adupvw on athos2
2021-06-28T12:19:36.504448+0000 mgr.athos2 (mgr.5211678) 1623985 : cephadm [ERR] Traceback (most recent call last):
2021-06-28T12:19:36.504677+0000 mgr.athos2 (mgr.5211678) 1623986 : cephadm [ERR]   File "/lib/python3.6/site-packages/remoto/process.py", line 188, in check
2021-06-28T12:19:36.504883+0000 mgr.athos2 (mgr.5211678) 1623987 : cephadm [ERR]     response = result.receive(timeout)
2021-06-28T12:19:36.505087+0000 mgr.athos2 (mgr.5211678) 1623988 : cephadm [ERR]   File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 749, in receive
2021-06-28T12:19:36.505290+0000 mgr.athos2 (mgr.5211678) 1623989 : cephadm [ERR]     raise self._getremoteerror() or EOFError()
2021-06-28T12:19:36.505525+0000 mgr.athos2 (mgr.5211678) 1623990 : cephadm [ERR] execnet.gateway_base.RemoteError: Traceback (most recent call last):
2021-06-28T12:19:36.505741+0000 mgr.athos2 (mgr.5211678) 1623991 : cephadm [ERR]   File "<string>", line 1088, in executetask
2021-06-28T12:19:36.505951+0000 mgr.athos2 (mgr.5211678) 1623992 : cephadm [ERR]   File "/lib/python3.6/site-packages/remoto/process.py", line 151, in _remote_check
2021-06-28T12:19:36.506161+0000 mgr.athos2 (mgr.5211678) 1623993 : cephadm [ERR]   File "/usr/lib/python3.6/subprocess.py", line 863, in communicate
2021-06-28T12:19:36.506370+0000 mgr.athos2 (mgr.5211678) 1623994 : cephadm [ERR]     stdout, stderr = self._communicate(input, endtime, timeout)
2021-06-28T12:19:36.506579+0000 mgr.athos2 (mgr.5211678) 1623995 : cephadm [ERR]   File "/usr/lib/python3.6/subprocess.py", line 1519, in _communicate
2021-06-28T12:19:36.506785+0000 mgr.athos2 (mgr.5211678) 1623996 : cephadm [ERR]     input_view = memoryview(self._input)
2021-06-28T12:19:36.506993+0000 mgr.athos2 (mgr.5211678) 1623997 : cephadm [ERR] TypeError: memoryview: a bytes-like object is required, not 'str'
2021-06-28T12:19:36.507202+0000 mgr.athos2 (mgr.5211678) 1623998 : cephadm [ERR] 
2021-06-28T12:19:36.507409+0000 mgr.athos2 (mgr.5211678) 1623999 : cephadm [ERR] 
2021-06-28T12:19:36.508516+0000 mgr.athos2 (mgr.5211678) 1624001 : cephadm [ERR] Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15 --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mds.cephfs.athos2.adupvw --config-json -
Traceback (most recent call last):
  File "/lib/python3.6/site-packages/remoto/process.py", line 188, in check
    response = result.receive(timeout)
  File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 749, in receive
    raise self._getremoteerror() or EOFError()
execnet.gateway_base.RemoteError: Traceback (most recent call last):
  File "<string>", line 1088, in executetask
  File "/lib/python3.6/site-packages/remoto/process.py", line 151, in _remote_check
  File "/usr/lib/python3.6/subprocess.py", line 863, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.6/subprocess.py", line 1519, in _communicate
    input_view = memoryview(self._input)
TypeError: memoryview: a bytes-like object is required, not 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1021, in _remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1157, in _run_cephadm
    stdin=stdin)
  File "/lib/python3.6/site-packages/remoto/process.py", line 209, in check
    'Failed to execute command: %s' % ' '.join(command)
RuntimeError: Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15 --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mds.cephfs.athos2.adupvw --config-json -

This appears to be a python related error. There were a few tickets about this in the past, but none were MDS related

Actions #2

Updated by Jesse Roland almost 3 years ago

I had put this task on the shelf for a while to work on other stuff and since the cluster was still in a functional state. Coming back and inspecting I'm realizing this python error is occurring on all of my containers, includings monitors, managers, and OSD's

2021-07-12T20:35:52.738777+0000 mgr.athos2 (mgr.5211678) 3363601 : cephadm [ERR] Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15 --no-
container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mgr.athos6 --reconfig --config-json -
Traceback (most recent call last):
  File "/lib/python3.6/site-packages/remoto/process.py", line 188, in check
    response = result.receive(timeout)
  File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 749, in receive
    raise self._getremoteerror() or EOFError()
execnet.gateway_base.RemoteError: Traceback (most recent call last):
  File "<string>", line 1088, in executetask
  File "/lib/python3.6/site-packages/remoto/process.py", line 151, in _remote_check
  File "/usr/lib/python3.6/subprocess.py", line 863, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.6/subprocess.py", line 1519, in _communicate
    input_view = memoryview(self._input)
TypeError: memoryview: a bytes-like object is required, not 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1021, in _remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1157, in _run_cephadm
    stdin=stdin)
  File "/lib/python3.6/site-packages/remoto/process.py", line 209, in check
    'Failed to execute command: %s' % ' '.join(command)
RuntimeError: Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15 --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mgr.athos6 --reconfig --config-json -
2021-07-12T20:35:52.740158+0000 mgr.athos2 (mgr.5211678) 3363602 : cephadm [INF] Reconfiguring mon.athos6 (unknown last config time)...
2021-07-12T20:35:52.747412+0000 mgr.athos2 (mgr.5211678) 3363605 : cephadm [INF] Deploying daemon mon.athos6 on athos6
2021-07-12T20:35:54.859395+0000 mgr.athos2 (mgr.5211678) 3363611 : cephadm [ERR] Traceback (most recent call last):
2021-07-12T20:35:54.859597+0000 mgr.athos2 (mgr.5211678) 3363612 : cephadm [ERR]   File "/lib/python3.6/site-packages/remoto/process.py", line 188, in check
2021-07-12T20:35:54.859796+0000 mgr.athos2 (mgr.5211678) 3363613 : cephadm [ERR]     response = result.receive(timeout)
2021-07-12T20:35:54.860031+0000 mgr.athos2 (mgr.5211678) 3363614 : cephadm [ERR]   File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 749, in receive
2021-07-12T20:35:54.860260+0000 mgr.athos2 (mgr.5211678) 3363615 : cephadm [ERR]     raise self._getremoteerror() or EOFError()
2021-07-12T20:35:54.860484+0000 mgr.athos2 (mgr.5211678) 3363616 : cephadm [ERR] execnet.gateway_base.RemoteError: Traceback (most recent call last):
2021-07-12T20:35:54.860719+0000 mgr.athos2 (mgr.5211678) 3363617 : cephadm [ERR]   File "<string>", line 1088, in executetask
2021-07-12T20:35:54.860939+0000 mgr.athos2 (mgr.5211678) 3363618 : cephadm [ERR]   File "/lib/python3.6/site-packages/remoto/process.py", line 151, in _remote_check
2021-07-12T20:35:54.861150+0000 mgr.athos2 (mgr.5211678) 3363619 : cephadm [ERR]   File "/usr/lib/python3.6/subprocess.py", line 863, in communicate
2021-07-12T20:35:54.861459+0000 mgr.athos2 (mgr.5211678) 3363620 : cephadm [ERR]     stdout, stderr = self._communicate(input, endtime, timeout)
2021-07-12T20:35:54.861677+0000 mgr.athos2 (mgr.5211678) 3363621 : cephadm [ERR]   File "/usr/lib/python3.6/subprocess.py", line 1519, in _communicate
2021-07-12T20:35:54.861887+0000 mgr.athos2 (mgr.5211678) 3363622 : cephadm [ERR]     input_view = memoryview(self._input)
2021-07-12T20:35:54.862103+0000 mgr.athos2 (mgr.5211678) 3363623 : cephadm [ERR] TypeError: memoryview: a bytes-like object is required, not 'str'
2021-07-12T20:35:54.862310+0000 mgr.athos2 (mgr.5211678) 3363624 : cephadm [ERR] 
2021-07-12T20:35:54.862519+0000 mgr.athos2 (mgr.5211678) 3363625 : cephadm [ERR] 
2021-07-12T20:35:54.863743+0000 mgr.athos2 (mgr.5211678) 3363627 : cephadm [ERR] Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15 --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mon.athos6 --config-json -
Traceback (most recent call last):
  File "/lib/python3.6/site-packages/remoto/process.py", line 188, in check
    response = result.receive(timeout)
  File "/lib/python3.6/site-packages/execnet/gateway_base.py", line 749, in receive
    raise self._getremoteerror() or EOFError()
execnet.gateway_base.RemoteError: Traceback (most recent call last):
  File "<string>", line 1088, in executetask
  File "/lib/python3.6/site-packages/remoto/process.py", line 151, in _remote_check
  File "/usr/lib/python3.6/subprocess.py", line 863, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.6/subprocess.py", line 1519, in _communicate
    input_view = memoryview(self._input)
TypeError: memoryview: a bytes-like object is required, not 'str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1021, in _remote_connection
    yield (conn, connr)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 1157, in _run_cephadm
    stdin=stdin)
  File "/lib/python3.6/site-packages/remoto/process.py", line 209, in check
    'Failed to execute command: %s' % ' '.join(command)
RuntimeError: Failed to execute command: sudo /usr/bin/cephadm --image docker.io/ceph/ceph:v15 --no-container-init deploy --fsid 85361255-4989-4e27-bdb3-e017b9081911 --name mon.athos6 --config-json -
Actions #3

Updated by Jesse Roland almost 3 years ago

I've tracked down the issue. More details with fix here: https://github.com/alfredodeza/remoto/issues/65

The problem stems from the remoto library, which is not properly encoding the `stdin` variable. I've filed a PR to address this here: https://github.com/alfredodeza/remoto/pull/66/

There may be a better solution, but as of now patching remoto/process.py in the container has fixed the issue for me.

Actions #4

Updated by Sebastian Wagner over 2 years ago

  • Project changed from Ceph to Orchestrator
Actions #5

Updated by Sebastian Wagner over 2 years ago

  • Status changed from New to Resolved

awesome. Thank you!

resolved upstream

Actions

Also available in: Atom PDF