Project

General

Profile

Bug #46990

execnet: EOFError: couldnt load message header, expected 9 bytes, got 0

Added by Sebastian Wagner over 3 years ago. Updated about 3 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

[ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: Failed to execute command: /usr/bin/python3 -u
    Module 'cephadm' has failed: Failed to execute command: /usr/bin/python3 -u
Aug 17 13:42:20 master bash[19272]: Traceback (most recent call last):
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 432, in from_io
Aug 17 13:42:20 master bash[19272]:     header = io.read(9)  # type 1, channel 4, payload 4
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 402, in read
Aug 17 13:42:20 master bash[19272]:     raise EOFError("expected %d bytes, got %d" % (numbytes, len(buf)))
Aug 17 13:42:20 master bash[19272]: EOFError: expected 9 bytes, got 0
Aug 17 13:42:20 master bash[19272]: During handling of the above exception, another exception occurred:
Aug 17 13:42:20 master bash[19272]: Traceback (most recent call last):
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/remoto/process.py", line 188, in check
Aug 17 13:42:20 master bash[19272]:     response = result.receive(timeout)
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 749, in receive
Aug 17 13:42:20 master bash[19272]:     raise self._getremoteerror() or EOFError()
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 967, in _thread_receiver
Aug 17 13:42:20 master bash[19272]:     msg = Message.from_io(io)
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 437, in from_io
Aug 17 13:42:20 master bash[19272]:     raise EOFError("couldnt load message header, " + e.args[0])
Aug 17 13:42:20 master bash[19272]: EOFError: couldnt load message header, expected 9 bytes, got 0
Aug 17 13:42:20 master bash[19272]: During handling of the above exception, another exception occurred:
Aug 17 13:42:20 master bash[19272]: Traceback (most recent call last):
Aug 17 13:42:20 master bash[19272]:   File "/usr/share/ceph/mgr/cephadm/module.py", line 1035, in _remote_connection
Aug 17 13:42:20 master bash[19272]:     yield (conn, connr)
Aug 17 13:42:20 master bash[19272]:   File "/usr/share/ceph/mgr/cephadm/module.py", line 1131, in _run_cephadm
Aug 17 13:42:20 master bash[19272]:     stdin=script.encode('utf-8'))
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/remoto/process.py", line 209, in check
Aug 17 13:42:20 master bash[19272]:     'Failed to execute command: %s' % ' '.join(command)
Aug 17 13:42:20 master bash[19272]: RuntimeError: Failed to execute command: /usr/bin/python3 -u
Aug 17 13:42:20 master bash[19272]: debug 2020-08-17T11:42:20.900+0000 7f2bbda3c700 -1 log_channel(cephadm) log [ERR] : Failed to execute command: /usr/bin/python3 -u
Aug 17 13:42:20 master bash[19272]: Traceback (most recent call last):
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 432, in from_io
Aug 17 13:42:20 master bash[19272]:     header = io.read(9)  # type 1, channel 4, payload 4
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 402, in read
Aug 17 13:42:20 master bash[19272]:     raise EOFError("expected %d bytes, got %d" % (numbytes, len(buf)))
Aug 17 13:42:20 master bash[19272]: EOFError: expected 9 bytes, got 0
Aug 17 13:42:20 master bash[19272]: During handling of the above exception, another exception occurred:
Aug 17 13:42:20 master bash[19272]: Traceback (most recent call last):
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/remoto/process.py", line 188, in check
Aug 17 13:42:20 master bash[19272]:     response = result.receive(timeout)
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 749, in receive
Aug 17 13:42:20 master bash[19272]:     raise self._getremoteerror() or EOFError()
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 967, in _thread_receiver
Aug 17 13:42:20 master bash[19272]:     msg = Message.from_io(io)
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/execnet/gateway_base.py", line 437, in from_io
Aug 17 13:42:20 master bash[19272]:     raise EOFError("couldnt load message header, " + e.args[0])
Aug 17 13:42:20 master bash[19272]: EOFError: couldnt load message header, expected 9 bytes, got 0
Aug 17 13:42:20 master bash[19272]: During handling of the above exception, another exception occurred:
Aug 17 13:42:20 master bash[19272]: Traceback (most recent call last):
Aug 17 13:42:20 master bash[19272]:   File "/usr/share/ceph/mgr/cephadm/module.py", line 1035, in _remote_connection
Aug 17 13:42:20 master bash[19272]:     yield (conn, connr)
Aug 17 13:42:20 master bash[19272]:   File "/usr/share/ceph/mgr/cephadm/module.py", line 1131, in _run_cephadm
Aug 17 13:42:20 master bash[19272]:     stdin=script.encode('utf-8'))
Aug 17 13:42:20 master bash[19272]:   File "/usr/lib/python3.6/site-packages/remoto/process.py", line 209, in check
Aug 17 13:42:20 master bash[19272]:     'Failed to execute command: %s' % ' '.join(command)
Aug 17 13:42:20 master bash[19272]: RuntimeError: Failed to execute command: /usr/bin/python3 -u
Aug 17 13:42:21 master bash[19272]: Warning: Permanently added 'master' (ECDSA) to the list of known hosts.

execnet is again super helpful.

Fortunately, we were able to recover from this, as we're calling _reset_con() in that case.


Related issues

Related to Orchestrator - Bug #38757: mgr/ssh orchestrator doesn't work Can't reproduce
Related to Orchestrator - Cleanup #44676: cephadm: Replace execnet (and remoto) Resolved
Duplicated by Orchestrator - Bug #46764: cephadm (ceph orch apply) sometimes gets "stuck" and cannot deploy any OSDs Can't reproduce

History

#1 Updated by Sebastian Wagner over 3 years ago

  • Description updated (diff)

#2 Updated by Sebastian Wagner over 3 years ago

  • Related to Bug #38757: mgr/ssh orchestrator doesn't work added

#3 Updated by Sebastian Wagner over 3 years ago

  • Subject changed from execnet: expected 9 bytes, got 0 to execnet: EOFError: couldnt load message header, expected 9 bytes, got 0

#5 Updated by Sebastian Wagner over 3 years ago

  • Related to Cleanup #44676: cephadm: Replace execnet (and remoto) added

#6 Updated by Sebastian Wagner over 3 years ago

  • Related to Bug #46764: cephadm (ceph orch apply) sometimes gets "stuck" and cannot deploy any OSDs added

#7 Updated by Nathan Cutler over 3 years ago

  • Affected Versions v15.2.5 added

Seems to:

(1) happen in libvirt VMs running on slower hardware (e.g. nested virt)
(2) be a recent regression

#8 Updated by Sebastian Wagner over 3 years ago

  • Description updated (diff)

#9 Updated by Nathan Cutler over 3 years ago

  • Related to deleted (Bug #46764: cephadm (ceph orch apply) sometimes gets "stuck" and cannot deploy any OSDs)

#10 Updated by Nathan Cutler over 3 years ago

  • Duplicated by Bug #46764: cephadm (ceph orch apply) sometimes gets "stuck" and cannot deploy any OSDs added

#11 Updated by Nathan Cutler over 3 years ago

Note: this problem is known to arise (only on machines with root filesystem on HDD) the first time "ceph orch apply" is run after "cephadm bootstrap" completes.

I found it's enough to wait one minute, after "ceph bootstrap" command completes, and before issuing the "ceph orch apply" command to create OSDs, to make the problem go away.

Could it be that this is just a consequence of cephadm being asynchronous? Is it possible that running "ceph orch apply" immediately after "cephadm bootstrap" returns catches mgr/cephadm unprepared - maybe it is still working through its startup routine, for example?

#12 Updated by Nathan Cutler over 3 years ago

The moral of the story is: wait for the bootstrap MON and MGR to appear in "cephadm ls" before proceeding with "ceph orch apply".

#13 Updated by Sebastian Wagner about 3 years ago

  • Status changed from New to Can't reproduce

Also available in: Atom PDF