Bug #48498: octopus: timeout when running the "ceph" command - Ceph - Ceph

Actions

Copy link

Bug #48498

open

octopus: timeout when running the "ceph" command

Added by Mathew Clarke over 3 years ago. Updated about 3 years ago.

Status:

Fix Under Review

Priority:

Normal

Assignee:

Kefu Chai

Category:

ceph cli

Target version:

v15.2.5

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v15.2.5

ceph-qa-suite:

ceph-ansible

Pull request ID:

40476

Crash signature (v1):

Crash signature (v2):

Description

Running "Ubuntu 20.04.1" with kernel "5.4.77-217"

I've installed octopus version 15.2.5 from distro armhf repo as "deb-src https://download.ceph.com/debian-octopus/ focal main" fails with unmet dependencies (see issue: https://tracker.ceph.com/issues/45915)

when I run "ceph" from a bash prompt I get the error below.

Traceback (most recent call last):
  File "/usr/bin/ceph", line 1278, in <module>
    retval = main()
  File "/usr/bin/ceph", line 984, in main
    cluster_handle = run_in_thread(rados.Rados,
  File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1339, in run_in_thread
    raise Exception("timed out")
Exception: timed out

Even when I run "ceph -h" I get the same error halfway through the help output. I'm new to ceph so any pointers on how I can troubleshoot this would be greatly appreciated.

Thanks

Actions

Copy link

Updated by Mathew Clarke over 3 years ago

The ubuntu packages have just updated to 15.2.7 and I'm still getting the same issue

ceph -v
ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)

ceph -s
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1278, in <module>
    retval = main()
  File "/usr/bin/ceph", line 984, in main
    cluster_handle = run_in_thread(rados.Rados,
  File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1339, in run_in_thread
    raise Exception("timed out")
Exception: timed out

Actions

Copy link

Updated by Rocky Cardwell over 3 years ago

Mathew Clarke wrote:

The ubuntu packages have just updated to 15.2.7 and I'm still getting the same issue

[...]

[...]

I managed to get ceph running on the odroid-hc2 after having this and other issues. To fix this, I edited /usr/lib/python3/dist-packages/ceph_argparse.py, line 1332, changing the 32 to 16. I suspect it's something to do with armhf being 32bit. Like this:

    if timeout == 0 or timeout == None:
        # python threading module will just get blocked if timeout is `None`,
        # otherwise it will keep polling until timeout or thread stops.
        # wait for INT32_MAX, as python 3.6.8 use int32_t to present the
        # timeout in integer when converting it to nanoseconds
        timeout = (1 << (16 - 1)) - 1
    t = RadosThread(func, *args, **kwargs)

The other issue I ran into was a segmentation fault when starting the manager daemon. I fixed that issue by adding UNW_ARM_UNWIND_METHOD=4 to /etc/default/ceph like:

# /etc/default/ceph
#
# Environment file for ceph daemon systemd unit files.
#

# Increase tcmalloc cache size
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728
UNW_ARM_UNWIND_METHOD=4

Hope that helps.

Actions

Copy link

Updated by Mathew Clarke over 3 years ago

Thanks for responding.

I'm only running the OSD's on the Odroid HC2 but applied the "UNW_ARM_UNWIND_METHOD=4" anyway as the "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728" was already set.

I also patched the "timeout = (1 << (16 - 1)) - 1" line in "/usr/lib/python3/dist-packages/ceph_argparse.py".

It now longer errors as before but hangs for around 50 mins and then reports "[errno 110] RADOS timed out (error connecting to the cluster)" when running "ceph -s"

Is this something you've run in to?

Actions

Copy link

Updated by Rocky Cardwell over 3 years ago

I did have some issues with ceph -s hanging. It always turned out that the monitor process wasn't running, or not enough were running for quorum. In my case, there were errors in /var/log/syslog saying that it couldn't access files in /var/lib/ceph. I ended up fixing those by just changing the files to be owned by the ceph user/group.

Actions

Copy link

Updated by Mathew Clarke over 3 years ago

Sorry for the slow reply, just got around to checking this one. "ceph:ceph" was already set on "/var/lib/ceph" but ran "chown ceph:ceph -r /var/lib/ceph" anyway which didn't resolve the hanging issues.

The Syslog only shows the following errors to do with ceph

Dec 22 21:37:01 cn-svr-osd-01 ansible-ceph_volume: [WARNING] The value 5120 (type int) in a string field was converted to '5120' (type string). If this does not look like what you expect, quote the entire value to ensure it does not change.
Dec 22 21:37:01 cn-svr-osd-01 ansible-ceph_volume: [WARNING] The value -1 (type int) in a string field was converted to '-1' (type string). If this does not look like what you expect, quote the entire value to ensure it does not change.
Dec 22 21:37:01 cn-svr-osd-01 ansible-ceph_volume: [WARNING] The value False (type bool) in a string field was converted to 'False' (type string). If this does not look like what you expect, quote the entire value to ensure it does not change.

Thanks for your help with this one but I think I'm going to invest in so x86 hardware to continue this project.

Actions

Copy link

Updated by O W over 3 years ago

Hi,

I'm also struggling to get CEPH 15.2.7 running using ceph-ansible on ubuntu hirsute on the Odroid HC2. Currently I only have two HC2s and mix it with vagrant VMs to have quorum mons. Your fixes helped a lot in getting futher but the mgr doesn't come up.

TASK [ceph-mgr : wait for all mgr to be up] ********************************************************************
Sunday 24 January 2021 15:10:08 +0000 (0:00:00.026) 0:02:34.816 **
FAILED - RETRYING: wait for all mgr to be up (30 retries left).
FAILED - RETRYING: wait for all mgr to be up (29 retries left).
FAILED - RETRYING: wait for all mgr to be up (28 retries left).
FAILED - RETRYING: wait for all mgr to be up (27 retries left).
FAILED - RETRYING: wait for all mgr to be up (26 retries left).
FAILED - RETRYING: wait for all mgr to be up (25 retries left).
FAILED - RETRYING: wait for all mgr to be up (24 retries left).
FAILED - RETRYING: wait for all mgr to be up (23 retries left).
FAILED - RETRYING: wait for all mgr to be up (22 retries left).
FAILED - RETRYING: wait for all mgr to be up (21 retries left).
FAILED - RETRYING: wait for all mgr to be up (20 retries left).
FAILED - RETRYING: wait for all mgr to be up (19 retries left).
FAILED - RETRYING: wait for all mgr to be up (18 retries left).
FAILED - RETRYING: wait for all mgr to be up (17 retries left).
FAILED - RETRYING: wait for all mgr to be up (16 retries left).
FAILED - RETRYING: wait for all mgr to be up (15 retries left).
FAILED - RETRYING: wait for all mgr to be up (14 retries left).
FAILED - RETRYING: wait for all mgr to be up (13 retries left).
FAILED - RETRYING: wait for all mgr to be up (12 retries left).
FAILED - RETRYING: wait for all mgr to be up (11 retries left).
FAILED - RETRYING: wait for all mgr to be up (10 retries left).
FAILED - RETRYING: wait for all mgr to be up (9 retries left).
FAILED - RETRYING: wait for all mgr to be up (8 retries left).
FAILED - RETRYING: wait for all mgr to be up (7 retries left).
FAILED - RETRYING: wait for all mgr to be up (6 retries left).
FAILED - RETRYING: wait for all mgr to be up (5 retries left).
FAILED - RETRYING: wait for all mgr to be up (4 retries left).
FAILED - RETRYING: wait for all mgr to be up (3 retries left).
FAILED - RETRYING: wait for all mgr to be up (2 retries left).
FAILED - RETRYING: wait for all mgr to be up (1 retries left).
fatal: [odroidxu4 > odroidxu4]: FAILED! => changed=false
attempts: 30
cmd:
- ceph
- --cluster
- ceph
- mgr
- dump
- -f
- json
delta: '0:00:00.758196'
end: '2021-01-24 16:13:22.642994'
rc: 0
start: '2021-01-24 16:13:21.884798'
stderr: ''
stderr_lines: <omitted>
stdout: |2

{"epoch":1,"active_gid":0,"active_name":"","active_addrs":{"addrvec":[]},"active_addr":"(unrecognized address family 0)/0","active_change":"0.000000","active_mgr_features":0,"available":false,"standbys":[],"modules":["iostat","restful"],"available_modules":[],"services":{},"always_on_modules":{"nautilus":["balancer","crash","devicehealth","orchestrator_cli","progress","rbd_support","status","volumes"],"octopus":["balancer","crash","devicehealth","orchestrator","pg_autoscaler","progress","rbd_support","status","telemetry","volumes"],"last_failure_osd_epoch":0,"active_clients":[]}}
  stdout_lines: &lt;omitted&gt;

root@odroidxu4:~# dpkg -l ceph
Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description
+++-==============-===============-============-===================================
ii ceph 15.2.7-0ubuntu4 armhf distributed storage and file system

Any ideas?

Cheers,
Oliver

Actions

Copy link

Updated by Kefu Chai about 3 years ago

I think this issue was fixed by https://github.com/ceph/ceph/pull/38665, https://github.com/ceph/ceph/pull/38665/commits/e59693fd96c34672c4c743514bd173fc70a3a544 to be specific.

I will backport it to octopus.

Actions

Copy link

Updated by Kefu Chai about 3 years ago

Subject changed from timeout when running the "ceph" command to octopus: timeout when running the "ceph" command
Status changed from New to Triaged
Assignee set to Kefu Chai
Pull request ID set to 40476