Project

General

Profile

Bug #48498

octopus: timeout when running the "ceph" command

Added by Mathew Clarke 5 months ago. Updated about 1 month ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
ceph cli
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
ceph-ansible
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Running "Ubuntu 20.04.1" with kernel "5.4.77-217"

I've installed octopus version 15.2.5 from distro armhf repo as "deb-src https://download.ceph.com/debian-octopus/ focal main" fails with unmet dependencies (see issue: https://tracker.ceph.com/issues/45915)

when I run "ceph" from a bash prompt I get the error below.

Traceback (most recent call last):
  File "/usr/bin/ceph", line 1278, in <module>
    retval = main()
  File "/usr/bin/ceph", line 984, in main
    cluster_handle = run_in_thread(rados.Rados,
  File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1339, in run_in_thread
    raise Exception("timed out")
Exception: timed out

Even when I run "ceph -h" I get the same error halfway through the help output. I'm new to ceph so any pointers on how I can troubleshoot this would be greatly appreciated.

Thanks

History

#1 Updated by Mathew Clarke 5 months ago

The ubuntu packages have just updated to 15.2.7 and I'm still getting the same issue

ceph -v
ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
ceph -s
Traceback (most recent call last):
  File "/usr/bin/ceph", line 1278, in <module>
    retval = main()
  File "/usr/bin/ceph", line 984, in main
    cluster_handle = run_in_thread(rados.Rados,
  File "/usr/lib/python3/dist-packages/ceph_argparse.py", line 1339, in run_in_thread
    raise Exception("timed out")
Exception: timed out

#2 Updated by Rocky Cardwell 5 months ago

Mathew Clarke wrote:

The ubuntu packages have just updated to 15.2.7 and I'm still getting the same issue

[...]

[...]

I managed to get ceph running on the odroid-hc2 after having this and other issues. To fix this, I edited /usr/lib/python3/dist-packages/ceph_argparse.py, line 1332, changing the 32 to 16. I suspect it's something to do with armhf being 32bit. Like this:

    if timeout == 0 or timeout == None:
        # python threading module will just get blocked if timeout is `None`,
        # otherwise it will keep polling until timeout or thread stops.
        # wait for INT32_MAX, as python 3.6.8 use int32_t to present the
        # timeout in integer when converting it to nanoseconds
        timeout = (1 << (16 - 1)) - 1
    t = RadosThread(func, *args, **kwargs)

The other issue I ran into was a segmentation fault when starting the manager daemon. I fixed that issue by adding UNW_ARM_UNWIND_METHOD=4 to /etc/default/ceph like:

# /etc/default/ceph
#
# Environment file for ceph daemon systemd unit files.
#

# Increase tcmalloc cache size
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728
UNW_ARM_UNWIND_METHOD=4

Hope that helps.

#3 Updated by Mathew Clarke 5 months ago

Thanks for responding.

I'm only running the OSD's on the Odroid HC2 but applied the "UNW_ARM_UNWIND_METHOD=4" anyway as the "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728" was already set.

I also patched the "timeout = (1 << (16 - 1)) - 1" line in "/usr/lib/python3/dist-packages/ceph_argparse.py".

It now longer errors as before but hangs for around 50 mins and then reports "[errno 110] RADOS timed out (error connecting to the cluster)" when running "ceph -s"

Is this something you've run in to?

#4 Updated by Rocky Cardwell 5 months ago

I did have some issues with ceph -s hanging. It always turned out that the monitor process wasn't running, or not enough were running for quorum. In my case, there were errors in /var/log/syslog saying that it couldn't access files in /var/lib/ceph. I ended up fixing those by just changing the files to be owned by the ceph user/group.

#5 Updated by Mathew Clarke 4 months ago

Sorry for the slow reply, just got around to checking this one. "ceph:ceph" was already set on "/var/lib/ceph" but ran "chown ceph:ceph -r /var/lib/ceph" anyway which didn't resolve the hanging issues.

The Syslog only shows the following errors to do with ceph

Dec 22 21:37:01 cn-svr-osd-01 ansible-ceph_volume: [WARNING] The value 5120 (type int) in a string field was converted to '5120' (type string). If this does not look like what you expect, quote the entire value to ensure it does not change.
Dec 22 21:37:01 cn-svr-osd-01 ansible-ceph_volume: [WARNING] The value -1 (type int) in a string field was converted to '-1' (type string). If this does not look like what you expect, quote the entire value to ensure it does not change.
Dec 22 21:37:01 cn-svr-osd-01 ansible-ceph_volume: [WARNING] The value False (type bool) in a string field was converted to 'False' (type string). If this does not look like what you expect, quote the entire value to ensure it does not change.

Thanks for your help with this one but I think I'm going to invest in so x86 hardware to continue this project.

#6 Updated by O W 4 months ago

Hi,

I'm also struggling to get CEPH 15.2.7 running using ceph-ansible on ubuntu hirsute on the Odroid HC2. Currently I only have two HC2s and mix it with vagrant VMs to have quorum mons. Your fixes helped a lot in getting futher but the mgr doesn't come up.

TASK [ceph-mgr : wait for all mgr to be up] ********************************************************************
Sunday 24 January 2021 15:10:08 +0000 (0:00:00.026) 0:02:34.816 **
FAILED - RETRYING: wait for all mgr to be up (30 retries left).
FAILED - RETRYING: wait for all mgr to be up (29 retries left).
FAILED - RETRYING: wait for all mgr to be up (28 retries left).
FAILED - RETRYING: wait for all mgr to be up (27 retries left).
FAILED - RETRYING: wait for all mgr to be up (26 retries left).
FAILED - RETRYING: wait for all mgr to be up (25 retries left).
FAILED - RETRYING: wait for all mgr to be up (24 retries left).
FAILED - RETRYING: wait for all mgr to be up (23 retries left).
FAILED - RETRYING: wait for all mgr to be up (22 retries left).
FAILED - RETRYING: wait for all mgr to be up (21 retries left).
FAILED - RETRYING: wait for all mgr to be up (20 retries left).
FAILED - RETRYING: wait for all mgr to be up (19 retries left).
FAILED - RETRYING: wait for all mgr to be up (18 retries left).
FAILED - RETRYING: wait for all mgr to be up (17 retries left).
FAILED - RETRYING: wait for all mgr to be up (16 retries left).
FAILED - RETRYING: wait for all mgr to be up (15 retries left).
FAILED - RETRYING: wait for all mgr to be up (14 retries left).
FAILED - RETRYING: wait for all mgr to be up (13 retries left).
FAILED - RETRYING: wait for all mgr to be up (12 retries left).
FAILED - RETRYING: wait for all mgr to be up (11 retries left).
FAILED - RETRYING: wait for all mgr to be up (10 retries left).
FAILED - RETRYING: wait for all mgr to be up (9 retries left).
FAILED - RETRYING: wait for all mgr to be up (8 retries left).
FAILED - RETRYING: wait for all mgr to be up (7 retries left).
FAILED - RETRYING: wait for all mgr to be up (6 retries left).
FAILED - RETRYING: wait for all mgr to be up (5 retries left).
FAILED - RETRYING: wait for all mgr to be up (4 retries left).
FAILED - RETRYING: wait for all mgr to be up (3 retries left).
FAILED - RETRYING: wait for all mgr to be up (2 retries left).
FAILED - RETRYING: wait for all mgr to be up (1 retries left).
fatal: [odroidxu4 > odroidxu4]: FAILED! => changed=false
attempts: 30
cmd:
- ceph
- --cluster
- ceph
- mgr
- dump
- -f
- json
delta: '0:00:00.758196'
end: '2021-01-24 16:13:22.642994'
rc: 0
start: '2021-01-24 16:13:21.884798'
stderr: ''
stderr_lines: <omitted>
stdout: |2

{"epoch":1,"active_gid":0,"active_name":"","active_addrs":{"addrvec":[]},"active_addr":"(unrecognized address family 0)/0","active_change":"0.000000","active_mgr_features":0,"available":false,"standbys":[],"modules":["iostat","restful"],"available_modules":[],"services":{},"always_on_modules":{"nautilus":["balancer","crash","devicehealth","orchestrator_cli","progress","rbd_support","status","volumes"],"octopus":["balancer","crash","devicehealth","orchestrator","pg_autoscaler","progress","rbd_support","status","telemetry","volumes"],"last_failure_osd_epoch":0,"active_clients":[]}}
stdout_lines: &lt;omitted&gt;

root@odroidxu4:~# dpkg -l ceph
Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description
+++-==============-===============-============-===================================
ii ceph 15.2.7-0ubuntu4 armhf distributed storage and file system

Any ideas?

Cheers,
Oliver

#7 Updated by Kefu Chai about 1 month ago

#8 Updated by Kefu Chai about 1 month ago

  • Subject changed from timeout when running the "ceph" command to octopus: timeout when running the "ceph" command
  • Status changed from New to Triaged
  • Assignee set to Kefu Chai
  • Pull request ID set to 40476

#9 Updated by Kefu Chai about 1 month ago

  • Status changed from Triaged to Fix Under Review

Also available in: Atom PDF