Project

General

Profile

Bug #43185

ceph -s not showing client activity

Added by super xor 2 months ago. Updated 16 days ago.

Status:
Need More Info
Priority:
High
Assignee:
-
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature:

Description

Since Nautilus upgrade ceph -s often (2 out of 3 times) does not show any client or recovery activity. Right now it's not showing anything since minutes.
Must be some bug because we're having heavy I/O going.

Cant give much more information, there's no errors or anything in log that relates to this. Should be easy to find on other setups.

History

#1 Updated by super xor 2 months ago

ceph -s only looks like this:

ceph -s
cluster:
id: c4068f25-d46d-438d-af63-5679a2d56efb
health: HEALTH_WARN
1 nearfull osd(s)
2 pool(s) nearfull
Failed to send data to Zabbix
1 subtrees have overcommitted pool target_size_bytes
1 subtrees have overcommitted pool target_size_ratio

services:
mon: 3 daemons, quorum m1-1045558,m2-1045557,m3-1045556 (age 5w)
mgr: m1-1045558(active, since 5w), standbys: m3-1045556, m2-1045557
mds: xz:1 {0=m3-1045556=up:active} 2 up:standby
osd: 136 osds: 136 up (since 3w), 136 in (since 3w)
rgw: 1 daemon active (m1-1045558.rgw0)
data:
pools: 11 pools, 2657 pgs
objects: 549.18M objects, 353 TiB
usage: 752 TiB used, 350 TiB / 1.1 PiB avail
pgs: 2627 active+clean
29 active+clean+scrubbing+deep
1 active+clean+scrubbing

#2 Updated by Neha Ojha 2 months ago

  • Project changed from mgr to RADOS
  • Priority changed from Normal to Urgent

#4 Updated by Neha Ojha about 1 month ago

  • Status changed from New to Need More Info

super xor wrote:

Possible relation to https://tracker.ceph.com/issues/43364 and https://tracker.ceph.com/issues/43317

Both these issues were reported in 14.2.5, are you also running 14.2.5? Affected versions indicates 14.2.4.

#5 Updated by super xor about 1 month ago

We run 14.2.4. I see mgr process at 100% sometimes and I been told that the reason for lack of activity show might be overloaded mgr or something.
It's hard to debug I get it, but basically it just doesnt show anything as pasted above. Happens sporadically, usually when there is heavy IO due to recovery or similar 'changes'. We been adding OSDs, changing PGs etc lately, so we see it a lot.

#6 Updated by Sage Weil 21 days ago

  • Target version set to v15.0.0

#7 Updated by Sage Weil 20 days ago

  • Priority changed from Urgent to High

#8 Updated by Neha Ojha 19 days ago

Are you observing any client activity in the cluster logs when "ceph -s" isn't reporting them?
It is sometimes possible for client I/O to be intermittently missing during periods of aggressive recovery and scrub activity. You can increase the value of "mon_stat_smooth_intervals" from the default of 6 to maybe 9 or higher and see it helps. This will increase the number of PGMaps stats over which we calculate the average read/write throughput of the whole cluster before displaying them.

#9 Updated by super xor 17 days ago

There's almost no load apart from scrubbing, like this is pretty average io:
client: 20 MiB/s rd, 61 MiB/s wr, 815 op/s rd, 1.14k op/s wr
And we got like 150 OSDs.

But the mgr process is stuck at 100% and crashes every 24h or so. Furthermore I can't call certain function like balancer status, it'll just hang forever.

crashes with following error:

2020-01-30 14:54:56.872 7fc861370700  0 mgr[dashboard] [30/Jan/2020:14:54:56] ENGINE Error in HTTPServer.tick
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line 2021, in start
    self.tick()
  File "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line 2090, in tick
    s, ssl_env = self.ssl_adapter.wrap(s)
  File "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/ssl_builtin.py", line 67, in wrap
    server_side=True)
  File "/usr/lib/python2.7/ssl.py", line 369, in wrap_socket
    _context=self)
  File "/usr/lib/python2.7/ssl.py", line 617, in __init__
    self.do_handshake()
  File "/usr/lib/python2.7/ssl.py", line 846, in do_handshake
    self._sslobj.do_handshake()
error: [Errno 0] Error

#10 Updated by super xor 17 days ago

strace for the hanging mgr thread

...
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
getpid()                                = 1360760
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025e9d80e4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x56025e9d8038, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 1360760
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 1
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=1000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=2000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=4000}) = 0 (Timeout)
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x56025eacff60, FUTEX_WAKE_PRIVATE, 1) = 0
select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=8000}) = 0 (Timeout)
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = 0
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x56025eacff60, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, 0xffffffff) = -1 EAGAIN (Resource temporarily unavailable)
...

#11 Updated by Neha Ojha 16 days ago

Can you grab a wallclock profiler dump from the mgr process when its usage goes to 100%?
Learn more about how to use it in https://github.com/markhpc/gdbpmp.
Grab the dump using something like - "gdbpmp.py -p`pidof ceph-mgr` -n 1000 -o mgr.gdbpmp"
View the results using - "gdbpmp.py -i mgr.gdbpmp -t 1"

You can also disable your mgr modules to see if the problem goes away. FWIW, we have fixed an issue with the balancer module, which made the mgr hang, the fix will be available in 14.2.8.

Also available in: Atom PDF