Bug #24421: async messager thread cpu high, osd service not normal until restart - Ceph - Ceph

Actions

Copy link

Bug #24421

open

async messager thread cpu high, osd service not normal until restart

Added by Yong Wang almost 6 years ago. Updated almost 6 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

OSD

Target version:

v12.2.6

% Done:

Source:

Community (dev)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v12.2.5

ceph-qa-suite:

rados

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ceph version 12.2.5
ms_type default async+posix
2+1 ec cephfs
three nodes
io bandwidths performance testing

it seems like to 2 threads (msgr-worker-0 and msgr-worker-2) cpu high (100%), msgr-worker-1 is lower cpu usage.
from strace detect result, it should call too many times epoll_wait(0) epoll_wait(1)
it seems like to block client io util we restart the osd
from osd dump, the 1004721 pid against process is not be found, and it bound 2 diferrent pid to four ports.in the other osds, only one pid.
the strangy osd has more width intervals.

below is for detail logs for the osd.62 (cpu high and io blocked osd)

===============================

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                
   5106 ceph      20   0 4470312 2.686g  14584 R 99.9  4.3 260:29.18 msgr-worker-0                                                                          
   5114 ceph      20   0 4470312 2.686g  14584 R 99.9  4.3   1128:47 msgr-worker-2

=================================

ceph osd dump |grep osd.62
osd.62 up   in  weight 1 up_from 1313 up_thru 1507 down_at 1311 last_clean_interval [974,1312) 10.0.30.56:6811/4721 10.0.30.56:6900/1004721 10.0.30.56:6901/1004721 10.0.30.56:6902/1004721 exists,up 49af9dab-a098-468b-b1c1-695e6b38b497
[root@node56 infinityfs1]# ceph osd dump |grep osd.61
osd.61 up   in  weight 1 up_from 949 up_thru 1282 down_at 948 last_clean_interval [924,947) 10.0.30.56:6802/4757 10.0.30.56:6820/4757 10.0.30.56:6852/4757 10.0.30.56:6878/4757 exists,up c502f177-8e71-4626-abfd-43dd68e60a75
[root@node56 infinityfs1]# ceph osd dump |grep osd.63
osd.63 up   in  weight 1 up_from 949 up_thru 1515 down_at 948 last_clean_interval [925,947) 10.0.30.56:6812/4732 10.0.30.56:6835/4732 10.0.30.56:6857/4732 10.0.30.56:6889/4732 exists,up 097973c6-f082-4340-82d9-4eed5d9f6728

===============================

2018-06-05 15:15:47.304673 7faebf4b7700 25 -- 10.0.30.56:6900/1004721 >> 10.0.30.55:6827/4730 conn(0x55618c856800 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=262 cs=1 l=0).read_until got 53 in buffer left is 0 buffer still has 853

2018-06-05 15:18:35.969029 7faebecb6700 30 Event(0x55616f530c80 nevent=5000 time_id=25911621).process_events shortest is 520967s it->first is 520937s

==============================

ceph daemon osd.62 perf dump |grep msgr_recv_messages
"msgr_recv_messages": 31469372,
"msgr_recv_messages": 46608668,
"msgr_recv_messages": 29412686,
[root@node56 infinityfs1]# ceph daemon osd.61 perf dump |grep msgr_recv_messages
"msgr_recv_messages": 2060833,
"msgr_recv_messages": 2108391,
"msgr_recv_messages": 1851998,
[root@node56 infinityfs1]# ceph daemon osd.63 perf dump |grep msgr_recv_messages
"msgr_recv_messages": 5926178,
"msgr_recv_messages": 9847810,
"msgr_recv_messages": 4722634,