Project

General

Profile

Bug #24421

async messager thread cpu high, osd service not normal until restart

Added by Yong Wang over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rados
Pull request ID:
Crash signature:

Description

ceph version 12.2.5
ms_type default async+posix
2+1 ec cephfs
three nodes
io bandwidths performance testing

it seems like to 2 threads (msgr-worker-0 and msgr-worker-2) cpu high (100%), msgr-worker-1 is lower cpu usage.
from strace detect result, it should call too many times epoll_wait(0) epoll_wait(1)
it seems like to block client io util we restart the osd
from osd dump, the 1004721 pid against process is not be found, and it bound 2 diferrent pid to four ports.in the other osds, only one pid.
the strangy osd has more width intervals.

below is for detail logs for the osd.62 (cpu high and io blocked osd)

===============================

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                
5106 ceph 20 0 4470312 2.686g 14584 R 99.9 4.3 260:29.18 msgr-worker-0
5114 ceph 20 0 4470312 2.686g 14584 R 99.9 4.3 1128:47 msgr-worker-2

=================================

ceph osd dump |grep osd.62
osd.62 up in weight 1 up_from 1313 up_thru 1507 down_at 1311 last_clean_interval [974,1312) 10.0.30.56:6811/4721 10.0.30.56:6900/1004721 10.0.30.56:6901/1004721 10.0.30.56:6902/1004721 exists,up 49af9dab-a098-468b-b1c1-695e6b38b497
[root@node56 infinityfs1]# ceph osd dump |grep osd.61
osd.61 up in weight 1 up_from 949 up_thru 1282 down_at 948 last_clean_interval [924,947) 10.0.30.56:6802/4757 10.0.30.56:6820/4757 10.0.30.56:6852/4757 10.0.30.56:6878/4757 exists,up c502f177-8e71-4626-abfd-43dd68e60a75
[root@node56 infinityfs1]# ceph osd dump |grep osd.63
osd.63 up in weight 1 up_from 949 up_thru 1515 down_at 948 last_clean_interval [925,947) 10.0.30.56:6812/4732 10.0.30.56:6835/4732 10.0.30.56:6857/4732 10.0.30.56:6889/4732 exists,up 097973c6-f082-4340-82d9-4eed5d9f6728

===============================

2018-06-05 15:15:47.304673 7faebf4b7700 25 -- 10.0.30.56:6900/1004721 >> 10.0.30.55:6827/4730 conn(0x55618c856800 :-1 s=STATE_OPEN_MESSAGE_HEADER pgs=262 cs=1 l=0).read_until got 53 in buffer left is 0 buffer still has 853

2018-06-05 15:18:35.969029 7faebecb6700 30 Event(0x55616f530c80 nevent=5000 time_id=25911621).process_events shortest is 520967s it->first is 520937s

==============================

ceph daemon osd.62 perf dump |grep msgr_recv_messages
"msgr_recv_messages": 31469372,
"msgr_recv_messages": 46608668,
"msgr_recv_messages": 29412686,
[root@node56 infinityfs1]# ceph daemon osd.61 perf dump |grep msgr_recv_messages
"msgr_recv_messages": 2060833,
"msgr_recv_messages": 2108391,
"msgr_recv_messages": 1851998,
[root@node56 infinityfs1]# ceph daemon osd.63 perf dump |grep msgr_recv_messages
"msgr_recv_messages": 5926178,
"msgr_recv_messages": 9847810,
"msgr_recv_messages": 4722634,

History

#1 Updated by Yong Wang over 2 years ago

why the osd 62 has diffrent pids?
why async got a lot of extern_num_events?
did it caused by time or something?
tks for any helps.

#2 Updated by Yong Wang over 2 years ago

hi all,did it has any updates please?

Also available in: Atom PDF