Project

General

Profile

Bug #23082

msg/Async drop message, io blocked a long time

Added by Yong Wang about 6 years ago. Updated about 5 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
AsyncMessenger
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

msg/Async drop message, io blocked a long time

2018-02-22 09:38:30.455263 7f7efec90700 0 -- 10.124.241.83:6967/25381 >> 10.124.241.86:7020/11018 conn(0x7f7e5052a000 sd=2255 :6967 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_OPEN
2018-02-22 09:38:30.455626 7f7effc92700 0 -- 10.124.241.83:6967/25381 >> 10.124.241.86:7020/11018 conn(0x7f7e5735c000 sd=2255 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=228 cs=3 l=0).process missed message? skipped from seq 621 to 624

2018-02-22 09:37:18.439586 7f7198893700 0 -- 10.124.241.87:7057/5952 >> 10.124.241.83:7057/26048 conn(0x7f70e5fc0000 sd=-1 :7057 s=STATE_OPEN pgs=179 cs=1 l=0).fault initiating reconnect

2018-02-22 09:37:02.443131 7f719608e700 0 log_channel(cluster) log [WRN] : slow request 30.351098 seconds old, received at 2018-02-22 09:36:32.091982: osd_op(client.16954133.0:61094 12.f2ed368a 0283b66d-01ef-4075-86a0-cf3abe4f7e44.7526188.1__shadow_(”掉色“的红糖2)揭秘红糖内幕+如何鉴别假红糖?[预编版]_8B210199-71C4-59FA-4459-97E3B6976C59/data/2017-11-01[乐惠苏州](卢中火)(乐惠苏州)(”掉色“的红糖2)揭秘红糖内幕+如何鉴>别假红糖?037FE8EF-78C2-448B-BA8C-419F85F97038.wav.2~XHTJnH02bCeNNR0SS76BFfTvCHee4tL.1_4 [read 0~434944] snapc 0=[] ack+read+known_if_redirected e18457) currently started

History

#1 Updated by Yong Wang about 6 years ago

ceph version is 10.2.10

#2 Updated by Yong Wang about 6 years ago

ceph-osd.74.log-20180223.gz:2018-02-22 09:38:10.106352 7ff67ac92700 0 -- 10.124.241.83:7042/25874 >> 10.124.241.81:7044/32362 conn(0x7ff5b4dd1000 sd=2662 :7042 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 2 vs existing csq=1 existing_state=STATE_OPEN

ceph-osd.74.log-20180223.gz:2018-02-22 09:38:10.106671 7ff679c90700 0 -- 10.124.241.83:7042/25874 >> 10.124.241.81:7044/32362 conn(0x7ff5c350a000 sd=2662 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=242 cs=3 l=0).process missed message? skipped from seq 4000 to 4002

ceph-osd.12.log-20180223.gz:2018-02-22 09:38:10.105752 7f8e8dc8f700 0 -- 10.124.241.81:7044/32362 >> 10.124.241.83:7042/25874 conn(0x7f8df9e7f000 sd=-1 :-1 s=STATE_OPEN pgs=251 cs=1 l=0).fault initiating reconnect

#3 Updated by Greg Farnum about 6 years ago

  • Assignee set to Haomai Wang

If it's a recurring issue, maybe just switch to SimpleMessenger since it was still the default for Jewel?

You will also probably need to provide more logs than just the error messages, but I'll let Haomai address any other details.

#4 Updated by Haomai Wang about 6 years ago

@wangyong plz move to simple for jewel since we have several bug fix not backport(not easy) to jewel.

#5 Updated by Yong Wang about 6 years ago

yes.we replaced simple with async to ms_type, due to simple communicate framework cost to many resoures.

Do you have a plan to backport to jewel 10.2.10(10.2.11?)

Could you supply those several bug fix MR link?

from code review, i found ms_die_on_skipped_message default values is false。 Do it can reboot itself and recover communicate,if ms_die_on_skipped_message is be set to true (debug version)?

==================
if (message->get_seq() > cur_seq + 1) {
ldout(async_msgr->cct, 0) << func << " missed message? skipped from seq "
<< cur_seq << " to " << message->get_seq() << dendl;
if (async_msgr->cct->_conf->ms_die_on_skipped_message)
assert(0 "skipped incoming seq");
}

================

#6 Updated by Greg Farnum about 6 years ago

Most init systems will restart the daemon on a single assert, and Ceph's normal recovery mechanisms will come into play then and let things recover, yes.

#7 Updated by Nathan Cutler about 6 years ago

There is one jewel async messenger backport PR open - https://github.com/ceph/ceph/pull/13212 - but it's not mergeable in its current form because it causes a memory leak.

Please note that Jewel will be declared "End of Life" (EoL) when Mimic is released.

#8 Updated by Greg Farnum about 5 years ago

  • Project changed from RADOS to Messengers
  • Category deleted (Tests)

#9 Updated by Greg Farnum about 5 years ago

  • Category set to AsyncMessenger

Also available in: Atom PDF