Project

General

Profile

Actions

Bug #13213

closed

stuck recovering apparently due to pg_query not sent by async messenger?

Added by Samuel Just over 8 years ago. Updated about 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

pg 0.b7 seems to be stuck due to a pg_query not getting sent by async messenger.

Interval starts around 2015-09-23 01:22:40.720925 epoch 1134. osd.0 never seems to get the pg_query message sent at:

2015-09-23 01:22:40.733947 7f12b5581700 1 -- 10.214.136.2:6812/1005806 >> 10.214.136.2:6803/46542 conn(0x7f12cdd41000 sd=40 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=777 cs=2 l=0). tx 0x7f12d041cc00 pg_query(0.37,0.b7,1.1 epoch 1134) v3

(search forward for '10.214.136.2:6803/46542.*pg_query\(0.37,0.b7,1.1 epoch 1134\)' in ceph-osd.sorted)

ubuntu@teuthology:/a/samuelj-2015-09-22_22:28:00-rados-wip-sam-testing-distro-basic-multi/1065726/remote/ceph-osd.sorted

Actions #1

Updated by Samuel Just over 8 years ago

ceph-osd.sorted contains the osd logs sorted together fyi.

Actions #2

Updated by Haomai Wang over 8 years ago

1. osd.0 build a successful connection with osd.1:
osd.0 log
2015-09-23 01:22:38.594434 7f12ba4ef700 1 -- 10.214.136.2:6812/1005806 >> 10.214.136.2:6803/46542 conn(0x7f12cdd41000 sd=40 :-1 s=STATE_OPEN pgs=777 cs=1 l=0). tx 0x7f12cde42d20 pg_info(1 pgs e1131:0.1e) v4
osd.1 log
2015-09-23 01:22:38.594589 7f73e67a1700 1 -- 10.214.136.2:6803/46542 >> 10.214.136.2:6812/1005806 conn(0x7f73f9e8b000 sd=126 :6803 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=303 cs=1 l=0). rx osd.1 0x7f73f5e9e1e0 pg_info(1 pgs e1131:0.1e) v4
2. osd.0 get peer closed log:
2015-09-23 01:22:39.812188 7f12bf4f9700 1 -- 10.214.136.2:6812/1005806 >> 10.214.136.2:6803/46542 conn(0x7f12cdd41000 sd=40 :-1 s=STATE_OPEN pgs=777 cs=1 l=0).read_bulk peer close file descriptor 40
we can't find any log in osd.1 side about what happened
3. osd.0 try to connect to osd.1 again, and increase his "connect_seq" to 2
4. osd.1 accept new connection but find peer "connect_seq" > 0, it send "RESETSESSION" tag
5. osd.0 receive "RESETSESSION" tag and clear its send queue
6. osd.1 won't receive pg_query message

The secret is I can't know why osd.1 close its connection in step 2, I searched AsyncConnection log, I guess it may caller mark_down this connection, so it won't output failure log.

Do you have new ideas? Otherwise I need to lower some log level.

Actions #3

Updated by Sage Weil over 8 years ago

  • Priority changed from Urgent to High
Actions #4

Updated by Samuel Just over 8 years ago

Haomai, you can create a branch of the suite with debug_ms = 20 to reproduce.

Actions #5

Updated by Samuel Just over 8 years ago

  • Priority changed from High to Urgent
Actions #6

Updated by Sage Weil over 8 years ago

let's schedule a run of -s rados/thrash with debug ms = 20 and ms type = async to hunt these last few bugs down!

Actions #7

Updated by Haomai Wang over 8 years ago

thanks! let me know if reproduce

Actions #8

Updated by Haomai Wang over 8 years ago

I ran rados/thrash suite here(http://pulpito.ceph.com/haomai-2015-12-20_00:32:30-rados:thrash-master-distro-basic-multi/). It seemed no related async bug. Looks like this bug is related to https://github.com/ceph/ceph/pull/6478 .

Actions #9

Updated by Haomai Wang about 8 years ago

  • Status changed from New to Resolved
Actions

Also available in: Atom PDF