Bug #13213: stuck recovering apparently due to pg_query not sent by async messenger? - Ceph - Ceph

Actions

Copy link

Bug #13213

closed

stuck recovering apparently due to pg_query not sent by async messenger?

Added by Samuel Just over 8 years ago. Updated about 8 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Haomai Wang

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

pg 0.b7 seems to be stuck due to a pg_query not getting sent by async messenger.

Interval starts around 2015-09-23 01:22:40.720925 epoch 1134. osd.0 never seems to get the pg_query message sent at:

2015-09-23 01:22:40.733947 7f12b5581700 1 -- 10.214.136.2:6812/1005806 >> 10.214.136.2:6803/46542 conn(0x7f12cdd41000 sd=40 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY pgs=777 cs=2 l=0). tx 0x7f12d041cc00 pg_query(0.37,0.b7,1.1 epoch 1134) v3

(search forward for '10.214.136.2:6803/46542.*pg_query\(0.37,0.b7,1.1 epoch 1134\)' in ceph-osd.sorted)

ubuntu@teuthology:/a/samuelj-2015-09-22_22:28:00-rados-wip-sam-testing-distro-basic-multi/1065726/remote/ceph-osd.sorted

Actions

Copy link

Updated by Samuel Just over 8 years ago

ceph-osd.sorted contains the osd logs sorted together fyi.

Actions

Copy link

Updated by Haomai Wang over 8 years ago

1. osd.0 build a successful connection with osd.1:
osd.0 log
2015-09-23 01:22:38.594434 7f12ba4ef700 1 -- 10.214.136.2:6812/1005806 >> 10.214.136.2:6803/46542 conn(0x7f12cdd41000 sd=40 :-1 s=STATE_OPEN pgs=777 cs=1 l=0). tx 0x7f12cde42d20 pg_info(1 pgs e1131:0.1e) v4
osd.1 log
2015-09-23 01:22:38.594589 7f73e67a1700 1 -- 10.214.136.2:6803/46542 >> 10.214.136.2:6812/1005806 conn(0x7f73f9e8b000 sd=126 :6803 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=303 cs=1 l=0). rx osd.1 0x7f73f5e9e1e0 pg_info(1 pgs e1131:0.1e) v4
2. osd.0 get peer closed log:
2015-09-23 01:22:39.812188 7f12bf4f9700 1 -- 10.214.136.2:6812/1005806 >> 10.214.136.2:6803/46542 conn(0x7f12cdd41000 sd=40 :-1 s=STATE_OPEN pgs=777 cs=1 l=0).read_bulk peer close file descriptor 40
we can't find any log in osd.1 side about what happened
3. osd.0 try to connect to osd.1 again, and increase his "connect_seq" to 2
4. osd.1 accept new connection but find peer "connect_seq" > 0, it send "RESETSESSION" tag
5. osd.0 receive "RESETSESSION" tag and clear its send queue
6. osd.1 won't receive pg_query message

The secret is I can't know why osd.1 close its connection in step 2, I searched AsyncConnection log, I guess it may caller mark_down this connection, so it won't output failure log.

Do you have new ideas? Otherwise I need to lower some log level.

Actions

Copy link