Project

General

Profile

Actions

Bug #2393

closed

objecter: dropping messages (old connection being used)

Added by Yehuda Sadeh almost 12 years ago. Updated almost 12 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2012-05-10 02:01:02.593957 7f9cda1b0700  0 -- [2607:f298:4:2243::5795]:0/1019484 send_message dropped message ping v1 because of no pipe on con 0x7f9cc459e660
2012-05-10 02:01:02.593970 7f9cda1b0700  1 -- [2607:f298:4:2243::5795]:0/1019484 --> [2607:f298:4:2243::7066]:6807/2644 -- ping v1 -- ?+0 0x7f9cbc3d2f40 con 0x7f9cc4144db0

2012-05-10 02:01:02.668326 7f9cdc2b5700  1 -- [2607:f298:4:2243::5795]:0/1019484 <== osd.290 [2607:f298:4:2243::7066]:6807/2644 46191 ==== watch-notify(c=1 v=1 i=740219 opcode=1) v1 ==== 543+0+0 (3452665825 0 0) 0x7f9cb43ac330 con 0x7f9cc4144db0

...
2012-05-10 02:01:04.271134 7f9c9cff9700  0 -- [2607:f298:4:2243::5795]:0/1019484 send_message dropped message osd_op(client.92928.0:753509 .dir.93029.10938 [call rgw.bucket_prepare_op] 5.6182dfe9) v4 because of no pipe on con 0x7f9cc459e660

One point to note is that we do have an open connection to that osd (where we got the watch-notify msg from). However, we still try to send requests over the broken connection.


Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #2316: rbd: restart of OSD leeds to stale qem-VM's with "ceph version 0.45-207-g3053e47"ResolvedJosh Durgin04/19/2012

Actions
Actions #1

Updated by Yehuda Sadeh almost 12 years ago

One more point to note is that all the following ping messages show the same issue (dropped message).

Actions #2

Updated by Sage Weil almost 12 years ago

  • Priority changed from Normal to Urgent
  • Source changed from Development to Community (dev)
Actions #3

Updated by Yehuda Sadeh almost 12 years ago

I think the ping is a red herring. In tick() we go over all the regular sessions, and then over all the lingering sessions for determining the ping destinations. The only option for both getting the "dropped message" message and the ping message is by having two different connection objects to the same destination: one regular and another lingering. Since we see the ping going out (at least the message), I assume the regular is broken.

Actions #4

Updated by Yehuda Sadeh almost 12 years ago

Ah, need to look at the logs again. There's nothing in this excerpt to say that both requests were supposed to go to the same osd.

Actions #5

Updated by Yehuda Sadeh almost 12 years ago

Objecter::ms_handle_reset() was not acquiring a lock. ms_handle_reset() racing with any operation that grabs the session may lead to such issues, where the session->con is bad.

Actions #6

Updated by Josh Durgin almost 12 years ago

The caller always holds the lock for the objecter. RadosClient::ms_handle_reset grabs the lock and calls objecter->ms_handle_reset.

Actions #7

Updated by Yehuda Sadeh almost 12 years ago

Ah, too bad.

Actions #8

Updated by Greg Farnum almost 12 years ago

Just to make sure we're on the same page: there is nothing in that snippet indicating that there is an active Connection to osd.290.
Right?

Actions #9

Updated by Sage Weil almost 12 years ago

  • Status changed from New to Duplicate
Actions

Also available in: Atom PDF