Bug #2393
closedobjecter: dropping messages (old connection being used)
0%
Description
2012-05-10 02:01:02.593957 7f9cda1b0700 0 -- [2607:f298:4:2243::5795]:0/1019484 send_message dropped message ping v1 because of no pipe on con 0x7f9cc459e660 2012-05-10 02:01:02.593970 7f9cda1b0700 1 -- [2607:f298:4:2243::5795]:0/1019484 --> [2607:f298:4:2243::7066]:6807/2644 -- ping v1 -- ?+0 0x7f9cbc3d2f40 con 0x7f9cc4144db0 2012-05-10 02:01:02.668326 7f9cdc2b5700 1 -- [2607:f298:4:2243::5795]:0/1019484 <== osd.290 [2607:f298:4:2243::7066]:6807/2644 46191 ==== watch-notify(c=1 v=1 i=740219 opcode=1) v1 ==== 543+0+0 (3452665825 0 0) 0x7f9cb43ac330 con 0x7f9cc4144db0 ... 2012-05-10 02:01:04.271134 7f9c9cff9700 0 -- [2607:f298:4:2243::5795]:0/1019484 send_message dropped message osd_op(client.92928.0:753509 .dir.93029.10938 [call rgw.bucket_prepare_op] 5.6182dfe9) v4 because of no pipe on con 0x7f9cc459e660
One point to note is that we do have an open connection to that osd (where we got the watch-notify msg from). However, we still try to send requests over the broken connection.
Updated by Yehuda Sadeh almost 12 years ago
One more point to note is that all the following ping messages show the same issue (dropped message).
Updated by Sage Weil almost 12 years ago
- Priority changed from Normal to Urgent
- Source changed from Development to Community (dev)
Updated by Yehuda Sadeh almost 12 years ago
I think the ping is a red herring. In tick() we go over all the regular sessions, and then over all the lingering sessions for determining the ping destinations. The only option for both getting the "dropped message" message and the ping message is by having two different connection objects to the same destination: one regular and another lingering. Since we see the ping going out (at least the message), I assume the regular is broken.
Updated by Yehuda Sadeh almost 12 years ago
Ah, need to look at the logs again. There's nothing in this excerpt to say that both requests were supposed to go to the same osd.
Updated by Yehuda Sadeh almost 12 years ago
Objecter::ms_handle_reset() was not acquiring a lock. ms_handle_reset() racing with any operation that grabs the session may lead to such issues, where the session->con is bad.
Updated by Josh Durgin almost 12 years ago
The caller always holds the lock for the objecter. RadosClient::ms_handle_reset grabs the lock and calls objecter->ms_handle_reset.
Updated by Greg Farnum almost 12 years ago
Just to make sure we're on the same page: there is nothing in that snippet indicating that there is an active Connection to osd.290.
Right?