Project

General

Profile

Bug #592

osd: rebind cluster_messenger when wrongly marked down

Added by Colin McCabe almost 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
Start date:
11/18/2010
Due date:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

This happened with commit:323565343071ce695f7d454ed29590688de64d5d on flab.ceph.dreamhost.com

While running test_unfound.sh, I noticed a situation where OSD peering never completed.

This happened in stray_test. Basically the sequence of events was like this:

1. brought up all OSDs with vstart.sh

2. Took out osd0 with

./ceph osd out 0

3. ran radostool with

./rados -p data put obj01 /tmp/tmp.UeAgBcigql/ver1

It blocked forever and never returned.

So what is going on on the backend?
Well, it seems that we are requesting something on PG 0.5. osd2 blocks forever with this message:

2010-11-18 15:05:33.161056 7fd48dc51710 osd2 14 _dispatch 0x2d24d80 osd_op(client4113.0:1 obj01 [writefull 0~1000000] 0.92bd) v1
2010-11-18 15:05:33.161069 7fd48dc51710 osd2 14 require_same_or_newer_map 9 (i am 14) 0x2d24d80
2010-11-18 15:05:33.161079 7fd48dc51710 osd2 14 _share_map_incoming client4113 10.3.14.10:0/1691 9
2010-11-18 15:05:33.161088 7fd48dc51710 osd2 14 client4113 has old map 9 < 14
2010-11-18 15:05:33.161094 7fd48dc51710 osd2 14 send_incremental_map 9 - > 14 to client4113 10.3.14.10:0/1691
2010-11-18 15:05:33.161131 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.14_0 0~116 = 116
2010-11-18 15:05:33.161152 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.13_0 0~116 = 116
2010-11-18 15:05:33.161170 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.12_0 0~116 = 116
2010-11-18 15:05:33.161187 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.11_0 0~116 = 116
2010-11-18 15:05:33.161214 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.10_0 0~424 = 424
2010-11-18 15:05:33.161224 7fd48dc51710 -- 10.3.14.10:6804/29550 --> client4113 10.3.14.10:0/1691 -- osd_map(10,14) v1 -- ?+0 0x2e94000
2010-11-18 15:05:33.161260 7fd48dc51710 osd2 14 hit non-existent pg 0.5, waiting

Further inspection reveals that pg 0.5 is stuck in the peering state (see pg_dump.txt).

For PG 0.5, the primary is osd1. The replica is osd2. The stray is osd0.

The primary (osd1) has requested PG info from osd2. However, for some reason, it never seems to get the PG info that it wanted. So the PG stays in its peering state and the user blocks forever.

Logs are under /home/cmccabe/log/osd_peering_doesnt_complete on flab.

test_log.txt View - console log (21.8 KB) Colin McCabe, 11/18/2010 05:03 PM

pg_dump.txt View - PG dump at epoch 14 (10.1 KB) Colin McCabe, 11/18/2010 05:03 PM

osd_dump.txt View (1.83 KB) Colin McCabe, 11/18/2010 05:04 PM

notes.txt View (20.4 KB) Colin McCabe, 11/18/2010 09:50 PM

Associated revisions

Revision 924b1fcb (diff)
Added by Sage Weil almost 9 years ago

osd: bind to new cluster address when wrongly marked down

If we come back up on the same address, there is a possible race. Other
nodes will mark_down when they see us go down. If we go up first, queue
some messages, and then they see that we're down and mark_down, the
messages we queued will get lost. Since it's stateful on the cluster
backend, we need to introduce an ordering so that closing out the old
session doesn't break the new session. We do this by binding to a new
address (just a different port, actually) before marking ourselves back
up.

Fixes #592.

Signed-off-by: Sage Weil <>

History

#1 Updated by Colin McCabe almost 9 years ago

#2 Updated by Colin McCabe almost 9 years ago

I should also add that Greg Farnum helped me examine the logs for this bug.

#3 Updated by Sage Weil almost 9 years ago

i suspect 0.5 didn't get set up on osd1 or 2 before osd0 went down? do you have the full logs for the other instances?

#4 Updated by Sage Weil almost 9 years ago

  • Category set to OSD
  • Assignee changed from Colin McCabe to Sage Weil
  • Target version set to v0.24

ok, this is a problem with how the osd is interacting with the messenger. looking at the history of 0.5, we see

grep -h 0\\.5\( osd.?| sort|less

that osd1 clearly queries osd2, but osd2 doesn't respond. comparing
grep 10.3.14.10:6804/29550 osd.1
grep 10.3.14.10:6802/29157 osd.2

we see that osd1 sends 9look at --> lines)
2010-11-18 15:04:59.743555 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGq v1 -- ?+0 0x181b8c0
2010-11-18 15:05:01.931516 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 1 ==== PGq v1 ==== 1472+0+0 (503857468 0 0) 0x181b8c0
2010-11-18 15:05:01.935455 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGnot v1 -- ?+0 0x17b1e00
2010-11-18 15:05:01.935593 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 2 ==== PGnot v1 ==== 1240+0+0 (733914456 0 0) 0x1392000
2010-11-18 15:05:12.130370 7fe83b066710 osd1 10 send_incremental_map 6 -> 10 to osd2 10.3.14.10:6804/29550
2010-11-18 15:05:12.130475 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- osd_map(7,10) v1 -- ?+0 0x1717400
2010-11-18 15:05:12.130506 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGnot v1 -- ?+0 0x13fdc40
2010-11-18 15:05:12.130596 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGq v1 -- ?+0 0x13c4a80
2010-11-18 15:05:13.191322 7fe835b50710 -- 10.3.14.10:6802/29157 >> 10.3.14.10:6804/29550 pipe(0x181e000 sd=18 pgs=7 cs=1 l=0).fault with nothing to send, going to standby
2010-11-18 15:05:16.053316 7fe836358710 -- 10.3.14.10:6802/29157 >> 10.3.14.10:6804/29550 pipe(0x181e780 sd=17 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 1 state 3
2010-11-18 15:05:16.053340 7fe836358710 -- 10.3.14.10:6802/29157 >> 10.3.14.10:6804/29550 pipe(0x181e780 sd=17 pgs=0 cs=0 l=0).accept peer reset, then tried to connect to us, replacing
2010-11-18 15:05:16.056617 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 1 ==== PGnot v1 ==== 1548+0+0 (3823422094 0 0) 0x182b380
2010-11-18 15:05:16.059076 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 2 ==== PGq v1 ==== 2713+0+0 (3673062997 0 0) 0x1835e00
2010-11-18 15:05:16.069644 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGnot v1 -- ?+0 0x18888c0
2010-11-18 15:05:19.726386 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 3 ==== PGnot v1 ==== 1548+0+0 (1870374711 0 0) 0x18888c0
2010-11-18 15:05:21.439128 7fe83b066710 osd1 11 send_incremental_map 10 -> 11 to osd2 10.3.14.10:6804/29550
2010-11-18 15:05:21.439170 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- osd_map(11,11) v1 -- ?+0 0x13fc400
2010-11-18 15:05:21.439205 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGnot v1 -- ?+0 0x17a6380

(where that PGq is the query in question). but osd2 doesn't get all that (look at <== lines)
2010-11-18 15:05:01.931076 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGq v1 -- ?+0 0x2d75a80
2010-11-18 15:05:01.931549 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 1 ==== PGq v1 ==== 304+0+0 (1391314930 0 0) 0x2a12000
2010-11-18 15:05:01.934263 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGnot v1 -- ?+0 0x2dafe00
2010-11-18 15:05:01.943923 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 2 ==== PGnot v1 ==== 6168+0+0 (2334785073 0 0) 0x2dafe00
2010-11-18 15:05:13.191211 7fd48dc51710 -- 10.3.14.10:6804/29550 mark_down 10.3.14.10:6802/29157 -- 0x2a11780
2010-11-18 15:05:16.052676 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGnot v1 -- ?+0 0x2e871c0
2010-11-18 15:05:16.052891 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGq v1 -- ?+0 0x2a121c0
2010-11-18 15:05:19.726086 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGnot v1 -- ?+0 0x2e99a80
2010-11-18 15:05:24.101482 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 1 ==== PGnot v1 ==== 11404+0+0 (1765133000 0 0) 0x2a121c0
2010-11-18 15:05:24.111139 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 2 ==== osd_map(11,11) v1 ==== 148+0+0 (40936685 0 0) 0x29fe600
2010-11-18 15:05:24.111377 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 3 ==== PGnot v1 ==== 3088+0+0 (1681852847 0 0) 0x2e99a80

The PGq in question and a few other messages are lost, because of that mark_down. Whoops. The osd does that when a peer goes down (it closes out the connection).. in this case, osd1 went down in epoch 8. But in this case osd1 has already moved on to epoch 10 and sent new messages, but osd2 is behind and is losing those as a result.

The interaction is tricky. It's not immediately obvious what the osd should be doing here, need to think about it some more.

#5 Updated by Colin McCabe almost 9 years ago

Ah. Looks like you got it figured out.
I wasn't aware of what mark_down did.

Just in case anyone finds it useful, here are some notes I made about the situation.

#6 Updated by Sage Weil almost 9 years ago

  • Subject changed from OSD peering doesn't complete to osd: rebind cluster_messenger when wrongly marked down
  • Priority changed from Normal to High

I think the cleanest solution here is to re-bind the cluster_messenger to a new port when we are marked down and go back up again. Because the intra-cluster msgr is stateful, I don't think there is an easy way around this. Probably we should always instantiate a separate cluster msgr just to handle this case, so that we don't have to tear down our monclient session too.

#7 Updated by Sage Weil almost 9 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF