ok, this is a problem with how the osd is interacting with the messenger. looking at the history of 0.5, we see
grep -h 0\\.5\( osd.?| sort|less
that osd1 clearly queries osd2, but osd2 doesn't respond. comparing
grep 10.3.14.10:6804/29550 osd.1
grep 10.3.14.10:6802/29157 osd.2
we see that osd1 sends 9look at --> lines)
2010-11-18 15:04:59.743555 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGq v1 -- ?+0 0x181b8c0
2010-11-18 15:05:01.931516 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 1 ==== PGq v1 ==== 1472+0+0 (503857468 0 0) 0x181b8c0
2010-11-18 15:05:01.935455 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGnot v1 -- ?+0 0x17b1e00
2010-11-18 15:05:01.935593 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 2 ==== PGnot v1 ==== 1240+0+0 (733914456 0 0) 0x1392000
2010-11-18 15:05:12.130370 7fe83b066710 osd1 10 send_incremental_map 6 -> 10 to osd2 10.3.14.10:6804/29550
2010-11-18 15:05:12.130475 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- osd_map(7,10) v1 -- ?+0 0x1717400
2010-11-18 15:05:12.130506 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGnot v1 -- ?+0 0x13fdc40
2010-11-18 15:05:12.130596 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGq v1 -- ?+0 0x13c4a80
2010-11-18 15:05:13.191322 7fe835b50710 -- 10.3.14.10:6802/29157 >> 10.3.14.10:6804/29550 pipe(0x181e000 sd=18 pgs=7 cs=1 l=0).fault with nothing to send, going to standby
2010-11-18 15:05:16.053316 7fe836358710 -- 10.3.14.10:6802/29157 >> 10.3.14.10:6804/29550 pipe(0x181e780 sd=17 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 1 state 3
2010-11-18 15:05:16.053340 7fe836358710 -- 10.3.14.10:6802/29157 >> 10.3.14.10:6804/29550 pipe(0x181e780 sd=17 pgs=0 cs=0 l=0).accept peer reset, then tried to connect to us, replacing
2010-11-18 15:05:16.056617 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 1 ==== PGnot v1 ==== 1548+0+0 (3823422094 0 0) 0x182b380
2010-11-18 15:05:16.059076 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 2 ==== PGq v1 ==== 2713+0+0 (3673062997 0 0) 0x1835e00
2010-11-18 15:05:16.069644 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGnot v1 -- ?+0 0x18888c0
2010-11-18 15:05:19.726386 7fe83b066710 -- 10.3.14.10:6802/29157 <== osd2 10.3.14.10:6804/29550 3 ==== PGnot v1 ==== 1548+0+0 (1870374711 0 0) 0x18888c0
2010-11-18 15:05:21.439128 7fe83b066710 osd1 11 send_incremental_map 10 -> 11 to osd2 10.3.14.10:6804/29550
2010-11-18 15:05:21.439170 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- osd_map(11,11) v1 -- ?+0 0x13fc400
2010-11-18 15:05:21.439205 7fe83b066710 -- 10.3.14.10:6802/29157 --> osd2 10.3.14.10:6804/29550 -- PGnot v1 -- ?+0 0x17a6380
(where that PGq is the query in question). but osd2 doesn't get all that (look at <== lines)
2010-11-18 15:05:01.931076 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGq v1 -- ?+0 0x2d75a80
2010-11-18 15:05:01.931549 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 1 ==== PGq v1 ==== 304+0+0 (1391314930 0 0) 0x2a12000
2010-11-18 15:05:01.934263 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGnot v1 -- ?+0 0x2dafe00
2010-11-18 15:05:01.943923 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 2 ==== PGnot v1 ==== 6168+0+0 (2334785073 0 0) 0x2dafe00
2010-11-18 15:05:13.191211 7fd48dc51710 -- 10.3.14.10:6804/29550 mark_down 10.3.14.10:6802/29157 -- 0x2a11780
2010-11-18 15:05:16.052676 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGnot v1 -- ?+0 0x2e871c0
2010-11-18 15:05:16.052891 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGq v1 -- ?+0 0x2a121c0
2010-11-18 15:05:19.726086 7fd48dc51710 -- 10.3.14.10:6804/29550 --> osd1 10.3.14.10:6802/29157 -- PGnot v1 -- ?+0 0x2e99a80
2010-11-18 15:05:24.101482 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 1 ==== PGnot v1 ==== 11404+0+0 (1765133000 0 0) 0x2a121c0
2010-11-18 15:05:24.111139 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 2 ==== osd_map(11,11) v1 ==== 148+0+0 (40936685 0 0) 0x29fe600
2010-11-18 15:05:24.111377 7fd48dc51710 -- 10.3.14.10:6804/29550 <== osd1 10.3.14.10:6802/29157 3 ==== PGnot v1 ==== 3088+0+0 (1681852847 0 0) 0x2e99a80
The PGq in question and a few other messages are lost, because of that mark_down. Whoops. The osd does that when a peer goes down (it closes out the connection).. in this case, osd1 went down in epoch 8. But in this case osd1 has already moved on to epoch 10 and sent new messages, but osd2 is behind and is losing those as a result.
The interaction is tricky. It's not immediately obvious what the osd should be doing here, need to think about it some more.