Project

General

Profile

Actions

Bug #592

closed

osd: rebind cluster_messenger when wrongly marked down

Added by Colin McCabe over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

0%

Spent time:
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This happened with commit:323565343071ce695f7d454ed29590688de64d5d on flab.ceph.dreamhost.com

While running test_unfound.sh, I noticed a situation where OSD peering never completed.

This happened in stray_test. Basically the sequence of events was like this:

1. brought up all OSDs with vstart.sh

2. Took out osd0 with

./ceph osd out 0

3. ran radostool with

./rados -p data put obj01 /tmp/tmp.UeAgBcigql/ver1

It blocked forever and never returned.

So what is going on on the backend?
Well, it seems that we are requesting something on PG 0.5. osd2 blocks forever with this message:

2010-11-18 15:05:33.161056 7fd48dc51710 osd2 14 _dispatch 0x2d24d80 osd_op(client4113.0:1 obj01 [writefull 0~1000000] 0.92bd) v1
2010-11-18 15:05:33.161069 7fd48dc51710 osd2 14 require_same_or_newer_map 9 (i am 14) 0x2d24d80
2010-11-18 15:05:33.161079 7fd48dc51710 osd2 14 _share_map_incoming client4113 10.3.14.10:0/1691 9
2010-11-18 15:05:33.161088 7fd48dc51710 osd2 14 client4113 has old map 9 < 14
2010-11-18 15:05:33.161094 7fd48dc51710 osd2 14 send_incremental_map 9 - > 14 to client4113 10.3.14.10:0/1691
2010-11-18 15:05:33.161131 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.14_0 0~116 = 116
2010-11-18 15:05:33.161152 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.13_0 0~116 = 116
2010-11-18 15:05:33.161170 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.12_0 0~116 = 116
2010-11-18 15:05:33.161187 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.11_0 0~116 = 116
2010-11-18 15:05:33.161214 7fd48dc51710 filestore(dev/osd2) read dev/osd2/current/meta/inc_osdmap.10_0 0~424 = 424
2010-11-18 15:05:33.161224 7fd48dc51710 -- 10.3.14.10:6804/29550 --> client4113 10.3.14.10:0/1691 -- osd_map(10,14) v1 -- ?+0 0x2e94000
2010-11-18 15:05:33.161260 7fd48dc51710 osd2 14 hit non-existent pg 0.5, waiting

Further inspection reveals that pg 0.5 is stuck in the peering state (see pg_dump.txt).

For PG 0.5, the primary is osd1. The replica is osd2. The stray is osd0.

The primary (osd1) has requested PG info from osd2. However, for some reason, it never seems to get the PG info that it wanted. So the PG stays in its peering state and the user blocks forever.

Logs are under /home/cmccabe/log/osd_peering_doesnt_complete on flab.


Files

test_log.txt (21.8 KB) test_log.txt console log Colin McCabe, 11/18/2010 05:03 PM
pg_dump.txt (10.1 KB) pg_dump.txt PG dump at epoch 14 Colin McCabe, 11/18/2010 05:03 PM
osd_dump.txt (1.83 KB) osd_dump.txt Colin McCabe, 11/18/2010 05:04 PM
notes.txt (20.4 KB) notes.txt Colin McCabe, 11/18/2010 09:50 PM
Actions

Also available in: Atom PDF