Project

General

Profile

Actions

Bug #599

closed

recover_master_log, doesn't

Added by Colin McCabe over 13 years ago. Updated over 13 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is another peering bug. We found it on wido's cluster. Basically, peering never completes.

I just examined PG 0.0, one of the eternally peering PGs.

root@logger:~# ceph pg dump o - | head
2010-11-22 21:16:39.315764 mon <
[pg,dump]
2010-11-22 21:16:39.342433 mon1 -> 'ok' (0)
version 90443
last_osdmap_epoch 6673
last_pg_scan 1693
full_ratio 0.95
nearfull_ratio 0.8
pg_stat objects mip unf degr kb bytes log disklog state v reported up acting last_scrub
0.0 420 0 0 0 444416 454918662 297 297 crashed+peering 1486'422 6549'4376 [0,9,3] [0,9,3] 1486'422 2010-11-15 20:37:17.925299
0.1 432 0 0 0 349946 358160785 198 198 crashed+peering 1486'432 6608'1391 [5,11,9] [5,11,9] 1486'432 2010-11-15 20:42:38.744814

On OSD0, the last function to be called from do_peer is recover_master_log.

2010-11-22 20:22:28.107043 7f965f12d710 osd0 6633 pg[0.0( v 1486'422 (1486'419,1486'422] n=420 ec=2 les=6491 6626/6626/6549) [0,9,3] r=0 lcod 0'0 mlcod 0'0 !hml crashed+peering] peer up [0,9,3],   acting [0,9,3]
2010-11-22 20:22:28.107056 7f965f12d710 osd0 6633 pg[0.0( v 1486'422 (1486'419,1486'422] n=420 ec=2 les=6491 6626/6626/6549) [0,9,3] r=0 lcod 0'0 mlcod 0'0 !hml crashed+peering] peer prior_set is 3,9
2010-11-22 20:22:28.107068 7f965f12d710 osd0 6633 pg[0.0( v 1486'422 (1486'419,1486'422] n=420 ec=2 les=6491 6626/6626/6549) [0,9,3] r=0 lcod 0'0 mlcod 0'0 !hml crashed+peering] recover_master_log
2010-11-22 20:22:28.107080 7f965f12d710 osd0 6633 pg[0.0( v 1486'422 (1486'419,1486'422] n=420 ec=2 les=6491 6626/6626/6549) [0,9,3] r=0 lcod 0'0 mlcod 0'0 !hml crashed+peering] waiting for osd3
2010-11-22 20:22:28.107092 7f965f12d710 osd0 6633 pg[0.0( v 1486'422 (1486'419,1486'422] n=420 ec=2 les=6491 6626/6626/6549) [0,9,3] r=0 lcod 0'0 mlcod 0'0 !hml crashed+peering] waiting for osd9
2010-11-22 20:22:28.107105 7f965f12d710 osd0 6633 pg[0.0( v 1486'422 (1486'419,1486'422] n=420 ec=2 les=6491 6626/6626/6549) [0,9,3] r=0 lcod 0'0 mlcod 0'0 !hml crashed+peering] update_stats 6549'4342

After that, we get pings and osd_maps from osd3 and osd9, but never the PGnot and other messages that we would need to make progress on peering.

It seems like osd3 and osd9 never send us anything except pings and osd_maps.

$ grep '<==' /var/log/ceph/osd.0.log | grep osd3 | grep -v osd_map | grep -v osd_ping 
$
$ grep '<==' /var/log/ceph/osd.0.log | grep osd9 | grep -v osd_map | grep -v osd_ping
$

By comparison, osd1 sent us a lot of stuff.

root@node01:~# grep '<==' /var/log/ceph/osd.0.log | grep osd1 | grep v osd_map | grep -v osd_ping | head -n 10
2010-11-22 20:19:21.961809 7f965f12d710 -
[2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 1 ==== PGnot v1 ==== 115508+0+0 (569735235 0 0) 0x979d700
2010-11-22 20:19:21.997131 7f965f12d710 -- [2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 2 ==== PGq v1 ==== 4757+0+0 (2153010741 0 0) 0xadd21c0
2010-11-22 20:19:36.383863 7f965f12d710 -- [2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 3 ==== PGq v1 ==== 888+0+0 (3717460735 0 0) 0x9ac2380
2010-11-22 20:19:36.385796 7f965f12d710 -- [2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 4 ==== PGnot v1 ==== 22184+0+0 (843986286 0 0) 0xb322540
2010-11-22 20:19:36.509454 7f965f12d710 -- [2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 5 ==== PGnot v1 ==== 101340+0+0 (35662047 0 0) 0xb322380
2010-11-22 20:19:36.821688 7f965f12d710 -- [2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 6 ==== PGq v1 ==== 450+0+0 (3299570323 0 0) 0xb3221c0
2010-11-22 20:19:49.564773 7f965f12d710 -- [2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 7 ==== PGq v1 ==== 888+0+0 (2410340724 0 0) 0x96358c0
2010-11-22 20:19:49.746765 7f965f12d710 -- [2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 8 ==== PGnot v1 ==== 101340+0+0 (3564056599 0 0) 0xa91be00
2010-11-22 20:19:54.402375 7f965f12d710 -- [2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 9 ==== PGnot v1 ==== 124440+0+0 (2647411595 0 0) 0xc070a80
2010-11-22 20:19:54.436352 7f965f12d710 -- [2001:16f8:10:2::c3c3:8f6b]:6803/31146 <== osd1 [2001:16f8:10:2::c3c3:a24f]:6803/25291 10 ==== PGq v1 ==== 1764+0+0 (449635568 0 0) 0xd6a4e00
Actions

Also available in: Atom PDF