Actions
Bug #5672
closedosd: stuck in DeletingState::try_stop_deletion
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
/a/dzafman-5154-7/71727
traced a random pg and found that osd.3 is having trouble talking to its peers:
from osd.3's log
2013-07-18 16:01:43.940661 7f63476ac700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.28(6) epoch 96) v4 -- ?+0 0x2e1ba80 con 0x3164000 2013-07-18 16:01:43.941364 7f633b27b700 1 -- 10.214.132.27:6801/11871 >> 10.214.132.27:6805/7587 pipe(0x2e36500 sd=235 :40620 s=2 pgs=922 cs=1 l=0 c=0x3164000).setting up a delay queue on Pipe 0x2e36500 2013-07-18 16:01:43.942235 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 1 ==== osd pg remove(epoch 96; pg2.28; ) v1 ==== 25+0+0 (857214090 0 0) 0x2e1ba80 con 0x3164000 2013-07-18 16:01:44.144033 7f6346eab700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.22(6),2.1f(6),0.1c(7),0.19(9),2.1a(5),2.17(4) epoch 96) v4 -- ?+0 0x2f78700 con 0x3164000 2013-07-18 16:01:44.145102 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 2 ==== osd pg remove(epoch 96; pg2.22; ) v1 ==== 25+0+0 (3814453790 0 0) 0x2f78700 con 0x3164000 2013-07-18 16:01:44.145906 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 3 ==== osd pg remove(epoch 96; pg2.1f; ) v1 ==== 25+0+0 (1077372323 0 0) 0x2f78540 con 0x3164000 2013-07-18 16:01:44.146568 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 4 ==== osd pg remove(epoch 96; pg0.1c; ) v1 ==== 25+0+0 (2055704519 0 0) 0x2a91e00 con 0x3164000 2013-07-18 16:01:44.726882 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 5 ==== osd pg remove(epoch 96; pg0.19; ) v1 ==== 25+0+0 (312897677 0 0) 0x3301700 con 0x3164000 2013-07-18 16:01:44.727661 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 6 ==== osd pg remove(epoch 96; pg2.1a; ) v1 ==== 25+0+0 (672570089 0 0) 0x3301540 con 0x3164000 2013-07-18 16:01:44.728645 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 7 ==== osd pg remove(epoch 96; pg2.17; ) v1 ==== 25+0+0 (34365817 0 0) 0x3301380 con 0x3164000 2013-07-18 16:01:44.735070 7f63476ac700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.10(5),0.12(4),1.11(4),0.10(7),1.f(10),2.e(8),0.c(7),1.b(9),1.8(11),0.9(11),2.a(7),2.7(10) epoch 96) v4 -- ?+0 0x2c7ac40 con 0x3164000 2013-07-18 16:01:45.246472 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 8 ==== osd pg remove(epoch 96; pg2.10; ) v1 ==== 25+0+0 (4166345853 0 0) 0x2c7ac40 con 0x3164000 2013-07-18 16:01:45.512935 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 9 ==== osd pg remove(epoch 96; pg0.12; ) v1 ==== 25+0+0 (2346316606 0 0) 0x33011c0 con 0x3164000 2013-07-18 16:01:45.913696 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 10 ==== osd pg remove(epoch 96; pg1.11; ) v1 ==== 25+0+0 (2727442857 0 0) 0x2c7aa80 con 0x3164000 2013-07-18 16:01:46.005993 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 11 ==== osd pg remove(epoch 96; pg0.10; ) v1 ==== 25+0+0 (429984112 0 0) 0x33428c0 con 0x3164000 2013-07-18 16:01:46.255352 7f63476ac700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.30(6) epoch 96) v4 -- ?+0 0x2c7a380 con 0x3164000 2013-07-18 16:01:46.255464 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 12 ==== osd pg remove(epoch 96; pg1.f; ) v1 ==== 25+0+0 (3619228900 0 0) 0x3342700 con 0x3164000 2013-07-18 16:01:46.605397 7f63476ac700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.2e(9) epoch 96) v4 -- ?+0 0x2e93700 con 0x3164000 2013-07-18 16:01:46.614835 7f6346eab700 10 osd.3 97 send_incremental_map 96 -> 97 to 0x3164000 10.214.132.27:6805/7587 2013-07-18 16:01:46.614847 7f6346eab700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- osd_map(97..97 src has 1..97) v3 -- ?+0 0x2dc26c0 con 0x3164000 2013-07-18 16:01:46.614887 7f6346eab700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.30(6) epoch 97) v4 -- ?+0 0x2a91700 con 0x3164000 2013-07-18 16:01:46.615309 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 13 ==== osd pg remove(epoch 96; pg2.e; ) v1 ==== 25+0+0 (2373783856 0 0) 0x3342540 con 0x3164000 2013-07-18 16:01:46.616358 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 14 ==== osd pg remove(epoch 96; pg0.c; ) v1 ==== 25+0+0 (4277241459 0 0) 0x3342380 con 0x3164000 2013-07-18 16:01:46.616942 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 15 ==== osd pg remove(epoch 96; pg1.b; ) v1 ==== 25+0+0 (4137979529 0 0) 0x33421c0 con 0x3164000 2013-07-18 16:01:46.617588 7f63476ac700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.2e(9),1.f(10),2.e(8),0.c(7),1.b(9),1.8(11),0.9(11),2.a(7),2.7(10) epoch 97) v4 -- ?+0 0x3189e00 con 0x3164000 2013-07-18 16:01:46.629076 7f6346eab700 10 osd.3 98 send_incremental_map 97 -> 98 to 0x3164000 10.214.132.27:6805/7587 2013-07-18 16:01:46.629091 7f6346eab700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- osd_map(98..98 src has 1..98) v3 -- ?+0 0x2dd1480 con 0x3164000 2013-07-18 16:01:46.629119 7f6346eab700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.30(6),2.2e(9) epoch 98) v4 -- ?+0 0x2c7a8c0 con 0x3164000 2013-07-18 16:01:46.630192 7f6346eab700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.8(11) epoch 98) v4 -- ?+0 0x30efa80 con 0x3164000 2013-07-18 16:01:46.630332 7f63476ac700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.1b(10),1.2b(9),0.9(11),2.a(7),2.7(10) epoch 98) v4 -- ?+0 0x31898c0 con 0x3164000 2013-07-18 16:01:46.631943 7f63476ac700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.2b(9),1.1b(10) epoch 98) v4 -- ?+0 0x3189700 con 0x3164000 2013-07-18 16:01:46.632147 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 16 ==== osd pg remove(epoch 96; pg1.8; ) v1 ==== 25+0+0 (769657312 0 0) 0x3342000 con 0x3164000 2013-07-18 16:01:46.906194 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 17 ==== osd pg remove(epoch 96; pg0.9; ) v1 ==== 25+0+0 (2530244921 0 0) 0x3342e00 con 0x3164000 2013-07-18 16:01:46.906358 7f63476ac700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.28(11) epoch 98) v4 -- ?+0 0x3189540 con 0x3164000 2013-07-18 16:01:46.906489 7f6346eab700 1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.18(11) epoch 98) v4 -- ?+0 0x30ef700 con 0x3164000 2013-07-18 16:01:46.907169 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 18 ==== osd pg remove(epoch 96; pg2.a; ) v1 ==== 25+0+0 (2892010333 0 0) 0x3342c40 con 0x3164000 2013-07-18 16:01:46.907655 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 19 ==== osd pg remove(epoch 96; pg2.7; ) v1 ==== 25+0+0 (2256169165 0 0) 0x3342a80 con 0x3164000 2013-07-18 16:01:46.907974 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 20 ==== osd pg remove(epoch 97; pg2.30; ) v1 ==== 25+0+0 (3680619129 0 0) 0x2c7a380 con 0x3164000 2013-07-18 16:01:47.256836 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 21 ==== osd pg remove(epoch 98; pg2.2e; ) v1 ==== 25+0+0 (3692356499 0 0) 0x2e93700 con 0x3164000 2013-07-18 16:01:47.257234 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 22 ==== osd pg remove(epoch 98; pg1.1b; ) v1 ==== 25+0+0 (784011271 0 0) 0x31898c0 con 0x3164000 2013-07-18 16:01:47.257431 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 23 ==== osd pg remove(epoch 98; pg1.2b; ) v1 ==== 25+0+0 (2815261738 0 0) 0x30efa80 con 0x3164000 2013-07-18 16:01:47.257624 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 24 ==== osd pg remove(epoch 98; pg1.28; ) v1 ==== 25+0+0 (2089417539 0 0) 0x30ef700 con 0x3164000 2013-07-18 16:01:47.257809 7f634beb5700 1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 25 ==== osd pg remove(epoch 98; pg1.18; ) v1 ==== 25+0+0 (4127168366 0 0) 0x3189540 con 0x3164000 2013-07-18 16:02:29.804062 7f633b17a700 0 -- 10.214.132.27:6801/11871 >> 10.214.132.27:6805/7587 pipe(0x2e36500 sd=235 :40620 s=2 pgs=922 cs=1 l=0 c=0x3164000).fault with nothing to send, going to standby
vs osd.4's log
2013-07-18 16:01:43.941111 7f32edba4700 10 osd.4 96 new session 0x365a620 con=0x35bb9a0 addr=10.214.132.27:6801/11871 2013-07-18 16:01:43.941296 7f32edba4700 1 -- 10.214.132.27:6805/7587 >> 10.214.132.27:6801/11871 pipe(0x2e0e780 sd=179 :6805 s=2 pgs=600 cs=1 l=0 c=0x35bb9a0).setting up a delay queue on Pipe 0x2e0e780 2013-07-18 16:01:43.941620 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 1 ==== pg_notify(2.28(6) epoch 96) v4 ==== 842+0+0 (3330806545 0 0) 0x392ee00 con 0x35bb9a0 2013-07-18 16:01:43.941922 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.28; ) v1 -- ?+0 0x3d74000 2013-07-18 16:01:44.144493 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 2 ==== pg_notify(2.22(6),2.1f(6),0.1c(7),0.19(9),2.1a(5),2.17(4) epoch 96) v4 ==== 4983+0+0 (283124080 0 0) 0x3d74000 con 0x35bb9a0 2013-07-18 16:01:44.144763 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.22; ) v1 -- ?+0 0x3d74700 2013-07-18 16:01:44.144806 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.1f; ) v1 -- ?+0 0x337a540 2013-07-18 16:01:44.145012 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.1c; ) v1 -- ?+0 0x3d74a80 2013-07-18 16:01:44.145201 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.19; ) v1 -- ?+0 0x2fc5700 2013-07-18 16:01:44.145392 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.1a; ) v1 -- ?+0 0x2fc5540 2013-07-18 16:01:44.145578 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.17; ) v1 -- ?+0 0x2fc5e00 2013-07-18 16:01:44.735762 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 3 ==== pg_notify(2.10(5),0.12(4),1.11(4),0.10(7),1.f(10),2.e(8),0.c(7),1.b(9),1.8(11),0.9(11),2.a(7),2.7(10) epoch 96) v4 ==== 10679+0+0 (3560301366 0 0) 0x2f c5e00 con 0x35bb9a0 2013-07-18 16:01:44.736167 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.10; ) v1 -- ?+0 0x337aa80 2013-07-18 16:01:44.736359 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.12; ) v1 -- ?+0 0x3114e00 2013-07-18 16:01:44.736386 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg1.11; ) v1 -- ?+0 0x3e0d1c0 2013-07-18 16:01:44.736566 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.10; ) v1 -- ?+0 0x3e0dc40 2013-07-18 16:01:44.736766 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg1.f; ) v1 -- ?+0 0x3e0d380 2013-07-18 16:01:44.736975 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.e; ) v1 -- ?+0 0x3e0d540 2013-07-18 16:01:44.737158 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.c; ) v1 -- ?+0 0x3e0de00 2013-07-18 16:01:44.737354 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg1.b; ) v1 -- ?+0 0x3e0d700 2013-07-18 16:01:44.737537 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg1.8; ) v1 -- ?+0 0x3e0da80 2013-07-18 16:01:44.737696 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.9; ) v1 -- ?+0 0x3302000 2013-07-18 16:01:44.737859 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.a; ) v1 -- ?+0 0x3302a80 2013-07-18 16:01:44.738026 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.7; ) v1 -- ?+0 0x3302380 2013-07-18 16:01:46.255656 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 4 ==== pg_notify(2.30(6) epoch 96) v4 ==== 826+0+0 (3790595863 0 0) 0x3302380 con 0x35bb9a0 2013-07-18 16:01:46.255859 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 97; pg2.30; ) v1 -- ?+0 0x2f27540 2013-07-18 16:01:46.605701 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 5 ==== pg_notify(2.2e(9) epoch 96) v4 ==== 943+0+0 (2001736930 0 0) 0x2f27540 con 0x35bb9a0 2013-07-18 16:01:46.605946 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg2.2e; ) v1 -- ?+0 0x3e0d540 2013-07-18 16:01:46.615062 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 6 ==== osd_map(97..97 src has 1..97) v3 ==== 226+0+0 (31467472 0 0) 0x2e246c0 con 0x35bb9a0 2013-07-18 16:01:46.615162 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 7 ==== pg_notify(2.30(6) epoch 97) v4 ==== 826+0+0 (1612021928 0 0) 0x3e0d540 con 0x35bb9a0 2013-07-18 16:01:46.618136 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 8 ==== pg_notify(2.2e(9),1.f(10),2.e(8),0.c(7),1.b(9),1.8(11),0.9(11),2.a(7),2.7(10) epoch 97) v4 ==== 8494+0+0 (3141197251 0 0) 0x3302a80 con 0x35bb9a0 2013-07-18 16:01:46.629479 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 9 ==== osd_map(98..98 src has 1..98) v3 ==== 226+0+0 (2720530855 0 0) 0x392f000 con 0x35bb9a0 2013-07-18 16:01:46.629531 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 10 ==== pg_notify(2.30(6),2.2e(9) epoch 98) v4 ==== 1757+0+0 (2073546399 0 0) 0x3302000 con 0x35bb9a0 2013-07-18 16:01:46.630787 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 11 ==== pg_notify(1.8(11) epoch 98) v4 ==== 1029+0+0 (1359721621 0 0) 0x3e0da80 con 0x35bb9a0 2013-07-18 16:01:46.630939 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 12 ==== pg_notify(1.1b(10),1.2b(9),0.9(11),2.a(7),2.7(10) epoch 98) v4 ==== 4793+0+0 (328206156 0 0) 0x3e0d700 con 0x35bb9a0 2013-07-18 16:01:46.631686 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg1.1b; ) v1 -- ?+0 0x3e0d380 2013-07-18 16:01:46.631850 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg1.2b; ) v1 -- ?+0 0x3114700 2013-07-18 16:01:46.632233 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 13 ==== pg_notify(1.2b(9),1.1b(10) epoch 98) v4 ==== 1945+0+0 (3349095817 0 0) 0x3e0d380 con 0x35bb9a0 2013-07-18 16:01:46.906632 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 14 ==== pg_notify(1.28(11) epoch 98) v4 ==== 1029+0+0 (3796804803 0 0) 0x3114700 con 0x35bb9a0 2013-07-18 16:01:46.906855 7f32fb2da700 1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 15 ==== pg_notify(1.18(11) epoch 98) v4 ==== 1029+0+0 (2563999155 0 0) 0x438c700 con 0x35bb9a0 2013-07-18 16:01:46.906920 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg1.28; ) v1 -- ?+0 0x30dd540 2013-07-18 16:01:46.907082 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg1.18; ) v1 -- ?+0 0x3114a80 2013-07-18 16:01:47.518823 7f32f62d0700 10 osd.4 99 send_incremental_map 96 -> 99 to 0x35bb9a0 10.214.132.27:6801/11871 2013-07-18 16:01:47.518840 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- osd_map(97..99 src has 1..99) v3 -- ?+0 0x3202b40 con 0x35bb9a0 2013-07-18 16:01:47.518870 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.2e(1) epoch 99) v4 -- ?+0 0x30dd000 con 0x35bb9a0 2013-07-18 16:01:47.519883 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.30(1),2.33(2) epoch 99) v4 -- ?+0 0x438c700 con 0x35bb9a0 2013-07-18 16:01:47.525926 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(1.2b(1),1.24(3),0.24(1),2.22(1) epoch 99) v4 -- ?+0 0x2e528c0 con 0x35bb9a0 2013-07-18 16:01:47.526127 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_query(0.21,1.20,2.1f epoch 99) v2 -- ?+0 0x43c2c00 con 0x35bb9a0 2013-07-18 16:01:47.533229 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(1.1a(3),0.1b(2),2.19(2),1.18(1),0.19(1),1.14(2),0.15(2),2.17(1),0.12(1),2.10(1),3.12(6) epoch 99) v4 -- ?+0 0x3114700 con 0x35bb9a0 2013-07-18 16:01:47.547493 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.13(2),1.11(1),0.10(1),1.f(1),0.c(1),2.e(1),1.b(1),2.a(1) epoch 99) v4 -- ?+0 0x348f1c0 con 0x35bb9a0 2013-07-18 16:01:47.547755 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_query(3.9 epoch 99) v2 -- ?+0 0x42f9a00 con 0x35bb9a0 2013-07-18 16:01:47.549273 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(0.5(2),1.4(3),2.3(2),3.2(4) epoch 99) v4 -- ?+0 0x3e0d380 con 0x35bb9a0 2013-07-18 16:01:48.604028 7f32f62d0700 10 osd.4 100 send_incremental_map 99 -> 100 to 0x35bb9a0 10.214.132.27:6801/11871 2013-07-18 16:01:48.604043 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- osd_map(100..100 src has 1..100) v3 -- ?+0 0x326bd80 con 0x35bb9a0 2013-07-18 16:01:48.604073 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.30(1) epoch 100) v4 -- ?+0 0x3d74a80 con 0x35bb9a0 2013-07-18 16:01:48.606287 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.33(2),2.2e(1),1.2b(1) epoch 100) v4 -- ?+0 0x3e0da80 con 0x35bb9a0 2013-07-18 16:01:48.606621 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(1.24(3),0.24(1),2.22(1) epoch 100) v4 -- ?+0 0x3d748c0 con 0x35bb9a0 2013-07-18 16:01:48.609566 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(1.1a(3),0.1b(2),2.19(2),1.18(1),0.19(1),1.14(2),0.15(2),2.17(1) epoch 100) v4 -- ?+0 0x3e0d540 con 0x35bb9a0 2013-07-18 16:01:48.609605 7f32f62d0700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(0.12(1),2.10(1),3.12(6),2.13(2),1.11(1),0.10(1),1.f(1),0.c(1),2.e(1),1.b(1) epoch 100) v4 -- ?+0 0x3d74000 con 0x35bb9a0 2013-07-18 16:01:48.610623 7f32f6ad1700 1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(3.2(4) epoch 100) v4 -- ?+0 0x3e0dc40 con 0x35bb9a0
Updated by Sage Weil almost 11 years ago
- Priority changed from Urgent to Immediate
Updated by Greg Farnum almost 11 years ago
You're just noticing that osd4 dispatched messages that osd3 never received and faulted on the communication line for?
Updated by Sage Weil almost 11 years ago
ah, dispatch is stuck here:
(gdb) bt #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162 #1 0x00000000006b345a in Wait (mutex=..., this=0x3a67418) at ./common/Cond.h:55 #2 DeletingState::try_stop_deletion (this=0x3a673c0) at osd/OSD.h:225 #3 0x0000000000689647 in OSD::_try_resurrect_pg (this=0x32d2000, curmap=..., pgid=..., resurrected=0x7fa0f3fc8220, old_pg_state=0x7fa0f3fc8310) at osd/OSD.cc:1795 #4 0x00000000006a4b05 in OSD::handle_pg_peering_evt (this=0x32d2000, info=..., pi=..., epoch=99, from=5, primary=<optimized out>, evt=...) at osd/OSD.cc:2210 #5 0x00000000006a68df in OSD::handle_pg_log (this=0x32d2000, op=...) at osd/OSD.cc:6062 #6 0x00000000006a73f0 in OSD::dispatch_op (this=0x32d2000, op=...) at osd/OSD.cc:4401 #7 0x00000000006abcfe in OSD::_dispatch (this=0x32d2000, m=<optimized out>) at osd/OSD.cc:4524 #8 0x00000000006ac4a6 in OSD::ms_dispatch (this=0x32d2000, m=0x3324680) at osd/OSD.cc:4273 #9 0x000000000098a569 in ms_deliver_dispatch (m=0x3324680, this=0x32ddc00) at msg/Messenger.h:587 #10 DispatchQueue::entry (this=0x32ddce8) at msg/DispatchQueue.cc:123 #11 0x00000000008aecad in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:104 #12 0x00007fa10215be9a in start_thread (arg=0x7fa0f3fc9700) at pthread_create.c:308 #13 0x00007fa1002eeccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 #14 0x0000000000000000 in ?? ()
last line for that thread is
2013-07-19 12:21:14.057957 7fa0f3fc9700 15 osd.0 99 project_pg_history end ec=1 les/c 65/65 97/97/97
on osd.0
Updated by Sage Weil almost 11 years ago
- Subject changed from osd: some msgr issue leads to hung recovery to osd: stuck in DeletingState::try_stop_deletion
Updated by Samuel Just almost 11 years ago
osd/filestore lock loop, I'll have a fix shortly
Actions