Project

General

Profile

Actions

Bug #5672

closed

osd: stuck in DeletingState::try_stop_deletion

Added by Sage Weil almost 11 years ago. Updated almost 11 years ago.

Status:
Resolved
Priority:
Immediate
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

/a/dzafman-5154-7/71727

traced a random pg and found that osd.3 is having trouble talking to its peers:

from osd.3's log

2013-07-18 16:01:43.940661 7f63476ac700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.28(6) epoch 96) v4 -- ?+0 0x2e1ba80 con 0x3164000
2013-07-18 16:01:43.941364 7f633b27b700  1 -- 10.214.132.27:6801/11871 >> 10.214.132.27:6805/7587 pipe(0x2e36500 sd=235 :40620 s=2 pgs=922 cs=1 l=0 c=0x3164000).setting up a delay queue on Pipe 0x2e36500
2013-07-18 16:01:43.942235 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 1 ==== osd pg remove(epoch 96; pg2.28; ) v1 ==== 25+0+0 (857214090 0 0) 0x2e1ba80 con 0x3164000
2013-07-18 16:01:44.144033 7f6346eab700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.22(6),2.1f(6),0.1c(7),0.19(9),2.1a(5),2.17(4) epoch 96) v4 -- ?+0 0x2f78700 con 0x3164000
2013-07-18 16:01:44.145102 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 2 ==== osd pg remove(epoch 96; pg2.22; ) v1 ==== 25+0+0 (3814453790 0 0) 0x2f78700 con 0x3164000
2013-07-18 16:01:44.145906 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 3 ==== osd pg remove(epoch 96; pg2.1f; ) v1 ==== 25+0+0 (1077372323 0 0) 0x2f78540 con 0x3164000
2013-07-18 16:01:44.146568 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 4 ==== osd pg remove(epoch 96; pg0.1c; ) v1 ==== 25+0+0 (2055704519 0 0) 0x2a91e00 con 0x3164000
2013-07-18 16:01:44.726882 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 5 ==== osd pg remove(epoch 96; pg0.19; ) v1 ==== 25+0+0 (312897677 0 0) 0x3301700 con 0x3164000
2013-07-18 16:01:44.727661 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 6 ==== osd pg remove(epoch 96; pg2.1a; ) v1 ==== 25+0+0 (672570089 0 0) 0x3301540 con 0x3164000
2013-07-18 16:01:44.728645 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 7 ==== osd pg remove(epoch 96; pg2.17; ) v1 ==== 25+0+0 (34365817 0 0) 0x3301380 con 0x3164000
2013-07-18 16:01:44.735070 7f63476ac700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.10(5),0.12(4),1.11(4),0.10(7),1.f(10),2.e(8),0.c(7),1.b(9),1.8(11),0.9(11),2.a(7),2.7(10) epoch 96) v4 -- ?+0 0x2c7ac40 con 0x3164000
2013-07-18 16:01:45.246472 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 8 ==== osd pg remove(epoch 96; pg2.10; ) v1 ==== 25+0+0 (4166345853 0 0) 0x2c7ac40 con 0x3164000
2013-07-18 16:01:45.512935 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 9 ==== osd pg remove(epoch 96; pg0.12; ) v1 ==== 25+0+0 (2346316606 0 0) 0x33011c0 con 0x3164000
2013-07-18 16:01:45.913696 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 10 ==== osd pg remove(epoch 96; pg1.11; ) v1 ==== 25+0+0 (2727442857 0 0) 0x2c7aa80 con 0x3164000
2013-07-18 16:01:46.005993 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 11 ==== osd pg remove(epoch 96; pg0.10; ) v1 ==== 25+0+0 (429984112 0 0) 0x33428c0 con 0x3164000
2013-07-18 16:01:46.255352 7f63476ac700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.30(6) epoch 96) v4 -- ?+0 0x2c7a380 con 0x3164000
2013-07-18 16:01:46.255464 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 12 ==== osd pg remove(epoch 96; pg1.f; ) v1 ==== 25+0+0 (3619228900 0 0) 0x3342700 con 0x3164000
2013-07-18 16:01:46.605397 7f63476ac700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.2e(9) epoch 96) v4 -- ?+0 0x2e93700 con 0x3164000
2013-07-18 16:01:46.614835 7f6346eab700 10 osd.3 97 send_incremental_map 96 -> 97 to 0x3164000 10.214.132.27:6805/7587
2013-07-18 16:01:46.614847 7f6346eab700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- osd_map(97..97 src has 1..97) v3 -- ?+0 0x2dc26c0 con 0x3164000
2013-07-18 16:01:46.614887 7f6346eab700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.30(6) epoch 97) v4 -- ?+0 0x2a91700 con 0x3164000
2013-07-18 16:01:46.615309 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 13 ==== osd pg remove(epoch 96; pg2.e; ) v1 ==== 25+0+0 (2373783856 0 0) 0x3342540 con 0x3164000
2013-07-18 16:01:46.616358 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 14 ==== osd pg remove(epoch 96; pg0.c; ) v1 ==== 25+0+0 (4277241459 0 0) 0x3342380 con 0x3164000
2013-07-18 16:01:46.616942 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 15 ==== osd pg remove(epoch 96; pg1.b; ) v1 ==== 25+0+0 (4137979529 0 0) 0x33421c0 con 0x3164000
2013-07-18 16:01:46.617588 7f63476ac700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.2e(9),1.f(10),2.e(8),0.c(7),1.b(9),1.8(11),0.9(11),2.a(7),2.7(10) epoch 97) v4 -- ?+0 0x3189e00 con 0x3164000
2013-07-18 16:01:46.629076 7f6346eab700 10 osd.3 98 send_incremental_map 97 -> 98 to 0x3164000 10.214.132.27:6805/7587
2013-07-18 16:01:46.629091 7f6346eab700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- osd_map(98..98 src has 1..98) v3 -- ?+0 0x2dd1480 con 0x3164000
2013-07-18 16:01:46.629119 7f6346eab700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(2.30(6),2.2e(9) epoch 98) v4 -- ?+0 0x2c7a8c0 con 0x3164000
2013-07-18 16:01:46.630192 7f6346eab700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.8(11) epoch 98) v4 -- ?+0 0x30efa80 con 0x3164000
2013-07-18 16:01:46.630332 7f63476ac700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.1b(10),1.2b(9),0.9(11),2.a(7),2.7(10) epoch 98) v4 -- ?+0 0x31898c0 con 0x3164000
2013-07-18 16:01:46.631943 7f63476ac700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.2b(9),1.1b(10) epoch 98) v4 -- ?+0 0x3189700 con 0x3164000
2013-07-18 16:01:46.632147 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 16 ==== osd pg remove(epoch 96; pg1.8; ) v1 ==== 25+0+0 (769657312 0 0) 0x3342000 con 0x3164000
2013-07-18 16:01:46.906194 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 17 ==== osd pg remove(epoch 96; pg0.9; ) v1 ==== 25+0+0 (2530244921 0 0) 0x3342e00 con 0x3164000
2013-07-18 16:01:46.906358 7f63476ac700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.28(11) epoch 98) v4 -- ?+0 0x3189540 con 0x3164000
2013-07-18 16:01:46.906489 7f6346eab700  1 -- 10.214.132.27:6801/11871 --> 10.214.132.27:6805/7587 -- pg_notify(1.18(11) epoch 98) v4 -- ?+0 0x30ef700 con 0x3164000
2013-07-18 16:01:46.907169 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 18 ==== osd pg remove(epoch 96; pg2.a; ) v1 ==== 25+0+0 (2892010333 0 0) 0x3342c40 con 0x3164000
2013-07-18 16:01:46.907655 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 19 ==== osd pg remove(epoch 96; pg2.7; ) v1 ==== 25+0+0 (2256169165 0 0) 0x3342a80 con 0x3164000
2013-07-18 16:01:46.907974 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 20 ==== osd pg remove(epoch 97; pg2.30; ) v1 ==== 25+0+0 (3680619129 0 0) 0x2c7a380 con 0x3164000
2013-07-18 16:01:47.256836 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 21 ==== osd pg remove(epoch 98; pg2.2e; ) v1 ==== 25+0+0 (3692356499 0 0) 0x2e93700 con 0x3164000
2013-07-18 16:01:47.257234 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 22 ==== osd pg remove(epoch 98; pg1.1b; ) v1 ==== 25+0+0 (784011271 0 0) 0x31898c0 con 0x3164000
2013-07-18 16:01:47.257431 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 23 ==== osd pg remove(epoch 98; pg1.2b; ) v1 ==== 25+0+0 (2815261738 0 0) 0x30efa80 con 0x3164000
2013-07-18 16:01:47.257624 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 24 ==== osd pg remove(epoch 98; pg1.28; ) v1 ==== 25+0+0 (2089417539 0 0) 0x30ef700 con 0x3164000
2013-07-18 16:01:47.257809 7f634beb5700  1 -- 10.214.132.27:6801/11871 <== osd.4 10.214.132.27:6805/7587 25 ==== osd pg remove(epoch 98; pg1.18; ) v1 ==== 25+0+0 (4127168366 0 0) 0x3189540 con 0x3164000
2013-07-18 16:02:29.804062 7f633b17a700  0 -- 10.214.132.27:6801/11871 >> 10.214.132.27:6805/7587 pipe(0x2e36500 sd=235 :40620 s=2 pgs=922 cs=1 l=0 c=0x3164000).fault with nothing to send, going to standby

vs osd.4's log
2013-07-18 16:01:43.941111 7f32edba4700 10 osd.4 96  new session 0x365a620 con=0x35bb9a0 addr=10.214.132.27:6801/11871
2013-07-18 16:01:43.941296 7f32edba4700  1 -- 10.214.132.27:6805/7587 >> 10.214.132.27:6801/11871 pipe(0x2e0e780 sd=179 :6805 s=2 pgs=600 cs=1 l=0 c=0x35bb9a0).setting up a delay queue on Pipe 0x2e0e780
2013-07-18 16:01:43.941620 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 1 ==== pg_notify(2.28(6) epoch 96) v4 ==== 842+0+0 (3330806545 0 0) 0x392ee00 con 0x35bb9a0
2013-07-18 16:01:43.941922 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.28; ) v1 -- ?+0 0x3d74000
2013-07-18 16:01:44.144493 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 2 ==== pg_notify(2.22(6),2.1f(6),0.1c(7),0.19(9),2.1a(5),2.17(4) epoch 96) v4 ==== 4983+0+0 (283124080 0 0) 0x3d74000 con 0x35bb9a0
2013-07-18 16:01:44.144763 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.22; ) v1 -- ?+0 0x3d74700
2013-07-18 16:01:44.144806 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.1f; ) v1 -- ?+0 0x337a540
2013-07-18 16:01:44.145012 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.1c; ) v1 -- ?+0 0x3d74a80
2013-07-18 16:01:44.145201 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.19; ) v1 -- ?+0 0x2fc5700
2013-07-18 16:01:44.145392 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.1a; ) v1 -- ?+0 0x2fc5540
2013-07-18 16:01:44.145578 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.17; ) v1 -- ?+0 0x2fc5e00
2013-07-18 16:01:44.735762 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 3 ==== pg_notify(2.10(5),0.12(4),1.11(4),0.10(7),1.f(10),2.e(8),0.c(7),1.b(9),1.8(11),0.9(11),2.a(7),2.7(10) epoch 96) v4 ==== 10679+0+0 (3560301366 0 0) 0x2f
c5e00 con 0x35bb9a0
2013-07-18 16:01:44.736167 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.10; ) v1 -- ?+0 0x337aa80
2013-07-18 16:01:44.736359 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.12; ) v1 -- ?+0 0x3114e00
2013-07-18 16:01:44.736386 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg1.11; ) v1 -- ?+0 0x3e0d1c0
2013-07-18 16:01:44.736566 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.10; ) v1 -- ?+0 0x3e0dc40
2013-07-18 16:01:44.736766 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg1.f; ) v1 -- ?+0 0x3e0d380
2013-07-18 16:01:44.736975 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.e; ) v1 -- ?+0 0x3e0d540
2013-07-18 16:01:44.737158 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.c; ) v1 -- ?+0 0x3e0de00
2013-07-18 16:01:44.737354 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg1.b; ) v1 -- ?+0 0x3e0d700
2013-07-18 16:01:44.737537 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg1.8; ) v1 -- ?+0 0x3e0da80
2013-07-18 16:01:44.737696 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg0.9; ) v1 -- ?+0 0x3302000
2013-07-18 16:01:44.737859 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.a; ) v1 -- ?+0 0x3302a80
2013-07-18 16:01:44.738026 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 96; pg2.7; ) v1 -- ?+0 0x3302380
2013-07-18 16:01:46.255656 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 4 ==== pg_notify(2.30(6) epoch 96) v4 ==== 826+0+0 (3790595863 0 0) 0x3302380 con 0x35bb9a0
2013-07-18 16:01:46.255859 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 97; pg2.30; ) v1 -- ?+0 0x2f27540
2013-07-18 16:01:46.605701 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 5 ==== pg_notify(2.2e(9) epoch 96) v4 ==== 943+0+0 (2001736930 0 0) 0x2f27540 con 0x35bb9a0
2013-07-18 16:01:46.605946 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg2.2e; ) v1 -- ?+0 0x3e0d540
2013-07-18 16:01:46.615062 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 6 ==== osd_map(97..97 src has 1..97) v3 ==== 226+0+0 (31467472 0 0) 0x2e246c0 con 0x35bb9a0
2013-07-18 16:01:46.615162 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 7 ==== pg_notify(2.30(6) epoch 97) v4 ==== 826+0+0 (1612021928 0 0) 0x3e0d540 con 0x35bb9a0
2013-07-18 16:01:46.618136 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 8 ==== pg_notify(2.2e(9),1.f(10),2.e(8),0.c(7),1.b(9),1.8(11),0.9(11),2.a(7),2.7(10) epoch 97) v4 ==== 8494+0+0 (3141197251 0 0) 0x3302a80 con 0x35bb9a0
2013-07-18 16:01:46.629479 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 9 ==== osd_map(98..98 src has 1..98) v3 ==== 226+0+0 (2720530855 0 0) 0x392f000 con 0x35bb9a0
2013-07-18 16:01:46.629531 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 10 ==== pg_notify(2.30(6),2.2e(9) epoch 98) v4 ==== 1757+0+0 (2073546399 0 0) 0x3302000 con 0x35bb9a0
2013-07-18 16:01:46.630787 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 11 ==== pg_notify(1.8(11) epoch 98) v4 ==== 1029+0+0 (1359721621 0 0) 0x3e0da80 con 0x35bb9a0
2013-07-18 16:01:46.630939 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 12 ==== pg_notify(1.1b(10),1.2b(9),0.9(11),2.a(7),2.7(10) epoch 98) v4 ==== 4793+0+0 (328206156 0 0) 0x3e0d700 con 0x35bb9a0
2013-07-18 16:01:46.631686 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg1.1b; ) v1 -- ?+0 0x3e0d380
2013-07-18 16:01:46.631850 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg1.2b; ) v1 -- ?+0 0x3114700
2013-07-18 16:01:46.632233 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 13 ==== pg_notify(1.2b(9),1.1b(10) epoch 98) v4 ==== 1945+0+0 (3349095817 0 0) 0x3e0d380 con 0x35bb9a0
2013-07-18 16:01:46.906632 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 14 ==== pg_notify(1.28(11) epoch 98) v4 ==== 1029+0+0 (3796804803 0 0) 0x3114700 con 0x35bb9a0
2013-07-18 16:01:46.906855 7f32fb2da700  1 -- 10.214.132.27:6805/7587 <== osd.3 10.214.132.27:6801/11871 15 ==== pg_notify(1.18(11) epoch 98) v4 ==== 1029+0+0 (2563999155 0 0) 0x438c700 con 0x35bb9a0
2013-07-18 16:01:46.906920 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg1.28; ) v1 -- ?+0 0x30dd540
2013-07-18 16:01:46.907082 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> osd.3 10.214.132.27:6801/11871 -- osd pg remove(epoch 98; pg1.18; ) v1 -- ?+0 0x3114a80
2013-07-18 16:01:47.518823 7f32f62d0700 10 osd.4 99 send_incremental_map 96 -> 99 to 0x35bb9a0 10.214.132.27:6801/11871
2013-07-18 16:01:47.518840 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- osd_map(97..99 src has 1..99) v3 -- ?+0 0x3202b40 con 0x35bb9a0
2013-07-18 16:01:47.518870 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.2e(1) epoch 99) v4 -- ?+0 0x30dd000 con 0x35bb9a0
2013-07-18 16:01:47.519883 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.30(1),2.33(2) epoch 99) v4 -- ?+0 0x438c700 con 0x35bb9a0
2013-07-18 16:01:47.525926 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(1.2b(1),1.24(3),0.24(1),2.22(1) epoch 99) v4 -- ?+0 0x2e528c0 con 0x35bb9a0
2013-07-18 16:01:47.526127 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_query(0.21,1.20,2.1f epoch 99) v2 -- ?+0 0x43c2c00 con 0x35bb9a0
2013-07-18 16:01:47.533229 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(1.1a(3),0.1b(2),2.19(2),1.18(1),0.19(1),1.14(2),0.15(2),2.17(1),0.12(1),2.10(1),3.12(6) epoch 99) v4 -- ?+0 0x3114700 con 0x35bb9a0
2013-07-18 16:01:47.547493 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.13(2),1.11(1),0.10(1),1.f(1),0.c(1),2.e(1),1.b(1),2.a(1) epoch 99) v4 -- ?+0 0x348f1c0 con 0x35bb9a0
2013-07-18 16:01:47.547755 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_query(3.9 epoch 99) v2 -- ?+0 0x42f9a00 con 0x35bb9a0
2013-07-18 16:01:47.549273 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(0.5(2),1.4(3),2.3(2),3.2(4) epoch 99) v4 -- ?+0 0x3e0d380 con 0x35bb9a0
2013-07-18 16:01:48.604028 7f32f62d0700 10 osd.4 100 send_incremental_map 99 -> 100 to 0x35bb9a0 10.214.132.27:6801/11871
2013-07-18 16:01:48.604043 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- osd_map(100..100 src has 1..100) v3 -- ?+0 0x326bd80 con 0x35bb9a0
2013-07-18 16:01:48.604073 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.30(1) epoch 100) v4 -- ?+0 0x3d74a80 con 0x35bb9a0
2013-07-18 16:01:48.606287 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(2.33(2),2.2e(1),1.2b(1) epoch 100) v4 -- ?+0 0x3e0da80 con 0x35bb9a0
2013-07-18 16:01:48.606621 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(1.24(3),0.24(1),2.22(1) epoch 100) v4 -- ?+0 0x3d748c0 con 0x35bb9a0
2013-07-18 16:01:48.609566 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(1.1a(3),0.1b(2),2.19(2),1.18(1),0.19(1),1.14(2),0.15(2),2.17(1) epoch 100) v4 -- ?+0 0x3e0d540 con 0x35bb9a0
2013-07-18 16:01:48.609605 7f32f62d0700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(0.12(1),2.10(1),3.12(6),2.13(2),1.11(1),0.10(1),1.f(1),0.c(1),2.e(1),1.b(1) epoch 100) v4 -- ?+0 0x3d74000 con 0x35bb9a0
2013-07-18 16:01:48.610623 7f32f6ad1700  1 -- 10.214.132.27:6805/7587 --> 10.214.132.27:6801/11871 -- pg_notify(3.2(4) epoch 100) v4 -- ?+0 0x3e0dc40 con 0x35bb9a0

Actions #1

Updated by Sage Weil almost 11 years ago

  • Priority changed from Urgent to Immediate
Actions #2

Updated by Greg Farnum almost 11 years ago

You're just noticing that osd4 dispatched messages that osd3 never received and faulted on the communication line for?

Actions #3

Updated by Sage Weil almost 11 years ago

ah, dispatch is stuck here:

(gdb) bt
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00000000006b345a in Wait (mutex=..., this=0x3a67418) at ./common/Cond.h:55
#2  DeletingState::try_stop_deletion (this=0x3a673c0) at osd/OSD.h:225
#3  0x0000000000689647 in OSD::_try_resurrect_pg (this=0x32d2000, curmap=..., pgid=..., resurrected=0x7fa0f3fc8220, old_pg_state=0x7fa0f3fc8310) at osd/OSD.cc:1795
#4  0x00000000006a4b05 in OSD::handle_pg_peering_evt (this=0x32d2000, info=..., pi=..., epoch=99, from=5, primary=<optimized out>, evt=...) at osd/OSD.cc:2210
#5  0x00000000006a68df in OSD::handle_pg_log (this=0x32d2000, op=...) at osd/OSD.cc:6062
#6  0x00000000006a73f0 in OSD::dispatch_op (this=0x32d2000, op=...) at osd/OSD.cc:4401
#7  0x00000000006abcfe in OSD::_dispatch (this=0x32d2000, m=<optimized out>) at osd/OSD.cc:4524
#8  0x00000000006ac4a6 in OSD::ms_dispatch (this=0x32d2000, m=0x3324680) at osd/OSD.cc:4273
#9  0x000000000098a569 in ms_deliver_dispatch (m=0x3324680, this=0x32ddc00) at msg/Messenger.h:587
#10 DispatchQueue::entry (this=0x32ddce8) at msg/DispatchQueue.cc:123
#11 0x00000000008aecad in DispatchQueue::DispatchThread::entry (this=<optimized out>) at msg/DispatchQueue.h:104
#12 0x00007fa10215be9a in start_thread (arg=0x7fa0f3fc9700) at pthread_create.c:308
#13 0x00007fa1002eeccd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#14 0x0000000000000000 in ?? ()

last line for that thread is
2013-07-19 12:21:14.057957 7fa0f3fc9700 15 osd.0 99 project_pg_history end ec=1 les/c 65/65 97/97/97


on osd.0

Actions #4

Updated by Sage Weil almost 11 years ago

  • Subject changed from osd: some msgr issue leads to hung recovery to osd: stuck in DeletingState::try_stop_deletion
Actions #5

Updated by Samuel Just almost 11 years ago

osd/filestore lock loop, I'll have a fix shortly

Actions #6

Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to 7
Actions #7

Updated by Ian Colle almost 11 years ago

  • Status changed from 7 to Resolved

Landed

Actions

Also available in: Atom PDF