Project

General

Profile

Actions

Bug #14150

closed

OSD segmentation fault when network failue

Added by Xiaoxi Chen over 8 years ago. Updated about 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Reproduce:

1. setup a Ceph cluster with 2+ Hosts
2. Plug out the Nic of a Host to make network_error = true. (https://github.com/ceph/ceph/blob/master/src/osd/OSD.cc#L6649)

See debug log below, this bug is introduced by c0c5a6e7d09439b8bc23ad7ab83889ae5a921def

2015-12-21 15:26:39.086196 7f713c08d700 20 osd.2 60 kicking pg 0.6
2015-12-21 15:26:39.086206 7f713c08d700 30 osd.2 pg_epoch: 60 pg[0.6( v 59'262 (0'0,59'262] local-les=59 n=209 ec=1 les/c/f 59/59/0 60/60/57) [0,4] r=-1 lpr=60 pi=4-59/15 crt=59'260 lcod 59'261 inactive NOTIFY] lock
2015-12-21 15:26:39.086537 7f713c08d700 20 osd.2 60 kicking pg 0.0
2015-12-21 15:26:39.086551 7f713c08d700 30 osd.2 pg_epoch: 60 pg[0.0( v 59'295 (0'0,59'295] local-les=59 n=209 ec=1 les/c/f 59/59/0 60/60/58) [4,0] r=-1 lpr=60 pi=49-59/6 crt=59'291 lcod 59'294 inactive NOTIFY] lock
2015-12-21 15:26:39.086877 7f713c08d700 20 osd.2 60 kicking pg 0.2f
2015-12-21 15:26:39.086887 7f713c08d700 30 osd.2 pg_epoch: 60 pg[0.2f( v 59'289 (0'0,59'289] local-les=59 n=211 ec=1 les/c/f 59/59/0 60/60/60) [5,0] r=-1 lpr=60 pi=58-59/1 crt=59'284 lcod 59'288 inactive NOTIFY] lock
2015-12-21 15:26:39.087196 7f713c08d700 20 osd.2 60 kicking pg 0.1a
2015-12-21 15:26:39.087206 7f713c08d700 30 osd.2 pg_epoch: 60 pg[0.1a( v 59'327 (0'0,59'327] local-les=59 n=239 ec=1 les/c/f 59/59/0 60/60/60) [1,4] r=-1 lpr=60 pi=58-59/1 crt=59'318 lcod 59'326 inactive NOTIFY] lock
2015-12-21 15:26:39.087542 7f713c08d700 20 osd.2 60 kicking pg 0.e
2015-12-21 15:26:39.087553 7f713c08d700 30 osd.2 pg_epoch: 60 pg[0.e( v 59'274 (0'0,59'274] local-les=59 n=207 ec=1 les/c/f 59/59/0 60/60/57) [1,4] r=-1 lpr=60 pi=8-59/23 crt=59'271 lcod 59'273 inactive NOTIFY] lock
2015-12-21 15:26:39.087876 7f713c08d700 20 osd.2 60 kicking pg 0.3d
2015-12-21 15:26:39.087885 7f713c08d700 30 osd.2 pg_epoch: 60 pg[0.3d( v 59'271 (0'0,59'271] local-les=59 n=200 ec=1 les/c/f 59/59/0 60/60/58) [5,1] r=-1 lpr=60 pi=46-59/6 crt=59'266 lcod 59'270 inactive NOTIFY] lock
2015-12-21 15:26:39.088574 7f713c08d700 1 -- 192.168.8.24:6800/127319 mark_down 0x7f71532e9680 -- 0x7f71535b6000
2015-12-21 15:26:39.088602 7f713c08d700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.24:6789/0 pipe(0x7f71535b6000 sd=24 :58959 s=2 pgs=4 cs=1 l=1 c=0x7f71532e9680).unregister_pipe
2015-12-21 15:26:39.088612 7f713c08d700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.24:6789/0 pipe(0x7f71535b6000 sd=24 :58959 s=2 pgs=4 cs=1 l=1 c=0x7f71532e9680).stop
2015-12-21 15:26:39.088680 7f714e84d700 2 -- 192.168.8.24:6800/127319 >> 172.168.8.24:6789/0 pipe(0x7f71535b6000 sd=24 :58959 s=4 pgs=4 cs=1 l=1 c=0x7f71532e9680).reader couldn't read tag, (0) Success
2015-12-21 15:26:39.088729 7f714e84d700 2 -- 192.168.8.24:6800/127319 >> 172.168.8.24:6789/0 pipe(0x7f71535b6000 sd=24 :58959 s=4 pgs=4 cs=1 l=1 c=0x7f71532e9680).fault (0) Success
2015-12-21 15:26:39.088747 7f714e84d700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.24:6789/0 pipe(0x7f71535b6000 sd=24 :58959 s=4 pgs=4 cs=1 l=1 c=0x7f71532e9680).fault already closed|closing
2015-12-21 15:26:39.088766 7f714e84d700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.24:6789/0 pipe(0x7f71535b6000 sd=24 :58959 s=4 pgs=4 cs=1 l=1 c=0x7f71532e9680).reader done
2015-12-21 15:26:39.088771 7f714f480700 20 -- 192.168.8.24:6800/127319 >> 172.168.8.24:6789/0 pipe(0x7f71535b6000 sd=24 :58959 s=4 pgs=4 cs=1 l=1 c=0x7f71532e9680).writer finishing
2015-12-21 15:26:39.088834 7f714f480700 10 -- 192.168.8.24:6800/127319 queue_reap 0x7f71535b6000
2015-12-21 15:26:39.088849 7f714f480700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.24:6789/0 pipe(0x7f71535b6000 sd=24 :58959 s=4 pgs=4 cs=1 l=1 c=0x7f71532e9680).writer done
2015-12-21 15:26:39.104966 7f713c08d700 10 -- 192.168.8.24:6800/127319 shutdown 192.168.8.24:6800/127319
2015-12-21 15:26:39.104991 7f713c08d700 1 -- 192.168.8.24:6800/127319 mark_down_all
2015-12-21 15:26:39.104998 7f713c08d700 5 -- 192.168.8.24:6800/127319 mark_down_all 172.168.8.26:0/2952018400 0x7f7153bd4000
2015-12-21 15:26:39.105008 7f713c08d700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=2 pgs=4 cs=1 l=1 c=0x7f7153a15300).unregister_pipe - not registered
2015-12-21 15:26:39.105016 7f713c08d700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=2 pgs=4 cs=1 l=1 c=0x7f7153a15300).stop
2015-12-21 15:26:39.105052 7f713c08d700 10 -- 10.10.8.24:6800/1127319 shutdown 10.10.8.24:6800/1127319
2015-12-21 15:26:39.105060 7f713c08d700 1 -- 10.10.8.24:6800/1127319 mark_down_all
2015-12-21 15:26:39.105078 7f713c08d700 10 -- 10.10.8.24:0/127319 shutdown 10.10.8.24:0/127319
2015-12-21 15:26:39.105083 7f713c08d700 1 -- 10.10.8.24:0/127319 mark_down_all
2015-12-21 15:26:39.105095 7f713c08d700 10 -- :/127319 shutdown :/127319
2015-12-21 15:26:39.105098 7f713c08d700 1 -- :/127319 mark_down_all
2015-12-21 15:26:39.105110 7f713c08d700 10 -- 192.168.8.24:6802/1127319 shutdown 192.168.8.24:6802/1127319
2015-12-21 15:26:39.105116 7f713c08d700 1 -- 192.168.8.24:6802/1127319 mark_down_all
2015-12-21 15:26:39.105092 7f711fe29700 2 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=4 pgs=4 cs=1 l=1 c=0x7f7153a15300).reader couldn't read tag, (11) Resource temporarily unavailable
2015-12-21 15:26:39.105130 7f713c08d700 10 -- 10.10.8.24:6801/1127319 shutdown 10.10.8.24:6801/1127319
2015-12-21 15:26:39.105136 7f713c08d700 1 -- 10.10.8.24:6801/1127319 mark_down_all
2015-12-21 15:26:39.105129 7f711fe29700 2 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=4 pgs=4 cs=1 l=1 c=0x7f7153a15300).fault (11) Resource temporarily unavailable
2015-12-21 15:26:39.105145 7f711fe29700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=4 pgs=4 cs=1 l=1 c=0x7f7153a15300).fault already closed|closing
2015-12-21 15:26:39.105155 7f711fe29700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=4 pgs=4 cs=1 l=1 c=0x7f7153a15300).reader done
2015-12-21 15:26:39.105164 7f713c08d700 10 osd.2 0 handle_osd_ping canceling in-flight failure report for osd.0
2015-12-21 15:26:39.105216 7f711fd28700 20 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=4 pgs=4 cs=1 l=1 c=0x7f7153a15300).writer finishing
2015-12-21 15:26:39.105260 7f711fd28700 10 -- 192.168.8.24:6800/127319 queue_reap 0x7f7153bd4000
2015-12-21 15:26:39.105280 7f711fd28700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=4 pgs=4 cs=1 l=1 c=0x7f7153a15300).writer done
2015-12-21 15:26:39.105320 7f7147a1c700 10 -- 192.168.8.24:6800/127319 reaper
2015-12-21 15:26:39.105358 7f7147a1c700 10 -- 192.168.8.24:6800/127319 reaper reaping pipe 0x7f7153bd4000 172.168.8.26:0/2952018400
2015-12-21 15:26:39.105369 7f7147a1c700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=4 pgs=4 cs=1 l=1 c=0x7f7153a15300).discard_queue
2015-12-21 15:26:39.105382 7f7147a1c700 10 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=4 pgs=4 cs=1 l=1 c=0x7f7153a15300).unregister_pipe - not registered
2015-12-21 15:26:39.105393 7f7147a1c700 20 -- 192.168.8.24:6800/127319 >> 172.168.8.26:0/2952018400 pipe(0x7f7153bd4000 sd=31 :6800 s=4 pgs=4 cs=1 l=1 c=0x7f7153a15300).join
2015-12-21 15:26:39.105888 7f7147a1c700 10 -- 192.168.8.24:6800/127319 reaper reaped pipe 0x7f7153bd4000 172.168.8.26:0/2952018400
2015-12-21 15:26:39.105925 7f7147a1c700 10 -- 192.168.8.24:6800/127319 reaper deleted pipe 0x7f7153bd4000
2015-12-21 15:26:39.105930 7f7147a1c700 10 -- 192.168.8.24:6800/127319 reaper done
2015-12-21 15:26:39.117078 7f713c08d700 -1 ** Caught signal (Segmentation fault) *
in thread 7f713c08d700

ceph version 9.2.0-1515-gd7a1790 (d7a1790076581be2386ef6554126dd486ad7ad57)
1: (()+0x68c58a) [0x7f714eedd58a]
2: (()+0x10340) [0x7f714dd64340]
3: (OSD::handle_osd_map(MOSDMap*)+0x1b28) [0x7f714eb26e48]
4: (OSD::_dispatch(Message*)+0x2a7) [0x7f714eb31557]
5: (OSD::ms_dispatch(Message*)+0x20f) [0x7f714eb31bbf]
6: (DispatchQueue::entry()+0x64a) [0x7f714f08803a]
7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f714efb0fed]
8: (()+0x8182) [0x7f714dd5c182]
9: (clone()+0x6d) [0x7f714c0a347d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #2

Updated by Nathan Cutler over 8 years ago

  • Status changed from New to Fix Under Review
Actions #3

Updated by Josh Durgin about 7 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF