Project

General

Profile

Actions

Bug #6992

closed

OSD assert fails after it found it was marked as down by monitor during high system load because of newly_acked_seq != out_seq

Added by Xihui He over 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
emperor, dumpling
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

If the OSD load is high and heartbeat message is not replied timely, it may be marked as down by monitor. In this case SimpleMessenger::mark_down_all() is called. All local pipes are stopped, but the remote pipes may be still there. If the OSD reconnect the remote OSD, the existing remote pipe is used, but the local pipe is a new pipe, so the newly_acked_seq and out_seq mismatch, and the local out_q is empty, then assert(m) fails.

To fix this issue, there are two options in my mind:
1. In Pipe::stop, set the state = STATE_CLOSING rather than STATE_CLOSED, however as the system load is high, not sure if this would work.
2. Use lossless_peer_reuse Policy for Pipe between OSDs, so during the Pipe::accept it would do reset check, however I don't know if this would have other side effort.

I prefer to option 1.

bt:
#0 0x00000034732328a5 in raise () from /lib64/libc.so.6
#1 0x0000003473234085 in abort () from /lib64/libc.so.6
#2 0x00000034752be971 in _gnu_cxx::_verbose_terminate_handler() () from
/usr/lib64/libstdc++.so.6
#3 0x00000034752bcbe6 in ?? () from /usr/lib64/libstdc++.so.6
#4 0x00000034752bcc13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5 0x00000034752bcd0e in _cxa_throw () from /usr/lib64/libstdc++.so.6
#6 0x00000000008ad729 in ceph::
_ceph_assert_fail (assertion=0x35a01c0 "\001",
file=0xb8a32700 "@k\243\270", line=1043,
func=0xaf1500 "int Pipe::connect()") at common/assert.cc:77
#7 0x00000000009a2035 in Pipe::connect (this=0x35a0098) at msg/Pipe.cc:1043
#8 0x00000000009a2983 in Pipe::writer (this=0x5fce1680) at msg/Pipe.cc:1505
#9 0x00000000009afe8d in Pipe::Writer::entry (this=<value optimized out>) at
msg/Pipe.h:59
#10 0x0000003473607851 in start_thread () from /lib64/libpthread.so.0
#11 0x00000034732e890d in clone () from /lib64/libc.so.6

log from crashed osd:
8962> 2013-12-10 10:51:51.909688 7fc95849e700 0 log [WRN] : map e32948 wrongly marked me down
-8961> 2013-12-10 10:51:51.909702 7fc95849e700 1 osd.445 32948 start_waiting_for_healthy
-8960> 2013-12-10 10:51:51.909707 7fc95849e700 1 -
10.194.0.226:6817/17636 rebind avoid 6817,6818,6819
8959> 2013-12-10 10:51:51.909718 7fc95849e700 1 accepter.accepter.rebind avoid 6817,6818,6819
-8958> 2013-12-10 10:51:51.910718 7fc95849e700 1 -
10.194.0.226:0/17636 learned my addr 10.194.0.226:0/17636
8957> 2013-12-10 10:51:51.910741 7fc95849e700 1 accepter.accepter.bind my_inst.addr is 10.194.0.226:6850/1017636 need_addr=0
-8956> 2013-12-10 10:51:51.910747 7fc95849e700 1 accepter.accepter.start
-8955> 2013-12-10 10:51:51.910785 7fc95849e700 1 -
10.194.0.226:6850/1017636 mark_down_all

234> 2013-12-10 10:53:09.880259 7fc961708700 2 - 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x3d818000 sd=334 :57409 s=1 pgs=0 cs=0 l=0 c=0x4a3ca680). got newly_acked_seq 2022 vs out_seq 0

0> 2013-12-10 10:53:11.907809 7fc961708700 -1 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fc961708700 time 2013-12-10 10:53:09.880287
msg/Pipe.cc: 1043: FAILED assert(m)

ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)
1: (Pipe::connect()+0x3cb5) [0x992095]
2: (Pipe::writer()+0x5f3) [0x9929e3]
3: (Pipe::Writer::entry()+0xd) [0x99feed]
4: /lib64/libpthread.so.0() [0x3da9e07851]
5: (clone()+0x6d) [0x3da9ae890d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

log from peer:
2013-12-10 10:53:09.879900 7fbb9c3a0700 0 -- 10.193.207.20:6829/1027483 >> 10.194.0.226:6850/1017636 pipe(0x250ddf00 sd=679 :6829 s=0 pgs=0 cs=0 l=0 c=0xb3617220).accept connect_seq 0 vs existing 59 state standby
2013-12-10 10:53:09.879929 7fbb9c3a0700 0 -- 10.193.207.20:6829/1027483 >> 10.194.0.226:6850/1017636 pipe(0x250ddf00 sd=679 :6829 s=0 pgs=0 cs=0 l=0 c=0xb3617220).accept peer reset, then tried to connect to us, replacing


Files

bug (5.33 MB) bug Xihui He, 12/16/2013 02:25 AM
Actions

Also available in: Atom PDF