Bug #6992: OSD assert fails after it found it was marked as down by monitor during high system load because of newly_acked_seq != out_seq - Ceph - Ceph

Actions

Copy link

Bug #6992

closed

OSD assert fails after it found it was marked as down by monitor during high system load because of newly_acked_seq != out_seq

Added by Xihui He over 10 years ago. Updated about 10 years ago.

Status:

Resolved

Priority:

High

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

emperor, dumpling

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

If the OSD load is high and heartbeat message is not replied timely, it may be marked as down by monitor. In this case SimpleMessenger::mark_down_all() is called. All local pipes are stopped, but the remote pipes may be still there. If the OSD reconnect the remote OSD, the existing remote pipe is used, but the local pipe is a new pipe, so the newly_acked_seq and out_seq mismatch, and the local out_q is empty, then assert(m) fails.

To fix this issue, there are two options in my mind:
1. In Pipe::stop, set the state = STATE_CLOSING rather than STATE_CLOSED, however as the system load is high, not sure if this would work.
2. Use lossless_peer_reuse Policy for Pipe between OSDs, so during the Pipe::accept it would do reset check, however I don't know if this would have other side effort.

I prefer to option 1.

bt:
#0 0x00000034732328a5 in raise () from /lib64/libc.so.6
#1 0x0000003473234085 in abort () from /lib64/libc.so.6
#2 0x00000034752be971 in _gnu_cxx::_verbose_terminate_handler() () from
/usr/lib64/libstdc++.so.6
#3 0x00000034752bcbe6 in ?? () from /usr/lib64/libstdc++.so.6
#4 0x00000034752bcc13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5 0x00000034752bcd0e in _cxa_throw () from /usr/lib64/libstdc++.so.6
#6 0x00000000008ad729 in ceph::_ceph_assert_fail (assertion=0x35a01c0 "\001",
file=0xb8a32700 "@k\243\270", line=1043,
func=0xaf1500 "int Pipe::connect()") at common/assert.cc:77
#7 0x00000000009a2035 in Pipe::connect (this=0x35a0098) at msg/Pipe.cc:1043
#8 0x00000000009a2983 in Pipe::writer (this=0x5fce1680) at msg/Pipe.cc:1505
#9 0x00000000009afe8d in Pipe::Writer::entry (this=<value optimized out>) at
msg/Pipe.h:59
#10 0x0000003473607851 in start_thread () from /lib64/libpthread.so.0
#11 0x00000034732e890d in clone () from /lib64/libc.so.6

log from crashed osd:
8962> 2013-12-10 10:51:51.909688 7fc95849e700 0 log [WRN] : map e32948 wrongly marked me down
-8961> 2013-12-10 10:51:51.909702 7fc95849e700 1 osd.445 32948 start_waiting_for_healthy
-8960> 2013-12-10 10:51:51.909707 7fc95849e700 1 - 10.194.0.226:6817/17636 rebind avoid 6817,6818,6819
8959> 2013-12-10 10:51:51.909718 7fc95849e700 1 accepter.accepter.rebind avoid 6817,6818,6819
-8958> 2013-12-10 10:51:51.910718 7fc95849e700 1 - 10.194.0.226:0/17636 learned my addr 10.194.0.226:0/17636
8957> 2013-12-10 10:51:51.910741 7fc95849e700 1 accepter.accepter.bind my_inst.addr is 10.194.0.226:6850/1017636 need_addr=0
-8956> 2013-12-10 10:51:51.910747 7fc95849e700 1 accepter.accepter.start
-8955> 2013-12-10 10:51:51.910785 7fc95849e700 1 - 10.194.0.226:6850/1017636 mark_down_all

~~234> 2013-12-10 10:53:09.880259 7fc961708700 2 -~~ 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x3d818000 sd=334 :57409 s=1 pgs=0 cs=0 l=0 c=0x4a3ca680). got newly_acked_seq 2022 vs out_seq 0

0> 2013-12-10 10:53:11.907809 7fc961708700 -1 msg/Pipe.cc: In function 'int Pipe::connect()' thread 7fc961708700 time 2013-12-10 10:53:09.880287
msg/Pipe.cc: 1043: FAILED assert(m)

ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7)
1: (Pipe::connect()+0x3cb5) [0x992095]
2: (Pipe::writer()+0x5f3) [0x9929e3]
3: (Pipe::Writer::entry()+0xd) [0x99feed]
4: /lib64/libpthread.so.0() [0x3da9e07851]
5: (clone()+0x6d) [0x3da9ae890d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

log from peer:
2013-12-10 10:53:09.879900 7fbb9c3a0700 0 -- 10.193.207.20:6829/1027483 >> 10.194.0.226:6850/1017636 pipe(0x250ddf00 sd=679 :6829 s=0 pgs=0 cs=0 l=0 c=0xb3617220).accept connect_seq 0 vs existing 59 state standby
2013-12-10 10:53:09.879929 7fbb9c3a0700 0 -- 10.193.207.20:6829/1027483 >> 10.194.0.226:6850/1017636 pipe(0x250ddf00 sd=679 :6829 s=0 pgs=0 cs=0 l=0 c=0xb3617220).accept peer reset, then tried to connect to us, replacing

Files

bug (5.33 MB) bug

Xihui He, 12/16/2013 02:25 AM

Actions

Copy link

Updated by Xihui He over 10 years ago

https://github.com/ceph/ceph/pull/938

Actions

Copy link

Updated by Sage Weil over 10 years ago

Do you have complete logs of this? I don't think this is the full story.

If an osd is marked down, it will bind to complete new ip addresses and ports with unique none in entity_addr_t. Remote osds reconnecting will see a peer addr mismatch and stop, and outgoing connections will have a different addr and have no association with the old connect.

There is some weird bug hiding in this code, but I haven't been able to reproduce it with complete messenger logs (debug ms = 20); if you have/can, that would be immensely helpful!

Actions

Copy link

Updated by Sage Weil over 10 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Xihui He over 10 years ago

File bug bug added

Then I think this may be a multi thread issue. In SimpleMessenger::rebind(), it calls accepter.rebind() before mark_down_all(). Below log shows for peer "10.193.207.20:6829/1027483", the nonce is increased from 17636 to 1017636, but the connection seq is not reset. The full log is also attached.

P.S. I don't think forcefully close the socket in stop() is a good idea.

2013-12-10 10:18:44.066739 7fc949a02700  0 -- 10.194.0.226:6817/17636 >> 10.193.207.20:6829/1027483 pipe(0x3db7d00 sd=297 :6817 s=0 pgs=0 cs=0 l=0 c=0x42709340).accept connect_seq 54 vs existing 53 state standby
2013-12-10 10:34:19.084310 7fc949a02700  0 -- 10.194.0.226:6817/17636 >> 10.193.207.20:6829/1027483 pipe(0x3db7d00 sd=297 :6817 s=2 pgs=1203932 cs=55 l=0 c=0x4001e1a0).fault with nothing to send, going to standby
2013-12-10 10:51:18.575358 7fc949414700  0 -- 10.194.0.226:6817/17636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :6817 s=0 pgs=0 cs=0 l=0 c=0x58806040).accept connect_seq 56 vs existing 55 state standby
2013-12-10 10:51:52.207225 7fc949414700  0 -- 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :6817 s=2 pgs=1204224 cs=57 l=0 c=0x4001e1a0).fault, initiating reconnect
 -8127> 2013-12-10 10:51:52.207155 7fc949414700  2 -- 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :6817 s=2 pgs=1204224 cs=57 l=0 c=0x4001e1a0).reader couldn't read tag, Transport endpoint is not connected
 -8126> 2013-12-10 10:51:52.207192 7fc949414700  2 -- 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :6817 s=2 pgs=1204224 cs=57 l=0 c=0x4001e1a0).fault 107: Transport endpoint is not connected
 -8125> 2013-12-10 10:51:52.207225 7fc949414700  0 -- 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :6817 s=2 pgs=1204224 cs=57 l=0 c=0x4001e1a0).fault, initiating reconnect
 -6331> 2013-12-10 10:51:53.660409 7fc95849e700  5 -- 10.194.0.226:6850/1017636 mark_down_all 10.193.207.20:6829/1027483 0x41908f00
 -6327> 2013-12-10 10:51:53.660441 7fc949414700  2 -- 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :44074 s=4 pgs=1204237 cs=59 l=0 c=0x4001e1a0).reader couldn't read tag, Success
 -6326> 2013-12-10 10:51:53.660522 7fc949414700  2 -- 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :44074 s=4 pgs=1204237 cs=59 l=0 c=0x4001e1a0).fault 0: Success
  -234> 2013-12-10 10:53:09.880259 7fc961708700  2 -- 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x3d818000 sd=334 :57409 s=1 pgs=0 cs=0 l=0 c=0x4a3ca680). got newly_acked_seq 2022 vs out_seq 0
2013-12-10 12:10:59.253319 7f8d9ac18700  0 -- 10.194.0.226:6816/43018 >> 10.193.207.20:6829/1027483 pipe(0x1e725780 sd=168 :44445 s=2 pgs=1204263 cs=1 l=0 c=0x1e70ca40).fault with nothing to send, going to standby

Actions

Copy link

Updated by Xihui He over 10 years ago

https://github.com/ceph/ceph/pull/1016/

From the log the rebind was at 2013-12-10 10:51:51.909718
2013-12-10 10:51:51.909718 7fc95849e700 1 accepter.accepter.rebind avoid 6817,6818,6819

The mark down was at 2013-12-10 10:51:53.660409
2013-12-10 10:51:53.660409 7fc95849e700 5 -- 10.194.0.226:6850/1017636 mark_down_all 10.193.207.20:6829/1027483 0x41908f00

And there was a reconnect between the rebind and mark down
2013-12-10 10:51:52.207225 7fc949414700 0 -- 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :6817 s=2 pgs=1204224 cs=57 l=0 c=0x4001e1a0).fault, initiating reconnect

After the rebind, the nonce is increased from 17636 to 1017636. In the reconnect a new connection is used (0x58806040 vs 0x4001e1a0), but the connect_seq is not reset.
2013-12-10 10:51:18.575358 7fc949414700 0 -- 10.194.0.226:6817/17636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :6817 s=0 pgs=0 cs=0 l=0 c=0x58806040).accept connect_seq 56 vs existing 55 state standby
2013-12-10 10:51:52.207225 7fc949414700 0 -- 10.194.0.226:6850/1017636 >> 10.193.207.20:6829/1027483 pipe(0x41908f00 sd=29 :6817 s=2 pgs=1204224 cs=57 l=0 c=0x4001e1a0).fault, initiating reconnect

To avoid this case, I think we should stop the accepter and mark_down_all before rebind.

Actions

Copy link