Bug #15113
Updated by Kefu Chai about 8 years ago
see /a/kchai-2016-03-11_19:18:43-rados-wip-kefu-testing---basic-smithi/54873/remote/smithi029/ <pre> 2016-03-12 07:47:41.489053 7f91e8233700 -1 mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f91e8233700 time 2016-03-12 07:47:41.485183 mon/Monitor.cc: 2599: FAILED assert(session) ceph version 10.0.4-2138-g45b885e (45b885ec9e3458bd6af5d84089194cb91a26568f) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f91f14f0e65] 2: (Monitor::handle_command(std::shared_ptr<MonOpRequest>)+0x10cf) [0x7f91f116fcff] 3: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0x33b) [0x7f91f117c22b] 4: (Monitor::C_RetryMessage::_finish(int)+0x74) [0x7f91f119e1d4] 5: (C_MonOp::finish(int)+0x79) [0x7f91f1183019] 6: (Context::complete(int)+0x9) [0x7f91f11820d9] 7: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x7f91f11885cc] 8: (Monitor::resend_routed_requests()+0x993) [0x7f91f11741e3] 9: (Monitor::finish_election()+0x7b) [0x7f91f117446b] 10: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x285) [0x7f91f1174ad5] 11: (Elector::victory()+0x40e) [0x7f91f12993be] 12: (Elector::expire()+0x89) [0x7f91f129a6f9] 13: (Context::complete(int)+0x9) [0x7f91f11820d9] 14: (SafeTimer::timer_thread()+0x104) [0x7f91f14e1774] 15: (SafeTimerThread::entry()+0xd) [0x7f91f14e31ad] 16: (()+0x7dc5) [0x7f91efa5fdc5] 17: (clone()+0x6d) [0x7f91ee32821d] </pre> <pre> -800> 2016-03-12 07:47:14.072707 7f91e9a36700 10 mon.a@1(peon).paxosservice(osdmap 207..387) dispatch 0x7f91fc9b2800 mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi014-18648-73"} v 0) v1 from client.4472 172.21.15.14:0/4107930439 con 0x7f91fdc63800 ... -771> 2016-03-12 07:47:14.072932 7f91e9a36700 10 mon.a@1(peon) e1 forward_request 221 request log(1 entries from seq 235 at 2016-03-12 07:47:14.072681) v1 features 0 ... -83> 2016-03-12 07:47:41.423231 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_bulk peer close file descriptor 24 -82> 2016-03-12 07:47:41.423263 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_until read failed -81> 2016-03-12 07:47:41.423270 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).process read tag failed -80> 2016-03-12 07:47:41.423287 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).fault on lossy channel, failing -79> 2016-03-12 07:47:41.423294 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1)._stop ... -72> 2016-03-12 07:47:41.423503 7f91e9a36700 10 mon.a@1(electing) e1 ms_handle_reset 0x7f91fdc63800 172.21.15.14:0/4107930439 -71> 2016-03-12 07:47:41.423529 7f91e9235700 1 Event(0x7f91fc977448 owner=140264666785536 nevent=5000 time_id=64).wakeup -70> 2016-03-12 07:47:41.423575 7f91e9235700 1 Event(0x7f91fc976f48 owner=140264650000128 nevent=5000 time_id=43).wakeup -69> 2016-03-12 07:47:41.458297 7f91e8233700 10 mon.a@1(electing) e1 join_election ... -7> 2016-03-12 07:47:41.485128 7f91e8233700 10 mon.a@1(leader) e1 resend_routed_requests -6> 2016-03-12 07:47:41.485131 7f91e8233700 10 mon.a@1(leader) e1 requeue for self tid 220 -5> 2016-03-12 07:47:41.485135 7f91e8233700 5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485135, event: retry routed request, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi014-18648-73"} v 0) -4> 2016-03-12 07:47:41.485148 7f91e8233700 10 mon.a@1(leader) e1 requeue for self tid 221 -3> 2016-03-12 07:47:41.485152 7f91e8233700 5 -- op tracker -- seq: 5075, time: 2016-03-12 07:47:41.485152, event: retry routed request, op: log(1 entries from seq 235 at 2016-03-12 07:47:14.072681) -2> 2016-03-12 07:47:41.485164 7f91e8233700 5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485164, event: callback finished, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smit hi014-18648-73"} v 0) -1> 2016-03-12 07:47:41.485171 7f91e8233700 5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485171, event: mon:dispatch_op, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi 014-18648-73"} v 0) 0> 2016-03-12 07:47:41.489053 7f91e8233700 -1 mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f91e8233700 time 2016-03-12 07:47:41.485183 </pre> the session of the routed request should have been removed in @Monitor::ms_handle_reset(Connection *con)@, just like before: <pre> 2016-03-12 07:43:07.956767 7f91e8a34700 10 mon.a@1(peon) e1 ms_handle_reset 0x7f91fd14f800 172.21.15.14:0/4107930439 2016-03-12 07:43:07.956777 7f91e8a34700 10 mon.a@1(peon) e1 reset/close on session client.4472 172.21.15.14:0/4107930439 2016-03-12 07:43:07.956785 7f91e8a34700 10 mon.a@1(peon) e1 remove_session 0x7f91fd141640 client.4472 172.21.15.14:0/4107930439 </pre> but what we have is: <pre> 2016-03-12 07:47:41.423231 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_bulk peer close file descriptor 24 2016-03-12 07:47:41.423263 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_until read failed 2016-03-12 07:47:41.423270 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).process read tag failed 2016-03-12 07:47:41.423287 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).fault on lossy channel, failing 2016-03-12 07:47:41.423294 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1)._stop 2016-03-12 07:47:41.423503 7f91e9a36700 10 mon.a@1(electing) e1 ms_handle_reset 0x7f91fdc63800 172.21.15.14:0/4107930439 </pre> so the session was not removed, that's why the request was handled after the connection is reset. this is a race condition: # ______________________________ elector: in win_election(), it resend_routed_requests(), and collects the routed requests # msgr: in ms_handle_reset(), it reset the session # msgr: it waits for the lock # ______________________________ elector: in win_election(), it handle_command(), but the session is reset, hence it panics. # msgr: remove session, and erase related requests from Monitor::routed_requests.