Project

General

Profile

Bug #15113

Updated by Kefu Chai about 8 years ago

see /a/kchai-2016-03-11_19:18:43-rados-wip-kefu-testing---basic-smithi/54873/remote/smithi029/ 

 <pre> 
 2016-03-12 07:47:41.489053 7f91e8233700 -1 mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f91e8233700 time 2016-03-12 07:47:41.485183 
 mon/Monitor.cc: 2599: FAILED assert(session) 

  ceph version 10.0.4-2138-g45b885e (45b885ec9e3458bd6af5d84089194cb91a26568f) 
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f91f14f0e65] 
  2: (Monitor::handle_command(std::shared_ptr<MonOpRequest>)+0x10cf) [0x7f91f116fcff] 
  3: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0x33b) [0x7f91f117c22b] 
  4: (Monitor::C_RetryMessage::_finish(int)+0x74) [0x7f91f119e1d4] 
  5: (C_MonOp::finish(int)+0x79) [0x7f91f1183019] 
  6: (Context::complete(int)+0x9) [0x7f91f11820d9] 
  7: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x7f91f11885cc] 
  8: (Monitor::resend_routed_requests()+0x993) [0x7f91f11741e3] 
  9: (Monitor::finish_election()+0x7b) [0x7f91f117446b] 
  10: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x285) [0x7f91f1174ad5] 
  11: (Elector::victory()+0x40e) [0x7f91f12993be] 
  12: (Elector::expire()+0x89) [0x7f91f129a6f9] 
  13: (Context::complete(int)+0x9) [0x7f91f11820d9] 
  14: (SafeTimer::timer_thread()+0x104) [0x7f91f14e1774] 
  15: (SafeTimerThread::entry()+0xd) [0x7f91f14e31ad] 
  16: (()+0x7dc5) [0x7f91efa5fdc5] 
  17: (clone()+0x6d) [0x7f91ee32821d] 
 </pre> 

 <pre> 
   -800> 2016-03-12 07:47:14.072707 7f91e9a36700 10 mon.a@1(peon).paxosservice(osdmap 207..387) dispatch 0x7f91fc9b2800 mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi014-18648-73"} v 0) v1 from client.4472 172.21.15.14:0/4107930439 con 0x7f91fdc63800 
 ... 
   -771> 2016-03-12 07:47:14.072932 7f91e9a36700 10 mon.a@1(peon) e1 forward_request 221 request log(1 entries from seq 235 at 2016-03-12 07:47:14.072681) v1 features 0 
 ... 
    -83> 2016-03-12 07:47:41.423231 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_bulk peer close file descriptor 24 
    -82> 2016-03-12 07:47:41.423263 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_until read failed 
    -81> 2016-03-12 07:47:41.423270 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).process read tag failed 
    -80> 2016-03-12 07:47:41.423287 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).fault on lossy channel, failing 
    -79> 2016-03-12 07:47:41.423294 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1)._stop 

 ... 
    -72> 2016-03-12 07:47:41.423503 7f91e9a36700 10 mon.a@1(electing) e1 ms_handle_reset 0x7f91fdc63800 172.21.15.14:0/4107930439 
    -71> 2016-03-12 07:47:41.423529 7f91e9235700    1 Event(0x7f91fc977448 owner=140264666785536 nevent=5000 time_id=64).wakeup 
    -70> 2016-03-12 07:47:41.423575 7f91e9235700    1 Event(0x7f91fc976f48 owner=140264650000128 nevent=5000 time_id=43).wakeup 
    -69> 2016-03-12 07:47:41.458297 7f91e8233700 10 mon.a@1(electing) e1 join_election 
 ... 
     -7> 2016-03-12 07:47:41.485128 7f91e8233700 10 mon.a@1(leader) e1 resend_routed_requests 
     -6> 2016-03-12 07:47:41.485131 7f91e8233700 10 mon.a@1(leader) e1    requeue for self tid 220 
     -5> 2016-03-12 07:47:41.485135 7f91e8233700    5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485135, event: retry routed request, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi014-18648-73"} v 0) 
     -4> 2016-03-12 07:47:41.485148 7f91e8233700 10 mon.a@1(leader) e1    requeue for self tid 221 
     -3> 2016-03-12 07:47:41.485152 7f91e8233700    5 -- op tracker -- seq: 5075, time: 2016-03-12 07:47:41.485152, event: retry routed request, op: log(1 entries from seq 235 at 2016-03-12 07:47:14.072681) 
     -2> 2016-03-12 07:47:41.485164 7f91e8233700    5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485164, event: callback finished, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smit 
 hi014-18648-73"} v 0) 
     -1> 2016-03-12 07:47:41.485171 7f91e8233700    5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485171, event: mon:dispatch_op, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi 
 014-18648-73"} v 0) 
      0> 2016-03-12 07:47:41.489053 7f91e8233700 -1 mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f91e8233700 time 2016-03-12 07:47:41.485183 
 </pre> 

 the session of the routed request should have been removed in @Monitor::ms_handle_reset(Connection *con)@, just like before: 

 <pre> 
 2016-03-12 07:43:07.956767 7f91e8a34700 10 mon.a@1(peon) e1 ms_handle_reset 0x7f91fd14f800 172.21.15.14:0/4107930439 
 2016-03-12 07:43:07.956777 7f91e8a34700 10 mon.a@1(peon) e1 reset/close on session client.4472 172.21.15.14:0/4107930439 
 2016-03-12 07:43:07.956785 7f91e8a34700 10 mon.a@1(peon) e1 remove_session 0x7f91fd141640 client.4472 172.21.15.14:0/4107930439 
 </pre> 

 but what we have is: 
 <pre> 
 2016-03-12 07:47:41.423231 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_bulk peer close file descriptor 24 
 2016-03-12 07:47:41.423263 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_until read failed 
 2016-03-12 07:47:41.423270 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).process read tag failed 
 2016-03-12 07:47:41.423287 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).fault on lossy channel, failing 
 2016-03-12 07:47:41.423294 7f91e9a36700    1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1)._stop 
 2016-03-12 07:47:41.423503 7f91e9a36700 10 mon.a@1(electing) e1 ms_handle_reset 0x7f91fdc63800 172.21.15.14:0/4107930439 
 </pre> 

 so the session was not removed, that's why the request was handled after the connection is reset. this is a race condition: 

 # ______________________________ elector: in win_election(), it resend_routed_requests(), and collects the routed requests 
 # msgr: in ms_handle_reset(), it reset the session 
 # msgr: it waits for the lock 
 # ______________________________ elector: in win_election(), it handle_command(), but the session is reset, hence it panics. 
 # msgr: remove session, and erase related requests from Monitor::routed_requests. 
 

Back