Actions
Bug #15113
closedassert(session) in Monitor::handle_command() if client resets the conn
% Done:
0%
Source:
Development
Tags:
Backport:
infernalis
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
see /a/kchai-2016-03-11_19:18:43-rados-wip-kefu-testing---basic-smithi/54873/remote/smithi029/
2016-03-12 07:47:41.489053 7f91e8233700 -1 mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f91e8233700 time 2016-03-12 07:47:41.485183 mon/Monitor.cc: 2599: FAILED assert(session) ceph version 10.0.4-2138-g45b885e (45b885ec9e3458bd6af5d84089194cb91a26568f) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f91f14f0e65] 2: (Monitor::handle_command(std::shared_ptr<MonOpRequest>)+0x10cf) [0x7f91f116fcff] 3: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0x33b) [0x7f91f117c22b] 4: (Monitor::C_RetryMessage::_finish(int)+0x74) [0x7f91f119e1d4] 5: (C_MonOp::finish(int)+0x79) [0x7f91f1183019] 6: (Context::complete(int)+0x9) [0x7f91f11820d9] 7: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x7f91f11885cc] 8: (Monitor::resend_routed_requests()+0x993) [0x7f91f11741e3] 9: (Monitor::finish_election()+0x7b) [0x7f91f117446b] 10: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x285) [0x7f91f1174ad5] 11: (Elector::victory()+0x40e) [0x7f91f12993be] 12: (Elector::expire()+0x89) [0x7f91f129a6f9] 13: (Context::complete(int)+0x9) [0x7f91f11820d9] 14: (SafeTimer::timer_thread()+0x104) [0x7f91f14e1774] 15: (SafeTimerThread::entry()+0xd) [0x7f91f14e31ad] 16: (()+0x7dc5) [0x7f91efa5fdc5] 17: (clone()+0x6d) [0x7f91ee32821d]
-800> 2016-03-12 07:47:14.072707 7f91e9a36700 10 mon.a@1(peon).paxosservice(osdmap 207..387) dispatch 0x7f91fc9b2800 mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi014-18648-73"} v 0) v1 from client.4472 172.21.15.14:0/4107930439 con 0x7f91fdc63800 ... -771> 2016-03-12 07:47:14.072932 7f91e9a36700 10 mon.a@1(peon) e1 forward_request 221 request log(1 entries from seq 235 at 2016-03-12 07:47:14.072681) v1 features 0 ... -83> 2016-03-12 07:47:41.423231 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_bulk peer close file descriptor 24 -82> 2016-03-12 07:47:41.423263 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_until read failed -81> 2016-03-12 07:47:41.423270 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).process read tag failed -80> 2016-03-12 07:47:41.423287 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).fault on lossy channel, failing -79> 2016-03-12 07:47:41.423294 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1)._stop ... -72> 2016-03-12 07:47:41.423503 7f91e9a36700 10 mon.a@1(electing) e1 ms_handle_reset 0x7f91fdc63800 172.21.15.14:0/4107930439 -71> 2016-03-12 07:47:41.423529 7f91e9235700 1 Event(0x7f91fc977448 owner=140264666785536 nevent=5000 time_id=64).wakeup -70> 2016-03-12 07:47:41.423575 7f91e9235700 1 Event(0x7f91fc976f48 owner=140264650000128 nevent=5000 time_id=43).wakeup -69> 2016-03-12 07:47:41.458297 7f91e8233700 10 mon.a@1(electing) e1 join_election ... -7> 2016-03-12 07:47:41.485128 7f91e8233700 10 mon.a@1(leader) e1 resend_routed_requests -6> 2016-03-12 07:47:41.485131 7f91e8233700 10 mon.a@1(leader) e1 requeue for self tid 220 -5> 2016-03-12 07:47:41.485135 7f91e8233700 5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485135, event: retry routed request, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi014-18648-73"} v 0) -4> 2016-03-12 07:47:41.485148 7f91e8233700 10 mon.a@1(leader) e1 requeue for self tid 221 -3> 2016-03-12 07:47:41.485152 7f91e8233700 5 -- op tracker -- seq: 5075, time: 2016-03-12 07:47:41.485152, event: retry routed request, op: log(1 entries from seq 235 at 2016-03-12 07:47:14.072681) -2> 2016-03-12 07:47:41.485164 7f91e8233700 5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485164, event: callback finished, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smit hi014-18648-73"} v 0) -1> 2016-03-12 07:47:41.485171 7f91e8233700 5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485171, event: mon:dispatch_op, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi 014-18648-73"} v 0) 0> 2016-03-12 07:47:41.489053 7f91e8233700 -1 mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f91e8233700 time 2016-03-12 07:47:41.485183
the session of the routed request should have been removed in Monitor::ms_handle_reset(Connection *con)
, just like before:
2016-03-12 07:43:07.956767 7f91e8a34700 10 mon.a@1(peon) e1 ms_handle_reset 0x7f91fd14f800 172.21.15.14:0/4107930439 2016-03-12 07:43:07.956777 7f91e8a34700 10 mon.a@1(peon) e1 reset/close on session client.4472 172.21.15.14:0/4107930439 2016-03-12 07:43:07.956785 7f91e8a34700 10 mon.a@1(peon) e1 remove_session 0x7f91fd141640 client.4472 172.21.15.14:0/4107930439
but what we have is:
2016-03-12 07:47:41.423231 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_bulk peer close file descriptor 24 2016-03-12 07:47:41.423263 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_until read failed 2016-03-12 07:47:41.423270 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).process read tag failed 2016-03-12 07:47:41.423287 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).fault on lossy channel, failing 2016-03-12 07:47:41.423294 7f91e9a36700 1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1)._stop 2016-03-12 07:47:41.423503 7f91e9a36700 10 mon.a@1(electing) e1 ms_handle_reset 0x7f91fdc63800 172.21.15.14:0/4107930439
so the session was not removed, that's why the request was handled after the connection is reset. this is a race condition:
- ______________________ SafeTimer::timer_thread(), with mon_lock:
- ______________________ elector: in win_election(), it resend_routed_requests(), and collects the routed requests
- msgr: in ms_handle_reset(), it reset the session
- msgr: it waits for the lock
- ______________________ elector: in win_election(), it handle_command(), but the session is reset, hence it panics.
- msgr: remove session, and erase related requests from Monitor::routed_requests.
Updated by Yuri Weinstein about 8 years ago
Run: http://pulpito.ceph.com/teuthology-2016-03-12_22:00:02-rados-jewel-distro-basic-smithi/
Job: 57270
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-03-12_22:00:02-rados-jewel-distro-basic-smithi/57270/teuthology.log
quorum_status' 2016-03-13T19:11:52.745 INFO:tasks.ceph.mon.a.smithi026.stderr:mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f08eb7d0700 time 2016-03-14 02:11:52.740438 2016-03-13T19:11:52.746 INFO:tasks.ceph.mon.a.smithi026.stderr:mon/Monitor.cc: 2601: FAILED assert(session) 2016-03-13T19:11:52.746 INFO:tasks.ceph.mon.a.smithi026.stderr: ceph version 10.0.5-2235-g41cf560 (41cf56062ba83f36fda5959bddd7697423b24751) 2016-03-13T19:11:52.746 INFO:tasks.ceph.mon.a.smithi026.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f08f42711e5] 2016-03-13T19:11:52.747 INFO:tasks.ceph.mon.a.smithi026.stderr: 2: (Monitor::handle_command(std::shared_ptr<MonOpRequest>)+0x10cf) [0x7f08f3edc3df] 2016-03-13T19:11:52.747 INFO:tasks.ceph.mon.a.smithi026.stderr: 3: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0x33b) [0x7f08f3ee8a3b] 2016-03-13T19:11:52.747 INFO:tasks.ceph.mon.a.smithi026.stderr: 4: (Monitor::C_Command::_finish(int)+0x33d) [0x7f08f3f7af8d] 2016-03-13T19:11:52.747 INFO:tasks.ceph.mon.a.smithi026.stderr: 5: (C_MonOp::finish(int)+0x79) [0x7f08f3eef029] 2016-03-13T19:11:52.748 INFO:tasks.ceph.mon.a.smithi026.stderr: 6: (Context::complete(int)+0x9) [0x7f08f3eee0e9] 2016-03-13T19:11:52.748 INFO:tasks.ceph.mon.a.smithi026.stderr: 7: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x7f08f3ef461c] 2016-03-13T19:11:52.748 INFO:tasks.ceph.mon.a.smithi026.stderr: 8: (PaxosService::restart()+0xdb) [0x7f08f3f1f0cb] 2016-03-13T19:11:52.748 INFO:tasks.ceph.mon.a.smithi026.stderr: 9: (Monitor::_reset()+0x1dc) [0x7f08f3ebf76c] 2016-03-13T19:11:52.749 INFO:tasks.ceph.mon.a.smithi026.stderr: 10: (Monitor::bootstrap()+0x2a5) [0x7f08f3ee19c5] 2016-03-13T19:11:52.749 INFO:tasks.ceph.mon.a.smithi026.stderr: 11: (Paxos::accept_timeout()+0x18b) [0x7f08f3f0acab] 2016-03-13T19:11:52.749 INFO:tasks.ceph.mon.a.smithi026.stderr: 12: (Context::complete(int)+0x9) [0x7f08f3eee0e9] 2016-03-13T19:11:52.749 INFO:tasks.ceph.mon.a.smithi026.stderr: 13: (SafeTimer::timer_thread()+0x104) [0x7f08f4261af4] 2016-03-13T19:11:52.750 INFO:tasks.ceph.mon.a.smithi026.stderr: 14: (SafeTimerThread::entry()+0xd) [0x7f08f426352d] 2016-03-13T19:11:52.750 INFO:tasks.ceph.mon.a.smithi026.stderr: 15: (()+0x7dc5) [0x7f08f27bfdc5] 2016-03-13T19:11:52.750 INFO:tasks.ceph.mon.a.smithi026.stderr: 16: (clone()+0x6d) [0x7f08f108821d] 2016-03-13T19:11:52.751 INFO:tasks.ceph.mon.a.smithi026.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret
Updated by Kefu Chai about 8 years ago
- Status changed from 12 to Fix Under Review
Updated by Kefu Chai about 8 years ago
- Status changed from Fix Under Review to Pending Backport
- Backport changed from jewel to infernalis
Updated by Nathan Cutler about 8 years ago
- Copied to Backport #15227: infernalis: assert(session) in Monitor::handle_command() if client resets the conn added
Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to Resolved
Actions