Project

General

Profile

Actions

Bug #15113

closed

assert(session) in Monitor::handle_command() if client resets the conn

Added by Kefu Chai about 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Monitor
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
infernalis
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

see /a/kchai-2016-03-11_19:18:43-rados-wip-kefu-testing---basic-smithi/54873/remote/smithi029/

2016-03-12 07:47:41.489053 7f91e8233700 -1 mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f91e8233700 time 2016-03-12 07:47:41.485183
mon/Monitor.cc: 2599: FAILED assert(session)

 ceph version 10.0.4-2138-g45b885e (45b885ec9e3458bd6af5d84089194cb91a26568f)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f91f14f0e65]
 2: (Monitor::handle_command(std::shared_ptr<MonOpRequest>)+0x10cf) [0x7f91f116fcff]
 3: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0x33b) [0x7f91f117c22b]
 4: (Monitor::C_RetryMessage::_finish(int)+0x74) [0x7f91f119e1d4]
 5: (C_MonOp::finish(int)+0x79) [0x7f91f1183019]
 6: (Context::complete(int)+0x9) [0x7f91f11820d9]
 7: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x7f91f11885cc]
 8: (Monitor::resend_routed_requests()+0x993) [0x7f91f11741e3]
 9: (Monitor::finish_election()+0x7b) [0x7f91f117446b]
 10: (Monitor::win_election(unsigned int, std::set<int, std::less<int>, std::allocator<int> >&, unsigned long, MonCommand const*, int, std::set<int, std::less<int>, std::allocator<int> > const*)+0x285) [0x7f91f1174ad5]
 11: (Elector::victory()+0x40e) [0x7f91f12993be]
 12: (Elector::expire()+0x89) [0x7f91f129a6f9]
 13: (Context::complete(int)+0x9) [0x7f91f11820d9]
 14: (SafeTimer::timer_thread()+0x104) [0x7f91f14e1774]
 15: (SafeTimerThread::entry()+0xd) [0x7f91f14e31ad]
 16: (()+0x7dc5) [0x7f91efa5fdc5]
 17: (clone()+0x6d) [0x7f91ee32821d]
  -800> 2016-03-12 07:47:14.072707 7f91e9a36700 10 mon.a@1(peon).paxosservice(osdmap 207..387) dispatch 0x7f91fc9b2800 mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi014-18648-73"} v 0) v1 from client.4472 172.21.15.14:0/4107930439 con 0x7f91fdc63800
...
  -771> 2016-03-12 07:47:14.072932 7f91e9a36700 10 mon.a@1(peon) e1 forward_request 221 request log(1 entries from seq 235 at 2016-03-12 07:47:14.072681) v1 features 0
...
   -83> 2016-03-12 07:47:41.423231 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_bulk peer close file descriptor 24
   -82> 2016-03-12 07:47:41.423263 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_until read failed
   -81> 2016-03-12 07:47:41.423270 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).process read tag failed
   -80> 2016-03-12 07:47:41.423287 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).fault on lossy channel, failing
   -79> 2016-03-12 07:47:41.423294 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1)._stop

...
   -72> 2016-03-12 07:47:41.423503 7f91e9a36700 10 mon.a@1(electing) e1 ms_handle_reset 0x7f91fdc63800 172.21.15.14:0/4107930439
   -71> 2016-03-12 07:47:41.423529 7f91e9235700  1 Event(0x7f91fc977448 owner=140264666785536 nevent=5000 time_id=64).wakeup
   -70> 2016-03-12 07:47:41.423575 7f91e9235700  1 Event(0x7f91fc976f48 owner=140264650000128 nevent=5000 time_id=43).wakeup
   -69> 2016-03-12 07:47:41.458297 7f91e8233700 10 mon.a@1(electing) e1 join_election
...
    -7> 2016-03-12 07:47:41.485128 7f91e8233700 10 mon.a@1(leader) e1 resend_routed_requests
    -6> 2016-03-12 07:47:41.485131 7f91e8233700 10 mon.a@1(leader) e1  requeue for self tid 220
    -5> 2016-03-12 07:47:41.485135 7f91e8233700  5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485135, event: retry routed request, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi014-18648-73"} v 0)
    -4> 2016-03-12 07:47:41.485148 7f91e8233700 10 mon.a@1(leader) e1  requeue for self tid 221
    -3> 2016-03-12 07:47:41.485152 7f91e8233700  5 -- op tracker -- seq: 5075, time: 2016-03-12 07:47:41.485152, event: retry routed request, op: log(1 entries from seq 235 at 2016-03-12 07:47:14.072681)
    -2> 2016-03-12 07:47:41.485164 7f91e8233700  5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485164, event: callback finished, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smit
hi014-18648-73"} v 0)
    -1> 2016-03-12 07:47:41.485171 7f91e8233700  5 -- op tracker -- seq: 5074, time: 2016-03-12 07:47:41.485171, event: mon:dispatch_op, op: mon_command({"prefix": "osd tier set-overlay", "pool": "test-rados-api-smithi014-18648-56", "overlaypool": "test-rados-api-smithi
014-18648-73"} v 0)
     0> 2016-03-12 07:47:41.489053 7f91e8233700 -1 mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f91e8233700 time 2016-03-12 07:47:41.485183

the session of the routed request should have been removed in Monitor::ms_handle_reset(Connection *con), just like before:

2016-03-12 07:43:07.956767 7f91e8a34700 10 mon.a@1(peon) e1 ms_handle_reset 0x7f91fd14f800 172.21.15.14:0/4107930439
2016-03-12 07:43:07.956777 7f91e8a34700 10 mon.a@1(peon) e1 reset/close on session client.4472 172.21.15.14:0/4107930439
2016-03-12 07:43:07.956785 7f91e8a34700 10 mon.a@1(peon) e1 remove_session 0x7f91fd141640 client.4472 172.21.15.14:0/4107930439

but what we have is:

2016-03-12 07:47:41.423231 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_bulk peer close file descriptor 24
2016-03-12 07:47:41.423263 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).read_until read failed
2016-03-12 07:47:41.423270 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).process read tag failed
2016-03-12 07:47:41.423287 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1).fault on lossy channel, failing
2016-03-12 07:47:41.423294 7f91e9a36700  1 -- 172.21.15.29:6789/0 >> 172.21.15.14:0/4107930439 conn(0x7f91fdc63800 sd=24 :6789 s=STATE_OPEN pgs=21 cs=1 l=1)._stop
2016-03-12 07:47:41.423503 7f91e9a36700 10 mon.a@1(electing) e1 ms_handle_reset 0x7f91fdc63800 172.21.15.14:0/4107930439

so the session was not removed, that's why the request was handled after the connection is reset. this is a race condition:

  1. ______________________ SafeTimer::timer_thread(), with mon_lock:
  2. ______________________ elector: in win_election(), it resend_routed_requests(), and collects the routed requests
  3. msgr: in ms_handle_reset(), it reset the session
  4. msgr: it waits for the lock
  5. ______________________ elector: in win_election(), it handle_command(), but the session is reset, hence it panics.
  6. msgr: remove session, and erase related requests from Monitor::routed_requests.

Related issues 1 (0 open1 closed)

Copied to Ceph - Backport #15227: infernalis: assert(session) in Monitor::handle_command() if client resets the connRejectedActions
Actions #1

Updated by Kefu Chai about 8 years ago

  • Description updated (diff)
Actions #2

Updated by Kefu Chai about 8 years ago

  • Description updated (diff)
Actions #3

Updated by Yuri Weinstein about 8 years ago

Run: http://pulpito.ceph.com/teuthology-2016-03-12_22:00:02-rados-jewel-distro-basic-smithi/
Job: 57270
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2016-03-12_22:00:02-rados-jewel-distro-basic-smithi/57270/teuthology.log

quorum_status'
2016-03-13T19:11:52.745 INFO:tasks.ceph.mon.a.smithi026.stderr:mon/Monitor.cc: In function 'void Monitor::handle_command(MonOpRequestRef)' thread 7f08eb7d0700 time 2016-03-14 02:11:52.740438
2016-03-13T19:11:52.746 INFO:tasks.ceph.mon.a.smithi026.stderr:mon/Monitor.cc: 2601: FAILED assert(session)
2016-03-13T19:11:52.746 INFO:tasks.ceph.mon.a.smithi026.stderr: ceph version 10.0.5-2235-g41cf560 (41cf56062ba83f36fda5959bddd7697423b24751)
2016-03-13T19:11:52.746 INFO:tasks.ceph.mon.a.smithi026.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f08f42711e5]
2016-03-13T19:11:52.747 INFO:tasks.ceph.mon.a.smithi026.stderr: 2: (Monitor::handle_command(std::shared_ptr<MonOpRequest>)+0x10cf) [0x7f08f3edc3df]
2016-03-13T19:11:52.747 INFO:tasks.ceph.mon.a.smithi026.stderr: 3: (Monitor::dispatch_op(std::shared_ptr<MonOpRequest>)+0x33b) [0x7f08f3ee8a3b]
2016-03-13T19:11:52.747 INFO:tasks.ceph.mon.a.smithi026.stderr: 4: (Monitor::C_Command::_finish(int)+0x33d) [0x7f08f3f7af8d]
2016-03-13T19:11:52.747 INFO:tasks.ceph.mon.a.smithi026.stderr: 5: (C_MonOp::finish(int)+0x79) [0x7f08f3eef029]
2016-03-13T19:11:52.748 INFO:tasks.ceph.mon.a.smithi026.stderr: 6: (Context::complete(int)+0x9) [0x7f08f3eee0e9]
2016-03-13T19:11:52.748 INFO:tasks.ceph.mon.a.smithi026.stderr: 7: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x7f08f3ef461c]
2016-03-13T19:11:52.748 INFO:tasks.ceph.mon.a.smithi026.stderr: 8: (PaxosService::restart()+0xdb) [0x7f08f3f1f0cb]
2016-03-13T19:11:52.748 INFO:tasks.ceph.mon.a.smithi026.stderr: 9: (Monitor::_reset()+0x1dc) [0x7f08f3ebf76c]
2016-03-13T19:11:52.749 INFO:tasks.ceph.mon.a.smithi026.stderr: 10: (Monitor::bootstrap()+0x2a5) [0x7f08f3ee19c5]
2016-03-13T19:11:52.749 INFO:tasks.ceph.mon.a.smithi026.stderr: 11: (Paxos::accept_timeout()+0x18b) [0x7f08f3f0acab]
2016-03-13T19:11:52.749 INFO:tasks.ceph.mon.a.smithi026.stderr: 12: (Context::complete(int)+0x9) [0x7f08f3eee0e9]
2016-03-13T19:11:52.749 INFO:tasks.ceph.mon.a.smithi026.stderr: 13: (SafeTimer::timer_thread()+0x104) [0x7f08f4261af4]
2016-03-13T19:11:52.750 INFO:tasks.ceph.mon.a.smithi026.stderr: 14: (SafeTimerThread::entry()+0xd) [0x7f08f426352d]
2016-03-13T19:11:52.750 INFO:tasks.ceph.mon.a.smithi026.stderr: 15: (()+0x7dc5) [0x7f08f27bfdc5]
2016-03-13T19:11:52.750 INFO:tasks.ceph.mon.a.smithi026.stderr: 16: (clone()+0x6d) [0x7f08f108821d]
2016-03-13T19:11:52.751 INFO:tasks.ceph.mon.a.smithi026.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret 
Actions #4

Updated by Kefu Chai about 8 years ago

  • Backport set to jewel
Actions #5

Updated by Kefu Chai about 8 years ago

  • Status changed from 12 to Fix Under Review
Actions #6

Updated by Kefu Chai about 8 years ago

  • Description updated (diff)
Actions #7

Updated by Kefu Chai about 8 years ago

  • Status changed from Fix Under Review to Pending Backport
  • Backport changed from jewel to infernalis
Actions #8

Updated by Nathan Cutler about 8 years ago

  • Copied to Backport #15227: infernalis: assert(session) in Monitor::handle_command() if client resets the conn added
Actions #9

Updated by Loïc Dachary over 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF