Project

General

Profile

Actions

Bug #38705

closed

mgr: segv in module thread, PyArg_ParseTuple

Added by Sage Weil about 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

     0> 2019-03-12 15:45:08.094 7f18522b4700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f18522b4700 thread_name:prometheus

 ceph version 14.1.0-589-g96939c1 (96939c10eb6b3296161d2009da58061072d2a704) nautilus (rc)
 1: (()+0x12890) [0x7f18691e1890]
 2: (()+0x1cfca2) [0x7f186982dca2]
 3: (()+0x1d2125) [0x7f1869830125]
 4: (PyArg_ParseTuple()+0x86) [0x7f18698305d6]
 5: (()+0x14e994) [0x55f689b12994]
 6: (PyEval_EvalFrameEx()+0x8010) [0x7f186970d1d0]
 7: (PyEval_EvalCodeEx()+0x7d8) [0x7f186983d278]
 8: (PyEval_EvalFrameEx()+0x5bf6) [0x7f186970adb6]
 9: (PyEval_EvalCodeEx()+0x7d8) [0x7f186983d278]
 10: (()+0x1645f9) [0x7f18697c25f9]
 11: (PyObject_Call()+0x43) [0x7f18696b2333]
 12: (()+0x1abd1c) [0x7f1869809d1c]
 13: (PyObject_Call()+0x43) [0x7f18696b2333]
 14: (PyObject_CallMethod()+0xc8) [0x7f18697d6c78]
 15: (PyModuleRunner::serve()+0x62) [0x55f689b957f2]
 16: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1cf) [0x55f689b95e9f]
 17: (()+0x76db) [0x7f18691d66db]
 18: (clone()+0x3f) [0x7f18683b788f]

/a/sage-2019-03-12_15:01:18-rados-wip-sage3-testing-2019-03-12-0708-distro-basic-smithi/3713081
Actions #1

Updated by Sage Weil about 5 years ago

lots of these failures. module varies (i've seen dashboard, prometheus so far)

Actions #2

Updated by Sage Weil about 5 years ago

appear to happen during standby. also, i see an ignored monmap message:

   -50> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr handle_mgr_map active in map: 0 active is 6600
   -49> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] Starting modules in standby mode
   -48> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] skipping module 'balancer' because it does not implement a standby mode
   -47> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] skipping module 'crash' because it does not implement a standby mode
   -46> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] skipping module 'devicehealth' because it does not implement a standby mode
   -45> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] skipping module 'orchestrator_cli' because it does not implement a standby mode
   -44> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] skipping module 'progress' because it does not implement a standby mode
   -43> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] starting module prometheus
   -42> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] skipping module 'restful' because it does not implement a standby mode
   -41> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] skipping module 'selftest' because it does not implement a standby mode
   -40> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] skipping module 'status' because it does not implement a standby mode
   -39> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr[py] skipping module 'volumes' because it does not implement a standby mode
   -38> 2019-03-12 15:45:08.094 7f185f87c700 20 mgr Gil Switched to new thread state 0x55f68fc0e000
   -37> 2019-03-12 15:45:08.094 7f185f07b700  4 mgrc handle_mgr_map Got map version 123
   -36> 2019-03-12 15:45:08.094 7f185f07b700  4 mgrc handle_mgr_map Active mgr is now [v2:172.21.15.201:6800/14793,v1:172.21.15.201:6801/14793]
   -35> 2019-03-12 15:45:08.094 7f185f07b700  4 mgrc reconnect Starting new session with [v2:172.21.15.201:6800/14793,v1:172.21.15.201:6801/14793]
   -34> 2019-03-12 15:45:08.094 7f185f07b700  1 --2- 172.21.15.201:0/14794 >> [v2:172.21.15.201:6800/14793,v1:172.21.15.201:6801/14793] conn(0x55f68fc46000 0x55f68fc4e000 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect
   -33> 2019-03-12 15:45:08.094 7f1863083700  1 -- 172.21.15.201:0/14794 >> [v2:172.21.15.201:6800/14793,v1:172.21.15.201:6801/14793] conn(0x55f68fc46000 msgr2=0x55f68fc4e000 unknown :-1 s=STATE_CONNECTING_RE l=0).process reconnect failed to v2:172.21.15.201:6800/14793
   -32> 2019-03-12 15:45:08.094 7f1863083700  1 --2- 172.21.15.201:0/14794 >> [v2:172.21.15.201:6800/14793,v1:172.21.15.201:6801/14793] conn(0x55f68fc46000 0x55f68fc4e000 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rx=0 tx=0)._fault waiting 0.200000
   -31> 2019-03-12 15:45:08.094 7f185f87c700  1 mgr load Constructed class from module: prometheus
   -30> 2019-03-12 15:45:08.094 7f185f87c700 20 mgr ~Gil Destroying new thread state 0x55f68fc0e000
   -29> 2019-03-12 15:45:08.094 7f185f87c700  4 mgr operator() Starting thread for prometheus
   -28> 2019-03-12 15:45:08.094 7f18522b4700  4 mgr entry Entering thread for prometheus
   -27> 2019-03-12 15:45:08.094 7f18522b4700 20 mgr Gil Switched to new thread state 0x55f68fc0e0b0
   -26> 2019-03-12 15:45:08.094 7f185f07b700  1 -- 172.21.15.201:0/14794 --> [v2:172.21.15.201:6800/14793,v1:172.21.15.201:6801/14793] -- mgropen(unknown.z) v3 -- 0x55f68fc56000 con 0x55f68fc46000
   -25> 2019-03-12 15:45:08.094 7f185f07b700  1 client.0 ms_handle_refused on v2:172.21.15.201:6800/14793
   -24> 2019-03-12 15:45:08.094 7f185f07b700  1 client.0 ms_handle_refused on v2:172.21.15.201:6800/14793
   -23> 2019-03-12 15:45:08.094 7f185f07b700  1 -- 172.21.15.201:0/14794 <== mon.0 v2:172.21.15.17:3300/0 4 ==== mon_map magic: 0 v1 ==== 377+0+0 (crc 0 0 0) 0x55f68b678600 con 0x55f68c204900
   -22> 2019-03-12 15:45:08.094 7f185f07b700 10 monclient: handle_monmap mon_map magic: 0 v1
   -21> 2019-03-12 15:45:08.094 7f185f07b700 10 monclient:  got monmap 1 from mon.a (according to old e1)
   -20> 2019-03-12 15:45:08.094 7f185f07b700 10 monclient: dump:
epoch 1
fsid a419c130-4869-44c6-a9a2-9aafcae98e38
last_changed 2019-03-12 15:38:31.242540
created 2019-03-12 15:38:31.242540
min_mon_release 14 (nautilus)
0: [v2:172.21.15.17:3300/0,v1:172.21.15.17:6789/0] mon.a
1: [v2:172.21.15.201:3300/0,v1:172.21.15.201:6789/0] mon.b
2: [v2:172.21.15.17:3301/0,v1:172.21.15.17:6790/0] mon.c

   -19> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr ms_dispatch standby mon_map magic: 0 v1
   -18> 2019-03-12 15:45:08.094 7f185f07b700  0 ms_deliver_dispatch: unhandled message 0x55f68b678600 mon_map magic: 0 v1 from mon.0 v2:172.21.15.17:3300/0
   -17> 2019-03-12 15:45:08.094 7f185f07b700  1 -- 172.21.15.201:0/14794 <== mon.0 v2:172.21.15.17:3300/0 5 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 194+0+0 (crc 0 0 0) 0x55f68b48af40 con 0x55f68c204900
   -16> 2019-03-12 15:45:08.094 7f185f07b700 10 cephx client: 0x55f68b4c2b60 handle_response ret = 0
   -15> 2019-03-12 15:45:08.094 7f185f07b700 10 cephx client:  get_rotating_key
   -14> 2019-03-12 15:45:08.094 7f185f07b700 10 auth: dump_rotating:
   -13> 2019-03-12 15:45:08.094 7f185f07b700 10 auth:  id 1 AQCM0odcw+vbLRAA1czv1D9JvAG55JVaKaLblg== expires 2019-03-12 16:38:52.769386
   -12> 2019-03-12 15:45:08.094 7f185f07b700 10 auth:  id 2 AQCM0odc5vrbLRAAyiUW3Qfbcdtng/rlyzJqOA== expires 2019-03-12 17:38:52.769386
   -11> 2019-03-12 15:45:08.094 7f185f07b700 10 auth:  id 3 AQCM0odc7wTcLRAAdP17TON2ZXs/fHP2KqVuEw== expires 2019-03-12 18:38:52.769386
   -10> 2019-03-12 15:45:08.094 7f185f07b700 10 monclient: _finish_auth 0
    -9> 2019-03-12 15:45:08.094 7f185f07b700 10 cephx: validate_tickets want 55 have 55 need 0
    -8> 2019-03-12 15:45:08.094 7f185f07b700 20 cephx client: need_tickets: want=55 have=55 need=0
    -7> 2019-03-12 15:45:08.094 7f185f07b700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-03-12 15:44:38.098447)
    -6> 2019-03-12 15:45:08.094 7f185f07b700 10 auth: dump_rotating:
    -5> 2019-03-12 15:45:08.094 7f185f07b700 10 auth:  id 1 AQCM0odcw+vbLRAA1czv1D9JvAG55JVaKaLblg== expires 2019-03-12 16:38:52.769386
    -4> 2019-03-12 15:45:08.094 7f185f07b700 10 auth:  id 2 AQCM0odc5vrbLRAAyiUW3Qfbcdtng/rlyzJqOA== expires 2019-03-12 17:38:52.769386
    -3> 2019-03-12 15:45:08.094 7f185f07b700 10 auth:  id 3 AQCM0odc7wTcLRAAdP17TON2ZXs/fHP2KqVuEw== expires 2019-03-12 18:38:52.769386
    -2> 2019-03-12 15:45:08.094 7f185f07b700  1 -- 172.21.15.201:0/14794 <== mon.0 v2:172.21.15.17:3300/0 6 ==== osd_map(19..19 src has 1..19) v4 ==== 3829+0+0 (crc 0 0 0) 0x55f68c16af00 con 0x55f68c204900
    -1> 2019-03-12 15:45:08.094 7f185f07b700  4 mgr ms_dispatch standby osd_map(19..19 src has 1..19) v4
     0> 2019-03-12 15:45:08.094 7f18522b4700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f18522b4700 thread_name:prometheus

another one,

   -43> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr handle_mgr_map active in map: 0 active is 4861
   -42> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] Starting modules in standby mode
   -41> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] skipping module 'balancer' because it does not implement a standby mode
   -40> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] skipping module 'crash' because it does not implement a standby mode
   -39> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] starting module dashboard
   -38> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] skipping module 'devicehealth' because it does not implement a standby mode
   -37> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] skipping module 'orchestrator_cli' because it does not implement a standby mode
   -36> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] skipping module 'progress' because it does not implement a standby mode
   -35> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] skipping module 'restful' because it does not implement a standby mode
   -34> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] skipping module 'status' because it does not implement a standby mode
   -33> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr[py] skipping module 'volumes' because it does not implement a standby mode
   -32> 2019-03-12 16:29:27.847 7fa2aead2700 20 mgr Gil Switched to new thread state 0x5b24000
   -31> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgrc handle_mgr_map Got map version 21
   -30> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgrc handle_mgr_map Active mgr is now 
   -29> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgrc reconnect No active mgr available yet
   -28> 2019-03-12 16:29:27.847 7fa2ae2d1700  1 -- 172.21.15.3:0/16587 <== mon.0 v2:172.21.15.3:3300/0 4 ==== mon_map magic: 0 v1 ==== 377+0+0 (crc 0 0 0) 0x2ea4400 con 0x2e8ad80
   -27> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 monclient: handle_monmap mon_map magic: 0 v1
   -26> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 monclient:  got monmap 1 from mon.a (according to old e1)
   -25> 2019-03-12 16:29:27.847 7fa2aead2700  1 mgr load Constructed class from module: dashboard
   -24> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 monclient: dump:
epoch 1
fsid c7e84a6a-22ca-4bdc-bdaf-076e4bf9bbce
last_changed 2019-03-12 16:27:13.323545
created 2019-03-12 16:27:13.323545
min_mon_release 14 (nautilus)
0: [v2:172.21.15.3:3300/0,v1:172.21.15.3:6789/0] mon.a
1: [v2:172.21.15.90:3300/0,v1:172.21.15.90:6789/0] mon.b
2: [v2:172.21.15.3:3301/0,v1:172.21.15.3:6790/0] mon.c

   -23> 2019-03-12 16:29:27.847 7fa2aead2700 20 mgr ~Gil Destroying new thread state 0x5b24000
   -22> 2019-03-12 16:29:27.847 7fa2aead2700  4 mgr operator() Starting thread for dashboard
   -21> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr ms_dispatch standby mon_map magic: 0 v1
   -20> 2019-03-12 16:29:27.847 7fa2ae2d1700  0 ms_deliver_dispatch: unhandled message 0x2ea4400 mon_map magic: 0 v1 from mon.0 v2:172.21.15.3:3300/0
   -19> 2019-03-12 16:29:27.847 7fa2ae2d1700  1 -- 172.21.15.3:0/16587 <== mon.0 v2:172.21.15.3:3300/0 5 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 194+0+0 (crc 0 0 0) 0x215a1c0 con 0x2e8ad80
   -18> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 cephx client: 0x214cb60 handle_response ret = 0
   -17> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 cephx client:  get_rotating_key
   -16> 2019-03-12 16:29:27.847 7fa2a279a700  4 mgr entry Entering thread for dashboard
   -15> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 auth: dump_rotating:
   -14> 2019-03-12 16:29:27.847 7fa2a279a700 20 mgr Gil Switched to new thread state 0x5b240b0
   -13> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 auth:  id 1 AQD23YdcLp7iDhAAGCNvwQxZcHbXFyhrX1lP/Q== expires 2019-03-12 17:27:34.249727
   -12> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 auth:  id 2 AQD23Ydc4bviDhAA3m+5Go4DxPfHndQvL6dMeQ== expires 2019-03-12 18:27:34.249727
   -11> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 auth:  id 3 AQD23YdcQtriDhAAKTUdF5Q+sOkln5vFycVIeQ== expires 2019-03-12 19:27:34.249727
   -10> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 monclient: _finish_auth 0
    -9> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 cephx: validate_tickets want 55 have 55 need 0
    -8> 2019-03-12 16:29:27.847 7fa2ae2d1700 20 cephx client: need_tickets: want=55 have=55 need=0
    -7> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2019-03-12 16:28:57.851350)
    -6> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 auth: dump_rotating:
    -5> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 auth:  id 1 AQD23YdcLp7iDhAAGCNvwQxZcHbXFyhrX1lP/Q== expires 2019-03-12 17:27:34.249727
    -4> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 auth:  id 2 AQD23Ydc4bviDhAA3m+5Go4DxPfHndQvL6dMeQ== expires 2019-03-12 18:27:34.249727
    -3> 2019-03-12 16:29:27.847 7fa2ae2d1700 10 auth:  id 3 AQD23YdcQtriDhAAKTUdF5Q+sOkln5vFycVIeQ== expires 2019-03-12 19:27:34.249727
    -2> 2019-03-12 16:29:27.847 7fa2ae2d1700  1 -- 172.21.15.3:0/16587 <== mon.0 v2:172.21.15.3:3300/0 6 ==== osd_map(29..29 src has 1..29) v4 ==== 6037+0+0 (crc 0 0 0) 0x2df0780 con 0x2e8ad80
    -1> 2019-03-12 16:29:27.847 7fa2ae2d1700  4 mgr ms_dispatch standby osd_map(29..29 src has 1..29) v4
     0> 2019-03-12 16:29:27.847 7fa2a279a700 -1 *** Caught signal (Segmentation fault) **
 in thread 7fa2a279a700 thread_name:dashboard

Actions #3

Updated by Sage Weil about 5 years ago

  • Status changed from 12 to Fix Under Review
Actions #4

Updated by Sage Weil about 5 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF