Project

General

Profile

Fix #8308

Long running remote operations timed out prematurely

Added by John Spray over 7 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Backend (services)
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Two things weird here:

  • Why is a pool delete taking so long?
  • We don't seem to be correctly recognising that a long running JID is still running.
2014-05-05 15:21:16,599 - DEBUG - cthulhu RpcInterface >> delete(('61723e0f-992b-466e-9c09-9914931bc584', 'pool', 14), {})
2014-05-05 15:21:16,600 - DEBUG - cthulhu.OsdMapModifyingRequest Request._submit: mira122.front.sepia.ceph.com/ceph/[('osd pool delete', {'pool2': 'test', 'sure': '--yes-i-really-really-mean-it', 'pool': 'test'})]
2014-05-05 15:21:16,651 - INFO - cthulhu.OsdMapModifyingRequest Request 1ec232ff-25f4-48ee-91f8-276247337c0d started job 20140505152116648632
2014-05-05 15:21:16,677 - INFO - cthulhu Eventer._emit: 2014-05-05 22:21:16.677399+00:00/INFO/Started: Deleting pool 'test'
2014-05-05 15:21:16,677 - DEBUG - cthulhu RpcInterface << delete

...

2014-05-05 15:22:31,424 - DEBUG - cthulhu RequestCollection.tick: 1 JIDs underway
2014-05-05 15:22:31,425 - ERROR - cthulhu Request 1ec232ff-25f4-48ee-91f8-276247337c0d JID 20140505152116648632 stale: now=2014-05-05 22:22:31.425014+00:00, alive_at=2014-05-05 22:21:16.651791+00:00
2014-05-05 15:22:31,425 - INFO - cthulhu.OsdMapModifyingRequest Request 1ec232ff-25f4-48ee-91f8-276247337c0d completed with error=True (Lost contact)
2014-05-05 15:22:32,011 - DEBUG - cthulhu _run.ev: mira122.front.sepia.ceph.com/tag=20140505152116648632
2014-05-05 15:22:32,012 - DEBUG - cthulhu _run.ev: mira122.front.sepia.ceph.com/tag=salt/job/20140505152116648632/ret/mira122.front.sepia.ceph.com
2014-05-05 15:22:32,012 - DEBUG - cthulhu on_completion: jid=20140505152116648632 data={'fun_args': ['61723e0f-992b-466e-9c09-9914931bc584', 'ceph', [['osd pool delete', {'pool2': 'test', 'sure': '--yes-i-really-really-mean-it', 'pool': 'test', 'format': 'json'}]]], 'jid': '20140505152116648632', 'return': {'versions': {'osd_map': 1918, 'pg_summary': 'd8409ecfd83bcc2d87d082f35ec46620', 'mds_map': 1, 'mon_status': 16, 'health': '82f135baf8b089e517773899ffaa64e8', 'mon_map': 1, 'config': '3a961e6f7015443076446f6a1abf593d'}, 'results': [None], 'error': False, 'err_outbuf': '', 'fsid': '61723e0f-992b-466e-9c09-9914931bc584', 'err_outs': ''}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2014-05-05_15:22:32.010850', 'fun': 'ceph.rados_commands', 'id': 'mira122.front.sepia.ceph.com'}
2014-05-05 15:22:32,012 - ERROR - cthulhu Exception handling message with tag salt/job/20140505152116648632/ret/mira122.front.sepia.ceph.com
Traceback (most recent call last):
  File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 290, in _run
    self.on_completion(data)
  File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 35, in wrapped
    return func(*args, **kwargs)
  File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 428, in on_completion
    self._requests.on_completion(data)
  File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/request_collection.py", line 178, in on_completion
    request = self.get_by_jid(jid)
  File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/request_collection.py", line 40, in get_by_jid
    return self._by_jid[jid]
KeyError: '20140505152116648632'
2014-05-05 15:22:32,013 - DEBUG - cthulhu Message content: {'fun_args': ['61723e0f-992b-466e-9c09-9914931bc584', 'ceph', [['osd pool delete', {'pool2': 'test', 'sure': '--yes-i-really-really-mean-it', 'pool': 'test', 'format': 'json'}]]], 'jid': '20140505152116648632', 'return': {'versions': {'osd_map': 1918, 'pg_summary': 'd8409ecfd83bcc2d87d082f35ec46620', 'mds_map': 1, 'mon_status': 16, 'health': '82f135baf8b089e517773899ffaa64e8', 'mon_map': 1, 'config': '3a961e6f7015443076446f6a1abf593d'}, 'results': [None], 'error': False, 'err_outbuf': '', 'fsid': '61723e0f-992b-466e-9c09-9914931bc584', 'err_outs': ''}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2014-05-05_15:22:32.010850', 'fun': 'ceph.rados_commands', 'id': 'mira122.front.sepia.ceph.com'}

Associated revisions

Revision 17441362 (diff)
Added by John Spray over 7 years ago

cthulhu: Fix long running job handling

This was broken in a couple of ways: * The saltutil.running message was never getting sent * The branch for handling the responses was never
getting called because it came after the check
for FSID.

Fixes: #8308

History

#1 Updated by John Spray over 7 years ago

  • Target version changed from v1.2-dev9 to v1.2-dev10

#2 Updated by John Spray over 7 years ago

  • Status changed from New to In Progress

#3 Updated by John Spray over 7 years ago

  • Status changed from In Progress to Fix Under Review

#4 Updated by Christina Meno over 7 years ago

  • Status changed from Fix Under Review to Resolved

Also available in: Atom PDF