Project

General

Profile

Fix #7800

cthulhu becomes unresponsive to RPCs

Added by John Spray about 10 years ago. Updated almost 10 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Backend (services)
Target version:
% Done:

0%

Source:
other
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

# tcpdump -i lo port 5050
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on lo, link-type EN10MB (Ethernet), capture size 65535 bytes
09:46:31.353201 IP localhost.38780 > localhost.mmcc: Flags [S], seq 472801759, win 43690, options [mss 65495,sackOK,TS val 643105537 ecr 0,nop,wscale 7], length 0
09:46:31.353213 IP localhost.mmcc > localhost.38780: Flags [S.], seq 1364088870, ack 472801760, win 43690, options [mss 65495,sackOK,TS val 643105537 ecr 643105537,nop,wscale 7], length 0
09:46:31.353224 IP localhost.38780 > localhost.mmcc: Flags [.], ack 1, win 342, options [nop,nop,TS val 643105537 ecr 643105537], length 0
09:46:31.353345 IP localhost.mmcc > localhost.38780: Flags [P.], seq 1:11, ack 1, win 342, options [nop,nop,TS val 643105537 ecr 643105537], length 10
09:46:31.353357 IP localhost.38780 > localhost.mmcc: Flags [.], ack 11, win 342, options [nop,nop,TS val 643105537 ecr 643105537], length 0
09:46:31.353630 IP localhost.38780 > localhost.mmcc: Flags [P.], seq 1:13, ack 11, win 342, options [nop,nop,TS val 643105538 ecr 643105537], length 12
09:46:31.353639 IP localhost.mmcc > localhost.38780: Flags [.], ack 13, win 342, options [nop,nop,TS val 643105538 ecr 643105538], length 0
09:46:31.353707 IP localhost.mmcc > localhost.38780: Flags [P.], seq 11:13, ack 13, win 342, options [nop,nop,TS val 643105538 ecr 643105538], length 2
09:46:31.353743 IP localhost.mmcc > localhost.38780: Flags [P.], seq 13:15, ack 13, win 342, options [nop,nop,TS val 643105538 ecr 643105538], length 2
09:46:31.353790 IP localhost.38780 > localhost.mmcc: Flags [.], ack 15, win 342, options [nop,nop,TS val 643105538 ecr 643105538], length 0
09:46:31.353814 IP localhost.38780 > localhost.mmcc: Flags [P.], seq 13:89, ack 15, win 342, options [nop,nop,TS val 643105538 ecr 643105538], length 76
09:46:31.393687 IP localhost.mmcc > localhost.38780: Flags [.], ack 89, win 342, options [nop,nop,TS val 643105548 ecr 643105538], length 0

09:47:01.366974 IP localhost.38780 > localhost.mmcc: Flags [F.], seq 89, ack 15, win 342, options [nop,nop,TS val 643113041 ecr 643105548], length 0
09:47:01.367132 IP localhost.mmcc > localhost.38780: Flags [F.], seq 15, ack 90, win 342, options [nop,nop,TS val 643113041 ecr 643113041], length 0
09:47:01.367162 IP localhost.38780 > localhost.mmcc: Flags [.], ack 16, win 342, options [nop,nop,TS val 643113041 ecr 643113041], length 0

History

#1 Updated by John Spray about 10 years ago

This is what a successful RPC looks like after a clean restart:

read(12, "\1\0\0\0\0\0\0\0", 8)         = 8
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(5, "2014-03-20 10:01:46,437 - DEBUG "..., 80) = 80
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(5, "2014-03-20 10:01:47,037 - DEBUG "..., 118) = 118
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(5, "2014-03-20 10:01:47,037 - DEBUG "..., 70) = 70
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(5, "2014-03-20 10:01:47,037 - DEBUG "..., 72) = 72
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
read(12, "\1\0\0\0\0\0\0\0", 8)         = 8
read(12, "\1\0\0\0\0\0\0\0", 8)         = 8
write(10, "\1\0\0\0\0\0\0\0", 8)        = 8
read(12, "\1\0\0\0\0\0\0\0", 8)         = 8

#2 Updated by John Spray about 10 years ago

I think this may well be postgres, because we're not using a gevent-aware interface to it on that server

#3 Updated by Dan Mick about 10 years ago

I was thinking that installing psycogreen was enough, and thought I did; maybe one or the other isn't true. I guess you have to call the monkeypatch routine from somewhere.

#4 Updated by John Spray about 10 years ago

Installing it doesn't do anything, you have to call patch_psycopg somewhere (https://pypi.python.org/pypi/psycogreen/1.0)

#5 Updated by John Spray about 10 years ago

As with other stuff I'm loathe to spend a huge amount of time on this until #7088 is done: the database I/O is fairly critical to identifying what "normal" looks like.

#6 Updated by John Spray about 10 years ago

  • Target version changed from v1.2-dev6 to v1.2 Backlog

In "wait and see" mode on this issue since updating 0MQ, fixing memory leak & migrating to final DB config.

#7 Updated by Yan-Fa Li about 10 years ago

Not sure it's related but I'm getting random 500s.
This is from httpd_error.log:
[Thu Mar 27 17:28:58 2014] [error] Traceback (most recent call last):
[Thu Mar 27 17:28:58 2014] [error] File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
[Thu Mar 27 17:28:58 2014] [error] result = self._run(*self.args, **self.kwargs)
[Thu Mar 27 17:28:58 2014] [error] AttributeError: 'Greenlet' object has no attribute '_run'
[Thu Mar 27 17:28:58 2014] [error] <Greenlet at 0x7f5f624de910> failed with AttributeError
[Thu Mar 27 17:28:58 2014] [error]

cthulu.log:

2014-03-27 14:20:40,496 - ERROR - cthulhu Exception handling message with tag salt/job/20140327142040306778/ret/mira110.front.sepia.ceph.com
Traceback (most recent call last):
File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 286, in _run
self.on_completion(data)
File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 35, in wrapped
return func(*args, **kwargs)
File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 423, in on_completion
self._requests.on_completion(data)
File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/request_collection.py", line 232, in on_completion
self._eventer.on_user_request_complete(request)
File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/eventer.py", line 90, in on_user_request_complete
self._emit(INFO, "Succeeded: %s" % request.headline, **request.assocations)
AttributeError: 'UserRequest' object has no attribute 'assocations'

calamari.log:

2014-03-27 17:29:04,229 - ERROR - django.request Internal Server Error: /api/v2/cluster/61723e0f-992b-466e-9c09-9914931bc584/server
Traceback (most recent call last):
File "/opt/calamari/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/viewsets.py", line 78, in view
return self.dispatch(request, *args, **kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 72, in dispatch
return super(RPCView, self).dispatch(request, *args, **kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/django/views/decorators/csrf.py", line 77, in wrapped_view
return view_func(*args, **kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 399, in dispatch
response = self.handle_exception(exc)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 89, in handle_exception
return super(RPCView, self).handle_exception(exc)
File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/views.py", line 396, in dispatch
response = handler(request, *args, **kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/v2.py", line 562, in list
[DataObject(s) for s in servers], many=True).data)
File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/serializers.py", line 571, in data
self._data = [self.to_native(item) for item in obj]
File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/serializers.py", line 349, in to_native
value = field.field_to_native(obj, field_name)
File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/fields.py", line 329, in field_to_native
return super(WritableField, self).field_to_native(obj, field_name)
File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/fields.py", line 198, in field_to_native
value = get_component(value, component)
File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/fields.py", line 56, in get_component
val = getattr(obj, attr_name)
AttributeError: 'DataObject' object has no attribute 'frontend_iface'

#8 Updated by John Spray about 10 years ago

Those aren't related to this. I happen to know what those other things are, you can ignore them safely.

#9 Updated by John Spray almost 10 years ago

  • Status changed from New to Resolved

No evidence that this continues to be a problem, mira106 has stayed up for several days without issue. Reopen if it happens again.

#10 Updated by John Spray almost 10 years ago

  • Target version changed from v1.2 Backlog to v1.2-dev6

Also available in: Atom PDF