Bug #10630
closedfailed to decode json output from thrashosds
0%
Description
ubuntu@teuthology:/a/samuelj-2015-01-23_06:57:45-rados-master-wip-testing-vanilla-fixes-basic-multi/719316
2015-01-23T14:59:22.306 INFO:tasks.thrashosds.thrasher:in_osds: [4, 5, 2, 3] out_osds: [1, 0] dead_osds: [] live_osds: [0, 2, 4, 3, 1, 5] 2015-01-23T14:59:22.307 INFO:tasks.thrashosds.thrasher:choose_action: min_in 4 min_out 0 min_live 2 min_dead 0 2015-01-23T14:59:22.307 INFO:tasks.thrashosds.thrasher:Setting osd 4 primary_affinity to 0.000000 2015-01-23T14:59:22.308 INFO:teuthology.orchestra.run.plana41:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph osd primary-affinity 4 0' 2015-01-23T14:59:22.907 INFO:teuthology.orchestra.run.plana41.stderr:set osd.4 primary-affinity to 0 (802) 2015-01-23T14:59:27.922 INFO:tasks.thrashosds.thrasher:in_osds: [4, 5, 2, 3] out_osds: [1, 0] dead_osds: [] live_osds: [0, 2, 4, 3, 1, 5] ... 2015-01-23T15:05:28.605 INFO:tasks.thrashosds:joining thrashosds 2015-01-23T15:05:28.605 ERROR:teuthology.run_tasks:Manager failed: thrashosds Traceback (most recent call last): File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 122, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/thrashosds.py", line 174, in task thrash_proc.do_join() File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph_manager.py", line 416, in do_join self.thread.get() File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get raise self._exception ValueError: No JSON object could be decoded
Updated by Sage Weil about 9 years ago
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2015-01-31_06:26:17-rados-wip-sage-testing-distro-basic-multi/733423
Updated by Kefu Chai about 9 years ago
the json format is used as a general way to get the status of the cluster and everything, while the work horse of the thrash test, Thrasher
in ceph_manager.py, tries to get the status from the cluster at several different places. among other things:
- get_osd_dump_json() // osd dump --format=json
- get_pg_stats() // pg dump --format=json
so i am not sure which command printed the string not readable by the json parser. ideally, we should have the backtrace where the exception was thrown. when looking at the implementation of Greenlet
class,
def _report_error(self, exc_info): # .... self.parent.handle_error(self, *exc_info)
it does pass the exception info to its parent greenlet. and the parent greenlet is the return value of
get_hub()
, see Greenlet.__init__()
. and in Hub
class:
def handle_error(self, context, type, value, tb): if not issubclass(type, self.NOT_ERROR): self.print_exception(context, type, value, tb) if context is None or issubclass(type, self.SYSTEM_ERROR): self.handle_system_error(type, value)
in the same class
def print_exception(self, context, type, value, tb): traceback.print_exception(type, value, tb) # ......
but i cannot find any backtrace info other than the one printed by the rethrown exception:
ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2015-01-31_06:26:17-rados-wip-sage-testing-distro-basic-multi/733423$ grep -A 10 Traceback teuthology.log Traceback (most recent call last): File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 122, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/thrashosds.py", line 174, in task thrash_proc.do_join() File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph_manager.py", line 416, in do_join self.thread.get() File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get raise self._exception -- Traceback (most recent call last): File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 30, in nested yield vars File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph.py", line 1294, in task yield File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 122, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/thrashosds.py", line 174, in task thrash_proc.do_join() -- Traceback (most recent call last): File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 30, in nested yield vars File "/home/teuthworker/src/teuthology_master/teuthology/task/install.py", line 1269, in task yield File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 122, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/thrashosds.py", line 174, in task thrash_proc.do_join()
because i need to hack the greenlet to root cause the reason why the original backtrace does not show up in the log file, and i can not mess around with the lab machine (or it will be a little bit tricky to do so). i will use Loïc's container hack to start the hack.
Updated by Sage Weil about 9 years ago
ubuntu@teuthology:/a/sage-2015-02-17_22:30:59-rados-wip-sage-testing-distro-basic-multi/763975
fwiw it looks a bit to me like the admin socket dump_ops_in_flight command after starting the osd is hung. perhaps enabling the clietn-side debug logs, or --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monclient 20 on that ceph command, would help.. then we'd see it all in the teuthology log. or some extra logging around that code in ceph_manager.py?
Updated by Kefu Chai about 9 years ago
thanks sage, i was on holiday last week. will look at this ticket tomorrow. seems it needs more efforts to setup a testbed for running teuthology even with loïc's hack.
Sage Weil wrote:
ubuntu@teuthology:/a/sage-2015-02-17_22:30:59-rados-wip-sage-testing-distro-basic-multi/763975
fwiw it looks a bit to me like the admin socket dump_ops_in_flight command after starting the osd is hung. perhaps enabling the clietn-side debug logs, or --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monclient 20 on that ceph command, would help.. then we'd see it all in the teuthology log.
but if the osd is hung, the asok command will timeout, and throw an exception of "timed out waiting for admin_socket to appear after osd.{id} restart'. will try out your ideas tomorrow.
or some extra logging around that code in ceph_manager.py?
yeah, will print out more info at seeing an exception if greenlet fails do this for us.
Updated by Kefu Chai about 9 years ago
traceback.print_exception()
prints to stderr, while the stderr is not logged AFAICS. might want to find a way to (monkey) patch traceback.print_exception()
, or change the behaviour of gevent.Hub.print_exception()
.
Updated by Kefu Chai about 9 years ago
a PR posted at https://github.com/ceph/ceph-qa-suite/pull/344 for adding logging for uncaught exception thrown by thrasher. pending on review.
Updated by Kefu Chai about 9 years ago
Sage Weil wrote:
ubuntu@teuthology:/a/sage-2015-02-17_22:30:59-rados-wip-sage-testing-distro-basic-multi/763975
fwiw it looks a bit to me like the admin socket dump_ops_in_flight command after starting the osd is hung. perhaps enabling the clietn-side debug logs, or --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monclient 20 on that ceph command, would help.. then we'd see it all in the teuthology log.
sage, so i will
ceph osd tell 0 injectargs '--log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20'
before
ceph --admin-daemon ./out/osd.1.asok dump_ops_in_flight
and revert this debug settings after this command.
Updated by Kefu Chai about 9 years ago
perhaps enabling the clietn-side debug logs,
sorry sage, i am not sure that i am following you, i am not sure it works in this case, as the asok command is not a native ceph command. the ceph script talks to the daemon over the UDS, so we are not likely to get the client side log. and on the server side, the OSD logger sends the log to the stderr when log-to-stderr is enabled. and, i think the OSD daemon started by the teuthology is daemonized without -d
or -f
option. so teuthology.log can hardly capture the log messages for the dump_ops_in_flight command.
Updated by Kefu Chai about 9 years ago
- Assignee deleted (
Kefu Chai)
no tangible progress for a while, i am resigning from this ticket. :-(
Updated by Sage Weil about 9 years ago
Kefu Chai wrote:
perhaps enabling the clietn-side debug logs,
sorry sage, i am not sure that i am following you, i am not sure it works in this case, as the asok command is not a native ceph command. the ceph script talks to the daemon over the UDS, so we are not likely to get the client side log. and on the server side, the OSD logger sends the log to the stderr when log-to-stderr is enabled. and, i think the OSD daemon started by the teuthology is daemonized without
-d
or-f
option. so teuthology.log can hardly capture the log messages for the dump_ops_in_flight command.
oh, you're right. hrm. maybe what we need is better debugging on the server side for the admin socket handling code? there is a debug_asok setting that we could bump up..
Updated by Sage Weil almost 9 years ago
- Status changed from New to Resolved
- Regression set to No