Bug #10630: failed to decode json output from thrashosds - Ceph - Ceph

Actions

Copy link

Bug #10630

closed

failed to decode json output from thrashosds

Added by Sage Weil over 9 years ago. Updated almost 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

ubuntu@teuthology:/a/samuelj-2015-01-23_06:57:45-rados-master-wip-testing-vanilla-fixes-basic-multi/719316

2015-01-23T14:59:22.306 INFO:tasks.thrashosds.thrasher:in_osds:  [4, 5, 2, 3] out_osds:  [1, 0] dead_osds:  [] live_osds:  [0, 2, 4, 3, 1, 5]
2015-01-23T14:59:22.307 INFO:tasks.thrashosds.thrasher:choose_action: min_in 4 min_out 0 min_live 2 min_dead 0
2015-01-23T14:59:22.307 INFO:tasks.thrashosds.thrasher:Setting osd 4 primary_affinity to 0.000000
2015-01-23T14:59:22.308 INFO:teuthology.orchestra.run.plana41:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph osd primary-affinity 4 0'
2015-01-23T14:59:22.907 INFO:teuthology.orchestra.run.plana41.stderr:set osd.4 primary-affinity to 0 (802)
2015-01-23T14:59:27.922 INFO:tasks.thrashosds.thrasher:in_osds:  [4, 5, 2, 3] out_osds:  [1, 0] dead_osds:  [] live_osds:  [0, 2, 4, 3, 1, 5]
...
2015-01-23T15:05:28.605 INFO:tasks.thrashosds:joining thrashosds
2015-01-23T15:05:28.605 ERROR:teuthology.run_tasks:Manager failed: thrashosds
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 122, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/thrashosds.py", line 174, in task
    thrash_proc.do_join()
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph_manager.py", line 416, in do_join
    self.thread.get()
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get
    raise self._exception
ValueError: No JSON object could be decoded

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil about 9 years ago

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2015-01-31_06:26:17-rados-wip-sage-testing-distro-basic-multi/733423

Actions

Copy link

Updated by Kefu Chai about 9 years ago

Assignee set to Kefu Chai

Actions

Copy link

Updated by Kefu Chai about 9 years ago

the json format is used as a general way to get the status of the cluster and everything, while the work horse of the thrash test, Thrasher in ceph_manager.py, tries to get the status from the cluster at several different places. among other things:

get_osd_dump_json() // osd dump --format=json
get_pg_stats() // pg dump --format=json

so i am not sure which command printed the string not readable by the json parser. ideally, we should have the backtrace where the exception was thrown. when looking at the implementation of Greenlet class,

def _report_error(self, exc_info):
  # ....
  self.parent.handle_error(self, *exc_info)

it does pass the exception info to its parent greenlet. and the parent greenlet is the return value of get_hub(), see Greenlet.__init__(). and in Hub class:

    def handle_error(self, context, type, value, tb):
        if not issubclass(type, self.NOT_ERROR):
            self.print_exception(context, type, value, tb)
        if context is None or issubclass(type, self.SYSTEM_ERROR):
            self.handle_system_error(type, value)

in the same class

    def print_exception(self, context, type, value, tb):
        traceback.print_exception(type, value, tb)
        # ......

but i cannot find any backtrace info other than the one printed by the rethrown exception:

ubuntu@teuthology:/var/lib/teuthworker/archive/sage-2015-01-31_06:26:17-rados-wip-sage-testing-distro-basic-multi/733423$ grep -A 10 Traceback teuthology.log
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 122, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/thrashosds.py", line 174, in task
    thrash_proc.do_join()
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph_manager.py", line 416, in do_join
    self.thread.get()
  File "/usr/lib/python2.7/dist-packages/gevent/greenlet.py", line 308, in get
    raise self._exception
--
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 30, in nested
    yield vars
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/ceph.py", line 1294, in task
    yield
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 122, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/thrashosds.py", line 174, in task
    thrash_proc.do_join()
--
Traceback (most recent call last):
  File "/home/teuthworker/src/teuthology_master/teuthology/contextutil.py", line 30, in nested
    yield vars
  File "/home/teuthworker/src/teuthology_master/teuthology/task/install.py", line 1269, in task
    yield
  File "/home/teuthworker/src/teuthology_master/teuthology/run_tasks.py", line 122, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/var/lib/teuthworker/src/ceph-qa-suite_master/tasks/thrashosds.py", line 174, in task
    thrash_proc.do_join()

because i need to hack the greenlet to root cause the reason why the original backtrace does not show up in the log file, and i can not mess around with the lab machine (or it will be a little bit tricky to do so). i will use Loïc's container hack to start the hack.

Actions

Copy link

Updated by Sage Weil about 9 years ago

ubuntu@teuthology:/a/sage-2015-02-17_22:30:59-rados-wip-sage-testing-distro-basic-multi/763975

fwiw it looks a bit to me like the admin socket dump_ops_in_flight command after starting the osd is hung. perhaps enabling the clietn-side debug logs, or --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monclient 20 on that ceph command, would help.. then we'd see it all in the teuthology log. or some extra logging around that code in ceph_manager.py?

Actions

Copy link

Updated by Kefu Chai about 9 years ago

thanks sage, i was on holiday last week. will look at this ticket tomorrow. seems it needs more efforts to setup a testbed for running teuthology even with loïc's hack.

Sage Weil wrote:

ubuntu@teuthology:/a/sage-2015-02-17_22:30:59-rados-wip-sage-testing-distro-basic-multi/763975

fwiw it looks a bit to me like the admin socket dump_ops_in_flight command after starting the osd is hung. perhaps enabling the clietn-side debug logs, or --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monclient 20 on that ceph command, would help.. then we'd see it all in the teuthology log.

but if the osd is hung, the asok command will timeout, and throw an exception of "timed out waiting for admin_socket to appear after osd.{id} restart'. will try out your ideas tomorrow.

or some extra logging around that code in ceph_manager.py?

yeah, will print out more info at seeing an exception if greenlet fails do this for us.

Actions

Copy link

Updated by Kefu Chai about 9 years ago

traceback.print_exception() prints to stderr, while the stderr is not logged AFAICS. might want to find a way to (monkey) patch traceback.print_exception(), or change the behaviour of gevent.Hub.print_exception().

Actions

Copy link

Updated by Kefu Chai about 9 years ago

a PR posted at https://github.com/ceph/ceph-qa-suite/pull/344 for adding logging for uncaught exception thrown by thrasher. pending on review.

Actions

Copy link

Updated by Kefu Chai about 9 years ago

Sage Weil wrote:

ubuntu@teuthology:/a/sage-2015-02-17_22:30:59-rados-wip-sage-testing-distro-basic-multi/763975

fwiw it looks a bit to me like the admin socket dump_ops_in_flight command after starting the osd is hung. perhaps enabling the clietn-side debug logs, or --log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monclient 20 on that ceph command, would help.. then we'd see it all in the teuthology log.

sage, so i will

ceph osd tell 0 injectargs '--log-to-stderr --debug-ms 1 --debug-objecter 20 --debug-monc 20'

before

ceph --admin-daemon ./out/osd.1.asok dump_ops_in_flight

and revert this debug settings after this command.

Actions

Copy link

Updated by Kefu Chai about 9 years ago

perhaps enabling the clietn-side debug logs,

sorry sage, i am not sure that i am following you, i am not sure it works in this case, as the asok command is not a native ceph command. the ceph script talks to the daemon over the UDS, so we are not likely to get the client side log. and on the server side, the OSD logger sends the log to the stderr when log-to-stderr is enabled. and, i think the OSD daemon started by the teuthology is daemonized without -d or -f option. so teuthology.log can hardly capture the log messages for the dump_ops_in_flight command.

Actions

Copy link

#10

Updated by Kefu Chai about 9 years ago

Assignee deleted (~~Kefu Chai~~)

no tangible progress for a while, i am resigning from this ticket. :-(

Actions

Copy link

#11

Updated by Sage Weil about 9 years ago

Kefu Chai wrote:

perhaps enabling the clietn-side debug logs,

sorry sage, i am not sure that i am following you, i am not sure it works in this case, as the asok command is not a native ceph command. the ceph script talks to the daemon over the UDS, so we are not likely to get the client side log. and on the server side, the OSD logger sends the log to the stderr when log-to-stderr is enabled. and, i think the OSD daemon started by the teuthology is daemonized without -d or -f option. so teuthology.log can hardly capture the log messages for the dump_ops_in_flight command.

oh, you're right. hrm. maybe what we need is better debugging on the server side for the admin socket handling code? there is a debug_asok setting that we could bump up..

Actions

Copy link

#12

Updated by Sage Weil almost 9 years ago

Status changed from New to Resolved
Regression set to No

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #10630

failed to decode json output from thrashosds

Updated by Sage Weil about 9 years ago

Updated by Kefu Chai about 9 years ago

Updated by Kefu Chai about 9 years ago

Updated by Sage Weil about 9 years ago

Updated by Kefu Chai about 9 years ago

Updated by Kefu Chai about 9 years ago

Updated by Kefu Chai about 9 years ago

Updated by Kefu Chai about 9 years ago

Updated by Kefu Chai about 9 years ago

Updated by Kefu Chai about 9 years ago

Updated by Sage Weil about 9 years ago

Updated by Sage Weil almost 9 years ago