Project

General

Profile

Actions

Bug #15300

closed

gevent.corecext.loop.handle_error hides error cause with CRITICAL:root:IOError

Added by Loïc Dachary about 8 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

and more

2016-03-26T23:12:13.468 INFO:teuthology.orchestra.run.smithi039.stderr:2016-03-27 06:12:13.454247 7f1d6ae6d700 10 client.6584.objecter  successfully canceled tick
2016-03-26T23:12:13.470 INFO:teuthology.orchestra.run.smithi039.stderr:2016-03-27 06:12:13.454760 7f1d6ae6d700  1 -- 172.21.15.39:0/468413173 mark_down 0x7f1d64093d40 -- 0x7f1d64098590
2016-03-26T23:12:13.470 INFO:teuthology.orchestra.run.smithi039.stderr:2016-03-27 06:12:13.454808 7f1d6ae6d700  1 -- 172.21.15.39:0/468413173 mark_down_all
2016-03-26T23:12:13.470 INFO:teuthology.orchestra.run.smithi039.stderr:2016-03-27 06:12:13.454977 7f1d6ae6d700  1 -- 172.21.15.39:0/468413173 shutdown complete.
2016-03-26T23:12:13.476 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now
2016-03-26T23:12:13.478 CRITICAL:root:  File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception
    traceback.print_exception(type, value, tb)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-03-26T23:12:13.479 CRITICAL:root:IOError
Actions #1

Updated by Loïc Dachary about 8 years ago

http://pulpito.ceph.com/loic-2016-03-24_01:22:33-rados-hammer-backports---basic-smithi/83615/

2016-03-26T21:37:45.252 INFO:tasks.rados.rados.0.smithi046.stderr:err -5
2016-03-26T21:37:45.253 INFO:tasks.rados.rados.0.smithi046.stderr:./test/osd/RadosModel.h: In function 'virtual void RollbackOp::_finish(TestOp::CallbackInfo*)' thread 7fd9377fe700 time 2016-03-27 04:37:45.253211
2016-03-26T21:37:45.253 INFO:tasks.rados.rados.0.smithi046.stderr:./test/osd/RadosModel.h: 1485: FAILED assert(0)
2016-03-26T21:37:45.253 INFO:tasks.rados.rados.0.smithi046.stderr: ceph version 0.94.6-259-g8cc324d (8cc324df4caf0d043208d9abe88a0e217e861dc3)
2016-03-26T21:37:45.254 INFO:tasks.rados.rados.0.smithi046.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x4e101b]
2016-03-26T21:37:45.254 INFO:tasks.rados.rados.0.smithi046.stderr: 2: (RollbackOp::_finish(TestOp::CallbackInfo*)+0xf7) [0x4c8097]
2016-03-26T21:37:45.255 INFO:tasks.rados.rados.0.smithi046.stderr: 3: (write_callback(void*, void*)+0x19) [0x4d8619]
2016-03-26T21:37:45.255 INFO:tasks.rados.rados.0.smithi046.stderr: 4: (()+0x950ed) [0x7fd94bd6c0ed]
2016-03-26T21:37:45.255 INFO:tasks.rados.rados.0.smithi046.stderr: 5: (()+0x70439) [0x7fd94bd47439]
2016-03-26T21:37:45.255 INFO:tasks.rados.rados.0.smithi046.stderr: 6: (()+0x134438) [0x7fd94be0b438]
2016-03-26T21:37:45.256 INFO:tasks.rados.rados.0.smithi046.stderr: 7: (()+0x8182) [0x7fd94b8ab182]
2016-03-26T21:37:45.256 INFO:tasks.rados.rados.0.smithi046.stderr: 8: (clone()+0x6d) [0x7fd94a23047d]
2016-03-26T21:37:45.256 INFO:tasks.rados.rados.0.smithi046.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-03-26T21:37:45.256 INFO:tasks.rados.rados.0.smithi046.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'
2016-03-26T21:37:46.101 CRITICAL:root:  File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception
    traceback.print_exception(type, value, tb)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-03-26T21:37:46.102 CRITICAL:root:IOError
Actions #2

Updated by Loïc Dachary about 8 years ago

  • Status changed from New to 12
  • Priority changed from Normal to Urgent

promoting to urgent and moving to sepia as it looks like some smithi are failing with I/O error

Actions #3

Updated by Loïc Dachary about 8 years ago

  • Project changed from teuthology to sepia
Actions #4

Updated by Zack Cerza about 8 years ago

Hrm. Isn't it just trying to print to stdout though?

Actions #5

Updated by Loïc Dachary about 8 years ago

@Zack http://tracker.ceph.com/issues/15300#note-1 suggests this is only the tip of the iceberg (-5 at the begining EIO and crash).

Actions #6

Updated by David Galloway about 8 years ago

This may be a dumb question but is the implication here possible hardware failure? If not, I'm not sure how to go about trying to reproduce the problem or if it's even something I should troubleshoot.

Actions #7

Updated by Zack Cerza about 8 years ago

Ah, I see. Well, apparently there's a firmware update for the NVMe cards. We should try applying the update to some of them to see if it magically fixes this :)

Actions #8

Updated by David Galloway about 8 years ago

  • Assignee set to David Galloway

Zack Cerza wrote:

Ah, I see. Well, apparently there's a firmware update for the NVMe cards. We should try applying the update to some of them to see if it magically fixes this :)

Any idea how I can intentionally reproduce the error, by chance?

Actions #9

Updated by Nathan Cutler about 8 years ago

  • Related to Support #15340: Update smithi P3700 NVMe card firmware added
Actions #10

Updated by Loïc Dachary about 8 years ago

2016-03-31T12:37:15.824 INFO:tasks.rados.rados.0.smithi037.stdout:6602: delete oid 4089 current snap is 0
2016-03-31T12:37:15.825 INFO:tasks.rados.rados.0.smithi037.stderr:6601: Error: oid 4266 read returned error code -2
2016-03-31T12:37:15.825 INFO:tasks.rados.rados.0.smithi037.stderr:./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7f42837fe700 time 2016-03-31 19:37:15.817732
2016-03-31T12:37:15.825 INFO:tasks.rados.rados.0.smithi037.stderr:./test/osd/RadosModel.h: 1109: FAILED assert(0)
2016-03-31T12:37:15.825 INFO:tasks.rados.rados.0.smithi037.stderr: ceph version 0.94.6-259-g8cc324d (8cc324df4caf0d043208d9abe88a0e217e861dc3)
2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x4e101b]
2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4d13ac]
2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 3: (()+0x9522d) [0x7f4297bfa22d]
2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 4: (()+0x70439) [0x7f4297bd5439]
2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 5: (()+0x134438) [0x7f4297c99438]
2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 6: (()+0x8182) [0x7f4297739182]
2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 7: (clone()+0x6d) [0x7f42960be47d]
2016-03-31T12:37:15.827 INFO:tasks.rados.rados.0.smithi037.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-03-31T12:37:15.827 INFO:tasks.rados.rados.0.smithi037.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'
2016-03-31T12:37:15.924 CRITICAL:root:  File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception
    traceback.print_exception(type, value, tb)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-03-31T12:37:15.924 CRITICAL:root:IOError
Actions #11

Updated by Loïc Dachary about 8 years ago

These come from two completely different tests, it's bound to be something environmental. And I doubt it's just gevent misbehaving.

2016-03-31T12:31:02.473 INFO:tasks.rados.rados.0.smithi058.stdout:570:  finishing copy_from to smithi05822912-26
2016-03-31T12:31:02.473 INFO:tasks.rados.rados.0.smithi058.stderr:Error: oid 26 copy_from 50 returned error code -5
2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr:./test/osd/RadosModel.h: In function 'virtual void CopyFromOp::_finish(TestOp::CallbackInfo*)' thread 7f79e0ff9700 time 2016-03-31 19:31:02.474007
2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr:./test/osd/RadosModel.h: 1594: FAILED assert(0)
2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr: ceph version 0.94.6-259-g8cc324d (8cc324df4caf0d043208d9abe88a0e217e861dc3)
2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x4e101b]
2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr: 2: (CopyFromOp::_finish(TestOp::CallbackInfo*)+0x1f5) [0x4c82a5]
2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr: 3: (write_callback(void*, void*)+0x19) [0x4d8619]
2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 4: (()+0x950ed) [0x7f79ed9720ed]
2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 5: (()+0x70439) [0x7f79ed94d439]
2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 6: (()+0x134438) [0x7f79eda11438]
2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 7: (()+0x8182) [0x7f79ed4b1182]
2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 8: (clone()+0x6d) [0x7f79ebe3647d]
2016-03-31T12:31:02.476 INFO:tasks.rados.rados.0.smithi058.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-03-31T12:31:02.476 INFO:tasks.rados.rados.0.smithi058.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'
2016-03-31T12:31:02.516 CRITICAL:root:  File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception
    traceback.print_exception(type, value, tb)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-03-31T12:31:02.516 CRITICAL:root:IOError

Actions #12

Updated by Loïc Dachary about 8 years ago

  • Project changed from sepia to teuthology
  • Assignee changed from David Galloway to Zack Cerza
  • Priority changed from Urgent to High

Running the rados suite on hammer and OpenStack shows the same kind of error. Only it's not obscured by gevent (see below). The gevent version used by teuthology on OpenStack is gevent==1.0.2

2016-04-03T23:14:34.423 INFO:tasks.rados.rados.0.target167114243099.stdout:3455: done (15 left)
2016-04-03T23:14:34.423 INFO:tasks.rados.rados.0.target167114243099.stdout:3456: delete oid 7 current snap is 357
2016-04-03T23:14:34.577 INFO:tasks.rados.rados.0.target167114243099.stderr:err -5
2016-04-03T23:14:34.580 INFO:tasks.rados.rados.0.target167114243099.stderr:./test/osd/RadosModel.h: In function 'virtual void SetAttrsOp::_finish(TestOp::CallbackInfo*)' thread 7fdc3bfff700 time 2016-04-03 23:14:34.562701
2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr:./test/osd/RadosModel.h: 687: FAILED assert(0)
2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: ceph version 0.94.6-259-g8cc324d (8cc324df4caf0d043208d9abe88a0e217e861dc3)
2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x4e101b]
2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 2: (SetAttrsOp::_finish(TestOp::CallbackInfo*)+0x218) [0x4c7f88]
2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 3: (write_callback(void*, void*)+0x19) [0x4d8619]
2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 4: (()+0x950ed) [0x7fdc54d9d0ed]
2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 5: (()+0x70439) [0x7fdc54d78439]
2016-04-03T23:14:34.582 INFO:tasks.rados.rados.0.target167114243099.stderr: 6: (()+0x134438) [0x7fdc54e3c438]
2016-04-03T23:14:34.582 INFO:tasks.rados.rados.0.target167114243099.stderr: 7: (()+0x8182) [0x7fdc548dc182]
2016-04-03T23:14:34.582 INFO:tasks.rados.rados.0.target167114243099.stderr: 8: (clone()+0x6d) [0x7fdc5326147d]
2016-04-03T23:14:34.582 INFO:tasks.rados.rados.0.target167114243099.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2016-04-03T23:14:34.583 INFO:tasks.rados.rados.0.target167114243099.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion'
2016-04-03T23:14:34.850 ERROR:teuthology.run_tasks:Manager failed: rados
Traceback (most recent call last):
  File "/home/ubuntu/teuthology/teuthology/run_tasks.py", line 139, in run_tasks
    suppress = manager.__exit__(*exc_info)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/home/ubuntu/src/ceph-qa-suite_hammer/tasks/rados.py", line 196, in task
    running.get()
  File "/home/ubuntu/teuthology/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 274, in get
    raise self._exception
CommandCrashedError: Command crashed: 'CEPH_CLIENT_ID=0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph_test_rados --ec-pool --op read 100 --op write 0 --op delete 50 --max-ops 4000 --objects 50 --max-in-flight 16 --size 4000000 --min-stride-size 400000 --max-stride-size 800000 --max-seconds 0 --op snap_create 50 --op snap_remove 50 --op rollback 50 --op setattr 25 --op rmattr 25 --op copy_from 50 --op append 100 --pool unique_pool_0'
2016-04-03T23:14:34.850 DEBUG:teuthology.run_tasks:Unwinding manager thrashosds
Actions #13

Updated by Loïc Dachary about 8 years ago

  • Subject changed from gevent.corecext.loop.handle_error CRITICAL:root:IOError to gevent.corecext.loop.handle_error hides error cause with CRITICAL:root:IOError
Actions #14

Updated by Nathan Cutler almost 8 years ago

Seeing this in hammer integration testing:

The following crash is reported during thrashosds:

2016-07-03T18:12:06.967 INFO:tasks.ceph.osd.5.smithi068.stderr:ceph-osd: /usr/include/boost/smart_ptr/intrusive_ptr.hpp:162: T* boost::intrusive_ptr<T>::operator->() const [with T = Connection]: Assertion `px != 0' failed.
2016-07-03T18:12:06.967 INFO:tasks.ceph.osd.5.smithi068.stderr:*** Caught signal (Aborted) **
2016-07-03T18:12:06.967 INFO:tasks.ceph.osd.5.smithi068.stderr: in thread 7f48a21d5700
2016-07-03T18:12:06.967 INFO:tasks.ceph.osd.5.smithi068.stderr:ceph-osd: /usr/include/boost/smart_ptr/intrusive_ptr.hpp:162: T* boost::intrusive_ptr<T>::operator->() const [with T = Connection]: Assertion `px != 0' failed.
2016-07-03T18:12:07.349 INFO:tasks.ceph.osd.5.smithi068.stderr:daemon-helper: command crashed with signal 6

but thrashosds chugs on for 2+ minutes more, when this happens:

2016-07-03T18:14:18.135 INFO:tasks.thrashosds.thrasher:in_osds:  [1, 2, 3, 4, 5, 0] out_osds:  [] dead_osds:  [] live_osds:  [1, 0, 3, 2, 5, 4]
2016-07-03T18:14:18.136 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2016-07-03T18:14:18.136 INFO:tasks.thrashosds.thrasher:Killing osd 5, live_osds are [1, 0, 3, 2, 5, 4]
2016-07-03T18:14:18.136 DEBUG:tasks.ceph.osd.5:waiting for process to exit
2016-07-03T18:14:18.136 INFO:teuthology.orchestra.run:waiting for 300
2016-07-03T18:14:18.140 CRITICAL:root:  File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error
    self.print_exception(context, type, value, tb)
  File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception
    traceback.print_exception(type, value, tb)
  File "/usr/lib/python2.7/traceback.py", line 124, in print_exception
    _print(file, 'Traceback (most recent call last):')
  File "/usr/lib/python2.7/traceback.py", line 13, in _print
    file.write(str+terminator)

2016-07-03T18:14:18.140 CRITICAL:root:IOError
Actions #15

Updated by Patrick Donnelly over 4 years ago

  • Status changed from 12 to New
Actions #16

Updated by Josh Durgin over 2 years ago

  • Status changed from New to Resolved

looks like it was hardware issues that have been fixed

Actions

Also available in: Atom PDF