Bug #15300
closedgevent.corecext.loop.handle_error hides error cause with CRITICAL:root:IOError
0%
Description
- http://pulpito.ceph.com/loic-2016-03-24_01:22:33-rados-hammer-backports---basic-smithi/83728/
- http://pulpito.ceph.com/loic-2016-03-24_01:22:33-rados-hammer-backports---basic-smithi/83661/
and more
2016-03-26T23:12:13.468 INFO:teuthology.orchestra.run.smithi039.stderr:2016-03-27 06:12:13.454247 7f1d6ae6d700 10 client.6584.objecter successfully canceled tick 2016-03-26T23:12:13.470 INFO:teuthology.orchestra.run.smithi039.stderr:2016-03-27 06:12:13.454760 7f1d6ae6d700 1 -- 172.21.15.39:0/468413173 mark_down 0x7f1d64093d40 -- 0x7f1d64098590 2016-03-26T23:12:13.470 INFO:teuthology.orchestra.run.smithi039.stderr:2016-03-27 06:12:13.454808 7f1d6ae6d700 1 -- 172.21.15.39:0/468413173 mark_down_all 2016-03-26T23:12:13.470 INFO:teuthology.orchestra.run.smithi039.stderr:2016-03-27 06:12:13.454977 7f1d6ae6d700 1 -- 172.21.15.39:0/468413173 shutdown complete. 2016-03-26T23:12:13.476 INFO:tasks.ceph.ceph_manager:no progress seen, keeping timeout for now 2016-03-26T23:12:13.478 CRITICAL:root: File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error self.print_exception(context, type, value, tb) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception traceback.print_exception(type, value, tb) File "/usr/lib/python2.7/traceback.py", line 124, in print_exception _print(file, 'Traceback (most recent call last):') File "/usr/lib/python2.7/traceback.py", line 13, in _print file.write(str+terminator) 2016-03-26T23:12:13.479 CRITICAL:root:IOError
Updated by Loïc Dachary about 8 years ago
http://pulpito.ceph.com/loic-2016-03-24_01:22:33-rados-hammer-backports---basic-smithi/83615/
2016-03-26T21:37:45.252 INFO:tasks.rados.rados.0.smithi046.stderr:err -5 2016-03-26T21:37:45.253 INFO:tasks.rados.rados.0.smithi046.stderr:./test/osd/RadosModel.h: In function 'virtual void RollbackOp::_finish(TestOp::CallbackInfo*)' thread 7fd9377fe700 time 2016-03-27 04:37:45.253211 2016-03-26T21:37:45.253 INFO:tasks.rados.rados.0.smithi046.stderr:./test/osd/RadosModel.h: 1485: FAILED assert(0) 2016-03-26T21:37:45.253 INFO:tasks.rados.rados.0.smithi046.stderr: ceph version 0.94.6-259-g8cc324d (8cc324df4caf0d043208d9abe88a0e217e861dc3) 2016-03-26T21:37:45.254 INFO:tasks.rados.rados.0.smithi046.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x4e101b] 2016-03-26T21:37:45.254 INFO:tasks.rados.rados.0.smithi046.stderr: 2: (RollbackOp::_finish(TestOp::CallbackInfo*)+0xf7) [0x4c8097] 2016-03-26T21:37:45.255 INFO:tasks.rados.rados.0.smithi046.stderr: 3: (write_callback(void*, void*)+0x19) [0x4d8619] 2016-03-26T21:37:45.255 INFO:tasks.rados.rados.0.smithi046.stderr: 4: (()+0x950ed) [0x7fd94bd6c0ed] 2016-03-26T21:37:45.255 INFO:tasks.rados.rados.0.smithi046.stderr: 5: (()+0x70439) [0x7fd94bd47439] 2016-03-26T21:37:45.255 INFO:tasks.rados.rados.0.smithi046.stderr: 6: (()+0x134438) [0x7fd94be0b438] 2016-03-26T21:37:45.256 INFO:tasks.rados.rados.0.smithi046.stderr: 7: (()+0x8182) [0x7fd94b8ab182] 2016-03-26T21:37:45.256 INFO:tasks.rados.rados.0.smithi046.stderr: 8: (clone()+0x6d) [0x7fd94a23047d] 2016-03-26T21:37:45.256 INFO:tasks.rados.rados.0.smithi046.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-03-26T21:37:45.256 INFO:tasks.rados.rados.0.smithi046.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion' 2016-03-26T21:37:46.101 CRITICAL:root: File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error self.print_exception(context, type, value, tb) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception traceback.print_exception(type, value, tb) File "/usr/lib/python2.7/traceback.py", line 124, in print_exception _print(file, 'Traceback (most recent call last):') File "/usr/lib/python2.7/traceback.py", line 13, in _print file.write(str+terminator) 2016-03-26T21:37:46.102 CRITICAL:root:IOError
Updated by Loïc Dachary about 8 years ago
- Status changed from New to 12
- Priority changed from Normal to Urgent
promoting to urgent and moving to sepia as it looks like some smithi are failing with I/O error
Updated by Loïc Dachary about 8 years ago
- Project changed from teuthology to sepia
Updated by Zack Cerza about 8 years ago
Hrm. Isn't it just trying to print to stdout though?
Updated by Loïc Dachary about 8 years ago
@Zack http://tracker.ceph.com/issues/15300#note-1 suggests this is only the tip of the iceberg (-5 at the begining EIO and crash).
Updated by David Galloway about 8 years ago
This may be a dumb question but is the implication here possible hardware failure? If not, I'm not sure how to go about trying to reproduce the problem or if it's even something I should troubleshoot.
Updated by Zack Cerza about 8 years ago
Ah, I see. Well, apparently there's a firmware update for the NVMe cards. We should try applying the update to some of them to see if it magically fixes this :)
Updated by David Galloway about 8 years ago
- Assignee set to David Galloway
Zack Cerza wrote:
Ah, I see. Well, apparently there's a firmware update for the NVMe cards. We should try applying the update to some of them to see if it magically fixes this :)
Any idea how I can intentionally reproduce the error, by chance?
Updated by Nathan Cutler about 8 years ago
- Related to Support #15340: Update smithi P3700 NVMe card firmware added
Updated by Loïc Dachary about 8 years ago
2016-03-31T12:37:15.824 INFO:tasks.rados.rados.0.smithi037.stdout:6602: delete oid 4089 current snap is 0 2016-03-31T12:37:15.825 INFO:tasks.rados.rados.0.smithi037.stderr:6601: Error: oid 4266 read returned error code -2 2016-03-31T12:37:15.825 INFO:tasks.rados.rados.0.smithi037.stderr:./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7f42837fe700 time 2016-03-31 19:37:15.817732 2016-03-31T12:37:15.825 INFO:tasks.rados.rados.0.smithi037.stderr:./test/osd/RadosModel.h: 1109: FAILED assert(0) 2016-03-31T12:37:15.825 INFO:tasks.rados.rados.0.smithi037.stderr: ceph version 0.94.6-259-g8cc324d (8cc324df4caf0d043208d9abe88a0e217e861dc3) 2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x4e101b] 2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0xec) [0x4d13ac] 2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 3: (()+0x9522d) [0x7f4297bfa22d] 2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 4: (()+0x70439) [0x7f4297bd5439] 2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 5: (()+0x134438) [0x7f4297c99438] 2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 6: (()+0x8182) [0x7f4297739182] 2016-03-31T12:37:15.826 INFO:tasks.rados.rados.0.smithi037.stderr: 7: (clone()+0x6d) [0x7f42960be47d] 2016-03-31T12:37:15.827 INFO:tasks.rados.rados.0.smithi037.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-03-31T12:37:15.827 INFO:tasks.rados.rados.0.smithi037.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion' 2016-03-31T12:37:15.924 CRITICAL:root: File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error self.print_exception(context, type, value, tb) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception traceback.print_exception(type, value, tb) File "/usr/lib/python2.7/traceback.py", line 124, in print_exception _print(file, 'Traceback (most recent call last):') File "/usr/lib/python2.7/traceback.py", line 13, in _print file.write(str+terminator) 2016-03-31T12:37:15.924 CRITICAL:root:IOError
Updated by Loïc Dachary about 8 years ago
These come from two completely different tests, it's bound to be something environmental. And I doubt it's just gevent misbehaving.
2016-03-31T12:31:02.473 INFO:tasks.rados.rados.0.smithi058.stdout:570: finishing copy_from to smithi05822912-26 2016-03-31T12:31:02.473 INFO:tasks.rados.rados.0.smithi058.stderr:Error: oid 26 copy_from 50 returned error code -5 2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr:./test/osd/RadosModel.h: In function 'virtual void CopyFromOp::_finish(TestOp::CallbackInfo*)' thread 7f79e0ff9700 time 2016-03-31 19:31:02.474007 2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr:./test/osd/RadosModel.h: 1594: FAILED assert(0) 2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr: ceph version 0.94.6-259-g8cc324d (8cc324df4caf0d043208d9abe88a0e217e861dc3) 2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x4e101b] 2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr: 2: (CopyFromOp::_finish(TestOp::CallbackInfo*)+0x1f5) [0x4c82a5] 2016-03-31T12:31:02.474 INFO:tasks.rados.rados.0.smithi058.stderr: 3: (write_callback(void*, void*)+0x19) [0x4d8619] 2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 4: (()+0x950ed) [0x7f79ed9720ed] 2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 5: (()+0x70439) [0x7f79ed94d439] 2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 6: (()+0x134438) [0x7f79eda11438] 2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 7: (()+0x8182) [0x7f79ed4b1182] 2016-03-31T12:31:02.475 INFO:tasks.rados.rados.0.smithi058.stderr: 8: (clone()+0x6d) [0x7f79ebe3647d] 2016-03-31T12:31:02.476 INFO:tasks.rados.rados.0.smithi058.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-03-31T12:31:02.476 INFO:tasks.rados.rados.0.smithi058.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion' 2016-03-31T12:31:02.516 CRITICAL:root: File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error self.print_exception(context, type, value, tb) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception traceback.print_exception(type, value, tb) File "/usr/lib/python2.7/traceback.py", line 124, in print_exception _print(file, 'Traceback (most recent call last):') File "/usr/lib/python2.7/traceback.py", line 13, in _print file.write(str+terminator) 2016-03-31T12:31:02.516 CRITICAL:root:IOError
Updated by Loïc Dachary about 8 years ago
- Project changed from sepia to teuthology
- Assignee changed from David Galloway to Zack Cerza
- Priority changed from Urgent to High
Running the rados suite on hammer and OpenStack shows the same kind of error. Only it's not obscured by gevent (see below). The gevent version used by teuthology on OpenStack is gevent==1.0.2
2016-04-03T23:14:34.423 INFO:tasks.rados.rados.0.target167114243099.stdout:3455: done (15 left) 2016-04-03T23:14:34.423 INFO:tasks.rados.rados.0.target167114243099.stdout:3456: delete oid 7 current snap is 357 2016-04-03T23:14:34.577 INFO:tasks.rados.rados.0.target167114243099.stderr:err -5 2016-04-03T23:14:34.580 INFO:tasks.rados.rados.0.target167114243099.stderr:./test/osd/RadosModel.h: In function 'virtual void SetAttrsOp::_finish(TestOp::CallbackInfo*)' thread 7fdc3bfff700 time 2016-04-03 23:14:34.562701 2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr:./test/osd/RadosModel.h: 687: FAILED assert(0) 2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: ceph version 0.94.6-259-g8cc324d (8cc324df4caf0d043208d9abe88a0e217e861dc3) 2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x4e101b] 2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 2: (SetAttrsOp::_finish(TestOp::CallbackInfo*)+0x218) [0x4c7f88] 2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 3: (write_callback(void*, void*)+0x19) [0x4d8619] 2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 4: (()+0x950ed) [0x7fdc54d9d0ed] 2016-04-03T23:14:34.581 INFO:tasks.rados.rados.0.target167114243099.stderr: 5: (()+0x70439) [0x7fdc54d78439] 2016-04-03T23:14:34.582 INFO:tasks.rados.rados.0.target167114243099.stderr: 6: (()+0x134438) [0x7fdc54e3c438] 2016-04-03T23:14:34.582 INFO:tasks.rados.rados.0.target167114243099.stderr: 7: (()+0x8182) [0x7fdc548dc182] 2016-04-03T23:14:34.582 INFO:tasks.rados.rados.0.target167114243099.stderr: 8: (clone()+0x6d) [0x7fdc5326147d] 2016-04-03T23:14:34.582 INFO:tasks.rados.rados.0.target167114243099.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2016-04-03T23:14:34.583 INFO:tasks.rados.rados.0.target167114243099.stderr:terminate called after throwing an instance of 'ceph::FailedAssertion' 2016-04-03T23:14:34.850 ERROR:teuthology.run_tasks:Manager failed: rados Traceback (most recent call last): File "/home/ubuntu/teuthology/teuthology/run_tasks.py", line 139, in run_tasks suppress = manager.__exit__(*exc_info) File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__ self.gen.next() File "/home/ubuntu/src/ceph-qa-suite_hammer/tasks/rados.py", line 196, in task running.get() File "/home/ubuntu/teuthology/virtualenv/local/lib/python2.7/site-packages/gevent/greenlet.py", line 274, in get raise self._exception CommandCrashedError: Command crashed: 'CEPH_CLIENT_ID=0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph_test_rados --ec-pool --op read 100 --op write 0 --op delete 50 --max-ops 4000 --objects 50 --max-in-flight 16 --size 4000000 --min-stride-size 400000 --max-stride-size 800000 --max-seconds 0 --op snap_create 50 --op snap_remove 50 --op rollback 50 --op setattr 25 --op rmattr 25 --op copy_from 50 --op append 100 --pool unique_pool_0' 2016-04-03T23:14:34.850 DEBUG:teuthology.run_tasks:Unwinding manager thrashosds
Updated by Loïc Dachary about 8 years ago
- Subject changed from gevent.corecext.loop.handle_error CRITICAL:root:IOError to gevent.corecext.loop.handle_error hides error cause with CRITICAL:root:IOError
Updated by Nathan Cutler almost 8 years ago
- http://pulpito.ceph.com/smithfarm-2016-07-02_01:06:27-rados-hammer-backports---basic-smithi/287964
- http://pulpito.ceph.com/smithfarm-2016-07-02_01:06:27-rados-hammer-backports---basic-smithi/287998
The following crash is reported during thrashosds:
2016-07-03T18:12:06.967 INFO:tasks.ceph.osd.5.smithi068.stderr:ceph-osd: /usr/include/boost/smart_ptr/intrusive_ptr.hpp:162: T* boost::intrusive_ptr<T>::operator->() const [with T = Connection]: Assertion `px != 0' failed. 2016-07-03T18:12:06.967 INFO:tasks.ceph.osd.5.smithi068.stderr:*** Caught signal (Aborted) ** 2016-07-03T18:12:06.967 INFO:tasks.ceph.osd.5.smithi068.stderr: in thread 7f48a21d5700 2016-07-03T18:12:06.967 INFO:tasks.ceph.osd.5.smithi068.stderr:ceph-osd: /usr/include/boost/smart_ptr/intrusive_ptr.hpp:162: T* boost::intrusive_ptr<T>::operator->() const [with T = Connection]: Assertion `px != 0' failed. 2016-07-03T18:12:07.349 INFO:tasks.ceph.osd.5.smithi068.stderr:daemon-helper: command crashed with signal 6
but thrashosds chugs on for 2+ minutes more, when this happens:
2016-07-03T18:14:18.135 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 4, 5, 0] out_osds: [] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4] 2016-07-03T18:14:18.136 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0 2016-07-03T18:14:18.136 INFO:tasks.thrashosds.thrasher:Killing osd 5, live_osds are [1, 0, 3, 2, 5, 4] 2016-07-03T18:14:18.136 DEBUG:tasks.ceph.osd.5:waiting for process to exit 2016-07-03T18:14:18.136 INFO:teuthology.orchestra.run:waiting for 300 2016-07-03T18:14:18.140 CRITICAL:root: File "gevent/corecext.pyx", line 360, in gevent.corecext.loop.handle_error (gevent/gevent.corecext.c:6344) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 563, in handle_error self.print_exception(context, type, value, tb) File "/home/teuthworker/src/teuthology_master/virtualenv/local/lib/python2.7/site-packages/gevent/hub.py", line 591, in print_exception traceback.print_exception(type, value, tb) File "/usr/lib/python2.7/traceback.py", line 124, in print_exception _print(file, 'Traceback (most recent call last):') File "/usr/lib/python2.7/traceback.py", line 13, in _print file.write(str+terminator) 2016-07-03T18:14:18.140 CRITICAL:root:IOError
Updated by Josh Durgin over 2 years ago
- Status changed from New to Resolved
looks like it was hardware issues that have been fixed