Project

General

Profile

Actions

Bug #20981

closed

./run_seed_to_range.sh errored out

Added by Sage Weil over 6 years ago. Updated over 6 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2017-08-10T21:53:13.678 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150437 7f65edd72e00  5 ceph_test_objectstore_state give collection hint, number of objects per collection: 0
2017-08-10T21:53:13.698 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150440 7f65edd72e00  5 filestore(b.00) queue_transactions(2167): new 0x56288b527520 osr(0.16_head 0x56288b40bc08)
2017-08-10T21:53:13.701 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150463 7f65edd72e00  5 filestore(b.00) queue_transactions(2215): (writeahead) 19 [Transaction(0x56288b527380)]
2017-08-10T21:53:13.705 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150468 7f65edd72e00  5 ceph_test_objectstore_state init create collection 1.11_head meta #-1:50a5dd44:::pglog_0.17_head:head#
2017-08-10T21:53:13.743 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150472 7f65edd72e00  5 ceph_test_objectstore_state give collection hint, number of objects per collection: 0
2017-08-10T21:53:13.760 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150476 7f65edd72e00  5 filestore(b.00) queue_transactions(2167): new 0x56288b527860 osr(0.17_head 0x56288b40bd88)
2017-08-10T21:53:13.806 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150497 7f65edd72e00  5 filestore(b.00) queue_transactions(2215): (writeahead) 20 [Transaction(0x56288b5276c0)]
2017-08-10T21:53:13.824 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150503 7f65edd72e00  5 ceph_test_objectstore_state init create collection 1.12_head meta #-1:5de4d004:::pglog_0.18_head:head#
2017-08-10T21:53:13.867 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150507 7f65edd72e00  5 ceph_test_objectstore_state give collection hint, number of objects per collection: 0
2017-08-10T21:53:13.875 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150511 7f65edd72e00  5 filestore(b.00) queue_transactions(2167): new 0x56288b527ba0 osr(0.18_head 0x56288b40bf08)
2017-08-10T21:53:13.882 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150522 7f65edd72e00  5 filestore(b.00) queue_transactions(2215): (writeahead) 21 [Transaction(0x56288b527a00)]
2017-08-10T21:53:13.889 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150526 7f65edd72e00  5 ceph_test_objectstore_state init create collection 1.13_head meta #-1:54555bf4:::pglog_0.19_head:head#
2017-08-10T21:53:13.945 INFO:teuthology.orchestra.run.smithi198.stderr:*** Caught signal (Aborted) **
2017-08-10T21:53:14.032 INFO:teuthology.orchestra.run.smithi198.stderr: in thread 7f65df2a2700 thread_name:tp_fstore_op
2017-08-10T21:53:14.039 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 21:36:29.150530 7f65edd72e00  5 ceph_test_objectstore_state give collection hint, number of objects per collection: 0
2017-08-10T21:53:14.047 INFO:teuthology.orchestra.run.smithi198.stderr:./run_seed_to.sh: line 288: 31178 Aborted                 $v ceph_test_filestore_idempotent_sequence run-sequence-to $stop_at $tmp_name_b $tmp_name_b/journal --test-seed $seed --osd-journal-size 100 --log-file $tmp_name_b.clean --debug-filestore 20 $tmp_opts_b
2017-08-10T21:53:14.080 INFO:teuthology.orchestra.run.smithi198.stdout:OK
2017-08-10T21:53:14.090 INFO:teuthology.orchestra.run.smithi198.stdout:run #2
2017-08-10T21:53:14.118 INFO:teuthology.orchestra.run.smithi198.stdout:stopped at 0
2017-08-10T21:53:14.150 INFO:teuthology.orchestra.run.smithi198.stdout:OK
2017-08-10T21:53:14.157 INFO:teuthology.orchestra.run.smithi198.stdout:run #3

/a/sage-2017-08-10_19:22:47-rados-wip-sage-testing-20170810a-distro-basic-smithi/1507719
Actions #1

Updated by Sage Weil over 6 years ago

Super weird.. looks like a race between heartbeat timeout and a failure injection maybe?

2017-08-10T20:58:20.040 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:24:58.852322 7f240d533700 10 filestore(a.00) _set_replay_guard(2453): 3.0.0
2017-08-10T20:58:20.043 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:24:58.852325 7f240d533700  5 filestore(a.00) _inject_failure(5461): 52 -> 51
2017-08-10T20:58:20.045 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:24:58.885570 7f240d533700 20 filestore dbobjectmap: seq is 1
...
2017-08-10T20:58:20.170 INFO:teuthology.orchestra.run.smithi198.stderr:*** Caught signal (Aborted) **
2017-08-10T20:58:20.173 INFO:teuthology.orchestra.run.smithi198.stderr: in thread 7f240d533700 thread_name:tp_fstore_op
2017-08-10T20:58:20.180 INFO:teuthology.orchestra.run.smithi198.stderr:/build/ceph-12.1.2-648-g2a98a4e/src/common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f240e535700 time 2017-08-10 20:28:03.897264
2017-08-10T20:58:20.183 INFO:teuthology.orchestra.run.smithi198.stderr:/build/ceph-12.1.2-648-g2a98a4e/src/common/HeartbeatMap.cc: 84: FAILED assert(0 == "hit suicide timeout")
...
2017-08-10T20:58:20.387 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:24:58.935792 7f240d533700  5 filestore(a.00) _inject_failure(5461): 49 -> 48

as in that thread continues...
...
2017-08-10T20:58:21.411 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:24:59.556080 7f240d533700  5 filestore(a.00) _inject_failure(5461): 2 -> 1
2017-08-10T20:58:21.413 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:24:59.556090 7f240d533700 10 filestore(a.00) _set_replay_guard(2485): 7.0.0 done
2017-08-10T20:58:21.415 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:24:59.556112 7f240d533700 10 filestore(a.00) _set_replay_guard(2453): 7.0.0
2017-08-10T20:58:21.418 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:24:59.556115 7f240d533700  5 filestore(a.00) _inject_failure(5461): 1 -> 0
2017-08-10T20:58:21.420 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:24:59.556117 7f240d533700 -1 filestore(a.00) _inject_failure(5463): KILLING
...
2017-08-10T20:58:21.442 INFO:teuthology.orchestra.run.smithi198.stderr: ceph version 12.1.2-648-g2a98a4e (2a98a4e4154bff1d5b670c0b0d1b5294048537b7) luminous (rc)
2017-08-10T20:58:21.445 INFO:teuthology.orchestra.run.smithi198.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f24128a75a2]
2017-08-10T20:58:21.448 INFO:teuthology.orchestra.run.smithi198.stderr: 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x46b) [0x7f2412b1939b]
2017-08-10T20:58:21.452 INFO:teuthology.orchestra.run.smithi198.stderr: 3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0x7f2412b19a26]
2017-08-10T20:58:21.458 INFO:teuthology.orchestra.run.smithi198.stderr: 4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f2412b1a1bc]
2017-08-10T20:58:21.465 INFO:teuthology.orchestra.run.smithi198.stderr: 5: (CephContextServiceThread::entry()+0x167) [0x7f2412b04427]
2017-08-10T20:58:21.472 INFO:teuthology.orchestra.run.smithi198.stderr: 6: (()+0x76ba) [0x7f241b9e56ba]
2017-08-10T20:58:21.478 INFO:teuthology.orchestra.run.smithi198.stderr: 7: (clone()+0x6d) [0x7f241156c82d]
2017-08-10T20:58:21.485 INFO:teuthology.orchestra.run.smithi198.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

then other threads contniue but stall out, until
2017-08-10T20:58:22.687 INFO:teuthology.orchestra.run.smithi198.stderr:2017-08-10 20:25:00.745805 7f240dd34700 15 filestore(a.00) _touch(3484): meta/#-1:5ccce178:::pglog_0.10_head:head#
2017-08-10T20:58:22.690 INFO:teuthology.orchestra.run.smithi198.stderr: ceph version 12.1.2-648-g2a98a4e (2a98a4e4154bff1d5b670c0b0d1b5294048537b7) luminous (rc)
2017-08-10T20:58:22.692 INFO:teuthology.orchestra.run.smithi198.stderr: 1: (()+0x402e24) [0x55e694e33e24]
2017-08-10T20:58:22.695 INFO:teuthology.orchestra.run.smithi198.stderr: 2: (()+0x11390) [0x7f241b9ef390]
2017-08-10T20:58:22.697 INFO:teuthology.orchestra.run.smithi198.stderr: 3: (()+0x1026d) [0x7f241b9ee26d]
2017-08-10T20:58:22.699 INFO:teuthology.orchestra.run.smithi198.stderr: 4: (pthread_mutex_lock()+0x7d) [0x7f241b9e7dbd]
2017-08-10T20:58:22.702 INFO:teuthology.orchestra.run.smithi198.stderr: 5: (ceph::logging::Log::flush()+0x2d) [0x7f24128cba2d]
2017-08-10T20:58:22.704 INFO:teuthology.orchestra.run.smithi198.stderr: 6: (FileStore::_inject_failure()+0x349) [0x55e694bf01e9]
2017-08-10T20:58:22.706 INFO:teuthology.orchestra.run.smithi198.stderr: 7: (FileStore::_set_replay_guard(int, SequencerPosition const&, ghobject_t const*, bool)+0x9e) [0x55e694bf5e4e]
2017-08-10T20:58:22.709 INFO:teuthology.orchestra.run.smithi198.stderr: 8: (FileStore::_set_replay_guard(coll_t const&, SequencerPosition const&, bool)+0x6b) [0x55e694bf650b]
2017-08-10T20:58:22.711 INFO:teuthology.orchestra.run.smithi198.stderr: 9: (FileStore::_create_collection(coll_t const&, int, SequencerPosition const&)+0x586) [0x55e694bf6d26]
2017-08-10T20:58:22.714 INFO:teuthology.orchestra.run.smithi198.stderr: 10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0x1184) [0x55e694c20a04]
2017-08-10T20:58:22.722 INFO:teuthology.orchestra.run.smithi198.stderr: 11: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x55e694c2620b]
2017-08-10T20:58:22.724 INFO:teuthology.orchestra.run.smithi198.stderr: 12: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x671) [0x55e694c268b1]

so.. maybe the log flush is crashing because it raced against the heartbeat timeout, which was trying to crash? :/

Actions #2

Updated by Kefu Chai over 6 years ago

/a/kchai-2017-08-18_03:03:28-rados-master-distro-basic-mira/1537335


2017-08-18T03:18:37.115 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.190012 7f4ef298a700  5 filestore(b.00) _do_op(2081): 0x7f4f046e8400 seq 3 osr(0.0_head 0x7f4f044c0660)/0x7f4f044c0660 start

2017-08-18T03:18:37.119 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.190037 7f4ef298a700 10 filestore(b.00) _do_transaction(2622): on 0x7f4f046ea780

2017-08-18T03:18:38.068 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.194498 7f4ef298a700  5 filestore(b.00) _do_op(2081): 0x7f4f046e8280 seq 5 osr(0.2_head 0x7f4f044c0d20)/0x7f4f044c0d20 start

2017-08-18T03:18:38.744 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.198513 7f4ef298a700  5 filestore(b.00) _do_op(2081): 0x7f4f046e8100 seq 7 osr(0.4_head 0x7f4f044c0f60)/0x7f4f044c0f60 start

2017-08-18T03:18:39.074 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.202595 7f4ef298a700  5 filestore(b.00) _do_op(2081): 0x7f4f046e7f80 seq 9 osr(0.6_head 0x7f4f044c11a0)/0x7f4f044c11a0 start

2017-08-18T03:18:39.637 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.206256 7f4ef298a700  5 filestore(b.00) _do_op(2081): 0x7f4f046e7e00 seq 11 osr(0.8_head 0x7f4f044c13e0)/0x7f4f044c13e0 start

2017-08-18T03:18:40.028 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.210118 7f4ef298a700  5 filestore(b.00) _do_op(2081): 0x7f4f046e89a0 seq 14 osr(0.11_head 0x7f4f044c1740)/0x7f4f044c1740 start

2017-08-18T03:18:43.758 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.236944 7f4ef298a700  5 filestore(b.00) _do_op(2081): 0x7f4f046e9360 seq 27 osr(0.24_head 0x7f4f044c3d80)/0x7f4f044c3d80 start

2017-08-18T03:18:44.067 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.241679 7f4ef298a700  5 filestore(b.00) _do_op(2081): 0x7f4f046e8640 seq 29 osr(0.26_head 0x7f4f048440c0)/0x7f4f048440c0 start
2017-08-18T03:18:44.070 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:13:05.241684 7f4ef298a700 10 filestore(b.00) _do_transaction(2622): on 0x7f4f046ece80

2017-08-18T03:18:44.411 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:14:07.926513 7f4ef298a700 10 filestore(b.00) _set_replay_guard(2485): 29.0.1 done

2017-08-18T03:18:44.414 INFO:teuthology.orchestra.run.mira111.stderr:2017-08-18 03:14:07.926537 7f4ef298a700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7f4ef298a700' had timed out after 60

the weird thing is thread 0x7f4ef298a700 should have reset the "h->timeout" when it was about to handle new ops. but it turns out it didn't.

Actions #3

Updated by Kefu Chai over 6 years ago

/a//kchai-2017-08-21_01:51:35-rados-master-distro-basic-smithi/1545907/teuthology.log has debug heartbeatmap = 20.

 -6> 2017-08-21 02:17:56.536847 7fb7a95c9700 10 filestore(b) _set_replay_guard(2485): 23.0.1 done
 -5> 2017-08-21 02:17:57.182800 7fb7a4dc0700 20 filestore(b) sync_entry(3982): woke after 5.000160
 -4> 2017-08-21 02:18:01.114250 7fb7aa5cb700 20 heartbeat_map is_healthy = healthy, total workers: 2, number of unhealthy: 0
 -3> 2017-08-21 02:37:33.744776 7fb7a95c9700 20 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7fb7a95c9700' grace 60 suicide 180
 -2> 2017-08-21 02:37:33.744790 7fb7a95c9700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7fb
7a95c9700' had timed out after 60
 -1> 2017-08-21 02:37:33.744794 7fb7a95c9700  1 heartbeat_map reset_timeout 'FileStore::op_tp thread 0x7fb
7a95c9700' had suicide timed out after 180

in which, after 21 minutes, thread 7fb7a95c9700 checked the suicide timeout. but during the 21 minutes, nothing happened... weird enough...

Actions #4

Updated by Josh Durgin over 6 years ago

  • Assignee set to David Zafman

David, can you take a look? This seems to be showing up pretty consistently in rados runs.

Actions #6

Updated by David Zafman over 6 years ago

Stack trace from core dump doesn't include a stack with _inject_failure() in it.

For core dump in /a/kchai-2017-08-18_03:03:28-rados-master-distro-basic-mira/1537335

(gdb) thread apply all bt

Thread 12 (Thread 0x7f512c10e700 (LWP 30792)):
#0  0x00007f513bb846d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f513222c671 in Wait (mutex=..., this=0x7f513e75e480) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Cond.h:48
#2  Finisher::finisher_thread_entry (this=0x7f513e75e420) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Finisher.cc:101
#3  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f5128907700 (LWP 30786)):
#0  0x00007f513bb871bd in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f513bb82d02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f513bb82c08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f51322010f8 in Mutex::Lock (this=this@entry=0x7f513ea20b28, no_lockdep=no_lockdep@entry=false) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Mutex.cc:107
#4  0x00007f5132230b4f in ThreadPool::pause (this=this@entry=0x7f513ea20b00) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/WorkQueue.cc:226
#5  0x00007f513c9c4350 in FileStore::sync_entry (this=0x7f513ea20000) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/os/filestore/FileStore.cc:3997
#6  0x00007f513c9f4cdd in FileStore::SyncThread::entry (this=<optimized out>) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/os/filestore/FileStore.h:211
#7  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f5129108700 (LWP 30787)):
#0  0x00007f513bb846d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f513c99b2ee in Cond::Wait (this=0x7f513ea52148, mutex=...) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Cond.h:48
#2  0x00007f513cb4e7a5 in FileJournal::write_thread_entry (this=0x7f513ea52000) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/os/filestore/FileJournal.cc:1210
#3  0x00007f513c9e923d in FileJournal::Writer::entry (this=<optimized out>) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/os/filestore/FileJournal.h:374
#4  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f5129909700 (LWP 30788)):
#0  0x00007f513bb846d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f513222c671 in Wait (mutex=..., this=0x7f513ea20080) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Cond.h:48
#2  Finisher::finisher_thread_entry (this=0x7f513ea20020) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Finisher.cc:101
#3  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f512e112700 (LWP 30752)):
#0  0x00007f513bb84a82 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f513245c9fa in WaitUntil (when=..., mutex=..., this=0x7f513e766150) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Cond.h:64
#2  WaitInterval (interval=..., mutex=..., this=0x7f513e766150) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Cond.h:73
#3  CephContextServiceThread::entry (this=0x7f513e7660d0) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/ceph_context.cc:135
#4  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f5128106700 (LWP 30785)):
#0  0x00007f513bb846d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f513ca0ee7f in Wait (mutex=..., this=0x7f513ea20830) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Cond.h:48
#2  WBThrottle::get_next_should_flush (this=this@entry=0x7f513ea20708, next=next@entry=0x7f5128104830) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/os/filestore/WBThrottle.cc:139
#3  0x00007f513ca0f53a in WBThrottle::entry (this=0x7f513ea20708) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/os/filestore/WBThrottle.cc:157
#4  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f512d911700 (LWP 30789)):
#0  0x00007f5130290ce2 in _dl_addr () from /lib64/libc.so.6
#1  0x00007f513026a862 in backtrace_symbols () from /lib64/libc.so.6
#2  0x00007f513cbb546e in BackTrace (s=0, this=0x7f512d90e2e0) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/BackTrace.h:24
#3  handle_fatal_signal (signum=6) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/global/signal_handler.cc:117
#4  <signal handler called>
#5  0x00007f513bb85741 in pthread_kill () from /lib64/libpthread.so.0
#6  0x00007f5132471398 in ceph::HeartbeatMap::_check (this=this@entry=0x7f513e7b3800, h=h@entry=0x7f513ea563c0, who=who@entry=0x7f51326b3636 "reset_timeout", now=now@entry=1503027560) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/HeartbeatMap.cc:82
#7  0x00007f51324717f5 in ceph::HeartbeatMap::reset_timeout (this=0x7f513e7b3800, h=0x7f513ea563c0, grace=60, suicide_grace=180) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/HeartbeatMap.cc:94
#8  0x00007f51322307f8 in ThreadPool::TPHandle::reset_tp_timeout (this=this@entry=0x7f512d90f660) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/WorkQueue.cc:56
#9  0x00007f5132236ca1 in ThreadPool::worker (this=0x7f513ea20b00, wt=0x7f513ea56360) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/WorkQueue.cc:118
#10 0x00007f5132237be0 in ThreadPool::WorkThread::entry (this=<optimized out>) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/WorkQueue.h:448
#11 0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f512b90d700 (LWP 30793)):
#0  0x00007f513bb846d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f5132229daf in Wait (mutex=..., this=0x7f513ea205e8) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Cond.h:48
#2  SafeTimer::timer_thread (this=0x7f513ea205d8) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Timer.cc:108
#3  0x00007f513222b3dd in SafeTimerThread::entry (this=<optimized out>) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Timer.cc:30
#4  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f513c7d0d00 (LWP 30750)):
#0  0x00007f513bb846d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f513c99b4cc in Wait (mutex=..., this=0x7ffe52674d60) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Cond.h:48
#2  TestObjectStoreState::wait_for_done (this=this@entry=0x7ffe52674c98) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/test/objectstore/TestObjectStoreState.h:81
#3  0x00007f513c99ab5b in TestObjectStoreState::init (this=this@entry=0x7ffe52674c98, colls=30, objs=0) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/test/objectstore/TestObjectStoreState.cc:81
#4  0x00007f513c98ee04 in run_sequence_to (val=<optimized out>, filestore_path="b.01", journal_path="b.01/journal") at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/test/objectstore/test_idempotent_sequence.cc:160
#5  0x00007f513c98f01d in run_command (command="run-sequence-to", args=std::vector of length 3, capacity 4 = {...}) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/test/objectstore/test_idempotent_sequence.cc:190
#6  0x00007f513c8fbd86 in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/test/objectstore/test_idempotent_sequence.cc:247

Thread 3 (Thread 0x7f512c90f700 (LWP 30791)):
#0  0x00007f513bb846d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f513222c671 in Wait (mutex=..., this=0x7f513e75e320) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Cond.h:48
#2  Finisher::finisher_thread_entry (this=0x7f513e75e2c0) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/Finisher.cc:101
#3  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f512f085700 (LWP 30751)):
#0  0x00007f513bb8743d in write () from /lib64/libpthread.so.0
#1  0x00007f51325d4430 in safe_write (fd=3, buf=0x7f512f083720, count=110) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/safe_io.c:59
#2  0x00007f5132250ce0 in ceph::logging::Log::_flush (this=this@entry=0x7f513e9bc000, t=t@entry=0x7f512f0838d0, requeue=requeue@entry=0x7f513e9bc110, crash=crash@entry=false) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/log/Log.cc:334
#3  0x00007f5132251059 in ceph::logging::Log::flush (this=this@entry=0x7f513e9bc000) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/log/Log.cc:274
#4  0x00007f51322512ce in ceph::logging::Log::entry (this=0x7f513e9bc000) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/log/Log.cc:448
#5  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f513025676d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f512d110700 (LWP 30790)):
#0  0x00007f513bb85741 in pthread_kill () from /lib64/libpthread.so.0
#1  0x00007f5132471398 in ceph::HeartbeatMap::_check (this=this@entry=0x7f513e7b3800, h=h@entry=0x7f513ea56240, who=who@entry=0x7f51326b3636 "reset_timeout", now=now@entry=1503027560) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/HeartbeatMap.cc:82
#2  0x00007f51324717f5 in ceph::HeartbeatMap::reset_timeout (this=0x7f513e7b3800, h=0x7f513ea56240, grace=60, suicide_grace=180) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/HeartbeatMap.cc:94
#3  0x00007f51322307f8 in ThreadPool::TPHandle::reset_tp_timeout (this=this@entry=0x7f512d10e660) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/WorkQueue.cc:56
#4  0x00007f513c9df29a in FileStore::_do_transaction (this=this@entry=0x7f513ea20000, t=..., op_seq=op_seq@entry=26, trans_num=trans_num@entry=0, handle=handle@entry=0x7f512d10e660) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/os/filestore/FileStore.cc:2633
#5  0x00007f513c9e4bbb in FileStore::_do_transactions (this=this@entry=0x7f513ea20000, tls=std::vector of length 1, capacity 1 = {...}, op_seq=26, handle=handle@entry=0x7f512d10e660) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/os/filestore/FileStore.cc:2340
#6  0x00007f513c9e4fea in FileStore::_do_op (this=0x7f513ea20000, osr=0x7f513e761440, handle=...) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/os/filestore/FileStore.cc:2083
#7  0x00007f5132236cbe in ThreadPool::worker (this=0x7f513ea20b00, wt=0x7f513ea56db0) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/WorkQueue.cc:120
#8  0x00007f5132237be0 in ThreadPool::WorkThread::entry (this=<optimized out>) at /usr/src/debug/ceph-12.1.2-840-g1b21476/src/common/WorkQueue.h:448
#9  0x00007f513bb80dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007f513025676d in clone () from /lib64/libc.so.6
Actions #7

Updated by Josh Durgin over 6 years ago

This is occurring in the current jewel branch now too:

https://github.com/ceph/ceph/pull/17317#issuecomment-325580432

Actions #8

Updated by David Zafman over 6 years ago

I reverted https://github.com/ceph/ceph/pull/15947 to see if that would fix it and it did NOT.

Actions #9

Updated by Nathan Cutler over 6 years ago

David, the jewel failure started occurring in the integration branch that included the following PRs: http://tracker.ceph.com/issues/20613#note-2

To me that's a strong indication the culprit (in jewel) is one of the PRs listed there.

Actions #11

Updated by David Zafman over 6 years ago

I looked at https://github.com/ceph/ceph/pull/15050 and don't see anything that would cause this issue.

Actions #12

Updated by David Zafman over 6 years ago

Are we sure it isn't http://tracker.ceph.com/issues/20613#note-24 ? Because the dead runs here http://pulpito.ceph.com/smithfarm-2017-08-28_13:25:20-rados-wip-jewel-backports-distro-basic-smithi/ have sha1 fe8b99d8aec719b5dd9aacaf126db1854f34cdc8

Actions #13

Updated by Nathan Cutler over 6 years ago

David, the first dead job to appear was http://pulpito.ceph.com/smithfarm-2017-08-21_19:38:42-rados-wip-jewel-backports-distro-basic-smithi/1547855/

The dead jobs are in the run you cited, too, but that's (presumably) because the culprit PR was already merged into jewel (that's why they appear on all jewel integration runs now).

Actions #15

Updated by David Zafman over 6 years ago

This bug was filed because the ceph_test_filestore_idempotent_sequence wasn't completing the _exit() in _inject_failure().

2017-08-10T20:58:27.052 INFO:teuthology.orchestra.run.smithi198.stderr:     0> 2017-08-10 20:31:07.705941 7f240d533700 -1 *** Caught signal (Aborted) **
2017-08-10T20:58:27.740 INFO:teuthology.orchestra.run.smithi198.stderr:./run_seed_to.sh: line 288: 20464 Aborted                 $v ceph_test_filestore_idempotent_sequence run-sequence-to $to $tmp_name_a $tmp_name_a/journal --test-seed $seed --osd-journal-size 100 --filestore-kill-at $killat $tmp_opts_a --log-file $tmp_name_a.fail --debug-filestore 20

One thing that has confused the issue is that the test can run and pass but still have asserts in the log:

For example, these asserts happen, but program doesn't terminate with a signal.

http://qa-proxy.ceph.com/teuthology/dzafman-2017-09-08_20:45:27-rados-wip-zafman-cmnt-24-distro-basic-smithi/1612686/teuthology.log

$ grep assert /a/dzafman-2017-09-08_20:45:27-rados-wip-zafman-cmnt-24-distro-basic-smithi/1612686/teuthology.log
2017-09-09T07:27:36.598 INFO:teuthology.orchestra.run.smithi051.stderr:common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
2017-09-09T07:27:36.630 INFO:teuthology.orchestra.run.smithi051.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f5498679d1b]
2017-09-09T07:27:40.617 INFO:teuthology.orchestra.run.smithi051.stderr:common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
2017-09-09T07:27:40.623 INFO:teuthology.orchestra.run.smithi051.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f5498679d1b]
2017-09-09T07:27:42.331 INFO:teuthology.orchestra.run.smithi051.stderr:common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
2017-09-09T07:27:42.338 INFO:teuthology.orchestra.run.smithi051.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7f5498679d1b]
2017-09-09T12:37:16.262 INFO:teuthology.orchestra.run.smithi051.stderr:common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
2017-09-09T12:37:16.278 INFO:teuthology.orchestra.run.smithi051.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fc08e8b0d1b]
2017-09-09T12:37:21.179 INFO:teuthology.orchestra.run.smithi051.stderr:common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
2017-09-09T12:37:21.187 INFO:teuthology.orchestra.run.smithi051.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fc08e8b0d1b]
2017-09-09T12:37:24.233 INFO:teuthology.orchestra.run.smithi051.stderr:common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
2017-09-09T12:37:24.242 INFO:teuthology.orchestra.run.smithi051.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x7fc08e8b0d1b]

$ tail -10 /a/dzafman-2017-09-08_20:45:27-rados-wip-zafman-cmnt-24-distro-basic-smithi/1612686/teuthology.log
2017-09-09T14:01:38.193 DEBUG:teuthology.run_tasks:Unwinding manager console_log
2017-09-09T14:01:38.215 DEBUG:teuthology.run_tasks:Unwinding manager internal.lock_machines
2017-09-09T14:01:38.236 INFO:teuthology.task.internal.lock_machines:Unlocking machines...
2017-09-09T14:01:38.278 INFO:teuthology.lock.ops:unlocked smithi051.front.sepia.ceph.com
2017-09-09T14:01:38.459 INFO:teuthology.run:Summary data:
{description: rados/objectstore/filestore-idempotent-aio-journal.yaml, duration: 36599.97275495529,
  flavor: basic, owner: scheduled_dzafman@TrustyTahr, success: true}

2017-09-09T14:01:38.463 DEBUG:teuthology.report:Pushing job info to http://paddles.front.sepia.ceph.com/
2017-09-09T14:01:38.512 INFO:teuthology.run:pass

Some of the test runs assumed to be cause by this tracker are actually ones that ran too long and have only the expected "assert(0 "hit suicide timeout")" in the log. They do not have the "Caught signal (Aborted)" with a final failure indication.

$ tail -4 /a/smithfarm-2017-09-06_16:01:34-rados-wip-jewel-backports-distro-basic-smithi/1601722/teuthology.log
2017-09-07T06:25:19.105 INFO:teuthology.orchestra.run.smithi143.stderr:2017-09-07 06:10:09.885513 7ff0b77484c0 10 filestore(a.01) _check_replay_guard object has 22.0.0 < current pos 22.0.1, in past, will replay
2017-09-07T06:25:19.218 DEBUG:teuthology.exit:Got signal 15; running 1 handler...
2017-09-07T06:25:19.248 DEBUG:teuthology.task.console_log:Killing console logger for smithi143
2017-09-07T06:25:19.284 DEBUG:teuthology.exit:Finished running handlers

Is there a longstanding race between _exit() and the 'assert(0 "hit suicide timeout")' ?

Actions #16

Updated by David Zafman over 6 years ago

  • Priority changed from Urgent to High

I'm lowering the priority because it appears to me to be a test code issue. The test doesn't detect the failure it is looking for.

Actions #17

Updated by David Zafman over 6 years ago

I was never able to reproduce this with the following command line test.

rm -rf /tmp/td td ; mkdir /tmp/td td ; cd td ; PATH=$PATH:../bin ../../src/test/objectstore/run_seed_to_range.sh 56 50 300 /tmp/td 2>&1 | tee my.log ; echo $? ; cd ..

Actions #18

Updated by Sage Weil over 6 years ago

  • Status changed from 12 to Can't reproduce

this failure seems to have gone away..

Actions

Also available in: Atom PDF