Project

General

Profile

Actions

Bug #23947

open

ceph_test_rados dumped core, Error: finished tid 1 when last_acked_tid was 6

Added by David Zafman almost 6 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):


Related issues 2 (0 open2 closed)

Related to RADOS - Bug #23857: flush (manifest) vs async recovery causes out of order opCan't reproduce

Actions
Related to RADOS - Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade testDuplicate04/09/2018

Actions
Actions #1

Updated by David Zafman almost 6 years ago

Stack trace from core:


Thread 1 (Thread 0x7efc09ffb700 (LWP 15989)):
#0  0x00007efc2c322428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007efc2c32402a in __GI_abort () at abort.c:89
#2  0x00005560047aabce in WriteOp::_finish (this=0x55600584adf0, info=<optimized out>) at /build/ceph-13.0.2-1912-ge31d02d/src/test/osd/RadosModel.h:906
#3  0x00005560047c6369 in TestOp::finish (info=0x5560058080d0, this=<optimized out>) at /build/ceph-13.0.2-1912-ge31d02d/src/test/osd/RadosModel.cc:20
#4  write_callback (comp=<optimized out>, arg=0x55600587cb70) at /build/ceph-13.0.2-1912-ge31d02d/src/test/osd/RadosModel.cc:33
#5  0x00007efc35716ae6 in librados::C_AioComplete::finish(int) () from /usr/lib/librados.so.2
#6  0x00007efc356f5949 in Context::complete(int) () from /usr/lib/librados.so.2
#7  0x00007efc2c999c1e in Finisher::finisher_thread_entry() () from /usr/lib/ceph/libceph-common.so.0
#8  0x00007efc3545b6ba in start_thread (arg=0x7efc09ffb700) at pthread_create.c:333
#9  0x00007efc2c3f441d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

2018-04-29T06:55:45.486 INFO:tasks.rados.rados.0.smithi146.stderr:Error: finished tid 1 when last_acked_tid was 6

Line 906 is this ceph_abort()

  void _finish(CallbackInfo *info) override
  {
    assert(info);
    context->state_lock.Lock();
    uint64_t tid = info->id;

    cout << num << ":  finishing write tid " << tid << " to " << context->prefix + oid << std::endl;

    if (tid <= last_acked_tid) {
      cerr << "Error: finished tid " << tid
           << " when last_acked_tid was " << last_acked_tid << std::endl;
      ceph_abort();
    }
Actions #2

Updated by David Zafman almost 6 years ago

(gdb) print *this
$4 = {<TestOp> = {_vptr.TestOp = 0x5560049d8088 <vtable for WriteOp+16>, num = 2487, context = 0x7fff72564bb0, stat = 0x7fff725648b0, done = false}, oid = "110", cont = {objnum = 1532, cursnap = 0, seqnum = 1532, prefix = "smithi14615956-OID: 110 snap 0\n", oid = ""}, waiting = std::set with 4 elements = {[0] = 0x5560057c9490, [1] = 0x556005887ca0,
    [2] = 0x5560058ad980, [3] = 0x55600607aa90}, rcompletion = 0x556005826aa0, waiting_on = 1, last_acked_tid = 6, read_op = {<librados::ObjectOperation> = {<No data fields>}, <No data fields>}, write_op = {<librados::ObjectOperation> = {<No data fields>}, unused = 0x0}, rbuffer = {
    _buffers = {<std::__cxx11::_List_base<ceph::buffer::ptr, std::allocator<ceph::buffer::ptr> >> = {_M_impl = {<std::allocator<std::_List_node<ceph::buffer::ptr> >> = {<__gnu_cxx::new_allocator<std::_List_node<ceph::buffer::ptr> >> = {<No data fields>}, <No data fields>}, _M_node = {<std::__detail::_List_node_base> = {_M_next = 0x7efc1c001f90,
              _M_prev = 0x7efc1c001f90}, _M_storage = {_M_storage = "\001\000\000\000\000\000\000"}}}}, <No data fields>}, _len = 1, _memcopy_count = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {<ceph::buffer::list::iterator_impl<false>> = {<std::iterator<std::forward_iterator_tag, char, long, char*, char&>> = {<No data fields>},
        bl = 0x55600584aef8, ls = 0x55600584aef8, Python Exception <class 'gdb.error'> There is no member or method named _M_data.:
off = 0, p = , p_off = 0}, <No data fields>}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}, do_append = false, do_excl = false}
$ grep "smithi14615956-110" teuthology.log
2018-04-29T06:53:05.103 INFO:tasks.rados.rados.0.smithi146.stdout:110:  writing smithi14615956-110 from 631963 to 1263046 tid 1
2018-04-29T06:53:05.110 INFO:tasks.rados.rados.0.smithi146.stdout:110:  writing smithi14615956-110 from 2061949 to 2621968 tid 2
2018-04-29T06:53:05.116 INFO:tasks.rados.rados.0.smithi146.stdout:110:  writing smithi14615956-110 from 3243812 to 3748579 tid 3
2018-04-29T06:53:05.356 INFO:tasks.rados.rados.0.smithi146.stdout:110:  finishing write tid 1 to smithi14615956-110
2018-04-29T06:53:05.356 INFO:tasks.rados.rados.0.smithi146.stdout:110:  finishing write tid 2 to smithi14615956-110
2018-04-29T06:53:05.356 INFO:tasks.rados.rados.0.smithi146.stdout:110:  finishing write tid 3 to smithi14615956-110
2018-04-29T06:53:05.356 INFO:tasks.rados.rados.0.smithi146.stdout:110:  finishing write tid 5 to smithi14615956-110
2018-04-29T06:53:05.357 INFO:tasks.rados.rados.0.smithi146.stdout:110:  finishing write tid 6 to smithi14615956-110
2018-04-29T06:55:36.190 INFO:tasks.rados.rados.0.smithi146.stdout:2487:  writing smithi14615956-110 from 548324 to 1216745 tid 1
2018-04-29T06:55:36.196 INFO:tasks.rados.rados.0.smithi146.stdout:2487:  writing smithi14615956-110 from 1749002 to 2312864 tid 2
2018-04-29T06:55:36.209 INFO:tasks.rados.rados.0.smithi146.stdout:2487:  writing smithi14615956-110 from 2937672 to 3584631 tid 3
2018-04-29T06:55:37.554 INFO:tasks.rados.rados.0.smithi146.stdout:2487:  finishing write tid 2 to smithi14615956-110
2018-04-29T06:55:37.647 INFO:tasks.rados.rados.0.smithi146.stdout:2487:  finishing write tid 3 to smithi14615956-110
2018-04-29T06:55:37.647 INFO:tasks.rados.rados.0.smithi146.stdout:2487:  finishing write tid 5 to smithi14615956-110
2018-04-29T06:55:37.647 INFO:tasks.rados.rados.0.smithi146.stdout:2487:  finishing write tid 6 to smithi14615956-110
2018-04-29T06:55:45.486 INFO:tasks.rados.rados.0.smithi146.stdout:2487:  finishing write tid 1 to smithi14615956-110
2018-04-29T06:55:45.486 INFO:tasks.rados.rados.0.smithi146.stderr:Error: finished tid 1 when last_acked_tid was 6
Actions #3

Updated by David Zafman almost 6 years ago

  • Subject changed from ceph_test_rados dumped core to ceph_test_rados dumped core, Error: finished tid 1 when last_acked_tid was 6
Actions #4

Updated by David Zafman almost 6 years ago

A 2018-04-29T06:55:36.190 INFO:tasks.rados.rados.0.smithi146.stdout:2487: writing smithi14615956-110 from 548324 to 1216745 tid 1
B 2018-04-29T06:55:36.196 INFO:tasks.rados.rados.0.smithi146.stdout:2487: writing smithi14615956-110 from 1749002 to 2312864 tid 2
C 2018-04-29T06:55:36.209 INFO:tasks.rados.rados.0.smithi146.stdout:2487: writing smithi14615956-110 from 2937672 to 3584631 tid 3

The write at 548324 "A" was processed 8 seconds later with RETRY=2.

B 2018-04-29 06:55:36.978 7fec604e3700 10 osd.3 pg_epoch: 97 pg[2.13( v 88'300 (0'0,88'300] local-lis/les=96/97 n=14 ec=96/17 lis/c 96/17 les/c/f 97/18/0 17/96/17) [3,2,4] r=0 lpr=96 pi=[17,96)/1 crt=88'300 lcod 0'0 mlcod 0'0 active mbc={}] new_repop rep_tid 2077 on osd_op(client.4339.0:9440 2.13 2:c958b3e9:::smithi14615956-110:head [write 1749002~563862 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected e96) v8

C 2018-04-29 06:55:36.978 7fec604e3700 10 osd.3 pg_epoch: 97 pg[2.13( v 97'301 (0'0,97'301] local-lis/les=96/97 n=14 ec=96/17 lis/c 96/96 les/c/f 97/97/0 17/96/17) [3,2,4] r=0 lpr=96 luod=88'300 crt=97'301 lcod 0'0 mlcod 0'0 active+clean] new_repop rep_tid 2078 on osd_op(client.4339.0:9441 2.13 2:c958b3e9:::smithi14615956-110:head [write 2937672~646959 [fadvise_dontneed]] snapc 0=[] ondisk+write+known_if_redirected e96) v8

A 2018-04-29 06:55:44.650 7fec604e3700 10 osd.3 pg_epoch: 104 pg[2.13( v 97'307 (0'0,97'307] local-lis/les=102/103 n=14 ec=96/17 lis/c 102/96 les/c/f 103/97/0 101/102/17) [3,2,1]/[3,2] async=[1] r=0 lpr=102 pi=[96,102)/1 rops=2 crt=97'307 lcod 97'306 mlcod 0'0 active+recovering+undersized+remapped mbc={255={(2+1)=8}}] new_repop rep_tid 2133 on osd_op(client.4339.0:9439 2.13 2:c958b3e9:::smithi14615956-110:head [write 548324~668421 [fadvise_dontneed]] snapc 0=[] RETRY=2 ondisk+retry+write+known_if_redirected e104) v8

Actions #5

Updated by David Zafman almost 6 years ago

  • Related to Bug #23857: flush (manifest) vs async recovery causes out of order op added
Actions #6

Updated by Kefu Chai almost 6 years ago

  • Related to Bug #23598: hammer->jewel: ceph_test_rados crashes during radosbench task in jewel rados upgrade test added
Actions #7

Updated by Kefu Chai about 5 years ago

/a/kchai-2019-01-29_02:56:52-rados-wip-kefu2-testing-2019-01-28-1728-distro-basic-mira/3521986

/a/kchai-2019-01-29_02:56:52-rados-wip-kefu2-testing-2019-01-28-1728-distro-basic-mira/3521988

Actions #8

Updated by Kefu Chai about 5 years ago

/a/kchai-2019-02-16_11:36:29-rados-wip-sage-testing-2019-02-16-1748-distro-basic-smithi/3601272

/a/kchai-2019-05-27_10:18:11-rados-wip-kefu-testing-2019-05-27-1206-distro-basic-smithi/3983050

Actions #9

Updated by Kefu Chai about 4 years ago

2020-03-24T16:51:52.628 INFO:tasks.rados.rados.0.smithi167.stdout:5179:  finishing write tid 5 to smithi16730471-110
2020-03-24T16:51:52.628 INFO:tasks.rados.rados.0.smithi167.stdout:5179:  finishing write tid 2 to smithi16730471-110
2020-03-24T16:51:52.628 INFO:tasks.rados.rados.0.smithi167.stderr:Error: finished tid 2 when last_acked_tid was 5
2020-03-24T16:51:52.628 INFO:tasks.rados.rados.0.smithi167.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/
gigantic/release/15.1.1-346-g983ae03/rpm/el8/BUILD/ceph-15.1.1-346-g983ae03/src/test/osd/RadosModel.h: In function 'virtual void WriteOp::_finish(TestOp::CallbackInfo*)' thread 7fabea7fc700 time 2020-03-2
4T16:51:52.620939+0000
2020-03-24T16:51:52.629 INFO:tasks.rados.rados.0.smithi167.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/
gigantic/release/15.1.1-346-g983ae03/rpm/el8/BUILD/ceph-15.1.1-346-g983ae03/src/test/osd/RadosModel.h: 907: ceph_abort_msg("abort() called")
2020-03-24T16:51:52.629 INFO:tasks.rados.rados.0.smithi167.stderr: ceph version 15.1.1-346-g983ae03 (983ae031b114222849ba5d8e81e4e984a5f8d8f8) octopus (rc)
2020-03-24T16:51:52.629 INFO:tasks.rados.rados.0.smithi167.stderr: 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > con
st&)+0xe5) [0x7fac05b9e310]
2020-03-24T16:51:52.629 INFO:tasks.rados.rados.0.smithi167.stderr: 2: (WriteOp::_finish(TestOp::CallbackInfo*)+0x666) [0x557f6dbd9da6]
2020-03-24T16:51:52.629 INFO:tasks.rados.rados.0.smithi167.stderr: 3: (write_callback(void*, void*)+0x1d) [0x557f6dbfa29d]
2020-03-24T16:51:52.629 INFO:tasks.rados.rados.0.smithi167.stderr: 4: (()+0x9e44e) [0x7fac0f0b444e]
2020-03-24T16:51:52.629 INFO:tasks.rados.rados.0.smithi167.stderr: 5: (()+0x619ed) [0x7fac0f0779ed]
2020-03-24T16:51:52.630 INFO:tasks.rados.rados.0.smithi167.stderr: 6: (Finisher::finisher_thread_entry()+0x1a5) [0x7fac05c2e6d5]
2020-03-24T16:51:52.630 INFO:tasks.rados.rados.0.smithi167.stderr: 7: (()+0x82de) [0x7fac0522d2de]

/a/kchai-2020-03-24_03:31:20-rados-wip-kefu-testing-2020-03-23-2313-distro-basic-smithi/4886049

Actions

Also available in: Atom PDF