Bug #50119
closedInvalid read of size 4 in ceph::logging::Log::dump_recent()
0%
Description
the test does the 2 writes, creates the digest, deletes the pool, and submits the final op, but that op hangs
/a/sage-2021-04-02_14:05:32-rados-wip-sage2-testing-2021-04-01-1417-pacific-distro-basic-smithi/6017373
client log is ./remote/smithi122/log/ceph-client.admin.39306.log.gz
osd.2 gets as far as
2021-04-02T17:35:02.897+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_op: op osd_op(client.14672.0:7610 212.c 212:34762567:test-rados-api-smithi122-39306-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e655) v8 2021-04-02T17:35:02.897+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] op_has_sufficient_caps session=0x203eee70 pool=212 (test-rados-api-smithi122-39306-83 test-rados-api-smithi122-39306-84) pool_app_metadata={rados={}} need_read_cap=0 need_write_cap=0 classes=[] -> yes 2021-04-02T17:35:02.897+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_op osd_op(client.14672.0:7610 212.c 212:34762567:test-rados-api-smithi122-39306-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e655) v8 may_cache -> write-ordered flags ondisk+read+ignore_cache+known_if_redirected 2021-04-02T17:35:02.897+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] get_object_context: found obc in cache: 0x20b7fff0 2021-04-02T17:35:02.898+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] get_object_context: 0x20b7fff0 212:34762567:test-rados-api-smithi122-39306-84::foo:head rwstate(none n=0 w=0) oi: 212:34762567:test-rados-api-smithi122-39306-84::foo:head(653'2 client.14672.0:7608 dirty|data_digest|manifest s 8192 uv 1 dd ff44a93c alloc_hint [0 0 0] manifest(chunked {0=(len: 2 oid: 78:08bc4ac3:test-rados-api-smithi122-39306-84::bar:0 offset: 0 flags: has_reference)})) exists: 1 ssc: 0x22971230 snapset: 0=[]:{} 2021-04-02T17:35:02.898+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] find_object_context 212:34762567:test-rados-api-smithi122-39306-84::foo:head @head oi=212:34762567:test-rados-api-smithi122-39306-84::foo:head(653'2 client.14672.0:7608 dirty|data_digest|manifest s 8192 uv 1 dd ff44a93c alloc_hint [0 0 0] manifest(chunked {0=(len: 2 oid: 78:08bc4ac3:test-rados-api-smithi122-39306-84::bar:0 offset: 0 flags: has_reference)})) 2021-04-02T17:35:02.899+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_op obc obc(212:34762567:test-rados-api-smithi122-39306-84::foo:head rwstate(write n=1 w=0)) 2021-04-02T17:35:02.899+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] execute_ctx 0x47642e30 2021-04-02T17:35:02.899+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] execute_ctx 212:34762567:test-rados-api-smithi122-39306-84::foo:head [tier-flush] ov 653'2 av 655'3 snapc 0=[] snapset 0=[]:{} 2021-04-02T17:35:02.899+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_osd_op 212:34762567:test-rados-api-smithi122-39306-84::foo:head [tier-flush] 2021-04-02T17:35:02.899+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_osd_op tier-flush 2021-04-02T17:35:02.900+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] start_flush 212:34762567:test-rados-api-smithi122-39306-84::foo:head v653'2 uv1 blocking 2021-04-02T17:35:02.901+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] snapset 0=[]:{} 2021-04-02T17:35:02.901+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] start_flush no older clones ... 2021-04-02T17:35:02.938+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_cdc oid: 212:34762567:test-rados-api-smithi122-39306-84::foo:head len: 8192 oi.size: 8192 chunk_size: 1024 2021-04-02T17:35:02.986+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] refcount_manifest Start refcount from 212:34762567:test-rados-api-smithi122-39306-84::foo:head to 0:00000000:test-rados-api-smithi122-39306-84::0b0692abe73bf7a995bf44f5e9951ddb037acf28:0 2021-04-02T17:35:02.989+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] get_object_context: found obc in cache: 0x20b7fff0 2021-04-02T17:35:02.989+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] get_object_context: 0x20b7fff0 212:34762567:test-rados-api-smithi122-39306-84::foo:head rwstate(write n=1 w=0) oi: 212:34762567:test-rados-api-smithi122-39306-84::foo:head(653'2 client.14672.0:7608 dirty|data_digest|manifest s 8192 uv 1 dd ff44a93c alloc_hint [0 0 0] manifest(chunked {0=(len: 2 oid: 78:08bc4ac3:test-rados-api-smithi122-39306-84::bar:0 offset: 0 flags: has_reference)})) exists: 1 ssc: 0x22971230 snapset: 0=[]:{} 2021-04-02T17:35:02.990+0000 3b13d700 10 monclient: _renew_subs 2021-04-02T17:35:02.990+0000 3b13d700 10 monclient: _send_mon_message to mon.b at v2:172.21.15.133:3300/0 2021-04-02T17:35:02.990+0000 3b13d700 1 -- [v2:172.21.15.122:6826/34231,v1:172.21.15.122:6827/34231] --> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- mon_subscribe({osdmap=656}) v3 -- 0x20155d20 con 0xf29eae0
Updated by Neha Ojha about 3 years ago
- Assignee set to Myoungwon Oh
- Backport set to pacific
Myoungwon Oh: can you please help debug this?
Updated by Myoungwon Oh about 3 years ago
hm.. I think this is not related to tier_flush.
According to /a/sage-2021-04-02_14\:05\:32-rados-wip-sage2-testing-2021-04-01-1417-pacific-distro-basic-smithi/6017373/remote/smithi122/log/ceph-osd.2.log,
there is an assert due to pgref leak as the following log. You can see this at the bottom of the log.
-1> 2021-04-02T23:37:51.057+0000 14c6e700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0-22-g34c6cd86/rpm/el8/BUILD/ceph-16.2.0-22-g34c6cd86/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 14c6e700 time 2021-04-02T23:37:51.019763+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0-22-g34c6cd86/rpm/el8/BUILD/ceph-16.2.0-22-g34c6cd86/src/osd/OSD.cc: 4292: ceph_abort_msg("abort() called")
ceph version 16.2.0-22-g34c6cd86 (34c6cd86610ee2c7210741ae5596df281599c00a) pacific (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x671438]
2: (OSD::shutdown()+0x131e) [0x786fce]
3: (OSD::handle_signal(int)+0x124) [0x7873c4]
4: (SignalHandler::entry()+0x5f8) [0xdb3f78]
5: /lib64/libpthread.so.0(+0x82de) [0xca292de]
6: clone()
Updated by Neha Ojha about 3 years ago
- Subject changed from TierFlushDuringFlush hangs to Invalid read of size 4 in ceph::logging::Log::dump_recent()
- Assignee deleted (
Myoungwon Oh)
<error> <unique>0x11bca84</unique> <tid>8</tid> <threadname>signal_handler</threadname> <kind>InvalidRead</kind> <what>Invalid read of size 4</what> <stack> <frame> <ip>0xCA35038</ip> <obj>/usr/lib64/libpthread-2.28.so</obj> <fn>pthread_getname_np</fn> </frame> <frame> <ip>0xFB4222</ip> <obj>/usr/bin/ceph-osd</obj> <fn>ceph::logging::Log::dump_recent()</fn> <dir>/usr/src/debug/ceph-16.2.0-22.g34c6cd86.el8.x86_64/src/log</dir> <file>Log.cc</file> <line>384</line> </frame> <frame> <ip>0xDB2D6A</ip> <obj>/usr/bin/ceph-osd</obj> <fn>handle_fatal_signal(int)</fn> <dir>/usr/src/debug/ceph-16.2.0-22.g34c6cd86.el8.x86_64/src/global</dir> <file>signal_handler.cc</file> <line>316</line> </frame> <frame> <ip>0xCA33DBF</ip> <obj>/usr/lib64/libpthread-2.28.so</obj> </frame> <frame> <ip>0xDC6D8DE</ip> <obj>/usr/lib64/libc-2.28.so</obj> <fn>raise</fn> </frame> <frame> <ip>0xDC57CF4</ip> <obj>/usr/lib64/libc-2.28.so</obj> <fn>abort</fn> </frame> <frame> <ip>0x671508</ip> <obj>/usr/bin/ceph-osd</obj> <fn>ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)</fn> <dir>/usr/src/debug/ceph-16.2.0-22.g34c6cd86.el8.x86_64/src/common</dir> <file>assert.cc</file> <line>198</line> </frame> <frame> <ip>0x786FCD</ip> <obj>/usr/bin/ceph-osd</obj> <fn>OSD::shutdown()</fn> <dir>/usr/src/debug/ceph-16.2.0-22.g34c6cd86.el8.x86_64/src/osd</dir> <file>OSD.cc</file> <line>4292</line>
/a/sage-2021-04-02_14:05:32-rados-wip-sage2-testing-2021-04-01-1417-pacific-distro-basic-smithi/6017373/remote/smithi122/log/valgrind/osd.2.log.gz
Updated by Neha Ojha about 3 years ago
Myoungwon Oh wrote:
hm.. I think this is not related to tier_flush.
According to /a/sage-2021-04-02_14\:05\:32-rados-wip-sage2-testing-2021-04-01-1417-pacific-distro-basic-smithi/6017373/remote/smithi122/log/ceph-osd.2.log,
there is an assert due to pgref leak as the following log. You can see this at the bottom of the log.-1> 2021-04-02T23:37:51.057+0000 14c6e700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0-22-g34c6cd86/rpm/el8/BUILD/ceph-16.2.0-22-g34c6cd86/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 14c6e700 time 2021-04-02T23:37:51.019763+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0-22-g34c6cd86/rpm/el8/BUILD/ceph-16.2.0-22-g34c6cd86/src/osd/OSD.cc: 4292: ceph_abort_msg("abort() called")ceph version 16.2.0-22-g34c6cd86 (34c6cd86610ee2c7210741ae5596df281599c00a) pacific (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x671438]
2: (OSD::shutdown()+0x131e) [0x786fce]
3: (OSD::handle_signal(int)+0x124) [0x7873c4]
4: (SignalHandler::entry()+0x5f8) [0xdb3f78]
5: /lib64/libpthread.so.0(+0x82de) [0xca292de]
6: clone()
Right, thanks!
Updated by Neha Ojha about 2 years ago
- Status changed from New to Can't reproduce