Project

General

Profile

Actions

Bug #50119

closed

Invalid read of size 4 in ceph::logging::Log::dump_recent()

Added by Sage Weil about 3 years ago. Updated about 2 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

the test does the 2 writes, creates the digest, deletes the pool, and submits the final op, but that op hangs

/a/sage-2021-04-02_14:05:32-rados-wip-sage2-testing-2021-04-01-1417-pacific-distro-basic-smithi/6017373

client log is ./remote/smithi122/log/ceph-client.admin.39306.log.gz

osd.2 gets as far as


2021-04-02T17:35:02.897+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_op: op osd_op(client.14672.0:7610 212.c 212:34762567:test-rados-api-smithi122-39306-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e655) v8
2021-04-02T17:35:02.897+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] op_has_sufficient_caps session=0x203eee70 pool=212 (test-rados-api-smithi122-39306-83 test-rados-api-smithi122-39306-84) pool_app_metadata={rados={}} need_read_cap=0 need_write_cap=0 classes=[] -> yes
2021-04-02T17:35:02.897+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_op osd_op(client.14672.0:7610 212.c 212:34762567:test-rados-api-smithi122-39306-84::foo:head [tier-flush] snapc 0=[] ondisk+read+ignore_cache+known_if_redirected e655) v8 may_cache -> write-ordered flags ondisk+read+ignore_cache+known_if_redirected
2021-04-02T17:35:02.897+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] get_object_context: found obc in cache: 0x20b7fff0
2021-04-02T17:35:02.898+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] get_object_context: 0x20b7fff0 212:34762567:test-rados-api-smithi122-39306-84::foo:head rwstate(none n=0 w=0) oi: 212:34762567:test-rados-api-smithi122-39306-84::foo:head(653'2 client.14672.0:7608 dirty|data_digest|manifest s 8192 uv 1 dd ff44a93c alloc_hint [0 0 0] manifest(chunked {0=(len: 2 oid: 78:08bc4ac3:test-rados-api-smithi122-39306-84::bar:0 offset: 0 flags: has_reference)})) exists: 1 ssc: 0x22971230 snapset: 0=[]:{}
2021-04-02T17:35:02.898+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] find_object_context 212:34762567:test-rados-api-smithi122-39306-84::foo:head @head oi=212:34762567:test-rados-api-smithi122-39306-84::foo:head(653'2 client.14672.0:7608 dirty|data_digest|manifest s 8192 uv 1 dd ff44a93c alloc_hint [0 0 0] manifest(chunked {0=(len: 2 oid: 78:08bc4ac3:test-rados-api-smithi122-39306-84::bar:0 offset: 0 flags: has_reference)}))
2021-04-02T17:35:02.899+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_op obc obc(212:34762567:test-rados-api-smithi122-39306-84::foo:head rwstate(write n=1 w=0))
2021-04-02T17:35:02.899+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] execute_ctx 0x47642e30
2021-04-02T17:35:02.899+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] execute_ctx 212:34762567:test-rados-api-smithi122-39306-84::foo:head [tier-flush] ov 653'2 av 655'3 snapc 0=[] snapset 0=[]:{}
2021-04-02T17:35:02.899+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_osd_op 212:34762567:test-rados-api-smithi122-39306-84::foo:head [tier-flush]
2021-04-02T17:35:02.899+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_osd_op  tier-flush
2021-04-02T17:35:02.900+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] start_flush 212:34762567:test-rados-api-smithi122-39306-84::foo:head v653'2 uv1 blocking
2021-04-02T17:35:02.901+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean]  snapset 0=[]:{}
2021-04-02T17:35:02.901+0000 3b13d700 20 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] start_flush no older clones
...
2021-04-02T17:35:02.938+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] do_cdc oid: 212:34762567:test-rados-api-smithi122-39306-84::foo:head len: 8192 oi.size: 8192 chunk_size: 1024
2021-04-02T17:35:02.986+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] refcount_manifest Start refcount from 212:34762567:test-rados-api-smithi122-39306-84::foo:head to 0:00000000:test-rados-api-smithi122-39306-84::0b0692abe73bf7a995bf44f5e9951ddb037acf28:0
2021-04-02T17:35:02.989+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] get_object_context: found obc in cache: 0x20b7fff0
2021-04-02T17:35:02.989+0000 3b13d700 10 osd.2 pg_epoch: 655 pg[212.c( v 653'2 (0'0,653'2] local-lis/les=648/649 n=1 ec=648/648 lis/c=648/648 les/c/f=649/649/0 sis=648) [2,0] r=0 lpr=648 crt=653'2 lcod 653'1 mlcod 653'1 active+clean] get_object_context: 0x20b7fff0 212:34762567:test-rados-api-smithi122-39306-84::foo:head rwstate(write n=1 w=0) oi: 212:34762567:test-rados-api-smithi122-39306-84::foo:head(653'2 client.14672.0:7608 dirty|data_digest|manifest s 8192 uv 1 dd ff44a93c alloc_hint [0 0 0] manifest(chunked {0=(len: 2 oid: 78:08bc4ac3:test-rados-api-smithi122-39306-84::bar:0 offset: 0 flags: has_reference)})) exists: 1 ssc: 0x22971230 snapset: 0=[]:{}
2021-04-02T17:35:02.990+0000 3b13d700 10 monclient: _renew_subs
2021-04-02T17:35:02.990+0000 3b13d700 10 monclient: _send_mon_message to mon.b at v2:172.21.15.133:3300/0
2021-04-02T17:35:02.990+0000 3b13d700  1 -- [v2:172.21.15.122:6826/34231,v1:172.21.15.122:6827/34231] --> [v2:172.21.15.133:3300/0,v1:172.21.15.133:6789/0] -- mon_subscribe({osdmap=656}) v3 -- 0x20155d20 con 0xf29eae0

Actions #1

Updated by Neha Ojha about 3 years ago

  • Assignee set to Myoungwon Oh
  • Backport set to pacific

Myoungwon Oh: can you please help debug this?

Actions #2

Updated by Myoungwon Oh about 3 years ago

hm.. I think this is not related to tier_flush.
According to /a/sage-2021-04-02_14\:05\:32-rados-wip-sage2-testing-2021-04-01-1417-pacific-distro-basic-smithi/6017373/remote/smithi122/log/ceph-osd.2.log,
there is an assert due to pgref leak as the following log. You can see this at the bottom of the log.

-1> 2021-04-02T23:37:51.057+0000 14c6e700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0-22-g34c6cd86/rpm/el8/BUILD/ceph-16.2.0-22-g34c6cd86/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 14c6e700 time 2021-04-02T23:37:51.019763+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0-22-g34c6cd86/rpm/el8/BUILD/ceph-16.2.0-22-g34c6cd86/src/osd/OSD.cc: 4292: ceph_abort_msg("abort() called")
ceph version 16.2.0-22-g34c6cd86 (34c6cd86610ee2c7210741ae5596df281599c00a) pacific (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x671438]
2: (OSD::shutdown()+0x131e) [0x786fce]
3: (OSD::handle_signal(int)+0x124) [0x7873c4]
4: (SignalHandler::entry()+0x5f8) [0xdb3f78]
5: /lib64/libpthread.so.0(+0x82de) [0xca292de]
6: clone()
Actions #3

Updated by Neha Ojha about 3 years ago

  • Subject changed from TierFlushDuringFlush hangs to Invalid read of size 4 in ceph::logging::Log::dump_recent()
  • Assignee deleted (Myoungwon Oh)
<error>
  <unique>0x11bca84</unique>
  <tid>8</tid>
  <threadname>signal_handler</threadname>
  <kind>InvalidRead</kind>
  <what>Invalid read of size 4</what>
  <stack>
    <frame>
      <ip>0xCA35038</ip>
      <obj>/usr/lib64/libpthread-2.28.so</obj>
      <fn>pthread_getname_np</fn>
    </frame>
    <frame>
      <ip>0xFB4222</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>ceph::logging::Log::dump_recent()</fn>
      <dir>/usr/src/debug/ceph-16.2.0-22.g34c6cd86.el8.x86_64/src/log</dir>
      <file>Log.cc</file>
      <line>384</line>
    </frame>
    <frame>
      <ip>0xDB2D6A</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>handle_fatal_signal(int)</fn>
      <dir>/usr/src/debug/ceph-16.2.0-22.g34c6cd86.el8.x86_64/src/global</dir>
      <file>signal_handler.cc</file>
      <line>316</line>
    </frame>
    <frame>
      <ip>0xCA33DBF</ip>
      <obj>/usr/lib64/libpthread-2.28.so</obj>
    </frame>
    <frame>
      <ip>0xDC6D8DE</ip>
      <obj>/usr/lib64/libc-2.28.so</obj>
      <fn>raise</fn>
    </frame>
    <frame>
      <ip>0xDC57CF4</ip>
      <obj>/usr/lib64/libc-2.28.so</obj>
      <fn>abort</fn>
    </frame>
    <frame>
      <ip>0x671508</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string&lt;char, std::char_traits&lt;char&gt;, std::allocator&lt;char&gt; &gt; const&amp;)</fn>
      <dir>/usr/src/debug/ceph-16.2.0-22.g34c6cd86.el8.x86_64/src/common</dir>
      <file>assert.cc</file>
      <line>198</line>
    </frame>
    <frame>
      <ip>0x786FCD</ip>
      <obj>/usr/bin/ceph-osd</obj>
      <fn>OSD::shutdown()</fn>
      <dir>/usr/src/debug/ceph-16.2.0-22.g34c6cd86.el8.x86_64/src/osd</dir>
      <file>OSD.cc</file>
      <line>4292</line>

/a/sage-2021-04-02_14:05:32-rados-wip-sage2-testing-2021-04-01-1417-pacific-distro-basic-smithi/6017373/remote/smithi122/log/valgrind/osd.2.log.gz

Actions #4

Updated by Neha Ojha about 3 years ago

Myoungwon Oh wrote:

hm.. I think this is not related to tier_flush.
According to /a/sage-2021-04-02_14\:05\:32-rados-wip-sage2-testing-2021-04-01-1417-pacific-distro-basic-smithi/6017373/remote/smithi122/log/ceph-osd.2.log,
there is an assert due to pgref leak as the following log. You can see this at the bottom of the log.

-1> 2021-04-02T23:37:51.057+0000 14c6e700 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0-22-g34c6cd86/rpm/el8/BUILD/ceph-16.2.0-22-g34c6cd86/src/osd/OSD.cc: In function 'int OSD::shutdown()' thread 14c6e700 time 2021-04-02T23:37:51.019763+0000
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.0-22-g34c6cd86/rpm/el8/BUILD/ceph-16.2.0-22-g34c6cd86/src/osd/OSD.cc: 4292: ceph_abort_msg("abort() called")

ceph version 16.2.0-22-g34c6cd86 (34c6cd86610ee2c7210741ae5596df281599c00a) pacific (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xe5) [0x671438]
2: (OSD::shutdown()+0x131e) [0x786fce]
3: (OSD::handle_signal(int)+0x124) [0x7873c4]
4: (SignalHandler::entry()+0x5f8) [0xdb3f78]
5: /lib64/libpthread.so.0(+0x82de) [0xca292de]
6: clone()

Right, thanks!

Actions #5

Updated by Neha Ojha about 2 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF