Project

General

Profile

Bug #23492

Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-eio.sh

Added by David Zafman almost 6 years ago. Updated over 5 years ago.

Status:
Resolved
Priority:
High
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
mimic,luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

dzafman-2018-03-28_15:20:23-rados:standalone-wip-zafman-testing-distro-basic-smithi/2331804

In TEST_rados_get_bad_size_shard_0 a ceph-osd startup of osd.3 resulted in the following crash. This was incidental to taking osd.3 down to corrupt an object using ceph-objectstore-tool.


2018-03-28T22:47:50.571 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 42 2674 bytes
2018-03-28T22:47:50.571 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 41 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 42 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 42 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 43 - loading and decoding 0x556d1fae3b00
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 43 2674 bytes
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 42 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 43 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 43 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 44 - loading and decoding 0x556d1fae3f80
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 44 2674 bytes
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 43 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3
2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 44 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap
2018-03-28T22:47:50.573 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 44 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down
2018-03-28T22:47:50.573 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 45 - loading and decoding 0x556d1fae4400
2018-03-28T22:47:50.573 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 45 2674 bytes
2018-03-28T22:47:50.573 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 44 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 45 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 45 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 46 - loading and decoding 0x556d1fae4880
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 46 2674 bytes
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 45 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 46 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 46 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 47 - loading and decoding 0x556d1fae4d00
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 47 2674 bytes
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 46 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3
2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 47 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap
2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 47 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down
2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 48 - loading and decoding 0x556d1fae5180
2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 48 2674 bytes
2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.027 7f35e22e2700 -1 *** Caught signal (Aborted) **
2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout: in thread 7f35e22e2700 thread_name:tp_osd_tp
2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout:
2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout: ceph version 13.0.1-3362-gab967e9 (ab967e9551853fdb3fcd2ddb071a2bb55e80e6d1) mimic (dev)
2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout: 1: (()+0x8fef70) [0x556d1d380f70]
2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout: 2: (()+0x11390) [0x7f3600f1a390]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 3: (gsignal()+0x38) [0x7f3600667428]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 4: (abort()+0x16a) [0x7f360066902a]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 5: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7f3602938965]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 6: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7f36028a0ee6]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 7: (()+0x71df31) [0x7f36028a0f31]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 8: (()+0x71d464) [0x7f36028a0464]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1901) [0x7f36025cd961]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f36025cecc1]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 11: (OSDService::try_get_map(unsigned int)+0x6bc) [0x556d1ce4b5fc]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x197) [0x556d1ce53ec7]
2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 13: (OSD::dequeue_peering_evt(PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xd4) [0x556d1ce54714]
2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 14: (PGPeeringItem::run(OSD*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x4d) [0x556d1d0afc3d]
2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1092) [0x556d1ce49b32]
2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x452) [0x7f3602457ef2]
2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f3602459ef0]
2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 18: (()+0x76ba) [0x7f3600f106ba]
2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 19: (clone()+0x6d) [0x7f360073941d]


Related issues

Duplicated by RADOS - Bug #24145: osdmap decode error in rados/standalone/* Duplicate 05/16/2018
Duplicated by RADOS - Bug #25106: Ceph-osd coredumps on launch Duplicate 07/25/2018
Copied to RADOS - Backport #24864: luminous: Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-eio.sh Resolved
Copied to RADOS - Backport #24865: mimic: Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-eio.sh Resolved

History

#1 Updated by David Zafman almost 6 years ago

  • Description updated (diff)
  • Source set to Development

#2 Updated by Josh Durgin almost 6 years ago

Let's see if this happens again now that sage's fast peering branch is merged.

#4 Updated by Casey Bodley almost 6 years ago

i saw a similar abort, except that it came from OSD::init() during qa/standalone/scrub/osd-scrub-repair.sh:

2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout:    -1> 2018-05-10 06:11:31.368 7f5dba062200 10 osd.1 0 add_map_bl 84 2695 bytes
2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout:     0> 2018-05-10 06:11:31.372 7f5dba062200 -1 *** Caught signal (Aborted) **
2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout: in thread 7f5dba062200 thread_name:ceph-osd
2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout:
2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout: ceph version 13.0.2-2185-gd928490 (d9284902e1b2e292595696caf11cdead18acec96) mimic (dev)
2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 1: (()+0x915ad0) [0x55abdfa57ad0]
2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 2: (()+0x11390) [0x7f5dafe55390]
2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 3: (gsignal()+0x38) [0x7f5daf5a2428]
2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 4: (abort()+0x16a) [0x7f5daf5a402a]
2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 5: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7f5db188b9f5]
2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 6: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7f5db17f31f6]
2018-05-10T06:12:07.167 INFO:tasks.workunit.client.0.smithi022.stdout: 7: (()+0x735241) [0x7f5db17f3241]
2018-05-10T06:12:07.167 INFO:tasks.workunit.client.0.smithi022.stdout: 8: (()+0x736934) [0x7f5db17f4934]
2018-05-10T06:12:07.167 INFO:tasks.workunit.client.0.smithi022.stdout: 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1915) [0x7f5db15208c5]
2018-05-10T06:12:07.167 INFO:tasks.workunit.client.0.smithi022.stdout: 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f5db1521c21]
2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 11: (OSDService::try_get_map(unsigned int)+0x508) [0x55abdf50e938]
2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 12: (OSDService::get_map(unsigned int)+0x1e) [0x55abdf56db5e]
2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 13: (OSD::init()+0x1d6f) [0x55abdf51bc0f]
2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 14: (main()+0x383a) [0x55abdf405b0a]
2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 15: (__libc_start_main()+0xf0) [0x7f5daf58d830]
2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 16: (_start()+0x29) [0x55abdf4d0ae9]
2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

http://qa-proxy.ceph.com/teuthology/cbodley-2018-05-09_21:16:26-rados-wip-rgw-req-state-time-distro-basic-smithi/2511172/teuthology.log

#5 Updated by Kefu Chai almost 6 years ago

/a/kchai-2018-05-28_09:21:54-rados-wip-kefu-testing-2018-05-28-1113-distro-basic-smithi/2601187

on mimic branch.

2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.056 7f4f5c3a2700 10 osd.3 73 OSD::ms_get_authorizer type=mgr
2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4df56700 10 osd.3 73 ms_handle_connect con 0x55fd6be32000
2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73 _preboot _preboot mon has osdmaps 1..77
2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 20 osd.3 73 update_osd_stat osd_stat(51 GiB used, 866 GiB avail, 917 GiB t
otal, peers [] op hist [])
2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 20 osd.3 73 check_full_status cur ratio 0.0553537. nearfull_ratio 0.85. ba
ckfillfull_ratio 0.99, full_ratio 0.99, failsafe_ratio 0.99, new state none
2018-05-28T10:34:50.036 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700  5 osd.3 73 heartbeat: osd_stat(51 GiB used, 866 GiB avail, 917 GiB total,
 peers [] op hist [])
2018-05-28T10:34:50.036 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73 _send_boot
2018-05-28T10:34:50.036 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73  new session (outgoing) 0x55fd6be1ef00 con=0x55fd6bc6a700 addr
=127.0.0.1:6814/81640
2018-05-28T10:34:50.037 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73  assuming cluster_addr ip matches client_addr
2018-05-28T10:34:50.037 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73  assuming hb_back_addr ip matches cluster_addr
2018-05-28T10:34:50.037 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73  assuming hb_front_addr ip matches client_addr
2018-05-28T10:34:50.037 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73  client_addr 127.0.0.1:6813/81640, cluster_addr 127.0.0.1:6814
/81640, hb_back_addr 127.0.0.1:6815/81640, hb_front_addr 127.0.0.1:6816/81640
2018-05-28T10:34:50.038 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.064 7f4f3eb13700 -1 *** Caught signal (Aborted) **
2018-05-28T10:34:50.038 INFO:tasks.workunit.client.0.smithi176.stdout: in thread 7f4f3eb13700 thread_name:tp_osd_tp
2018-05-28T10:34:50.038 INFO:tasks.workunit.client.0.smithi176.stdout:
2018-05-28T10:34:50.039 INFO:tasks.workunit.client.0.smithi176.stdout: ceph version 13.1.1-125-gbdaa11d (bdaa11d2bffdd71191dce15fef4655169a85cbc0) mimic (rc)
2018-05-28T10:34:50.039 INFO:tasks.workunit.client.0.smithi176.stdout: 1: (()+0x914200) [0x55fd6a13b200]
2018-05-28T10:34:50.039 INFO:tasks.workunit.client.0.smithi176.stdout: 2: (()+0x11390) [0x7f4f62190390]
2018-05-28T10:34:50.040 INFO:tasks.workunit.client.0.smithi176.stdout: 3: (gsignal()+0x38) [0x7f4f616c3428]
2018-05-28T10:34:50.040 INFO:tasks.workunit.client.0.smithi176.stdout: 4: (abort()+0x16a) [0x7f4f616c502a]
2018-05-28T10:34:50.040 INFO:tasks.workunit.client.0.smithi176.stdout: 5: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7f4f63fe5665]
2018-05-28T10:34:50.040 INFO:tasks.workunit.client.0.smithi176.stdout: 6: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7f4f63f4ce66]
2018-05-28T10:34:50.041 INFO:tasks.workunit.client.0.smithi176.stdout: 7: (()+0x733eb1) [0x7f4f63f4ceb1]
2018-05-28T10:34:50.041 INFO:tasks.workunit.client.0.smithi176.stdout: 8: (()+0x7355a4) [0x7f4f63f4e5a4]
2018-05-28T10:34:50.041 INFO:tasks.workunit.client.0.smithi176.stdout: 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1915) [0x7f4f63c79855]
2018-05-28T10:34:50.041 INFO:tasks.workunit.client.0.smithi176.stdout: 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f4f63c7abb1]
2018-05-28T10:34:50.042 INFO:tasks.workunit.client.0.smithi176.stdout: 11: (OSDService::try_get_map(unsigned int)+0x508) [0x55fd69bef638]
2018-05-28T10:34:50.042 INFO:tasks.workunit.client.0.smithi176.stdout: 12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x19d) [0x55fd69bf359d]
2018-05-28T10:34:50.042 INFO:tasks.workunit.client.0.smithi176.stdout: 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1a1) [0x55fd69bf3d61]
2018-05-28T10:34:50.042 INFO:tasks.workunit.client.0.smithi176.stdout: 14: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x55fd69e5d0f0]
2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x590) [0x55fd69c03560]
2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x46e) [0x7f4f63affabe]
2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f4f63b01b40]
2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 18: (()+0x76ba) [0x7f4f621866ba]
2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 19: (clone()+0x6d) [0x7f4f6179541d]
2018-05-28T10:34:50.044 INFO:tasks.workunit.client.0.smithi176.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2018-05-28T10:34:50.044 INFO:tasks.workunit.client.0.smithi176.stdout:

#6 Updated by Kefu Chai almost 6 years ago

  • Duplicated by Bug #24145: osdmap decode error in rados/standalone/* added

#7 Updated by David Zafman almost 6 years ago

This might be caused by 52dd99e3011bfc787042fe105e02c11b28867c4c which was included in https://github.com/ceph/ceph/pull/20177. That commit reverts without interacting with any of the other commits in that pull request. I need to look into it further.

#8 Updated by David Zafman almost 6 years ago

  • Assignee set to David Zafman

#9 Updated by David Zafman almost 6 years ago

I did a run based on d9284902e1b2e292595696caf11cdead18acec96 which is a branch off of master.

http://pulpito.ceph.com/dzafman-2018-06-18_18:03:17-rados:standalone-wip-zafman-testing-distro-basic-smithi/

4(30) failures occurred one of which was osd/PGLog.h: 1294: FAILED assert(st.st_size == 0)

This means this is NOT a decoding issue. The object that has data written to it is the pgmeta_oid which is supposed to be empty. In another case the decode of an OSDMap failed because of a struct_compat that should have been 8 but instead was 48 decimal or ascii '0'. In another case a decode failed somewhere that was a segmentation fault instead of a thrown exception.

In these test-erasure-eio.sh tests there are lots of objectstore_tool operations which stops and starts ceph-osds a lot. The failures can happen very early in the start-up or not long after.

#10 Updated by David Zafman almost 6 years ago

I was able to reproduce by running a loop of a single test case in qa/standalone/erasure-code/test-erasure-eio.sh

function TEST_rados_get_bad_size_shard_1() {
    local dir=$1
    setup_osds 4 || return 1

    local poolname=pool-jerasure
    create_erasure_coded_pool $poolname 2 1 || return 1
    # Set incorrect size into replicas OSD (1) and OSD (2)
    local shard_id=1
    while(true)
    do
        rados_get_data_bad_size $dir $shard_id 10 || return 1
        rados_get_data_bad_size $dir $shard_id 0 || return 1
        rados_get_data_bad_size $dir $shard_id 256 add || return 1
    done
    # NOTREACHED
    delete_erasure_coded_pool $poolname
}

Osd corrupted its 594 map but OSDs 0, 2, 3 match each other and don't have this ascii text in them.
td/loop/1/current/meta/DIR_5/DIR_2/osdmap.594__0_AC95D125__none

2018-06-19 17:17:26.726 7fca91bcd700 -1 log_channel(cluster) log [ERR] : Corruption detected: object 2:ac4962be:::obj-size-3890-1-0:head is missing hash_info
2018-06-19 17:17:27.306 7fcaacc62700 -1 received  signal: Terminated from bash ../qa/standalone/special/loop.sh  (PID: 90555) UID: 1000
2018-06-19 17:17:27.306 7fcaacc62700 -1 osd.1 594 *** Got signal Terminated ***
2018-06-19 17:17:27.370 7fcaacc62700 -1 osd.1 594 shutdown

Osd 1 is doing a strange combination of EC read handling AND dealing with map 594:

2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process 2.0s1 to_process <> waiting <> waiting_peering {}
2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process OpQueueItem(2.0s1 PGOpItem(op=MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3) prio 127 cost 0 e594) queued
2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process 2.0s1 to_process <OpQueueItem(2.0s1 PGOpItem(op=MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3) prio 127 cost 0 e594)> waiting <> waiting_peering {}
2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process OpQueueItem(2.0s1 PGOpItem(op=MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3) prio 127 cost 0 e594) pg 0x55ac6bb18000
2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 dequeue_op 0x55ac6c62e840 prio 127 cost 0 latency 0.000281 MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3 pg pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}]
2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 594 share_map osd.3 127.0.0.1:6814/5123 594
2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 594 should_share_map osd.3 127.0.0.1:6814/5123 594
2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] _handle_message: MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3
2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] get_hash_info: Getting attr on 2:ac4962be:::obj-size-3890-1-0:head
2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] get_hash_info: not in cache 2:ac4962be:::obj-size-3890-1-0:head
2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] get_hash_info: found on disk, size 0
2018-06-19 17:17:26.726 7fca91bcd700  0 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] get_hash_info: Mismatch of total_chunk_size 4096
2018-06-19 17:17:26.726 7fca91bcd700 -1 log_channel(cluster) log [ERR] : Corruption detected: object 2:ac4962be:::obj-size-3890-1-0:head is missing hash_info
2018-06-19 17:17:26.726 7fca91bcd700  5 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] handle_sub_read: No hinfo for 2:ac4962be:::obj-size-3890-1-0:head
2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 send_incremental_map 593 -> 594 to 0x55ac6bbe6a00 127.0.0.1:6814/5123
2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 add_map_inc_bl 594 212 bytes
2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 note_peer_epoch osd.3 has 594
2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 dequeue_op 0x55ac6c62e840 finish
2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process empty q, waiting

#11 Updated by David Zafman almost 6 years ago

Now that I've looked at the code there is nothing surprising about the map handling. There is code in dequeue_op() that calls maybe_share_map() and send_message_osd_cluster() calls share_map_peer().

#12 Updated by Josh Durgin almost 6 years ago

possibly related lumious run: http://pulpito.ceph.com/yuriw-2018-06-11_16:27:32-rados-wip-yuri3-testing-2018-06-11-1421-luminous-distro-basic-smithi/2654727/

backtrace is:

#27 0x0000557d6c9e9930 in OSDMap::decode (this=this@entry=0x557d7e5f0e00, bl=...) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSDMap.cc:2832
#28 0x0000557d6c9ebc1e in OSDMap::decode (this=this@entry=0x557d7e5f0e00, bl=...) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSDMap.cc:2641
#29 0x0000557d6c3036ac in OSDService::try_get_map (this=this@entry=0x557d77e1fdc8, epoch=epoch@entry=34) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSD.cc:1616
#30 0x0000557d6c30627e in OSD::advance_pg (this=this@entry=0x557d77e1e000, osd_epoch=<optimized out>, pg=pg@entry=0x557d7d239400, handle=..., rctx=rctx@entry=0x7fa514b673d0, new_pgs=new_pgs@entry=0x7fa514b673a0)
    at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSD.cc:8371
#31 0x0000557d6c3072e4 in OSD::process_peering_events (this=0x557d77e1e000, pgs=..., handle=...) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSD.cc:9880
#32 0x0000557d6c37b867 in OSD::PeeringWQ::_process (handle=..., pgs=..., this=<optimized out>) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSD.h:1871
#33 ThreadPool::BatchWorkQueue<PG>::_void_process (this=<optimized out>, p=0x557d78b365a0, handle=...) at /build/ceph-12.2.5-461-geccc8e6/src/common/WorkQueue.h:120
#34 0x0000557d6c910df9 in ThreadPool::worker (this=0x557d77e1e9b8, wt=0x557d77e2f3e0) at /build/ceph-12.2.5-461-geccc8e6/src/common/WorkQueue.cc:120
#35 0x0000557d6c911f00 in ThreadPool::WorkThread::entry (this=<optimized out>) at /build/ceph-12.2.5-461-geccc8e6/src/common/WorkQueue.h:448
#36 0x00007fa536cd86ba in start_thread (arg=0x7fa514b69700) at pthread_create.c:333
#37 0x00007fa535d4f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

PRs being tested in that branch: https://trello.com/c/oqjsU62O/326-wip-yuri3-testing-2018-06-11-1421-luminous - any of those seem like possible culprits David?

#13 Updated by David Zafman almost 6 years ago

  • Status changed from New to In Progress

#14 Updated by David Zafman almost 6 years ago

  • Priority changed from Normal to High

#15 Updated by David Zafman over 5 years ago

Final bisect results:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
2e47633ffa499b41255aed1ba8ea12e14ef556ba
85f6873e0420405ce1682abfc2c2884fa102a239
d55a66c56e106c347dabb150318add73667cab52
We cannot bisect more!

These are the first 3 commits of https://github.com/ceph/ceph/pull/20681

We could revert the entire pull request to fix this bug. Not sure what is causing the corruption.

#16 Updated by David Zafman over 5 years ago

sage APP [11:05 AM]
davidzlap: i'm suspicious of 056d02b0649d28200170f729dc0d625209206cfb ... can you try reverting just that commit?

davidzlap APP [11:13 AM]
@sage: ok

davidzlap APP [11:20 AM]
@sage: On closer examination the bisect found that even without that sha1 it failed.
There are 3 possible commits which can't be tested individually.  See previous comment.

#17 Updated by Sage Weil over 5 years ago

The problem is that int global_init_shutdown_stderr(CephContext *cct) is not being run at a time in the process lifecycle where other threads are opening and closing fds:

A: global_init_shutdown_stderr closes stderr
B: opens a new fd for something (e.g. filestore opens an osdmap file)
A: global_init_shutdown_stderr opens /dev/null, assuming it will be the same stderr fd (but it's not)
...
logging code writes to stderr and it ends up in the osdmap file

#19 Updated by Nathan Cutler over 5 years ago

  • Backport set to mimic,luminous

#20 Updated by Sage Weil over 5 years ago

  • Status changed from In Progress to Pending Backport

#21 Updated by Patrick Donnelly over 5 years ago

  • Copied to Backport #24864: luminous: Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-eio.sh added

#22 Updated by Patrick Donnelly over 5 years ago

  • Copied to Backport #24865: mimic: Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-eio.sh added

#23 Updated by Josh Durgin over 5 years ago

  • Duplicated by Bug #25106: Ceph-osd coredumps on launch added

#24 Updated by David Zafman over 5 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF