Bug #23492
Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-eio.sh
0%
Description
dzafman-2018-03-28_15:20:23-rados:standalone-wip-zafman-testing-distro-basic-smithi/2331804
In TEST_rados_get_bad_size_shard_0 a ceph-osd startup of osd.3 resulted in the following crash. This was incidental to taking osd.3 down to corrupt an object using ceph-objectstore-tool.
2018-03-28T22:47:50.571 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 42 2674 bytes 2018-03-28T22:47:50.571 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 41 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 42 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 42 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 43 - loading and decoding 0x556d1fae3b00 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 43 2674 bytes 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 42 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 43 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 43 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 44 - loading and decoding 0x556d1fae3f80 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 44 2674 bytes 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 43 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3 2018-03-28T22:47:50.572 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 44 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap 2018-03-28T22:47:50.573 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 44 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down 2018-03-28T22:47:50.573 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 45 - loading and decoding 0x556d1fae4400 2018-03-28T22:47:50.573 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 45 2674 bytes 2018-03-28T22:47:50.573 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 44 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 45 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 45 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 46 - loading and decoding 0x556d1fae4880 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 46 2674 bytes 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 45 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 46 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 46 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 47 - loading and decoding 0x556d1fae4d00 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 47 2674 bytes 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 46 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] handle_advance_map [3,0,2]/[3,0,2] -- 3/3 2018-03-28T22:47:50.574 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 47 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] state<Reset>: Reset advmap 2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 pg_epoch: 47 pg[1.1( empty local-lis/les=38/39 n=0 ec=2/2 lis/c 38/38 les/c/f 39/39/0 38/38/38) [3,0,2] r=0 lpr=39 crt=0'0 mlcod 0'0 unknown mbc={}] check_recovery_sources no source osds () went down 2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 20 osd.3 52 get_map 48 - loading and decoding 0x556d1fae5180 2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.023 7f35e22e2700 10 osd.3 52 add_map_bl 48 2674 bytes 2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout:2018-03-28 22:40:05.027 7f35e22e2700 -1 *** Caught signal (Aborted) ** 2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout: in thread 7f35e22e2700 thread_name:tp_osd_tp 2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout: 2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout: ceph version 13.0.1-3362-gab967e9 (ab967e9551853fdb3fcd2ddb071a2bb55e80e6d1) mimic (dev) 2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout: 1: (()+0x8fef70) [0x556d1d380f70] 2018-03-28T22:47:50.575 INFO:tasks.workunit.client.0.smithi132.stdout: 2: (()+0x11390) [0x7f3600f1a390] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 3: (gsignal()+0x38) [0x7f3600667428] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 4: (abort()+0x16a) [0x7f360066902a] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 5: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7f3602938965] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 6: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7f36028a0ee6] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 7: (()+0x71df31) [0x7f36028a0f31] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 8: (()+0x71d464) [0x7f36028a0464] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1901) [0x7f36025cd961] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f36025cecc1] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 11: (OSDService::try_get_map(unsigned int)+0x6bc) [0x556d1ce4b5fc] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x197) [0x556d1ce53ec7] 2018-03-28T22:47:50.576 INFO:tasks.workunit.client.0.smithi132.stdout: 13: (OSD::dequeue_peering_evt(PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xd4) [0x556d1ce54714] 2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 14: (PGPeeringItem::run(OSD*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x4d) [0x556d1d0afc3d] 2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x1092) [0x556d1ce49b32] 2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x452) [0x7f3602457ef2] 2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f3602459ef0] 2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 18: (()+0x76ba) [0x7f3600f106ba] 2018-03-28T22:47:50.577 INFO:tasks.workunit.client.0.smithi132.stdout: 19: (clone()+0x6d) [0x7f360073941d]
Related issues
History
#1 Updated by David Zafman almost 6 years ago
- Description updated (diff)
- Source set to Development
#2 Updated by Josh Durgin almost 6 years ago
Let's see if this happens again now that sage's fast peering branch is merged.
#3 Updated by David Zafman almost 6 years ago
#4 Updated by Casey Bodley almost 6 years ago
i saw a similar abort, except that it came from OSD::init() during qa/standalone/scrub/osd-scrub-repair.sh:
2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout: -1> 2018-05-10 06:11:31.368 7f5dba062200 10 osd.1 0 add_map_bl 84 2695 bytes 2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout: 0> 2018-05-10 06:11:31.372 7f5dba062200 -1 *** Caught signal (Aborted) ** 2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout: in thread 7f5dba062200 thread_name:ceph-osd 2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout: 2018-05-10T06:12:07.165 INFO:tasks.workunit.client.0.smithi022.stdout: ceph version 13.0.2-2185-gd928490 (d9284902e1b2e292595696caf11cdead18acec96) mimic (dev) 2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 1: (()+0x915ad0) [0x55abdfa57ad0] 2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 2: (()+0x11390) [0x7f5dafe55390] 2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 3: (gsignal()+0x38) [0x7f5daf5a2428] 2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 4: (abort()+0x16a) [0x7f5daf5a402a] 2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 5: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7f5db188b9f5] 2018-05-10T06:12:07.166 INFO:tasks.workunit.client.0.smithi022.stdout: 6: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7f5db17f31f6] 2018-05-10T06:12:07.167 INFO:tasks.workunit.client.0.smithi022.stdout: 7: (()+0x735241) [0x7f5db17f3241] 2018-05-10T06:12:07.167 INFO:tasks.workunit.client.0.smithi022.stdout: 8: (()+0x736934) [0x7f5db17f4934] 2018-05-10T06:12:07.167 INFO:tasks.workunit.client.0.smithi022.stdout: 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1915) [0x7f5db15208c5] 2018-05-10T06:12:07.167 INFO:tasks.workunit.client.0.smithi022.stdout: 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f5db1521c21] 2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 11: (OSDService::try_get_map(unsigned int)+0x508) [0x55abdf50e938] 2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 12: (OSDService::get_map(unsigned int)+0x1e) [0x55abdf56db5e] 2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 13: (OSD::init()+0x1d6f) [0x55abdf51bc0f] 2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 14: (main()+0x383a) [0x55abdf405b0a] 2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 15: (__libc_start_main()+0xf0) [0x7f5daf58d830] 2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: 16: (_start()+0x29) [0x55abdf4d0ae9] 2018-05-10T06:12:07.168 INFO:tasks.workunit.client.0.smithi022.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
#5 Updated by Kefu Chai almost 6 years ago
/a/kchai-2018-05-28_09:21:54-rados-wip-kefu-testing-2018-05-28-1113-distro-basic-smithi/2601187
on mimic branch.
2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.056 7f4f5c3a2700 10 osd.3 73 OSD::ms_get_authorizer type=mgr 2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4df56700 10 osd.3 73 ms_handle_connect con 0x55fd6be32000 2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73 _preboot _preboot mon has osdmaps 1..77 2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 20 osd.3 73 update_osd_stat osd_stat(51 GiB used, 866 GiB avail, 917 GiB t otal, peers [] op hist []) 2018-05-28T10:34:50.035 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 20 osd.3 73 check_full_status cur ratio 0.0553537. nearfull_ratio 0.85. ba ckfillfull_ratio 0.99, full_ratio 0.99, failsafe_ratio 0.99, new state none 2018-05-28T10:34:50.036 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 5 osd.3 73 heartbeat: osd_stat(51 GiB used, 866 GiB avail, 917 GiB total, peers [] op hist []) 2018-05-28T10:34:50.036 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73 _send_boot 2018-05-28T10:34:50.036 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73 new session (outgoing) 0x55fd6be1ef00 con=0x55fd6bc6a700 addr =127.0.0.1:6814/81640 2018-05-28T10:34:50.037 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73 assuming cluster_addr ip matches client_addr 2018-05-28T10:34:50.037 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73 assuming hb_back_addr ip matches cluster_addr 2018-05-28T10:34:50.037 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73 assuming hb_front_addr ip matches client_addr 2018-05-28T10:34:50.037 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.060 7f4f4431e700 10 osd.3 73 client_addr 127.0.0.1:6813/81640, cluster_addr 127.0.0.1:6814 /81640, hb_back_addr 127.0.0.1:6815/81640, hb_front_addr 127.0.0.1:6816/81640 2018-05-28T10:34:50.038 INFO:tasks.workunit.client.0.smithi176.stdout:2018-05-28 10:26:21.064 7f4f3eb13700 -1 *** Caught signal (Aborted) ** 2018-05-28T10:34:50.038 INFO:tasks.workunit.client.0.smithi176.stdout: in thread 7f4f3eb13700 thread_name:tp_osd_tp 2018-05-28T10:34:50.038 INFO:tasks.workunit.client.0.smithi176.stdout: 2018-05-28T10:34:50.039 INFO:tasks.workunit.client.0.smithi176.stdout: ceph version 13.1.1-125-gbdaa11d (bdaa11d2bffdd71191dce15fef4655169a85cbc0) mimic (rc) 2018-05-28T10:34:50.039 INFO:tasks.workunit.client.0.smithi176.stdout: 1: (()+0x914200) [0x55fd6a13b200] 2018-05-28T10:34:50.039 INFO:tasks.workunit.client.0.smithi176.stdout: 2: (()+0x11390) [0x7f4f62190390] 2018-05-28T10:34:50.040 INFO:tasks.workunit.client.0.smithi176.stdout: 3: (gsignal()+0x38) [0x7f4f616c3428] 2018-05-28T10:34:50.040 INFO:tasks.workunit.client.0.smithi176.stdout: 4: (abort()+0x16a) [0x7f4f616c502a] 2018-05-28T10:34:50.040 INFO:tasks.workunit.client.0.smithi176.stdout: 5: (__gnu_cxx::__verbose_terminate_handler()+0x135) [0x7f4f63fe5665] 2018-05-28T10:34:50.040 INFO:tasks.workunit.client.0.smithi176.stdout: 6: (__cxxabiv1::__terminate(void (*)())+0x6) [0x7f4f63f4ce66] 2018-05-28T10:34:50.041 INFO:tasks.workunit.client.0.smithi176.stdout: 7: (()+0x733eb1) [0x7f4f63f4ceb1] 2018-05-28T10:34:50.041 INFO:tasks.workunit.client.0.smithi176.stdout: 8: (()+0x7355a4) [0x7f4f63f4e5a4] 2018-05-28T10:34:50.041 INFO:tasks.workunit.client.0.smithi176.stdout: 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x1915) [0x7f4f63c79855] 2018-05-28T10:34:50.041 INFO:tasks.workunit.client.0.smithi176.stdout: 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f4f63c7abb1] 2018-05-28T10:34:50.042 INFO:tasks.workunit.client.0.smithi176.stdout: 11: (OSDService::try_get_map(unsigned int)+0x508) [0x55fd69bef638] 2018-05-28T10:34:50.042 INFO:tasks.workunit.client.0.smithi176.stdout: 12: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*)+0x19d) [0x55fd69bf359d] 2018-05-28T10:34:50.042 INFO:tasks.workunit.client.0.smithi176.stdout: 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x1a1) [0x55fd69bf3d61] 2018-05-28T10:34:50.042 INFO:tasks.workunit.client.0.smithi176.stdout: 14: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x55fd69e5d0f0] 2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x590) [0x55fd69c03560] 2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x46e) [0x7f4f63affabe] 2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f4f63b01b40] 2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 18: (()+0x76ba) [0x7f4f621866ba] 2018-05-28T10:34:50.043 INFO:tasks.workunit.client.0.smithi176.stdout: 19: (clone()+0x6d) [0x7f4f6179541d] 2018-05-28T10:34:50.044 INFO:tasks.workunit.client.0.smithi176.stdout: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2018-05-28T10:34:50.044 INFO:tasks.workunit.client.0.smithi176.stdout:
#6 Updated by Kefu Chai almost 6 years ago
- Duplicated by Bug #24145: osdmap decode error in rados/standalone/* added
#7 Updated by David Zafman almost 6 years ago
This might be caused by 52dd99e3011bfc787042fe105e02c11b28867c4c which was included in https://github.com/ceph/ceph/pull/20177. That commit reverts without interacting with any of the other commits in that pull request. I need to look into it further.
#8 Updated by David Zafman almost 6 years ago
- Assignee set to David Zafman
#9 Updated by David Zafman almost 6 years ago
I did a run based on d9284902e1b2e292595696caf11cdead18acec96 which is a branch off of master.
4(30) failures occurred one of which was osd/PGLog.h: 1294: FAILED assert(st.st_size == 0)
This means this is NOT a decoding issue. The object that has data written to it is the pgmeta_oid which is supposed to be empty. In another case the decode of an OSDMap failed because of a struct_compat that should have been 8 but instead was 48 decimal or ascii '0'. In another case a decode failed somewhere that was a segmentation fault instead of a thrown exception.
In these test-erasure-eio.sh tests there are lots of objectstore_tool operations which stops and starts ceph-osds a lot. The failures can happen very early in the start-up or not long after.
#10 Updated by David Zafman almost 6 years ago
I was able to reproduce by running a loop of a single test case in qa/standalone/erasure-code/test-erasure-eio.sh
function TEST_rados_get_bad_size_shard_1() { local dir=$1 setup_osds 4 || return 1 local poolname=pool-jerasure create_erasure_coded_pool $poolname 2 1 || return 1 # Set incorrect size into replicas OSD (1) and OSD (2) local shard_id=1 while(true) do rados_get_data_bad_size $dir $shard_id 10 || return 1 rados_get_data_bad_size $dir $shard_id 0 || return 1 rados_get_data_bad_size $dir $shard_id 256 add || return 1 done # NOTREACHED delete_erasure_coded_pool $poolname }
Osd corrupted its 594 map but OSDs 0, 2, 3 match each other and don't have this ascii text in them.
td/loop/1/current/meta/DIR_5/DIR_2/osdmap.594__0_AC95D125__none
2018-06-19 17:17:26.726 7fca91bcd700 -1 log_channel(cluster) log [ERR] : Corruption detected: object 2:ac4962be:::obj-size-3890-1-0:head is missing hash_info 2018-06-19 17:17:27.306 7fcaacc62700 -1 received signal: Terminated from bash ../qa/standalone/special/loop.sh (PID: 90555) UID: 1000 2018-06-19 17:17:27.306 7fcaacc62700 -1 osd.1 594 *** Got signal Terminated *** 2018-06-19 17:17:27.370 7fcaacc62700 -1 osd.1 594 shutdown
Osd 1 is doing a strange combination of EC read handling AND dealing with map 594:
2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process 2.0s1 to_process <> waiting <> waiting_peering {} 2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process OpQueueItem(2.0s1 PGOpItem(op=MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3) prio 127 cost 0 e594) queued 2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process 2.0s1 to_process <OpQueueItem(2.0s1 PGOpItem(op=MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3) prio 127 cost 0 e594)> waiting <> waiting_peering {} 2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process OpQueueItem(2.0s1 PGOpItem(op=MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3) prio 127 cost 0 e594) pg 0x55ac6bb18000 2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 dequeue_op 0x55ac6c62e840 prio 127 cost 0 latency 0.000281 MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3 pg pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] 2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 594 share_map osd.3 127.0.0.1:6814/5123 594 2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 594 should_share_map osd.3 127.0.0.1:6814/5123 594 2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] _handle_message: MOSDECSubOpRead(2.0s1 594/592 ECSubRead(tid=151, to_read={2:ac4962be:::obj-size-3890-1-0:head=0,4096,0}, subchunks={2:ac4962be:::obj-size-3890-1-0:head=[0,1]}, attrs_to_read=)) v3 2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] get_hash_info: Getting attr on 2:ac4962be:::obj-size-3890-1-0:head 2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] get_hash_info: not in cache 2:ac4962be:::obj-size-3890-1-0:head 2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] get_hash_info: found on disk, size 0 2018-06-19 17:17:26.726 7fca91bcd700 0 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] get_hash_info: Mismatch of total_chunk_size 4096 2018-06-19 17:17:26.726 7fca91bcd700 -1 log_channel(cluster) log [ERR] : Corruption detected: object 2:ac4962be:::obj-size-3890-1-0:head is missing hash_info 2018-06-19 17:17:26.726 7fca91bcd700 5 osd.1 pg_epoch: 594 pg[2.0s1( v 581'38 (0'0,581'38] local-lis/les=592/593 n=3 ec=28/28 lis/c 592/592 les/c/f 593/593/0 592/592/28) [3,1,2]p3(0) r=1 lpr=592 luod=0'0 crt=581'38 lcod 0'0 active mbc={}] handle_sub_read: No hinfo for 2:ac4962be:::obj-size-3890-1-0:head 2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 send_incremental_map 593 -> 594 to 0x55ac6bbe6a00 127.0.0.1:6814/5123 2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 add_map_inc_bl 594 212 bytes 2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 note_peer_epoch osd.3 has 594 2018-06-19 17:17:26.726 7fca91bcd700 10 osd.1 594 dequeue_op 0x55ac6c62e840 finish 2018-06-19 17:17:26.726 7fca91bcd700 20 osd.1 op_wq(0) _process empty q, waiting
#11 Updated by David Zafman almost 6 years ago
Now that I've looked at the code there is nothing surprising about the map handling. There is code in dequeue_op() that calls maybe_share_map() and send_message_osd_cluster() calls share_map_peer().
#12 Updated by Josh Durgin almost 6 years ago
possibly related lumious run: http://pulpito.ceph.com/yuriw-2018-06-11_16:27:32-rados-wip-yuri3-testing-2018-06-11-1421-luminous-distro-basic-smithi/2654727/
backtrace is:
#27 0x0000557d6c9e9930 in OSDMap::decode (this=this@entry=0x557d7e5f0e00, bl=...) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSDMap.cc:2832 #28 0x0000557d6c9ebc1e in OSDMap::decode (this=this@entry=0x557d7e5f0e00, bl=...) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSDMap.cc:2641 #29 0x0000557d6c3036ac in OSDService::try_get_map (this=this@entry=0x557d77e1fdc8, epoch=epoch@entry=34) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSD.cc:1616 #30 0x0000557d6c30627e in OSD::advance_pg (this=this@entry=0x557d77e1e000, osd_epoch=<optimized out>, pg=pg@entry=0x557d7d239400, handle=..., rctx=rctx@entry=0x7fa514b673d0, new_pgs=new_pgs@entry=0x7fa514b673a0) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSD.cc:8371 #31 0x0000557d6c3072e4 in OSD::process_peering_events (this=0x557d77e1e000, pgs=..., handle=...) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSD.cc:9880 #32 0x0000557d6c37b867 in OSD::PeeringWQ::_process (handle=..., pgs=..., this=<optimized out>) at /build/ceph-12.2.5-461-geccc8e6/src/osd/OSD.h:1871 #33 ThreadPool::BatchWorkQueue<PG>::_void_process (this=<optimized out>, p=0x557d78b365a0, handle=...) at /build/ceph-12.2.5-461-geccc8e6/src/common/WorkQueue.h:120 #34 0x0000557d6c910df9 in ThreadPool::worker (this=0x557d77e1e9b8, wt=0x557d77e2f3e0) at /build/ceph-12.2.5-461-geccc8e6/src/common/WorkQueue.cc:120 #35 0x0000557d6c911f00 in ThreadPool::WorkThread::entry (this=<optimized out>) at /build/ceph-12.2.5-461-geccc8e6/src/common/WorkQueue.h:448 #36 0x00007fa536cd86ba in start_thread (arg=0x7fa514b69700) at pthread_create.c:333 #37 0x00007fa535d4f41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
PRs being tested in that branch: https://trello.com/c/oqjsU62O/326-wip-yuri3-testing-2018-06-11-1421-luminous - any of those seem like possible culprits David?
#13 Updated by David Zafman almost 6 years ago
- Status changed from New to In Progress
#14 Updated by David Zafman almost 6 years ago
- Priority changed from Normal to High
#15 Updated by David Zafman over 5 years ago
Final bisect results:
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
2e47633ffa499b41255aed1ba8ea12e14ef556ba
85f6873e0420405ce1682abfc2c2884fa102a239
d55a66c56e106c347dabb150318add73667cab52
We cannot bisect more!
These are the first 3 commits of https://github.com/ceph/ceph/pull/20681
We could revert the entire pull request to fix this bug. Not sure what is causing the corruption.
#16 Updated by David Zafman over 5 years ago
sage APP [11:05 AM] davidzlap: i'm suspicious of 056d02b0649d28200170f729dc0d625209206cfb ... can you try reverting just that commit? davidzlap APP [11:13 AM] @sage: ok davidzlap APP [11:20 AM] @sage: On closer examination the bisect found that even without that sha1 it failed. There are 3 possible commits which can't be tested individually. See previous comment.
#17 Updated by Sage Weil over 5 years ago
The problem is that int global_init_shutdown_stderr(CephContext *cct) is not being run at a time in the process lifecycle where other threads are opening and closing fds:
A: global_init_shutdown_stderr closes stderr
B: opens a new fd for something (e.g. filestore opens an osdmap file)
A: global_init_shutdown_stderr opens /dev/null, assuming it will be the same stderr fd (but it's not)
...
logging code writes to stderr and it ends up in the osdmap file
#18 Updated by Sage Weil over 5 years ago
#19 Updated by Nathan Cutler over 5 years ago
- Backport set to mimic,luminous
#20 Updated by Sage Weil over 5 years ago
- Status changed from In Progress to Pending Backport
#21 Updated by Patrick Donnelly over 5 years ago
- Copied to Backport #24864: luminous: Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-eio.sh added
#22 Updated by Patrick Donnelly over 5 years ago
- Copied to Backport #24865: mimic: Abort in OSDMap::decode() during qa/standalone/erasure-code/test-erasure-eio.sh added
#23 Updated by Josh Durgin over 5 years ago
- Duplicated by Bug #25106: Ceph-osd coredumps on launch added
#24 Updated by David Zafman over 5 years ago
- Status changed from Pending Backport to Resolved