Bug #18637
closed"BlueStore.cc: 4967: FAILED assert(allow_eio || r != -5)" in powercycle-kraken-testing-basic-mira
0%
Description
Run: http://pulpito.ceph.com/teuthology-2017-01-22_17:42:58-powercycle-kraken-testing-basic-mira/
Jobs: 738648
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2017-01-22_17:42:58-powercycle-kraken-testing-basic-mira/738648/teuthology.log
Assertion: /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0-2-g61b1bee/rpm/el7/BUILD/ceph-11.2.0-2-g61b1bee/src/os/bluestore/BlueStore.cc: 4967: FAILED assert(allow_eio || r != -5) ceph version 11.2.0-2-g61b1bee (61b1beef1dc4802c32367fc71968101a09042c15) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x55db188d3b55] 2: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0x825) [0x55db1863c505] 3: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*, bool)+0x26b) [0x55db184e43db] 4: (ReplicatedBackend::handle_pull(pg_shard_t, PullOp&, PushOp*)+0xd2) [0x55db184e5952] 5: (ReplicatedBackend::do_pull(std::shared_ptr<OpRequest>)+0x1cc) [0x55db184e78ec] 6: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x363) [0x55db184ee0d3] 7: (PrimaryLogPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x100) [0x55db183af6d0] 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x55db1825800d] 9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x6d) [0x55db1825825d] 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7e2) [0x55db1827f632] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x947) [0x55db188d97c7] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55db188db920] 13: (()+0x7dc5) [0x7f3318f48dc5] 14: (clone()+0x6d) [0x7f3317e3073d]
Updated by Sage Weil over 7 years ago
- Assignee set to Sage Weil
- Priority changed from Normal to Immediate
Updated by Sage Weil about 7 years ago
- Status changed from New to Can't reproduce
- Priority changed from Immediate to Urgent
unable to reproduce this. best guess is another bad mira disk.
Updated by Matthew Via almost 7 years ago
I currently have an OSD that I can't get to stay started with this exact trace:
-19> 2017-05-09 14:53:51.482714 7f6fed94a700 5 osd.1 pg_epoch: 120018 pg[1.182( v 107846'9228 (52099'6228,107846'9228] local-les=120018 n=218 ec=1 les/c/f 120018/118233/0 120016/ 120017/120017) [1,3,8]/[1,3] r=0 lpr=120017 pi=118227-120016/168 bft=8 crt=107846'9228 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped] target shard 8 from MIN -18> 2017-05-09 14:53:51.482814 7f6fed94a700 1 -- 192.168.2.85:6813/3484 --> 192.168.2.85:6805/2805 -- pg_info(1 pgs e120018:0.148) v4 -- 0x7f7053604b20 con 0 -17> 2017-05-09 14:53:51.482859 7f6fed94a700 2 Event(0x7f7017af0e80 nevent=5000 time_id=12).wakeup -16> 2017-05-09 14:53:51.483397 7f6fed94a700 1 -- 192.168.2.85:6813/3484 --> 192.168.2.84:6801/30027 -- pg_info(1 pgs e120018:0.148) v4 -- 0x7f7053605a20 con 0 -15> 2017-05-09 14:53:51.483426 7f6fed94a700 2 Event(0x7f7017af0e80 nevent=5000 time_id=12).wakeup -14> 2017-05-09 14:53:51.483715 7f6fed94a700 1 -- 192.168.2.85:6813/3484 --> 192.168.2.85:6805/2805 -- pg_info(1 pgs e120018:12.6) v4 -- 0x7f7053602d20 con 0 -13> 2017-05-09 14:53:51.483739 7f6fed94a700 2 Event(0x7f7017af0e80 nevent=5000 time_id=12).wakeup -12> 2017-05-09 14:53:51.484307 7f6fed94a700 1 -- 192.168.2.85:6813/3484 --> 192.168.2.84:6801/30027 -- pg_info(1 pgs e120018:12.6) v4 -- 0x7f70606bd680 con 0 -11> 2017-05-09 14:53:51.484336 7f6fed94a700 2 Event(0x7f7017af0e80 nevent=5000 time_id=12).wakeup -10> 2017-05-09 14:53:51.484555 7f6fed94a700 1 -- 192.168.2.85:6813/3484 --> 192.168.2.85:6805/2805 -- pg_info(1 pgs e120018:2.117) v4 -- 0x7f7043dc30c0 con 0 -9> 2017-05-09 14:53:51.484575 7f6fed94a700 2 Event(0x7f7017af0e80 nevent=5000 time_id=12).wakeup -8> 2017-05-09 14:53:51.485109 7f6fed94a700 1 -- 192.168.2.85:6813/3484 --> 192.168.2.84:6801/30027 -- pg_info(1 pgs e120018:2.117) v4 -- 0x7f7043dc0f00 con 0 -7> 2017-05-09 14:53:51.485140 7f6fed94a700 2 Event(0x7f7017af0e80 nevent=5000 time_id=12).wakeup -6> 2017-05-09 14:53:51.485407 7f6fed94a700 1 -- 192.168.2.85:6813/3484 --> 192.168.2.85:6805/2805 -- pg_info(1 pgs e120018:0.161) v4 -- 0x7f70936f65a0 con 0 -5> 2017-05-09 14:53:51.485430 7f6fed94a700 2 Event(0x7f7017af0e80 nevent=5000 time_id=12).wakeup -4> 2017-05-09 14:53:51.485977 7f6fed94a700 1 -- 192.168.2.85:6813/3484 --> 192.168.2.84:6809/30071 -- pg_info(1 pgs e120018:0.161) v4 -- 0x7f7052482b40 con 0 -3> 2017-05-09 14:53:51.486012 7f6fed94a700 2 Event(0x7f7017af1280 nevent=5000 time_id=12).wakeup -2> 2017-05-09 14:53:51.486253 7f6fed94a700 1 -- 192.168.2.85:6813/3484 --> 192.168.2.85:6805/2805 -- pg_info(1 pgs e120018:1.160) v4 -- 0x7f7052485840 con 0 -1> 2017-05-09 14:53:51.486274 7f6fed94a700 2 Event(0x7f7017af0e80 nevent=5000 time_id=12).wakeup 0> 2017-05-09 14:53:51.486451 7f6fe9942700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h uge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/os/bluestore/BlueStore.cc: In function 'virtual int BlueStore::read(ObjectStore::CollectionHandle&, const ghobject_t&, uint64_t, size _t, ceph::bufferlist&, uint32_t, bool)' thread 7f6fe9942700 time 2017-05-09 14:53:51.467416 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/os /bluestore/BlueStore.cc: 4967: FAILED assert(allow_eio || r != -5) ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f700e016b35] 2: (BlueStore::read(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, unsigned long, unsigned long, ceph::buffer::list&, unsigned int, bool)+0x825) [0x7f700dd7f 4e5] 3: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*, bool)+0x26b) [0x7f700dc273bb] 4: (ReplicatedBackend::handle_pull(pg_shard_t, PullOp&, PushOp*)+0xd2) [0x7f700dc28932] 5: (ReplicatedBackend::do_pull(std::shared_ptr<OpRequest>)+0x1cc) [0x7f700dc2a8cc] 6: (ReplicatedBackend::handle_message(std::shared_ptr<OpRequest>)+0x363) [0x7f700dc310b3] 7: (PrimaryLogPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x100) [0x7f700daf26d0] 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x41d) [0x7f700d99b00d] 9: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest> const&)+0x6d) [0x7f700d99b25d] 10: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x7e2) [0x7f700d9c2632] 11: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x947) [0x7f700e01c7a7] 12: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f700e01e900] 13: (()+0x7dc5) [0x7f700af98dc5] 14: (clone()+0x6d) [0x7f7009e8073d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
This happens every time I start the OSD. Anything useful I can give you?
Updated by Matthew Via almost 7 years ago
A few lines above that may be relevent, so also adding that:
-6> 2017-05-10 08:49:33.029553 7f67f1776700 5 -- op tracker -- seq: 125, time: 2017-05-10 08:49:33.029552, event: done, op: osd pg remove(epoch 122045; pg2.ba; ) -5> 2017-05-10 08:49:33.030371 7f68024be700 5 -- 192.168.2.85:6805/16374 >> 192.168.2.85:6809/16133 conn(0x7f6886d08000 :6805 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=7 cs=1 l=0). rx osd.3 seq 7 0x7f68862c9a00 MOSDPGPull(2.a7 122045 [PullOp(2:e515e721:::rb.0.3f969.2ae8944a.000000000404:head, recovery_info: ObjectRecoveryInfo(2:e515e721:::rb.0.3f969.2ae8944a.000000000404:head@118184'411570, size: 18446744073709551615, copy_subset: [0~18446744073709551615], clone_subset: {}), recovery_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 -4> 2017-05-10 08:49:33.030424 7f68024be700 1 -- 192.168.2.85:6805/16374 <== osd.3 192.168.2.85:6809/16133 7 ==== MOSDPGPull(2.a7 122045 [PullOp(2:e515e721:::rb.0.3f969.2ae8944a.000000000404:head, recovery_info: ObjectRecoveryInfo(2:e515e721:::rb.0.3f969.2ae8944a.000000000404:head@118184'411570, size: 18446744073709551615, copy_subset: [0~18446744073709551615], clone_subset: {}), recovery_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) v2 ==== 527+0+0 (201856345 0 0) 0x7f68862c9a00 con 0x7f6886d08000 -3> 2017-05-10 08:49:33.030472 7f68024be700 5 -- op tracker -- seq: 126, time: 2017-05-10 08:49:33.030471, event: queued_for_pg, op: MOSDPGPull(2.a7 122045 [PullOp(2:e515e721:::rb.0.3f969.2ae8944a.000000000404:head, recovery_info: ObjectRecoveryInfo(2:e515e721:::rb.0.3f969.2ae8944a.000000000404:head@118184'411570, size: 18446744073709551615, copy_subset: [0~18446744073709551615], clone_subset: {}), recovery_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) -2> 2017-05-10 08:49:33.030578 7f67e6ede700 5 -- op tracker -- seq: 126, time: 2017-05-10 08:49:33.030565, event: reached_pg, op: MOSDPGPull(2.a7 122045 [PullOp(2:e515e721:::rb.0.3f969.2ae8944a.000000000404:head, recovery_info: ObjectRecoveryInfo(2:e515e721:::rb.0.3f969.2ae8944a.000000000404:head@118184'411570, size: 18446744073709551615, copy_subset: [0~18446744073709551615], clone_subset: {}), recovery_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:false))]) -1> 2017-05-10 08:49:33.057645 7f67e6ede700 -1 bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x9000, got 0x8f1c103f, expected 0xea38c31f, device location [0xb5d3009000~1000], object #2:e515e721:::rb.0.3f969.2ae8944a.000000000404:head# 0> 2017-05-10 08:49:33.062909 7f67e6ede700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/os/bluestore/BlueStore.cc: In function 'virtual int BlueStore::read(ObjectStore::CollectionHandle&, const ghobject_t&, uint64_t, size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f67e6ede700 time 2017-05-10 08:49:33.057693 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/os/bluestore/BlueStore.cc: 4967: FAILED assert(allow_eio || r != -5)
Updated by Matthew Via over 6 years ago
This happened to me for months, and eventually I deleted and recreated the osd. However now I'm triggering the bug again, preventing another OSD from starting this. This is reproduced every single time I start the OSD. Is there any chance of getting some traction on this bug?
Some bluefs debug log immediately prior:
-41> 2017-08-28 09:40:47.838526 7f46622c6700 20 bluefs _read_random read buffered 0x21eddb2~f4b of 2:0x4c600000+4200000 -40> 2017-08-28 09:40:47.838612 7f46622c6700 20 bluefs _read_random got 3915 -39> 2017-08-28 09:40:47.838645 7f46622c6700 10 bluefs _flush 0xa4501e6f00 ignoring, length 38984 < min_flush_size 65536 -38> 2017-08-28 09:40:47.838678 7f46622c6700 10 bluefs _read_random h 0xa444550d80 0x21eecfd~f1b from file(ino 5722 size 0x417a99d mtime 2017-08-26 04:01:44.042656 bdev 2 extents [2:0x4c600000+4200000] ) -37> 2017-08-28 09:40:47.838691 7f46622c6700 20 bluefs _read_random read buffered 0x21eecfd~f1b of 2:0x4c600000+4200000 -36> 2017-08-28 09:40:47.925350 7f46622c6700 20 bluefs _read_random got 3867 -35> 2017-08-28 09:40:47.925450 7f46622c6700 10 bluefs _flush 0xa4501e6f00 ignoring, length 42899 < min_flush_size 65536 -34> 2017-08-28 09:40:47.925498 7f46622c6700 10 bluefs _read_random h 0xa444550d80 0x21efc18~f1a from file(ino 5722 size 0x417a99d mtime 2017-08-26 04:01:44.042656 bdev 2 extents [2:0x4c600000+4200000] ) -33> 2017-08-28 09:40:47.925520 7f46622c6700 20 bluefs _read_random read buffered 0x21efc18~f1a of 2:0x4c600000+4200000 -32> 2017-08-28 09:40:47.925849 7f46622c6700 20 bluefs _read_random got 3866 -31> 2017-08-28 09:40:47.925905 7f46622c6700 10 bluefs _flush 0xa4501e6f00 ignoring, length 46766 < min_flush_size 65536 -30> 2017-08-28 09:40:47.925948 7f46622c6700 10 bluefs _read_random h 0xa444550d80 0x21f0b32~1025 from file(ino 5722 size 0x417a99d mtime 2017-08-26 04:01:44.042656 bdev 2 extents [2:0x4c600000+4200000 ]) -29> 2017-08-28 09:40:47.925966 7f46622c6700 20 bluefs _read_random read buffered 0x21f0b32~1025 of 2:0x4c600000+4200000 -28> 2017-08-28 09:40:47.926137 7f46622c6700 20 bluefs _read_random got 4133 -27> 2017-08-28 09:40:47.926172 7f46622c6700 10 bluefs _flush 0xa4501e6f00 ignoring, length 50632 < min_flush_size 65536 -26> 2017-08-28 09:40:47.926207 7f46622c6700 10 bluefs _read_random h 0xa444550d80 0x21f1b57~f1b from file(ino 5722 size 0x417a99d mtime 2017-08-26 04:01:44.042656 bdev 2 extents [2:0x4c600000+4200000] ) -25> 2017-08-28 09:40:47.926221 7f46622c6700 20 bluefs _read_random read buffered 0x21f1b57~f1b of 2:0x4c600000+4200000 -24> 2017-08-28 09:40:47.926500 7f46622c6700 20 bluefs _read_random got 3867 -23> 2017-08-28 09:40:47.926553 7f46622c6700 10 bluefs _flush 0xa4501e6f00 ignoring, length 54765 < min_flush_size 65536 -22> 2017-08-28 09:40:47.926592 7f46622c6700 10 bluefs _read_random h 0xa444550d80 0x21f2a72~f4d from file(ino 5722 size 0x417a99d mtime 2017-08-26 04:01:44.042656 bdev 2 extents [2:0x4c600000+4200000] ) -21> 2017-08-28 09:40:47.926608 7f46622c6700 20 bluefs _read_random read buffered 0x21f2a72~f4d of 2:0x4c600000+4200000 -20> 2017-08-28 09:40:47.926792 7f46622c6700 20 bluefs _read_random got 3917 -19> 2017-08-28 09:40:47.926827 7f46622c6700 10 bluefs _flush 0xa4501e6f00 ignoring, length 58632 < min_flush_size 65536 -18> 2017-08-28 09:40:47.926860 7f46622c6700 10 bluefs _read_random h 0xa444550d80 0x21f39bf~edd from file(ino 5722 size 0x417a99d mtime 2017-08-26 04:01:44.042656 bdev 2 extents [2:0x4c600000+4200000] ) -17> 2017-08-28 09:40:47.926878 7f46622c6700 20 bluefs _read_random read buffered 0x21f39bf~edd of 2:0x4c600000+4200000 -16> 2017-08-28 09:40:47.927611 7f46622c6700 20 bluefs _read_random got 3805 -15> 2017-08-28 09:40:47.927677 7f46622c6700 10 bluefs _flush 0xa4501e6f00 ignoring, length 62549 < min_flush_size 65536 -14> 2017-08-28 09:40:47.927719 7f46622c6700 10 bluefs _read_random h 0xa444550d80 0x21f489c~f1b from file(ino 5722 size 0x417a99d mtime 2017-08-26 04:01:44.042656 bdev 2 extents [2:0x4c600000+4200000] ) -13> 2017-08-28 09:40:47.927736 7f46622c6700 20 bluefs _read_random read buffered 0x21f489c~f1b of 2:0x4c600000+4200000 -12> 2017-08-28 09:40:47.927921 7f46622c6700 20 bluefs _read_random got 3867 -11> 2017-08-28 09:40:47.927958 7f46622c6700 10 bluefs _flush 0xa4501e6f00 0xfb2a20~10332 to file(ino 13815 size 0xfb2a20 mtime 2017-08-28 09:40:47.835276 bdev 2 extents [2:0xf5100000+1000000]) -10> 2017-08-28 09:40:47.927971 7f46622c6700 10 bluefs _flush_range 0xa4501e6f00 pos 0xfb2a20 0xfb2a20~10332 to file(ino 13815 size 0xfb2a20 mtime 2017-08-28 09:40:47.835276 bdev 2 extents [2:0xf510000 0+1000000]) -9> 2017-08-28 09:40:47.927982 7f46622c6700 20 bluefs _flush_range dirty_seq = 3761672 (unchanged, do nothing) -8> 2017-08-28 09:40:47.927986 7f46622c6700 20 bluefs _flush_range file now file(ino 13815 size 0xfc2d52 mtime 2017-08-28 09:40:47.927982 bdev 2 extents [2:0xf5100000+1000000]) -7> 2017-08-28 09:40:47.927993 7f46622c6700 20 bluefs _flush_range in 2:0xf5100000+1000000 x_off 0xfb2a20 -6> 2017-08-28 09:40:47.927997 7f46622c6700 20 bluefs _flush_range using partial tail 0xa20 -5> 2017-08-28 09:40:47.928001 7f46622c6700 20 bluefs _flush_range waiting for previous aio to complete -4> 2017-08-28 09:40:47.928014 7f46622c6700 20 bluefs _flush_range caching tail of 0xd52 and padding block with 0x2ae -3> 2017-08-28 09:40:47.928103 7f46622c6700 20 bluefs _flush_range h 0xa4501e6f00 pos now 0xfc2d52 -2> 2017-08-28 09:40:47.928145 7f46622c6700 10 bluefs _read_random h 0xa444550d80 0x21f57b7~f1b from file(ino 5722 size 0x417a99d mtime 2017-08-26 04:01:44.042656 bdev 2 extents [2:0x4c600000+4200000] ) -1> 2017-08-28 09:40:47.928161 7f46622c6700 20 bluefs _read_random read buffered 0x21f57b7~f1b of 2:0x4c600000+4200000 0> 2017-08-28 09:40:47.934158 7f465c2ba700 -1 *** Caught signal (Aborted) ** in thread 7f465c2ba700 thread_name:tp_osd_tp ceph version 11.2.1 (e0354f9d3b1eea1d75a7dd487ba8098311be38a7) 1: (()+0x9323ca) [0xa4398953ca] 2: (()+0xf370) [0x7f467b915370] 3: (gsignal()+0x37) [0x7f467a7331d7]