Bug #7423
erasure code: ./rados get retrieves a larger object
100%
Description
Steps to reproduce:
rm -fr dev out ; mkdir -p dev ; LC_ALL=C MON=1 OSD=10 bash -x ./vstart.sh -d -n -X -l mon osd ./ceph osd crush rule create-erasure ecruleset erasure-code-k=2 erasure-code-m=1 erasure-code-ruleset-failure-domain=osd ./ceph osd pool create ecpool 12 12 erasure crush_ruleset=ecruleset erasure-code-k=2 erasure-code-m=1 erasure-code-ruleset-failure-domain=osd ./rados --pool ecpool put SOMETHING /etc/group ./rados --pool ecpool get SOMETHING /tmp/SOMETHING
Associated revisions
osd: do not attempt to read past the object size
When reading from a replicated pool, trying to read more than the object
size results in a short read that does not go beyond the object size. In
erasure coded pools, objects are padded and the read will return more
bytes than the object actually contains.
http://tracker.ceph.com/issues/7423 fixes #7423
Signed-off-by: Loic Dachary <loic@dachary.org>
erasure-code: test rados put and get
Check that rados put immediately followed by rados get retrieves exactly
the same content.
http://tracker.ceph.com/issues/7423 refs #7423
Signed-off-by: Loic Dachary <loic@dachary.org>
History
#1 Updated by Loïc Dachary over 9 years ago
- Description updated (diff)
Loic Dachary wrote:
While playing with ECBackend pull request
[...]
Steps to reproduce after compiling:
[...]
#2 Updated by Loïc Dachary over 9 years ago
On ecbackend-8 it fails in the same way ( the crash does not happen immediately )
loic@fold:~/software/ceph/ceph/src$ ./rados --debug-objecter 20 --pool ecpool --conf ceph.conf put SOMETHING /etc/group 2014-02-14 11:02:01.537837 7f27db8d6700 10 client.?.objecter ms_handle_connect 0x15ed550 2014-02-14 11:02:01.537929 7f27db8d6700 10 client.?.objecter resend_mon_ops 2014-02-14 11:02:01.539322 7f27e032f7c0 10 client.4142.objecter maybe_request_map subscribing (onetime) to next osd map 2014-02-14 11:02:01.540422 7f27db8d6700 3 client.4142.objecter handle_osd_map got epochs [26,26] > 0 2014-02-14 11:02:01.540446 7f27db8d6700 3 client.4142.objecter handle_osd_map decoding full epoch 26 2014-02-14 11:02:01.541092 7f27db8d6700 20 client.4142.objecter dump_active .. 0 homeless 2014-02-14 11:02:01.541195 7f27db8d6700 3 client.4142.objecter handle_osd_map ignoring epochs [26,26] <= 26 2014-02-14 11:02:01.541200 7f27db8d6700 20 client.4142.objecter dump_active .. 0 homeless 2014-02-14 11:02:01.541593 7f27e032f7c0 10 client.4142.objecter recalc_op_target tid 1 pgid 3.847441d7 acting [4,6,9] 2014-02-14 11:02:01.541695 7f27e032f7c0 20 client.4142.objecter note: not requesting ack 2014-02-14 11:02:01.541723 7f27e032f7c0 10 client.4142.objecter op_submit oid SOMETHING @3 @3 [writefull 0~1040] tid 1 osd.4 2014-02-14 11:02:01.541739 7f27e032f7c0 15 client.4142.objecter send_op 1 to osd.4 2014-02-14 11:02:01.541784 7f27e032f7c0 5 client.4142.objecter 0 unacked, 1 uncommitted 2014-02-14 11:02:01.542145 7f27db8d6700 10 client.4142.objecter ms_handle_connect 0x15f0680 2014-02-14 11:02:01.551304 7f27db8d6700 10 client.4142.objecter in handle_osd_op_reply 2014-02-14 11:02:01.551324 7f27db8d6700 7 client.4142.objecter handle_osd_op_reply 1 ondisk v 26'1 uv 1 in 3.847441d7 attempt 0 2014-02-14 11:02:01.551350 7f27db8d6700 10 client.4142.objecter op 0 rval 0 len 0 2014-02-14 11:02:01.551361 7f27db8d6700 15 client.4142.objecter handle_osd_op_reply safe 2014-02-14 11:02:01.551365 7f27db8d6700 15 client.4142.objecter handle_osd_op_reply completed tid 1 2014-02-14 11:02:01.551372 7f27db8d6700 15 client.4142.objecter finish_op 1 2014-02-14 11:02:01.551519 7f27db8d6700 5 client.4142.objecter 0 unacked, 0 uncommitted 2014-02-14 11:02:01.551659 7f27e032f7c0 10 client.4142.objecter close_session for osd.4 loic@fold:~/software/ceph/ceph/src$ ./rados --debug-objecter 20 --pool ecpool --conf ceph.conf get SOMETHING /tmp/SOMETHING 2014-02-14 11:02:30.740102 7f96fa05d700 10 client.?.objecter ms_handle_connect 0x2278550 2014-02-14 11:02:30.740195 7f96fa05d700 10 client.?.objecter resend_mon_ops 2014-02-14 11:02:30.742537 7f96feab67c0 10 client.4143.objecter maybe_request_map subscribing (onetime) to next osd map 2014-02-14 11:02:30.743476 7f96fa05d700 3 client.4143.objecter handle_osd_map got epochs [26,26] > 0 2014-02-14 11:02:30.743492 7f96fa05d700 3 client.4143.objecter handle_osd_map decoding full epoch 26 2014-02-14 11:02:30.744126 7f96fa05d700 20 client.4143.objecter dump_active .. 0 homeless 2014-02-14 11:02:30.744220 7f96fa05d700 3 client.4143.objecter handle_osd_map ignoring epochs [26,26] <= 26 2014-02-14 11:02:30.744223 7f96fa05d700 20 client.4143.objecter dump_active .. 0 homeless 2014-02-14 11:02:30.744631 7f96feab67c0 10 client.4143.objecter recalc_op_target tid 1 pgid 3.847441d7 acting [4,6,9] 2014-02-14 11:02:30.744733 7f96feab67c0 20 client.4143.objecter note: not requesting commit 2014-02-14 11:02:30.744758 7f96feab67c0 10 client.4143.objecter op_submit oid SOMETHING @3 @3 [read 0~4194304] tid 1 osd.4 2014-02-14 11:02:30.744773 7f96feab67c0 15 client.4143.objecter send_op 1 to osd.4 2014-02-14 11:02:30.744834 7f96feab67c0 5 client.4143.objecter 1 unacked, 0 uncommitted 2014-02-14 11:02:30.745494 7f96fa05d700 10 client.4143.objecter ms_handle_connect 0x227b6d0 2014-02-14 11:02:30.921451 7f96fa05d700 1 client.4143.objecter ms_handle_reset on osd.4 2014-02-14 11:02:30.921478 7f96fa05d700 10 client.4143.objecter reopen_session osd.4 session, addr now osd.4 127.0.0.1:6820/17681 2014-02-14 11:02:30.921578 7f96fa05d700 10 client.4143.objecter kick_requests for osd.4 2014-02-14 11:02:30.921604 7f96fa05d700 15 client.4143.objecter send_op 1 to osd.4 2014-02-14 11:02:30.921658 7f96f6b74700 0 -- 127.0.0.1:0/1019577 >> 127.0.0.1:6820/17681 pipe(0x7f96e8006ca0 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7f96e8006f00).fault 2014-02-14 11:02:30.921846 7f96fa05d700 10 client.4143.objecter maybe_request_map subscribing (onetime) to next osd map 2014-02-14 11:02:35.742672 7f96f7e18700 10 client.4143.objecter tick
#3 Updated by Loïc Dachary over 9 years ago
crashes on master as of today
#4 Updated by Loïc Dachary over 9 years ago
- Category set to OSD
- Status changed from New to In Progress
- Assignee set to Loïc Dachary
- Priority changed from Normal to Urgent
- Target version set to 0.78
#5 Updated by Loïc Dachary over 9 years ago
#0 0x00007f98a58b7f77 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f98a58bb5e8 in __GI_abort () at abort.c:90 #2 0x00007f98a61c36e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #3 0x00007f98a61c1856 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x00007f98a61c1883 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #5 0x00007f98a61c1aae in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #6 0x000000000188a911 in ceph::buffer::list::substr_of (this=0x7f98600f9878, other=..., off=0, len=4194304) at common/buffer.cc:1334 #7 0x0000000001702280 in CallClientContexts::finish (this=0x7f9860053570, in=...) at osd/ECBackend.cc:1539 #8 0x0000000001707146 in GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete (this=0x7f9860053570, t=...) at ./include/Context.h:45 #9 0x00000000016f70f2 in ECBackend::complete_read_op (this=0x7f98440b7a40, rop=..., m=0x7f9870b902f0) at osd/ECBackend.cc:978 #10 0x00000000016f6f69 in ECBackend::handle_sub_read_reply (this=0x7f98440b7a40, from=..., op=..., m=0x7f9870b902f0) at osd/ECBackend.cc:963 #11 0x00000000016f4bb9 in ECBackend::handle_message (this=0x7f98440b7a40, _op=...) at osd/ECBackend.cc:669 #12 0x00000000015c4290 in ReplicatedPG::do_request (this=0x7f9844039010, op=..., handle=...) at osd/ReplicatedPG.cc:1081 #13 0x00000000012af612 in OSD::dequeue_op (this=0x431b0c0, pg=..., op=..., handle=...) at osd/OSD.cc:7470 #14 0x00000000012af1a9 in OSD::OpWQ::_process (this=0x431bee8, pg=..., handle=...) at osd/OSD.cc:7440 #15 0x0000000001353d80 in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process ( this=0x431bee8, handle=...) at ./common/WorkQueue.h:190 #16 0x0000000001877b06 in ThreadPool::worker (this=0x431b530, wt=0x43321d0) at common/WorkQueue.cc:125 #17 0x000000000187997d in ThreadPool::WorkThread::entry (this=0x43321d0) at common/WorkQueue.h:317 #18 0x000000000187024d in Thread::_entry_func (arg=0x43321d0) at common/Thread.cc:41 #19 0x00007f98a71d7f6e in start_thread (arg=0x7f9870b91700) at pthread_create.c:311 #20 0x00007f98a597b9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
#6 Updated by Loïc Dachary over 9 years ago
- File osd.log.gz added
- Tracker changed from Cleanup to Bug
#7 Updated by Loïc Dachary over 9 years ago
The last call will end with abort()
Breakpoint 1, ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06bdcec10) at osd/ECUtil.cc:13 13 uint64_t total_chunk_size = to_decode.begin()->second.length(); (gdb) bt #0 ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06bdcec10) at osd/ECUtil.cc:13 #1 0x00000000017021b9 in CallClientContexts::finish (this=0x7fa058019950, in=...) at osd/ECBackend.cc:1533 #2 0x0000000001707146 in GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete (this=0x7fa058019950, t=...) at ./include/Context.h:45 #3 0x00000000016f70f2 in ECBackend::complete_read_op (this=0x7fa03c0cd890, rop=..., m=0x7fa06bdcf2f0) at osd/ECBackend.cc:978 #4 0x00000000016f6f69 in ECBackend::handle_sub_read_reply (this=0x7fa03c0cd890, from=..., op=..., m=0x7fa06bdcf2f0) at osd/ECBackend.cc:963 #5 0x00000000016f4bb9 in ECBackend::handle_message (this=0x7fa03c0cd890, _op=...) at osd/ECBackend.cc:669 #6 0x00000000015c4290 in ReplicatedPG::do_request (this=0x7fa03c03f010, op=..., handle=...) at osd/ReplicatedPG.cc:1081 #7 0x00000000012af612 in OSD::dequeue_op (this=0x28c60c0, pg=..., op=..., handle=...) at osd/OSD.cc:7470 #8 0x00000000012af1a9 in OSD::OpWQ::_process (this=0x28c6ee8, pg=..., handle=...) at osd/OSD.cc:7440 #9 0x0000000001353d80 in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x28c6ee8, handle=...) at ./common/WorkQueue.h:190 #10 0x0000000001877b06 in ThreadPool::worker (this=0x28c6530, wt=0x28dd1a0) at common/WorkQueue.cc:125 #11 0x000000000187997d in ThreadPool::WorkThread::entry (this=0x28dd1a0) at common/WorkQueue.h:317 #12 0x000000000187024d in Thread::_entry_func (arg=0x28dd1a0) at common/Thread.cc:41 #13 0x00007fa0a21fbf6e in start_thread (arg=0x7fa06bdd0700) at pthread_create.c:311 #14 0x00007fa0a099f9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) c Continuing. [Switching to Thread 0x7fa06c621700 (LWP 13216)] Breakpoint 1, ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06c61fc10) at osd/ECUtil.cc:13 13 uint64_t total_chunk_size = to_decode.begin()->second.length(); (gdb) bt #0 ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06c61fc10) at osd/ECUtil.cc:13 #1 0x00000000017021b9 in CallClientContexts::finish (this=0x7fa058076ef0, in=...) at osd/ECBackend.cc:1533 #2 0x0000000001707146 in GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete (this=0x7fa058076ef0, t=...) at ./include/Context.h:45 #3 0x00000000016f70f2 in ECBackend::complete_read_op (this=0x7fa03c0cd890, rop=..., m=0x7fa06c6202f0) at osd/ECBackend.cc:978 #4 0x00000000016f6f69 in ECBackend::handle_sub_read_reply (this=0x7fa03c0cd890, from=..., op=..., m=0x7fa06c6202f0) at osd/ECBackend.cc:963 #5 0x00000000016f4bb9 in ECBackend::handle_message (this=0x7fa03c0cd890, _op=...) at osd/ECBackend.cc:669 #6 0x00000000015c4290 in ReplicatedPG::do_request (this=0x7fa03c03f010, op=..., handle=...) at osd/ReplicatedPG.cc:1081 #7 0x00000000012af612 in OSD::dequeue_op (this=0x28c60c0, pg=..., op=..., handle=...) at osd/OSD.cc:7470 #8 0x00000000012af1a9 in OSD::OpWQ::_process (this=0x28c6ee8, pg=..., handle=...) at osd/OSD.cc:7440 #9 0x0000000001353d80 in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x28c6ee8, handle=...) at ./common/WorkQueue.h:190 #10 0x0000000001877b06 in ThreadPool::worker (this=0x28c6530, wt=0x28dd010) at common/WorkQueue.cc:125 #11 0x000000000187997d in ThreadPool::WorkThread::entry (this=0x28dd010) at common/WorkQueue.h:317 #12 0x000000000187024d in Thread::_entry_func (arg=0x28dd010) at common/Thread.cc:41 #13 0x00007fa0a21fbf6e in start_thread (arg=0x7fa06c621700) at pthread_create.c:311 #14 0x00007fa0a099f9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 (gdb) c Continuing. Breakpoint 1, ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06c61fc10) at osd/ECUtil.cc:13 13 uint64_t total_chunk_size = to_decode.begin()->second.length(); (gdb) bt #0 ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06c61fc10) at osd/ECUtil.cc:13 #1 0x00000000017021b9 in CallClientContexts::finish (this=0x7fa0640aeb30, in=...) at osd/ECBackend.cc:1533 #2 0x0000000001707146 in GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete (this=0x7fa0640aeb30, t=...) at ./include/Context.h:45 #3 0x00000000016f70f2 in ECBackend::complete_read_op (this=0x7fa03c0cd890, rop=..., m=0x7fa06c6202f0) at osd/ECBackend.cc:978 #4 0x00000000016f6f69 in ECBackend::handle_sub_read_reply (this=0x7fa03c0cd890, from=..., op=..., m=0x7fa06c6202f0) at osd/ECBackend.cc:963 #5 0x00000000016f4bb9 in ECBackend::handle_message (this=0x7fa03c0cd890, _op=...) at osd/ECBackend.cc:669 #6 0x00000000015c4290 in ReplicatedPG::do_request (this=0x7fa03c03f010, op=..., handle=...) at osd/ReplicatedPG.cc:1081 #7 0x00000000012af612 in OSD::dequeue_op (this=0x28c60c0, pg=..., op=..., handle=...) at osd/OSD.cc:7470 #8 0x00000000012af1a9 in OSD::OpWQ::_process (this=0x28c6ee8, pg=..., handle=...) at osd/OSD.cc:7440 #9 0x0000000001353d80 in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x28c6ee8, handle=...) at ./common/WorkQueue.h:190 #10 0x0000000001877b06 in ThreadPool::worker (this=0x28c6530, wt=0x28dd010) at common/WorkQueue.cc:125 #11 0x000000000187997d in ThreadPool::WorkThread::entry (this=0x28dd010) at common/WorkQueue.h:317 #12 0x000000000187024d in Thread::_entry_func (arg=0x28dd010) at common/Thread.cc:41 #13 0x00007fa0a21fbf6e in start_thread (arg=0x7fa06c621700) at pthread_create.c:311 #14 0x00007fa0a099f9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
#8 Updated by Loïc Dachary over 9 years ago
- Description updated (diff)
#9 Updated by Loïc Dachary over 9 years ago
As of today instead of crashing it retrieves an object that is exactly 4096 bytes instead of the original one that was 1040 bytes.
#10 Updated by Loïc Dachary over 9 years ago
- Subject changed from erasure code: get object crashes to erasure code: ./rados get retrieves a larger object
#11 Updated by Loïc Dachary over 9 years ago
- Description updated (diff)
#12 Updated by Loïc Dachary over 9 years ago
- Description updated (diff)
- Status changed from In Progress to Fix Under Review
#13 Updated by Loïc Dachary over 9 years ago
- % Done changed from 0 to 70
It is likely not the proper way to fix it. It seems to work but it may be better to make it so ECBackend.cc does not return data beyond the object size but I did not figure out how to read the object size when in the ECBackend.cc read path.
#14 Updated by Sage Weil over 9 years ago
- Status changed from Fix Under Review to Resolved
#15 Updated by Loïc Dachary over 9 years ago
- % Done changed from 70 to 100