Project

General

Profile

Actions

Bug #7423

closed

erasure code: ./rados get retrieves a larger object

Added by Loïc Dachary about 10 years ago. Updated about 10 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
% Done:

100%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

work in progress

Steps to reproduce:

rm -fr dev out ;  mkdir -p dev ; LC_ALL=C MON=1 OSD=10 bash -x ./vstart.sh -d -n -X -l mon osd
./ceph osd crush rule create-erasure ecruleset erasure-code-k=2 erasure-code-m=1 erasure-code-ruleset-failure-domain=osd
./ceph osd pool create ecpool 12 12 erasure crush_ruleset=ecruleset erasure-code-k=2 erasure-code-m=1 erasure-code-ruleset-failure-domain=osd
./rados --pool ecpool put SOMETHING /etc/group
./rados --pool ecpool get SOMETHING /tmp/SOMETHING


Files

osd.log.gz (553 KB) osd.log.gz full log of crashed OSD Loïc Dachary, 02/19/2014 03:32 PM
Actions #1

Updated by Loïc Dachary about 10 years ago

  • Description updated (diff)

Loic Dachary wrote:

While playing with ECBackend pull request
[...]
Steps to reproduce after compiling:
[...]

Actions #2

Updated by Loïc Dachary about 10 years ago

On ecbackend-8 it fails in the same way ( the crash does not happen immediately )

loic@fold:~/software/ceph/ceph/src$ ./rados --debug-objecter 20 --pool ecpool --conf ceph.conf put SOMETHING /etc/group
2014-02-14 11:02:01.537837 7f27db8d6700 10 client.?.objecter ms_handle_connect 0x15ed550
2014-02-14 11:02:01.537929 7f27db8d6700 10 client.?.objecter resend_mon_ops
2014-02-14 11:02:01.539322 7f27e032f7c0 10 client.4142.objecter maybe_request_map subscribing (onetime) to next osd map
2014-02-14 11:02:01.540422 7f27db8d6700  3 client.4142.objecter handle_osd_map got epochs [26,26] > 0
2014-02-14 11:02:01.540446 7f27db8d6700  3 client.4142.objecter handle_osd_map decoding full epoch 26
2014-02-14 11:02:01.541092 7f27db8d6700 20 client.4142.objecter dump_active .. 0 homeless
2014-02-14 11:02:01.541195 7f27db8d6700  3 client.4142.objecter handle_osd_map ignoring epochs [26,26] <= 26
2014-02-14 11:02:01.541200 7f27db8d6700 20 client.4142.objecter dump_active .. 0 homeless
2014-02-14 11:02:01.541593 7f27e032f7c0 10 client.4142.objecter recalc_op_target tid 1 pgid 3.847441d7 acting [4,6,9]
2014-02-14 11:02:01.541695 7f27e032f7c0 20 client.4142.objecter  note: not requesting ack
2014-02-14 11:02:01.541723 7f27e032f7c0 10 client.4142.objecter op_submit oid SOMETHING @3 @3 [writefull 0~1040] tid 1 osd.4
2014-02-14 11:02:01.541739 7f27e032f7c0 15 client.4142.objecter send_op 1 to osd.4
2014-02-14 11:02:01.541784 7f27e032f7c0  5 client.4142.objecter 0 unacked, 1 uncommitted
2014-02-14 11:02:01.542145 7f27db8d6700 10 client.4142.objecter ms_handle_connect 0x15f0680
2014-02-14 11:02:01.551304 7f27db8d6700 10 client.4142.objecter in handle_osd_op_reply
2014-02-14 11:02:01.551324 7f27db8d6700  7 client.4142.objecter handle_osd_op_reply 1 ondisk v 26'1 uv 1 in 3.847441d7 attempt 0
2014-02-14 11:02:01.551350 7f27db8d6700 10 client.4142.objecter  op 0 rval 0 len 0
2014-02-14 11:02:01.551361 7f27db8d6700 15 client.4142.objecter handle_osd_op_reply safe
2014-02-14 11:02:01.551365 7f27db8d6700 15 client.4142.objecter handle_osd_op_reply completed tid 1
2014-02-14 11:02:01.551372 7f27db8d6700 15 client.4142.objecter finish_op 1
2014-02-14 11:02:01.551519 7f27db8d6700  5 client.4142.objecter 0 unacked, 0 uncommitted
2014-02-14 11:02:01.551659 7f27e032f7c0 10 client.4142.objecter close_session for osd.4
loic@fold:~/software/ceph/ceph/src$ ./rados --debug-objecter 20 --pool ecpool --conf ceph.conf get SOMETHING /tmp/SOMETHING
2014-02-14 11:02:30.740102 7f96fa05d700 10 client.?.objecter ms_handle_connect 0x2278550
2014-02-14 11:02:30.740195 7f96fa05d700 10 client.?.objecter resend_mon_ops
2014-02-14 11:02:30.742537 7f96feab67c0 10 client.4143.objecter maybe_request_map subscribing (onetime) to next osd map
2014-02-14 11:02:30.743476 7f96fa05d700  3 client.4143.objecter handle_osd_map got epochs [26,26] > 0
2014-02-14 11:02:30.743492 7f96fa05d700  3 client.4143.objecter handle_osd_map decoding full epoch 26
2014-02-14 11:02:30.744126 7f96fa05d700 20 client.4143.objecter dump_active .. 0 homeless
2014-02-14 11:02:30.744220 7f96fa05d700  3 client.4143.objecter handle_osd_map ignoring epochs [26,26] <= 26
2014-02-14 11:02:30.744223 7f96fa05d700 20 client.4143.objecter dump_active .. 0 homeless
2014-02-14 11:02:30.744631 7f96feab67c0 10 client.4143.objecter recalc_op_target tid 1 pgid 3.847441d7 acting [4,6,9]
2014-02-14 11:02:30.744733 7f96feab67c0 20 client.4143.objecter  note: not requesting commit
2014-02-14 11:02:30.744758 7f96feab67c0 10 client.4143.objecter op_submit oid SOMETHING @3 @3 [read 0~4194304] tid 1 osd.4
2014-02-14 11:02:30.744773 7f96feab67c0 15 client.4143.objecter send_op 1 to osd.4
2014-02-14 11:02:30.744834 7f96feab67c0  5 client.4143.objecter 1 unacked, 0 uncommitted
2014-02-14 11:02:30.745494 7f96fa05d700 10 client.4143.objecter ms_handle_connect 0x227b6d0
2014-02-14 11:02:30.921451 7f96fa05d700  1 client.4143.objecter ms_handle_reset on osd.4
2014-02-14 11:02:30.921478 7f96fa05d700 10 client.4143.objecter reopen_session osd.4 session, addr now osd.4 127.0.0.1:6820/17681
2014-02-14 11:02:30.921578 7f96fa05d700 10 client.4143.objecter kick_requests for osd.4
2014-02-14 11:02:30.921604 7f96fa05d700 15 client.4143.objecter send_op 1 to osd.4
2014-02-14 11:02:30.921658 7f96f6b74700  0 -- 127.0.0.1:0/1019577 >> 127.0.0.1:6820/17681 pipe(0x7f96e8006ca0 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7f96e8006f00).fault
2014-02-14 11:02:30.921846 7f96fa05d700 10 client.4143.objecter maybe_request_map subscribing (onetime) to next osd map
2014-02-14 11:02:35.742672 7f96f7e18700 10 client.4143.objecter tick

Actions #3

Updated by Loïc Dachary about 10 years ago

crashes on master as of today

Actions #4

Updated by Loïc Dachary about 10 years ago

  • Category set to OSD
  • Status changed from New to In Progress
  • Assignee set to Loïc Dachary
  • Priority changed from Normal to Urgent
  • Target version set to 0.78
Actions #5

Updated by Loïc Dachary about 10 years ago

#0  0x00007f98a58b7f77 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f98a58bb5e8 in __GI_abort () at abort.c:90
#2  0x00007f98a61c36e5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f98a61c1856 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f98a61c1883 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007f98a61c1aae in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x000000000188a911 in ceph::buffer::list::substr_of (this=0x7f98600f9878, other=..., off=0, len=4194304) at common/buffer.cc:1334
#7  0x0000000001702280 in CallClientContexts::finish (this=0x7f9860053570, in=...) at osd/ECBackend.cc:1539
#8  0x0000000001707146 in GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete (this=0x7f9860053570, t=...) at ./include/Context.h:45
#9  0x00000000016f70f2 in ECBackend::complete_read_op (this=0x7f98440b7a40, rop=..., m=0x7f9870b902f0) at osd/ECBackend.cc:978
#10 0x00000000016f6f69 in ECBackend::handle_sub_read_reply (this=0x7f98440b7a40, from=..., op=..., m=0x7f9870b902f0) at osd/ECBackend.cc:963
#11 0x00000000016f4bb9 in ECBackend::handle_message (this=0x7f98440b7a40, _op=...) at osd/ECBackend.cc:669
#12 0x00000000015c4290 in ReplicatedPG::do_request (this=0x7f9844039010, op=..., handle=...) at osd/ReplicatedPG.cc:1081
#13 0x00000000012af612 in OSD::dequeue_op (this=0x431b0c0, pg=..., op=..., handle=...) at osd/OSD.cc:7470
#14 0x00000000012af1a9 in OSD::OpWQ::_process (this=0x431bee8, pg=..., handle=...) at osd/OSD.cc:7440
#15 0x0000000001353d80 in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (
    this=0x431bee8, handle=...) at ./common/WorkQueue.h:190
#16 0x0000000001877b06 in ThreadPool::worker (this=0x431b530, wt=0x43321d0) at common/WorkQueue.cc:125
#17 0x000000000187997d in ThreadPool::WorkThread::entry (this=0x43321d0) at common/WorkQueue.h:317
#18 0x000000000187024d in Thread::_entry_func (arg=0x43321d0) at common/Thread.cc:41
#19 0x00007f98a71d7f6e in start_thread (arg=0x7f9870b91700) at pthread_create.c:311
#20 0x00007f98a597b9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
Actions #6

Updated by Loïc Dachary about 10 years ago

Actions #7

Updated by Loïc Dachary about 10 years ago

The last call will end with abort()


Breakpoint 1, ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06bdcec10)
    at osd/ECUtil.cc:13
13      uint64_t total_chunk_size = to_decode.begin()->second.length();
(gdb) bt
#0  ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06bdcec10) at osd/ECUtil.cc:13
#1  0x00000000017021b9 in CallClientContexts::finish (this=0x7fa058019950, in=...)
    at osd/ECBackend.cc:1533
#2  0x0000000001707146 in GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete (this=0x7fa058019950, t=...) at ./include/Context.h:45
#3  0x00000000016f70f2 in ECBackend::complete_read_op (this=0x7fa03c0cd890, rop=..., m=0x7fa06bdcf2f0)
    at osd/ECBackend.cc:978
#4  0x00000000016f6f69 in ECBackend::handle_sub_read_reply (this=0x7fa03c0cd890, from=..., op=..., 
    m=0x7fa06bdcf2f0) at osd/ECBackend.cc:963
#5  0x00000000016f4bb9 in ECBackend::handle_message (this=0x7fa03c0cd890, _op=...)
    at osd/ECBackend.cc:669
#6  0x00000000015c4290 in ReplicatedPG::do_request (this=0x7fa03c03f010, op=..., handle=...)
    at osd/ReplicatedPG.cc:1081
#7  0x00000000012af612 in OSD::dequeue_op (this=0x28c60c0, pg=..., op=..., handle=...)
    at osd/OSD.cc:7470
#8  0x00000000012af1a9 in OSD::OpWQ::_process (this=0x28c6ee8, pg=..., handle=...) at osd/OSD.cc:7440
#9  0x0000000001353d80 in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x28c6ee8, handle=...)
    at ./common/WorkQueue.h:190
#10 0x0000000001877b06 in ThreadPool::worker (this=0x28c6530, wt=0x28dd1a0) at common/WorkQueue.cc:125
#11 0x000000000187997d in ThreadPool::WorkThread::entry (this=0x28dd1a0) at common/WorkQueue.h:317
#12 0x000000000187024d in Thread::_entry_func (arg=0x28dd1a0) at common/Thread.cc:41
#13 0x00007fa0a21fbf6e in start_thread (arg=0x7fa06bdd0700) at pthread_create.c:311
#14 0x00007fa0a099f9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) c
Continuing.
[Switching to Thread 0x7fa06c621700 (LWP 13216)]

Breakpoint 1, ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06c61fc10)
    at osd/ECUtil.cc:13
13      uint64_t total_chunk_size = to_decode.begin()->second.length();
(gdb) bt
#0  ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06c61fc10) at osd/ECUtil.cc:13
#1  0x00000000017021b9 in CallClientContexts::finish (this=0x7fa058076ef0, in=...)
    at osd/ECBackend.cc:1533
#2  0x0000000001707146 in GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete (this=0x7fa058076ef0, t=...) at ./include/Context.h:45
#3  0x00000000016f70f2 in ECBackend::complete_read_op (this=0x7fa03c0cd890, rop=..., m=0x7fa06c6202f0)
    at osd/ECBackend.cc:978
#4  0x00000000016f6f69 in ECBackend::handle_sub_read_reply (this=0x7fa03c0cd890, from=..., op=..., 
    m=0x7fa06c6202f0) at osd/ECBackend.cc:963
#5  0x00000000016f4bb9 in ECBackend::handle_message (this=0x7fa03c0cd890, _op=...)
    at osd/ECBackend.cc:669
#6  0x00000000015c4290 in ReplicatedPG::do_request (this=0x7fa03c03f010, op=..., handle=...)
    at osd/ReplicatedPG.cc:1081
#7  0x00000000012af612 in OSD::dequeue_op (this=0x28c60c0, pg=..., op=..., handle=...)
    at osd/OSD.cc:7470
#8  0x00000000012af1a9 in OSD::OpWQ::_process (this=0x28c6ee8, pg=..., handle=...) at osd/OSD.cc:7440
#9  0x0000000001353d80 in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x28c6ee8, handle=...)
    at ./common/WorkQueue.h:190
#10 0x0000000001877b06 in ThreadPool::worker (this=0x28c6530, wt=0x28dd010) at common/WorkQueue.cc:125
#11 0x000000000187997d in ThreadPool::WorkThread::entry (this=0x28dd010) at common/WorkQueue.h:317
#12 0x000000000187024d in Thread::_entry_func (arg=0x28dd010) at common/Thread.cc:41
#13 0x00007fa0a21fbf6e in start_thread (arg=0x7fa06c621700) at pthread_create.c:311
#14 0x00007fa0a099f9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) c
Continuing.

Breakpoint 1, ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06c61fc10)
    at osd/ECUtil.cc:13
13      uint64_t total_chunk_size = to_decode.begin()->second.length();
(gdb) bt
#0  ECUtil::decode (sinfo=..., ec_impl=..., to_decode=..., out=0x7fa06c61fc10) at osd/ECUtil.cc:13
#1  0x00000000017021b9 in CallClientContexts::finish (this=0x7fa0640aeb30, in=...)
    at osd/ECBackend.cc:1533
#2  0x0000000001707146 in GenContext<std::pair<RecoveryMessages*, ECBackend::read_result_t&>&>::complete (this=0x7fa0640aeb30, t=...) at ./include/Context.h:45
#3  0x00000000016f70f2 in ECBackend::complete_read_op (this=0x7fa03c0cd890, rop=..., m=0x7fa06c6202f0)
    at osd/ECBackend.cc:978
#4  0x00000000016f6f69 in ECBackend::handle_sub_read_reply (this=0x7fa03c0cd890, from=..., op=..., 
    m=0x7fa06c6202f0) at osd/ECBackend.cc:963
#5  0x00000000016f4bb9 in ECBackend::handle_message (this=0x7fa03c0cd890, _op=...)
    at osd/ECBackend.cc:669
#6  0x00000000015c4290 in ReplicatedPG::do_request (this=0x7fa03c03f010, op=..., handle=...)
    at osd/ReplicatedPG.cc:1081
#7  0x00000000012af612 in OSD::dequeue_op (this=0x28c60c0, pg=..., op=..., handle=...)
    at osd/OSD.cc:7470
#8  0x00000000012af1a9 in OSD::OpWQ::_process (this=0x28c6ee8, pg=..., handle=...) at osd/OSD.cc:7440
#9  0x0000000001353d80 in ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process (this=0x28c6ee8, handle=...)
    at ./common/WorkQueue.h:190
#10 0x0000000001877b06 in ThreadPool::worker (this=0x28c6530, wt=0x28dd010) at common/WorkQueue.cc:125
#11 0x000000000187997d in ThreadPool::WorkThread::entry (this=0x28dd010) at common/WorkQueue.h:317
#12 0x000000000187024d in Thread::_entry_func (arg=0x28dd010) at common/Thread.cc:41
#13 0x00007fa0a21fbf6e in start_thread (arg=0x7fa06c621700) at pthread_create.c:311
#14 0x00007fa0a099f9cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Actions #8

Updated by Loïc Dachary about 10 years ago

  • Description updated (diff)
Actions #9

Updated by Loïc Dachary about 10 years ago

As of today instead of crashing it retrieves an object that is exactly 4096 bytes instead of the original one that was 1040 bytes.

Actions #10

Updated by Loïc Dachary about 10 years ago

  • Subject changed from erasure code: get object crashes to erasure code: ./rados get retrieves a larger object
Actions #11

Updated by Loïc Dachary about 10 years ago

  • Description updated (diff)
Actions #12

Updated by Loïc Dachary about 10 years ago

  • Description updated (diff)
  • Status changed from In Progress to Fix Under Review
Actions #13

Updated by Loïc Dachary about 10 years ago

  • % Done changed from 0 to 70

It is likely not the proper way to fix it. It seems to work but it may be better to make it so ECBackend.cc does not return data beyond the object size but I did not figure out how to read the object size when in the ECBackend.cc read path.

Actions #14

Updated by Sage Weil about 10 years ago

  • Status changed from Fix Under Review to Resolved
Actions #15

Updated by Loïc Dachary about 10 years ago

  • % Done changed from 70 to 100
Actions

Also available in: Atom PDF