Project

General

Profile

Actions

Fix #8914

closed

osd crashed at assert ReplicatedBackend::build_push_op

Added by Sahana Lokeshappa over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

100%

Source:
Community (dev)
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Steps to reproduce

./stop.sh
rm -fr dev out ;  mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd
./rados --pool rbd put SOMETHING /etc/group
# sleep 60 # comment this out and the problem does not show
rm dev/osd1/current/*/*SOMETHING* # osd.1 is the primary

It crashes in build_push_op because get_omap_iterator returned a null iterator because the file was removed.

Original description

OSD crashed with assert ReplicatedBackend::build_push_op .
Steps Followed:

sudo ceph pg map 3.151
osdmap e1274 pg 3.151 (3.151) -> up [2,9,20] acting [2,9,20]
I removed object file1 (inserted object using rados) , rm -f on /var/lib/ceph/osd/ceph-9/current/3.151/file1* and /var/lib/ceph/osd/ceph-2/current/3.151/file1*

Check for scrub errors using :
ceph pg scrub 3.151
ceph -w showed scrub errors on 2 and 9
Ran command:
ceph osd repair 2
Got Seg fault in osd.2:
2014-06-16 10:33:19.906324 7fb9e9543700 0 log [ERR] : 3.151 shard 2 missing a086551/file1/head//3
2014-06-16 10:33:19.906330 7fb9e9543700 0 log [ERR] : 3.151 shard 9 missing a086551/file1/head//3
2014-06-16 10:33:19.906362 7fb9e9543700 0 log [ERR] : 3.151 repair 1 missing, 0 inconsistent objects
2014-06-16 10:33:19.906378 7fb9e9543700 0 log [ERR] : 3.151 repair 2 errors, 2 fixed
2014-06-16 10:33:19.924977 7fb9e9d44700 -1 ** Caught signal (Segmentation fault) *
in thread 7fb9e9d44700

1: /usr/bin/ceph-osd() [0x974a1f]
2: (()+0x10340) [0x7fba089de340]
3: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*)+0xc1c) [0x7d209c]
4: (ReplicatedBackend::prep_push(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, eversion_t, interval_set<unsigned long>&, std::map<hobject_t, interval_set<unsigned long>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_set<unsigned long> > > >&, PushOp*)+0x3d8) [0x7d2b48]
5: (ReplicatedBackend::prep_push_to_replica(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, PushOp*)+0x3af) [0x7d6b8f]
6: (ReplicatedBackend::start_pushes(hobject_t const&, std::tr1::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x1af) [0x7d9c6f]
7: (C_ReplicatedBackend_OnPullComplete::finish(ThreadPool::TPHandle&)+0x143) [0x84b083]
8: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x9) [0x661a09]
9: (ReplicatedPG::BlessedGenContext<ThreadPool::TPHandle&>::finish(ThreadPool::TPHandle&)+0x95) [0x824f65]
10: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x9) [0x661a09]
11: (ThreadPool::WorkQueueVal<GenContext<ThreadPool::TPHandle&>, GenContext<ThreadPool::TPHandle&>>::_void_process(void*, ThreadPool::TPHandle&)+0x62) [0x6697d2]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa4b351]
13: (ThreadPool::WorkThread::entry()+0x10) [0xa4c440]
14: (()+0x8182) [0x7fba089d6182]
15: (clone()+0x6d) [0x7fba06d7730d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Files

GOOD-1-removed-primary.txt (1.88 MB) GOOD-1-removed-primary.txt primary osd log after removed file from primary + scrub is fine Loïc Dachary, 08/22/2014 06:03 AM
BAD-removed-primary.txt (1.32 MB) BAD-removed-primary.txt primary osd log after removed file from primary + scrub crashes Loïc Dachary, 08/22/2014 06:03 AM

Related issues 2 (1 open1 closed)

Related to Ceph - Bug #9114: osd: segv in build_push_opDuplicate08/14/2014

Actions
Related to RADOS - Feature #9328: osd: generalize the scrub workflowNew

Actions
Actions #1

Updated by Dhiraj Kamble over 9 years ago

This SEGV is caused due to a NULL omap iterator.
This can happen if the object belonging to the Primary OSD is missing from underlying filesystem of if the underlying filesystem itself is corrupted.

in this case a lookup(::stat()) on the object fails and returns an ENOENT; which causes get_omap_iterator to return a NULL iterator.

Fix:
assert if the object map iterator returned is NULL; so that I/O's can be served from other osd's.

Actions #2

Updated by Dhiraj Kamble over 9 years ago

Dhiraj Kamble wrote:

This SEGV is caused due to a NULL omap iterator.
This can happen if the object belonging to the Primary OSD is missing from underlying filesystem or if the underlying filesystem itself is corrupted.

in this case a lookup(::stat()) on the object fails and returns an ENOENT; which causes get_omap_iterator to return a NULL iterator.

Fix:
assert if the object map iterator returned is NULL; so that I/O's can be served from other osd's.

Actions #3

Updated by Greg Farnum over 9 years ago

  • Tracker changed from Bug to Fix
  • Assignee deleted (Greg Farnum)
  • Priority changed from High to Normal
  • Source changed from other to Community (dev)

Did you mean to add some comments on that second post, Dhiraj?

Anyway, this looks like a kind of disk corruption which we can handle more gracefully, but it'll need to get scheduled. It looks like we just need to pay attention to some return codes in build_push_op(), but if you could upload a full crash log that would be more certain, Sahana.

Actions #4

Updated by Dhiraj Kamble over 9 years ago

Hi Greg,

No i did not intend to add any comments.

The reason i thought we should assert is, so that we can serve i/o's from other osd's - incase the object itself has been explicitly removed or is corrupted.

Also i see that we expect build_push_op to return a success in ReplicatedBackend::prep_push; else we assert anyways.

@
void ReplicatedBackend::prep_push(
...
..

int r = build_push_op(pi.recovery_info,
pi.recovery_progress,
&new_progress,
pop,
&(pi.stat));
assert(r == 0); <<< expect a success; else we assert.
pi.recovery_progress = new_progress;
}
@

so i thought it would good to assert if the omap iterator is NULL in build_push_op or do you think its good to reset the recovery info and then assert. Anyways i dont see any changes that we could make on a failure return value from build_push_op.

let me know about your thoughts and how can we handle this gracefully.

cheers,
Dhiraj

Actions #5

Updated by Dhiraj Kamble over 9 years ago

btw, the steps to reproduce this issue are mentioned by Sahana above & it can be reproduced on a single node too.

For a given object; remove its corresponding file from the filesystem of the primary osd.(../primary_osd/current/pg_dir/)
initate a scrub on the pg to which the object belongs
intitiate a repair on the primary osd.

You'll hit the assert.

Actions #6

Updated by Samuel Just over 9 years ago

  • Assignee set to Samuel Just
  • Priority changed from Normal to High
Actions #7

Updated by Samuel Just over 9 years ago

Hmm, most likely a bug in repair. We should start by creating a teuthology task which reproduces the bug. Once we have that, we can fix it and add the test to the test suite.

Actions #8

Updated by Ian Colle over 9 years ago

  • Target version set to 0.86
Actions #9

Updated by Loïc Dachary over 9 years ago

  • Assignee changed from Samuel Just to Loïc Dachary
Actions #10

Updated by Loïc Dachary over 9 years ago

  • Status changed from New to Need More Info

I'm not able to reproduce the problem on ceph version 0.84-343-g92b227e (92b227e1c0b1533c359e74c81de58140b483ee8e) . Could you please confirm that you can and let me know which version of ceph you are using ?

$ rm -fr dev out ;  mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd
...
$ ./rados --pool rbd put SOMETHING /etc/group
$ ceph osd map rbd SOMETHING
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
osdmap e9 pool 'rbd' (0) object 'SOMETHING' -> pg 0.847441d7 (0.7) -> up ([1,0,2], p1) acting ([1,0,2], p1)
$ rm dev/osd1/current/0.7_head/SOMETHING__head_847441D7__0 
$ ceph pg scrub 0.7
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
instructing pg 0.7 on osd.1 to scrub
$ ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
    cluster d074b6b4-2163-4a3a-b8fe-efdc44a57d97
     health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; too few pgs per osd (2 < min 10); mon.a low disk space
     monmap e1: 1 mons at {a=127.0.0.1:6789/0}, election epoch 2, quorum 0 a
     osdmap e9: 3 osds: 3 up, 3 in
      pgmap v15: 8 pgs, 1 pools, 1190 bytes data, 1 objects
            415 GB used, 131 GB / 547 GB avail
                   7 active+clean
                   1 active+clean+inconsistent
$ ceph osd repair 1
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
osd.1 instructed to repair
$ ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
    cluster d074b6b4-2163-4a3a-b8fe-efdc44a57d97
     health HEALTH_WARN too few pgs per osd (2 < min 10); mon.a low disk space
     monmap e1: 1 mons at {a=127.0.0.1:6789/0}, election epoch 2, quorum 0 a
     osdmap e9: 3 osds: 3 up, 3 in
      pgmap v18: 8 pgs, 1 pools, 1190 bytes data, 1 objects
            415 GB used, 131 GB / 547 GB avail
                   8 active+clean
$ ceph osd tree
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
# id    weight    type name    up/down    reweight
-1    3    root default
-2    3        host fold
0    1            osd.0    up    1    
1    1            osd.1    up    1    
2    1            osd.2    up    1    
$ grep -F '[ERR] 0.7' out/osd.1.log
2014-08-19 23:10:31.476732 7f5697ee3700 10  will send 2014-08-19 23:10:24.514431 osd.1 127.0.0.1:6810/21923 1 : [ERR] 0.7 shard 1 missing 847441d7/SOMETHING/head//0
2014-08-19 23:10:31.476845 7f5697ee3700 10  will send 2014-08-19 23:10:24.514649 osd.1 127.0.0.1:6810/21923 2 : [ERR] 0.7 scrub 1 missing, 0 inconsistent objects
2014-08-19 23:10:31.476908 7f5697ee3700 10  will send 2014-08-19 23:10:24.514657 osd.1 127.0.0.1:6810/21923 3 : [ERR] 0.7 scrub 1 errors
2014-08-19 23:10:31.594350 7f56a0c44700 10  logged 2014-08-19 23:10:24.514431 osd.1 127.0.0.1:6810/21923 1 : [ERR] 0.7 shard 1 missing 847441d7/SOMETHING/head//0
2014-08-19 23:10:31.594376 7f56a0c44700 10  logged 2014-08-19 23:10:24.514649 osd.1 127.0.0.1:6810/21923 2 : [ERR] 0.7 scrub 1 missing, 0 inconsistent objects
2014-08-19 23:10:31.594394 7f56a0c44700 10  logged 2014-08-19 23:10:24.514657 osd.1 127.0.0.1:6810/21923 3 : [ERR] 0.7 scrub 1 errors
2014-08-19 23:11:11.478549 7f5697ee3700 10  will send 2014-08-19 23:11:01.531118 osd.1 127.0.0.1:6810/21923 4 : [ERR] 0.7 shard 1 missing 847441d7/SOMETHING/head//0
2014-08-19 23:11:11.478723 7f5697ee3700 10  will send 2014-08-19 23:11:01.531379 osd.1 127.0.0.1:6810/21923 5 : [ERR] 0.7 repair 1 missing, 0 inconsistent objects
2014-08-19 23:11:11.478779 7f5697ee3700 10  will send 2014-08-19 23:11:01.531481 osd.1 127.0.0.1:6810/21923 6 : [ERR] 0.7 repair 1 errors, 1 fixed
2014-08-19 23:11:11.548100 7f56a0c44700 10  logged 2014-08-19 23:11:01.531118 osd.1 127.0.0.1:6810/21923 4 : [ERR] 0.7 shard 1 missing 847441d7/SOMETHING/head//0
2014-08-19 23:11:11.548131 7f56a0c44700 10  logged 2014-08-19 23:11:01.531379 osd.1 127.0.0.1:6810/21923 5 : [ERR] 0.7 repair 1 missing, 0 inconsistent objects
2014-08-19 23:11:11.548151 7f56a0c44700 10  logged 2014-08-19 23:11:01.531481 osd.1 127.0.0.1:6810/21923 6 : [ERR] 0.7 repair 1 errors, 1 fixed
$ 

Actions #11

Updated by Dhiraj Kamble over 9 years ago

Hi Loic,

please find below the steps to reproduce the issue.

@*#:/build/ceph-firefly84/src# ./ceph -v
  • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH **
    ceph version 0.84-372-gb0aa846 (b0aa846b3f81225a779de00100e15334fb8156b3)

    @
#:/build/ceph-firefly84/src# ./ceph osd tree
  • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
  1. id weight type name up/down reweight
    -1 3 root default
    -3 3 rack localrack
    -2 3 host localhost
    0 1 osd.0 up 1
    1 1 osd.1 up 1
    2 1 osd.2 up 1

Steps:

@
#:/build/ceph-firefly84/src# ./rbd create img --size 1024 --image-format 2
#:/build/ceph-firefly84/src# ./rbd map img
/dev/rbd1

#:/build/ceph-firefly84/src# ./rados lspools
rbd

#:/build/ceph-firefly84/src# dd if=/dev/urandom of=/dev/rbd1 bs=4M oflag=direct count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.457051 s, 9.2 MB/s

#:/build/ceph-firefly84/src# ./ceph pg dump | egrep "^0.20"
  • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
    dumped all in format plain
    0.20 1 0 0 0 0 4194304 8 8 active+clean 2014-08-21 15:14:21.859898 11'8 11:103 [1,2,0] 1 [1,2,0] 1 0'0 2014-08-21 15:13:53.645589 0'0 2014-08-21 15:13:53.645589

#:/build/ceph-firefly84/src# ls ./dev/osd0/current/0.20_head/
rbd\udata.100d6b8b4567.0000000000000000__head_567EC760__0

#:/build/ceph-firefly84/src# ls ./dev/osd1/current/0.20_head/
rbd\udata.100d6b8b4567.0000000000000000__head_567EC760__0

#:/build/ceph-firefly84/src# ls ./dev/osd2/current/0.20_head/
rbd\udata.100d6b8b4567.0000000000000000__head_567EC760__0

#:/build/ceph-firefly84/src# ./ceph pg map 0.20
  • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
    osdmap e11 pg 0.20 (0.20) -> up [1,2,0] acting [1,2,0]

#:/build/ceph-firefly84/src# rm ./dev/osd1/current/0.20_head/rbd\\udata.100d6b8b4567.0000000000000000__head_567EC760__0
#:/build/ceph-firefly84/src# rm ./dev/osd2/current/0.20_head/rbd\\udata.100d6b8b4567.0000000000000000__head_567EC760__0

#:/build/ceph-firefly84/src# ./ceph osd repair 1
  • DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
    osd.1 instructed to repair

@

Primary osd - osd.1.log:

@2014-08-21 15:32:20.483804 7fa6a6847700 -1 ** Caught signal (Segmentation fault) * in thread 7fa6a6847700

ceph version 0.84-372-gb0aa846 (b0aa846b3f81225a779de00100e15334fb8156b3)
1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x1676821]
2: ./ceph-osd() [0x179eab2]
3: (()+0x10340) [0x7fa6c51d0340]
4: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*)+0x5ff) [0x159cce3]
5: (ReplicatedBackend::prep_push(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, eversion_t, interval_set<unsigned long>&, std::map<hobject_t, interval_set<unsigned long>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_set<unsigned long> > > >&, PushOp*)+0x1a5) [0x1599147]
6: (ReplicatedBackend::prep_push_to_replica(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, PushOp*)+0xc6a) [0x1598cb0]
7: (ReplicatedBackend::start_pushes(hobject_t const&, std::tr1::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x2c1) [0x15aa959]
8: (C_ReplicatedBackend_OnPullComplete::finish(ThreadPool::TPHandle&)+0x103) [0x15d3ddf]
9: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x2a) [0x131a0c8]
10: (ReplicatedPG::BlessedGenContext<ThreadPool::TPHandle&>::finish(ThreadPool::TPHandle&)+0xa7) [0x1626677]
11: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x2a) [0x131a0c8]
12: (GenContextWQ::_process(GenContext<ThreadPool::TPHandle&>*, ThreadPool::TPHandle&)+0x32) [0x13004f8]
13: (ThreadPool::WorkQueueVal<GenContext<ThreadPool::TPHandle&>*, GenContext<ThreadPool::TPHandle&>*>::_void_process(void*, ThreadPool::TPHandle&)+0xb2) [0x13a091a]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x18da2d6]
15: (ThreadPool::WorkThread::entry()+0x23) [0x18de2c7]
16: (Thread::entry_wrapper()+0x79) [0x18d2a47]
17: (Thread::_entry_func(void*)+0x18) [0x18d29c4]
18: (()+0x8182) [0x7fa6c51c8182]
19: (clone()+0x6d) [0x7fa6c37b630d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
@

Actions #12

Updated by Dhiraj Kamble over 9 years ago

Missed a step to mention.
before i did a repair on the primary osd; i aslo did a scrub

#:/build/ceph-firefly84/src# ./ceph pg scrub 0.20
#:/build/ceph-firefly84/src# ./ceph osd repair 1

Note: i deleted the files from the primary as well as another replica. (osds: 1 & 2)

Actions #13

Updated by Loïc Dachary over 9 years ago

  • Status changed from Need More Info to In Progress

Thanks for the update, will try again :-)

Actions #14

Updated by Loïc Dachary over 9 years ago

The following reproduces it reliably on my laptop:

ulimit -c 300000
./stop.sh
rm -fr dev out ;  mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd
./rbd create img --size 1024 --image-format 2
sudo ./rbd map img
sudo dd if=/dev/urandom of=/dev/rbd1 bs=4M oflag=direct count=1
ceph osd tree
rm dev/osd[01]/current/*/*rbd*data*
ceph pg scrub 0.2
ceph osd repair 0
sleep 60
ceph osd tree

and gdb on the core shows the following backtrace
#1  0x000000000179ecbc in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  0x000000000179ef8f in handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x000000000159cd71 in ReplicatedBackend::build_push_op (this=0x3878b40, recovery_info=..., progress=..., out_progress=0x7ffc6c7a8100, out_op=0x39c0000, stat=0x389f448) at osd/ReplicatedPG.cc:8760
#5  0x00000000015991d5 in ReplicatedBackend::prep_push (this=0x3878b40, obc=..., soid=..., peer=..., version=..., data_subset=..., clone_subsets=..., pop=0x39c0000) at osd/ReplicatedPG.cc:8369
#6  0x0000000001598d3e in ReplicatedBackend::prep_push_to_replica (this=0x3878b40, obc=..., soid=..., peer=..., pop=0x39c0000) at osd/ReplicatedPG.cc:8324
#7  0x00000000015aa9e7 in ReplicatedBackend::start_pushes (this=0x3878b40, soid=..., obc=..., h=0x3b17bf0) at osd/ReplicatedPG.cc:10202
#8  0x00000000015d3e31 in C_ReplicatedBackend_OnPullComplete::finish (this=0x39e3bd0, handle=...) at osd/ReplicatedPG.cc:2218
#9  0x000000000131a0dc in GenContext<ThreadPool::TPHandle&>::complete (this=0x39e3bd0, t=...) at ./include/Context.h:45
#10 0x00000000016266c9 in ReplicatedPG::BlessedGenContext<ThreadPool::TPHandle&>::finish (this=0x3d35420, t=...) at osd/ReplicatedPG.h:262
#11 0x000000000131a0dc in GenContext<ThreadPool::TPHandle&>::complete (this=0x3d35420, t=...) at ./include/Context.h:45
#12 0x00000000013004e8 in GenContextWQ::_process (this=0x38ed900, c=0x3d35420, tp=...) at ./common/WorkQueue.h:421
#13 0x00000000013a092e in ThreadPool::WorkQueueVal<GenContext<ThreadPool::TPHandle&>*, GenContext<ThreadPool::TPHandle&>*>::_void_process (this=0x38ed900, handle=...) at ./common/WorkQueue.h:191
#14 0x00000000018da69a in ThreadPool::worker (this=0x38ec730, wt=0x38a4d30) at common/WorkQueue.cc:128
#15 0x00000000018de68b in ThreadPool::WorkThread::entry (this=0x38a4d30) at common/WorkQueue.h:318
#16 0x00000000018d2e0b in Thread::entry_wrapper (this=0x38a4d30) at common/Thread.cc:60
#17 0x00000000018d2d88 in Thread::_entry_func (arg=0x38a4d30) at common/Thread.cc:45
#18 0x00007ffc8c1b3182 in start_thread (arg=0x7ffc6c7a9700) at pthread_create.c:312
#19 0x00007ffc8a7a130d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Actions #15

Updated by Loïc Dachary over 9 years ago

Even simpler and does not require root privileges

rm -f core
ulimit -c 300000
./stop.sh
rm -fr dev out ;  mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd
./rados --pool rbd put SOMETHING /etc/group
rm dev/osd1/current/*/*SOMETHING* # osd.1 is the primary
ceph pg scrub 0.7
sleep 60
ceph osd tree

Actions #16

Updated by Loïc Dachary over 9 years ago

For the problem to show the file being removed has to be the primary.

Actions #17

Updated by Loïc Dachary over 9 years ago

The problem does not show if waiting after the object is inserted. It is a race condition.

rm -f core
ulimit -c 300000
./stop.sh
rm -fr dev out ;  mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd
./rados --pool rbd put SOMETHING /etc/group
# sleep 60 # comment this out and the problem does not show
rm dev/osd1/current/*/*SOMETHING* # osd.1 is the primary
ceph pg scrub 0.7
sleep 60
ceph osd tree

Actions #19

Updated by Loïc Dachary over 9 years ago

The stack trace created by the minimal script is different from the one reported above, but it fails at the same point.

#0  0x00007ffce2d7520b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x000000000179ecbc in reraise_fatal (signum=11) at global/signal_handler.cc:59
#2  0x000000000179ef8f in handle_fatal_signal (signum=11) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x000000000159cd71 in ReplicatedBackend::build_push_op (this=0x4872780, recovery_info=..., progress=..., out_progress=0x7ffcc3361270, out_op=0x4846d00, stat=0x48989c8) at osd/ReplicatedPG.cc:8760
#5  0x00000000015991d5 in ReplicatedBackend::prep_push (this=0x4872780, obc=..., soid=..., peer=..., version=..., data_subset=..., clone_subsets=..., pop=0x4846d00) at osd/ReplicatedPG.cc:8369
#6  0x0000000001598d3e in ReplicatedBackend::prep_push_to_replica (this=0x4872780, obc=..., soid=..., peer=..., pop=0x4846d00) at osd/ReplicatedPG.cc:8324
#7  0x00000000015aa9e7 in ReplicatedBackend::start_pushes (this=0x4872780, soid=..., obc=..., h=0x4bae5a0) at osd/ReplicatedPG.cc:10202
#8  0x00000000017019e6 in ReplicatedBackend::recover_object (this=0x4872780, hoid=..., v=..., head=..., obc=..., _h=0x4bae5a0) at osd/ReplicatedBackend.cc:75
#9  0x00000000015aa5d5 in ReplicatedPG::prep_object_replica_pushes (this=0x497c800, soid=..., v=..., h=0x4bae5a0) at osd/ReplicatedPG.cc:10175
#10 0x00000000015abc1b in ReplicatedPG::recover_replicas (this=0x497c800, max=5, handle=...) at osd/ReplicatedPG.cc:10278
#11 0x00000000015a6f0c in ReplicatedPG::start_recovery_ops (this=0x497c800, max=5, prctx=0x7ffcc3362860, handle=..., ops_started=0x7ffcc33627b8) at osd/ReplicatedPG.cc:9851
#12 0x00000000012ed9d8 in OSD::do_recovery (this=0x48e6000, pg=0x497c800, handle=...) at osd/OSD.cc:7892
#13 0x000000000130cd50 in OSD::RecoveryWQ::_process (this=0x48e7438, pg=0x497c800, handle=...) at osd/OSD.h:2045
#14 0x000000000139f2df in ThreadPool::WorkQueue<PG>::_void_process (this=0x48e7438, p=0x497c800, handle=...) at ./common/WorkQueue.h:256
#15 0x00000000018da69a in ThreadPool::worker (this=0x48e6730, wt=0x48dd540) at common/WorkQueue.cc:128
#16 0x00000000018de68b in ThreadPool::WorkThread::entry (this=0x48dd540) at common/WorkQueue.h:318
#17 0x00000000018d2e0b in Thread::entry_wrapper (this=0x48dd540) at common/Thread.cc:60
#18 0x00000000018d2d88 in Thread::_entry_func (arg=0x48dd540) at common/Thread.cc:45
#19 0x00007ffce2d6d182 in start_thread (arg=0x7ffcc3363700) at pthread_create.c:312
#20 0x00007ffce135b30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Actions #20

Updated by Loïc Dachary over 9 years ago

  • Description updated (diff)
Actions #21

Updated by Loïc Dachary over 9 years ago

  • Description updated (diff)
Actions #22

Updated by Loïc Dachary over 9 years ago

<sjusthm> the next step is to write a failing teuthology test to reproduce
<sjusthm> and then adapt build_push_op to do what scrub does when it discovers an inconsistent object
<sjusthm> ideally, you want to make that path fairly general so we can hook any other places where the osd sees something inconsistent into the same thing
<sjusthm> the basic concept is you will add the object to the primary's missing set
<sjusthm> and then run recovery on it
<sjusthm> making sure to update the missing_loc machinery
<sjusthm> you can look at the repair code for how that currently works
<sjusthm> after that, we will want to give the replicas a way to propogate "oh crap, I don't have this object I'm supposed to have" back to the primary where the primary can then mark its local copy of the replica missing set to be missing that object and run recovery
<sjusthm> the long term goal is for the primary and replicas whenever they find something obviously wrong with an object to be able to behave as if the inconsistency was discovered through scrub/repair
<sjusthm> instead of just crashing
<sjusthm> the path should also spam the central log since it might be an indication of a flaky disk
<sjusthm> the bug here is that build_push_op can't deal with an object which the missing set should be there isn't there
<sjusthm> the missing set does not contain that object
<sjusthm> so the recovery machinery chose to simply push it from the primary to the replica
<sjusthm> and build_push_op goes to look for it
<sjusthm> the ObjectStore doesn't find it and gives back a NULL iterator
<sjusthm> so that path needs to handle an erroneously missing object (which would only happen if someone is maliciously messing with the ondisk store or if the filesystem is messed up)
Actions #23

Updated by Loïc Dachary over 9 years ago

When scrub scans objects it will update the corresponding entry in the map given in argument

The result is used by the next state of the scrub process

When build_push_op unexpectedly finds a non existent object, while recovering, the context is quite different. Since the scrub process builds its own set of information from scratch whenever it is run , it is unclear how build_push_op could contribute back the information : "hey, I discovered this object is missing !".

Actions #24

Updated by Loïc Dachary over 9 years ago

An idea, which is not a fix but may help. When the following sequence is done:

/stop.sh
rm -fr dev out ;  mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd
./rados --pool rbd put SOMETHING /etc/group
rm dev/osd1/current/*/*SOMETHING* # osd.1 is the primary
ceph pg repair 0.7 

It will pull a copy of the missing object from a replica. However, because the FDCache already has the missing object open, it will write to the (now unlinked) former object instead of creating a new one.
The on_local_recover_start method could add to the transaction an operation that checks if open file descriptor still is associated with an actual file. Or just re-open to avoid the problem, assuming if using the FDCache in this context does not matter.

Actions #25

Updated by Loïc Dachary over 9 years ago

tentative solution to a problem related to what happens although it does not address the main issue https://github.com/ceph/ceph/pull/2379 . Low hanging fruit to make a little progress ;-)

Actions #26

Updated by Loïc Dachary over 9 years ago

remove_snap_mapped_object removes the object but apparently does not clear the FDCache because the underlying file does not exist. This should be fixed.

The scrub logic should be abstracted so that information can be contributed to it from various contexts, not just when scrubbing.

Actions #27

Updated by Samuel Just over 9 years ago

  • Target version changed from 0.86 to 0.88
Actions #28

Updated by Loïc Dachary over 9 years ago

  • Status changed from In Progress to Fix Under Review
  • % Done changed from 0 to 80
Actions #29

Updated by Loïc Dachary over 9 years ago

The proposed fix greatly reduces the window of opportunity for this problem to happen. It can however still happen and another issue ( #9328 ) was created for the more general solution.

Actions #30

Updated by Loïc Dachary over 9 years ago

running rados thrasher on the proposed fix branch

Actions #31

Updated by Loïc Dachary over 9 years ago

The following teuthology job passes:

- thrashosds:
    chance_pgnum_grow: 1
    chance_pgpnum_fix: 1
    timeout: 1200
- rados:
    clients: [client.0]
    ops: 4000
    objects: 500
    ec_pool: true
    op_weights:
      read: 100
      write: 0
      append: 100
      delete: 50
      snap_create: 50
      snap_remove: 50
      rollback: 50
      copy_from: 50
      setattr: 25
      rmattr: 25
- rados:
    clients: [client.0]
    ops: 4000
    objects: 500
    op_weights:
      read: 100
      write: 0
      append: 100
      delete: 50
      snap_create: 50
      snap_remove: 50
      rollback: 50
      copy_from: 50
      setattr: 25
      rmattr: 25

Actions #32

Updated by Loïc Dachary over 9 years ago

  • Status changed from Fix Under Review to Resolved
  • % Done changed from 80 to 100
Actions

Also available in: Atom PDF