Fix #8914
osd crashed at assert ReplicatedBackend::build_push_op
100%
Description
Steps to reproduce¶
./stop.sh rm -fr dev out ; mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd ./rados --pool rbd put SOMETHING /etc/group # sleep 60 # comment this out and the problem does not show rm dev/osd1/current/*/*SOMETHING* # osd.1 is the primary
It crashes in build_push_op because get_omap_iterator returned a null iterator because the file was removed.
Original description¶
OSD crashed with assert ReplicatedBackend::build_push_op .
Steps Followed:
sudo ceph pg map 3.151
osdmap e1274 pg 3.151 (3.151) -> up [2,9,20] acting [2,9,20]
I removed object file1 (inserted object using rados) , rm -f on /var/lib/ceph/osd/ceph-9/current/3.151/file1* and /var/lib/ceph/osd/ceph-2/current/3.151/file1*
Check for scrub errors using :
ceph pg scrub 3.151
ceph -w showed scrub errors on 2 and 9
Ran command:
ceph osd repair 2
Got Seg fault in osd.2:
2014-06-16 10:33:19.906324 7fb9e9543700 0 log [ERR] : 3.151 shard 2 missing a086551/file1/head//3
2014-06-16 10:33:19.906330 7fb9e9543700 0 log [ERR] : 3.151 shard 9 missing a086551/file1/head//3
2014-06-16 10:33:19.906362 7fb9e9543700 0 log [ERR] : 3.151 repair 1 missing, 0 inconsistent objects
2014-06-16 10:33:19.906378 7fb9e9543700 0 log [ERR] : 3.151 repair 2 errors, 2 fixed
2014-06-16 10:33:19.924977 7fb9e9d44700 -1 ** Caught signal (Segmentation fault) *
in thread 7fb9e9d44700
1: /usr/bin/ceph-osd() [0x974a1f]
2: (()+0x10340) [0x7fba089de340]
3: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*)+0xc1c) [0x7d209c]
4: (ReplicatedBackend::prep_push(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, eversion_t, interval_set<unsigned long>&, std::map<hobject_t, interval_set<unsigned long>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_set<unsigned long> > > >&, PushOp*)+0x3d8) [0x7d2b48]
5: (ReplicatedBackend::prep_push_to_replica(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, PushOp*)+0x3af) [0x7d6b8f]
6: (ReplicatedBackend::start_pushes(hobject_t const&, std::tr1::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x1af) [0x7d9c6f]
7: (C_ReplicatedBackend_OnPullComplete::finish(ThreadPool::TPHandle&)+0x143) [0x84b083]
8: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x9) [0x661a09]
9: (ReplicatedPG::BlessedGenContext<ThreadPool::TPHandle&>::finish(ThreadPool::TPHandle&)+0x95) [0x824f65]
10: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x9) [0x661a09]
11: (ThreadPool::WorkQueueVal<GenContext<ThreadPool::TPHandle&>, GenContext<ThreadPool::TPHandle&>>::_void_process(void*, ThreadPool::TPHandle&)+0x62) [0x6697d2]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa4b351]
13: (ThreadPool::WorkThread::entry()+0x10) [0xa4c440]
14: (()+0x8182) [0x7fba089d6182]
15: (clone()+0x6d) [0x7fba06d7730d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Related issues
Associated revisions
os: FileStore::lfn_unlink always clears FDCache
Otherwise the FDCache will keep a file descriptor to a file that was
removed from the file system. This may create various type of errors
because the OSD checking the FDCache will assume the file that contains
information for an object exists although it does not. For instance in
the following:
- rados put object file
- rm file from the primary
- repair the pg to which the object is mapped
if the FDCache is not cleared, repair will incorrectly pull a copy from
a replica and write it to the now unlinked file. Later on, it will
assume the file exists on the primary and only be partially correct :
the data can still be accessed via the file descriptor but any operation
using the path name will fail.
http://tracker.ceph.com/issues/8914 Fixes: #8914
Signed-off-by: Loic Dachary <loic-201408@dachary.org>
History
#1 Updated by Dhiraj Kamble over 8 years ago
This SEGV is caused due to a NULL omap iterator.
This can happen if the object belonging to the Primary OSD is missing from underlying filesystem of if the underlying filesystem itself is corrupted.
in this case a lookup(::stat()) on the object fails and returns an ENOENT; which causes get_omap_iterator to return a NULL iterator.
Fix:
assert if the object map iterator returned is NULL; so that I/O's can be served from other osd's.
#2 Updated by Dhiraj Kamble over 8 years ago
Dhiraj Kamble wrote:
This SEGV is caused due to a NULL omap iterator.
This can happen if the object belonging to the Primary OSD is missing from underlying filesystem or if the underlying filesystem itself is corrupted.in this case a lookup(::stat()) on the object fails and returns an ENOENT; which causes get_omap_iterator to return a NULL iterator.
Fix:
assert if the object map iterator returned is NULL; so that I/O's can be served from other osd's.
#3 Updated by Greg Farnum over 8 years ago
- Tracker changed from Bug to Fix
- Assignee deleted (
Greg Farnum) - Priority changed from High to Normal
- Source changed from other to Community (dev)
Did you mean to add some comments on that second post, Dhiraj?
Anyway, this looks like a kind of disk corruption which we can handle more gracefully, but it'll need to get scheduled. It looks like we just need to pay attention to some return codes in build_push_op(), but if you could upload a full crash log that would be more certain, Sahana.
#4 Updated by Dhiraj Kamble over 8 years ago
Hi Greg,
No i did not intend to add any comments.
The reason i thought we should assert is, so that we can serve i/o's from other osd's - incase the object itself has been explicitly removed or is corrupted.
Also i see that we expect build_push_op to return a success in ReplicatedBackend::prep_push; else we assert anyways.
@
void ReplicatedBackend::prep_push(
...
..
int r = build_push_op(pi.recovery_info,
pi.recovery_progress,
&new_progress,
pop,
&(pi.stat));
assert(r == 0); <<< expect a success; else we assert.
pi.recovery_progress = new_progress;
}
@
so i thought it would good to assert if the omap iterator is NULL in build_push_op or do you think its good to reset the recovery info and then assert. Anyways i dont see any changes that we could make on a failure return value from build_push_op.
let me know about your thoughts and how can we handle this gracefully.
cheers,
Dhiraj
#5 Updated by Dhiraj Kamble over 8 years ago
btw, the steps to reproduce this issue are mentioned by Sahana above & it can be reproduced on a single node too.
For a given object; remove its corresponding file from the filesystem of the primary osd.(../primary_osd/current/pg_dir/)
initate a scrub on the pg to which the object belongs
intitiate a repair on the primary osd.
You'll hit the assert.
#6 Updated by Samuel Just over 8 years ago
- Assignee set to Samuel Just
- Priority changed from Normal to High
#7 Updated by Samuel Just over 8 years ago
Hmm, most likely a bug in repair. We should start by creating a teuthology task which reproduces the bug. Once we have that, we can fix it and add the test to the test suite.
#8 Updated by Ian Colle over 8 years ago
- Target version set to 0.86
#9 Updated by Loïc Dachary over 8 years ago
- Assignee changed from Samuel Just to Loïc Dachary
#10 Updated by Loïc Dachary over 8 years ago
- Status changed from New to Need More Info
I'm not able to reproduce the problem on ceph version 0.84-343-g92b227e (92b227e1c0b1533c359e74c81de58140b483ee8e) . Could you please confirm that you can and let me know which version of ceph you are using ?
$ rm -fr dev out ; mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd ... $ ./rados --pool rbd put SOMETHING /etc/group $ ceph osd map rbd SOMETHING *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** osdmap e9 pool 'rbd' (0) object 'SOMETHING' -> pg 0.847441d7 (0.7) -> up ([1,0,2], p1) acting ([1,0,2], p1) $ rm dev/osd1/current/0.7_head/SOMETHING__head_847441D7__0 $ ceph pg scrub 0.7 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** instructing pg 0.7 on osd.1 to scrub $ ceph -s *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** cluster d074b6b4-2163-4a3a-b8fe-efdc44a57d97 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors; too few pgs per osd (2 < min 10); mon.a low disk space monmap e1: 1 mons at {a=127.0.0.1:6789/0}, election epoch 2, quorum 0 a osdmap e9: 3 osds: 3 up, 3 in pgmap v15: 8 pgs, 1 pools, 1190 bytes data, 1 objects 415 GB used, 131 GB / 547 GB avail 7 active+clean 1 active+clean+inconsistent $ ceph osd repair 1 *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** osd.1 instructed to repair $ ceph -s *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** cluster d074b6b4-2163-4a3a-b8fe-efdc44a57d97 health HEALTH_WARN too few pgs per osd (2 < min 10); mon.a low disk space monmap e1: 1 mons at {a=127.0.0.1:6789/0}, election epoch 2, quorum 0 a osdmap e9: 3 osds: 3 up, 3 in pgmap v18: 8 pgs, 1 pools, 1190 bytes data, 1 objects 415 GB used, 131 GB / 547 GB avail 8 active+clean $ ceph osd tree *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** # id weight type name up/down reweight -1 3 root default -2 3 host fold 0 1 osd.0 up 1 1 1 osd.1 up 1 2 1 osd.2 up 1 $ grep -F '[ERR] 0.7' out/osd.1.log 2014-08-19 23:10:31.476732 7f5697ee3700 10 will send 2014-08-19 23:10:24.514431 osd.1 127.0.0.1:6810/21923 1 : [ERR] 0.7 shard 1 missing 847441d7/SOMETHING/head//0 2014-08-19 23:10:31.476845 7f5697ee3700 10 will send 2014-08-19 23:10:24.514649 osd.1 127.0.0.1:6810/21923 2 : [ERR] 0.7 scrub 1 missing, 0 inconsistent objects 2014-08-19 23:10:31.476908 7f5697ee3700 10 will send 2014-08-19 23:10:24.514657 osd.1 127.0.0.1:6810/21923 3 : [ERR] 0.7 scrub 1 errors 2014-08-19 23:10:31.594350 7f56a0c44700 10 logged 2014-08-19 23:10:24.514431 osd.1 127.0.0.1:6810/21923 1 : [ERR] 0.7 shard 1 missing 847441d7/SOMETHING/head//0 2014-08-19 23:10:31.594376 7f56a0c44700 10 logged 2014-08-19 23:10:24.514649 osd.1 127.0.0.1:6810/21923 2 : [ERR] 0.7 scrub 1 missing, 0 inconsistent objects 2014-08-19 23:10:31.594394 7f56a0c44700 10 logged 2014-08-19 23:10:24.514657 osd.1 127.0.0.1:6810/21923 3 : [ERR] 0.7 scrub 1 errors 2014-08-19 23:11:11.478549 7f5697ee3700 10 will send 2014-08-19 23:11:01.531118 osd.1 127.0.0.1:6810/21923 4 : [ERR] 0.7 shard 1 missing 847441d7/SOMETHING/head//0 2014-08-19 23:11:11.478723 7f5697ee3700 10 will send 2014-08-19 23:11:01.531379 osd.1 127.0.0.1:6810/21923 5 : [ERR] 0.7 repair 1 missing, 0 inconsistent objects 2014-08-19 23:11:11.478779 7f5697ee3700 10 will send 2014-08-19 23:11:01.531481 osd.1 127.0.0.1:6810/21923 6 : [ERR] 0.7 repair 1 errors, 1 fixed 2014-08-19 23:11:11.548100 7f56a0c44700 10 logged 2014-08-19 23:11:01.531118 osd.1 127.0.0.1:6810/21923 4 : [ERR] 0.7 shard 1 missing 847441d7/SOMETHING/head//0 2014-08-19 23:11:11.548131 7f56a0c44700 10 logged 2014-08-19 23:11:01.531379 osd.1 127.0.0.1:6810/21923 5 : [ERR] 0.7 repair 1 missing, 0 inconsistent objects 2014-08-19 23:11:11.548151 7f56a0c44700 10 logged 2014-08-19 23:11:01.531481 osd.1 127.0.0.1:6810/21923 6 : [ERR] 0.7 repair 1 errors, 1 fixed $
#11 Updated by Dhiraj Kamble over 8 years ago
Hi Loic,
please find below the steps to reproduce the issue.
@*#:/build/ceph-firefly84/src# ./ceph -v- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH **
ceph version 0.84-372-gb0aa846 (b0aa846b3f81225a779de00100e15334fb8156b3)
@
- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
- id weight type name up/down reweight
-1 3 root default
-3 3 rack localrack
-2 3 host localhost
0 1 osd.0 up 1
1 1 osd.1 up 1
2 1 osd.2 up 1
Steps:
@
#:/build/ceph-firefly84/src# ./rbd create img --size 1024 --image-format 2
#:/build/ceph-firefly84/src# ./rbd map img
/dev/rbd1
#:/build/ceph-firefly84/src# ./rados lspools
rbd
#:/build/ceph-firefly84/src# dd if=/dev/urandom of=/dev/rbd1 bs=4M oflag=direct count=1
1+0 records in
1+0 records out
4194304 bytes (4.2 MB) copied, 0.457051 s, 9.2 MB/s
- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
dumped all in format plain
0.20 1 0 0 0 0 4194304 8 8 active+clean 2014-08-21 15:14:21.859898 11'8 11:103 [1,2,0] 1 [1,2,0] 1 0'0 2014-08-21 15:13:53.645589 0'0 2014-08-21 15:13:53.645589
#:/build/ceph-firefly84/src# ls ./dev/osd0/current/0.20_head/
rbd\udata.100d6b8b4567.0000000000000000__head_567EC760__0
#:/build/ceph-firefly84/src# ls ./dev/osd1/current/0.20_head/
rbd\udata.100d6b8b4567.0000000000000000__head_567EC760__0
#:/build/ceph-firefly84/src# ls ./dev/osd2/current/0.20_head/
rbd\udata.100d6b8b4567.0000000000000000__head_567EC760__0
- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
osdmap e11 pg 0.20 (0.20) -> up [1,2,0] acting [1,2,0]
#:/build/ceph-firefly84/src# rm ./dev/osd1/current/0.20_head/rbd\\udata.100d6b8b4567.0000000000000000__head_567EC760__0
#:/build/ceph-firefly84/src# rm ./dev/osd2/current/0.20_head/rbd\\udata.100d6b8b4567.0000000000000000__head_567EC760__0
- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
osd.1 instructed to repair
@
Primary osd - osd.1.log:
@2014-08-21 15:32:20.483804 7fa6a6847700 -1 ** Caught signal (Segmentation fault) * in thread 7fa6a6847700
ceph version 0.84-372-gb0aa846 (b0aa846b3f81225a779de00100e15334fb8156b3)
1: (ceph::BackTrace::BackTrace(int)+0x2d) [0x1676821]
2: ./ceph-osd() [0x179eab2]
3: (()+0x10340) [0x7fa6c51d0340]
4: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*)+0x5ff) [0x159cce3]
5: (ReplicatedBackend::prep_push(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, eversion_t, interval_set<unsigned long>&, std::map<hobject_t, interval_set<unsigned long>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_set<unsigned long> > > >&, PushOp*)+0x1a5) [0x1599147]
6: (ReplicatedBackend::prep_push_to_replica(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, PushOp*)+0xc6a) [0x1598cb0]
7: (ReplicatedBackend::start_pushes(hobject_t const&, std::tr1::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x2c1) [0x15aa959]
8: (C_ReplicatedBackend_OnPullComplete::finish(ThreadPool::TPHandle&)+0x103) [0x15d3ddf]
9: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x2a) [0x131a0c8]
10: (ReplicatedPG::BlessedGenContext<ThreadPool::TPHandle&>::finish(ThreadPool::TPHandle&)+0xa7) [0x1626677]
11: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x2a) [0x131a0c8]
12: (GenContextWQ::_process(GenContext<ThreadPool::TPHandle&>*, ThreadPool::TPHandle&)+0x32) [0x13004f8]
13: (ThreadPool::WorkQueueVal<GenContext<ThreadPool::TPHandle&>*, GenContext<ThreadPool::TPHandle&>*>::_void_process(void*, ThreadPool::TPHandle&)+0xb2) [0x13a091a]
14: (ThreadPool::worker(ThreadPool::WorkThread*)+0x734) [0x18da2d6]
15: (ThreadPool::WorkThread::entry()+0x23) [0x18de2c7]
16: (Thread::entry_wrapper()+0x79) [0x18d2a47]
17: (Thread::_entry_func(void*)+0x18) [0x18d29c4]
18: (()+0x8182) [0x7fa6c51c8182]
19: (clone()+0x6d) [0x7fa6c37b630d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
@
#12 Updated by Dhiraj Kamble over 8 years ago
Missed a step to mention.
before i did a repair on the primary osd; i aslo did a scrub
#:/build/ceph-firefly84/src# ./ceph pg scrub 0.20
#:/build/ceph-firefly84/src# ./ceph osd repair 1
Note: i deleted the files from the primary as well as another replica. (osds: 1 & 2)
#13 Updated by Loïc Dachary over 8 years ago
- Status changed from Need More Info to In Progress
Thanks for the update, will try again :-)
#14 Updated by Loïc Dachary over 8 years ago
The following reproduces it reliably on my laptop:
ulimit -c 300000 ./stop.sh rm -fr dev out ; mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd ./rbd create img --size 1024 --image-format 2 sudo ./rbd map img sudo dd if=/dev/urandom of=/dev/rbd1 bs=4M oflag=direct count=1 ceph osd tree rm dev/osd[01]/current/*/*rbd*data* ceph pg scrub 0.2 ceph osd repair 0 sleep 60 ceph osd tree
and gdb on the core shows the following backtrace
#1 0x000000000179ecbc in reraise_fatal (signum=11) at global/signal_handler.cc:59 #2 0x000000000179ef8f in handle_fatal_signal (signum=11) at global/signal_handler.cc:105 #3 <signal handler called> #4 0x000000000159cd71 in ReplicatedBackend::build_push_op (this=0x3878b40, recovery_info=..., progress=..., out_progress=0x7ffc6c7a8100, out_op=0x39c0000, stat=0x389f448) at osd/ReplicatedPG.cc:8760 #5 0x00000000015991d5 in ReplicatedBackend::prep_push (this=0x3878b40, obc=..., soid=..., peer=..., version=..., data_subset=..., clone_subsets=..., pop=0x39c0000) at osd/ReplicatedPG.cc:8369 #6 0x0000000001598d3e in ReplicatedBackend::prep_push_to_replica (this=0x3878b40, obc=..., soid=..., peer=..., pop=0x39c0000) at osd/ReplicatedPG.cc:8324 #7 0x00000000015aa9e7 in ReplicatedBackend::start_pushes (this=0x3878b40, soid=..., obc=..., h=0x3b17bf0) at osd/ReplicatedPG.cc:10202 #8 0x00000000015d3e31 in C_ReplicatedBackend_OnPullComplete::finish (this=0x39e3bd0, handle=...) at osd/ReplicatedPG.cc:2218 #9 0x000000000131a0dc in GenContext<ThreadPool::TPHandle&>::complete (this=0x39e3bd0, t=...) at ./include/Context.h:45 #10 0x00000000016266c9 in ReplicatedPG::BlessedGenContext<ThreadPool::TPHandle&>::finish (this=0x3d35420, t=...) at osd/ReplicatedPG.h:262 #11 0x000000000131a0dc in GenContext<ThreadPool::TPHandle&>::complete (this=0x3d35420, t=...) at ./include/Context.h:45 #12 0x00000000013004e8 in GenContextWQ::_process (this=0x38ed900, c=0x3d35420, tp=...) at ./common/WorkQueue.h:421 #13 0x00000000013a092e in ThreadPool::WorkQueueVal<GenContext<ThreadPool::TPHandle&>*, GenContext<ThreadPool::TPHandle&>*>::_void_process (this=0x38ed900, handle=...) at ./common/WorkQueue.h:191 #14 0x00000000018da69a in ThreadPool::worker (this=0x38ec730, wt=0x38a4d30) at common/WorkQueue.cc:128 #15 0x00000000018de68b in ThreadPool::WorkThread::entry (this=0x38a4d30) at common/WorkQueue.h:318 #16 0x00000000018d2e0b in Thread::entry_wrapper (this=0x38a4d30) at common/Thread.cc:60 #17 0x00000000018d2d88 in Thread::_entry_func (arg=0x38a4d30) at common/Thread.cc:45 #18 0x00007ffc8c1b3182 in start_thread (arg=0x7ffc6c7a9700) at pthread_create.c:312 #19 0x00007ffc8a7a130d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
#15 Updated by Loïc Dachary over 8 years ago
Even simpler and does not require root privileges
rm -f core ulimit -c 300000 ./stop.sh rm -fr dev out ; mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd ./rados --pool rbd put SOMETHING /etc/group rm dev/osd1/current/*/*SOMETHING* # osd.1 is the primary ceph pg scrub 0.7 sleep 60 ceph osd tree
#16 Updated by Loïc Dachary over 8 years ago
For the problem to show the file being removed has to be the primary.
#17 Updated by Loïc Dachary over 8 years ago
The problem does not show if waiting after the object is inserted. It is a race condition.
rm -f core ulimit -c 300000 ./stop.sh rm -fr dev out ; mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd ./rados --pool rbd put SOMETHING /etc/group # sleep 60 # comment this out and the problem does not show rm dev/osd1/current/*/*SOMETHING* # osd.1 is the primary ceph pg scrub 0.7 sleep 60 ceph osd tree
#18 Updated by Loïc Dachary over 8 years ago
- File GOOD-1-removed-primary.txt View added
- File BAD-removed-primary.txt View added
#19 Updated by Loïc Dachary over 8 years ago
The stack trace created by the minimal script is different from the one reported above, but it fails at the same point.
#0 0x00007ffce2d7520b in raise (sig=11) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000179ecbc in reraise_fatal (signum=11) at global/signal_handler.cc:59 #2 0x000000000179ef8f in handle_fatal_signal (signum=11) at global/signal_handler.cc:105 #3 <signal handler called> #4 0x000000000159cd71 in ReplicatedBackend::build_push_op (this=0x4872780, recovery_info=..., progress=..., out_progress=0x7ffcc3361270, out_op=0x4846d00, stat=0x48989c8) at osd/ReplicatedPG.cc:8760 #5 0x00000000015991d5 in ReplicatedBackend::prep_push (this=0x4872780, obc=..., soid=..., peer=..., version=..., data_subset=..., clone_subsets=..., pop=0x4846d00) at osd/ReplicatedPG.cc:8369 #6 0x0000000001598d3e in ReplicatedBackend::prep_push_to_replica (this=0x4872780, obc=..., soid=..., peer=..., pop=0x4846d00) at osd/ReplicatedPG.cc:8324 #7 0x00000000015aa9e7 in ReplicatedBackend::start_pushes (this=0x4872780, soid=..., obc=..., h=0x4bae5a0) at osd/ReplicatedPG.cc:10202 #8 0x00000000017019e6 in ReplicatedBackend::recover_object (this=0x4872780, hoid=..., v=..., head=..., obc=..., _h=0x4bae5a0) at osd/ReplicatedBackend.cc:75 #9 0x00000000015aa5d5 in ReplicatedPG::prep_object_replica_pushes (this=0x497c800, soid=..., v=..., h=0x4bae5a0) at osd/ReplicatedPG.cc:10175 #10 0x00000000015abc1b in ReplicatedPG::recover_replicas (this=0x497c800, max=5, handle=...) at osd/ReplicatedPG.cc:10278 #11 0x00000000015a6f0c in ReplicatedPG::start_recovery_ops (this=0x497c800, max=5, prctx=0x7ffcc3362860, handle=..., ops_started=0x7ffcc33627b8) at osd/ReplicatedPG.cc:9851 #12 0x00000000012ed9d8 in OSD::do_recovery (this=0x48e6000, pg=0x497c800, handle=...) at osd/OSD.cc:7892 #13 0x000000000130cd50 in OSD::RecoveryWQ::_process (this=0x48e7438, pg=0x497c800, handle=...) at osd/OSD.h:2045 #14 0x000000000139f2df in ThreadPool::WorkQueue<PG>::_void_process (this=0x48e7438, p=0x497c800, handle=...) at ./common/WorkQueue.h:256 #15 0x00000000018da69a in ThreadPool::worker (this=0x48e6730, wt=0x48dd540) at common/WorkQueue.cc:128 #16 0x00000000018de68b in ThreadPool::WorkThread::entry (this=0x48dd540) at common/WorkQueue.h:318 #17 0x00000000018d2e0b in Thread::entry_wrapper (this=0x48dd540) at common/Thread.cc:60 #18 0x00000000018d2d88 in Thread::_entry_func (arg=0x48dd540) at common/Thread.cc:45 #19 0x00007ffce2d6d182 in start_thread (arg=0x7ffcc3363700) at pthread_create.c:312 #20 0x00007ffce135b30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
#20 Updated by Loïc Dachary over 8 years ago
- Description updated (diff)
#21 Updated by Loïc Dachary over 8 years ago
- Description updated (diff)
#22 Updated by Loïc Dachary over 8 years ago
<sjusthm> the next step is to write a failing teuthology test to reproduce <sjusthm> and then adapt build_push_op to do what scrub does when it discovers an inconsistent object <sjusthm> ideally, you want to make that path fairly general so we can hook any other places where the osd sees something inconsistent into the same thing <sjusthm> the basic concept is you will add the object to the primary's missing set <sjusthm> and then run recovery on it <sjusthm> making sure to update the missing_loc machinery <sjusthm> you can look at the repair code for how that currently works <sjusthm> after that, we will want to give the replicas a way to propogate "oh crap, I don't have this object I'm supposed to have" back to the primary where the primary can then mark its local copy of the replica missing set to be missing that object and run recovery <sjusthm> the long term goal is for the primary and replicas whenever they find something obviously wrong with an object to be able to behave as if the inconsistency was discovered through scrub/repair <sjusthm> instead of just crashing <sjusthm> the path should also spam the central log since it might be an indication of a flaky disk <sjusthm> the bug here is that build_push_op can't deal with an object which the missing set should be there isn't there <sjusthm> the missing set does not contain that object <sjusthm> so the recovery machinery chose to simply push it from the primary to the replica <sjusthm> and build_push_op goes to look for it <sjusthm> the ObjectStore doesn't find it and gives back a NULL iterator <sjusthm> so that path needs to handle an erroneously missing object (which would only happen if someone is maliciously messing with the ondisk store or if the filesystem is messed up)
#23 Updated by Loïc Dachary over 8 years ago
When scrub scans objects it will update the corresponding entry in the map given in argument
The result is used by the next state of the scrub process
When build_push_op unexpectedly finds a non existent object, while recovering, the context is quite different. Since the scrub process builds its own set of information from scratch whenever it is run , it is unclear how build_push_op could contribute back the information : "hey, I discovered this object is missing !".
#24 Updated by Loïc Dachary over 8 years ago
An idea, which is not a fix but may help. When the following sequence is done:
/stop.sh rm -fr dev out ; mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd ./rados --pool rbd put SOMETHING /etc/group rm dev/osd1/current/*/*SOMETHING* # osd.1 is the primary ceph pg repair 0.7
It will pull a copy of the missing object from a replica. However, because the FDCache already has the missing object open, it will write to the (now unlinked) former object instead of creating a new one.
The on_local_recover_start method could add to the transaction an operation that checks if open file descriptor still is associated with an actual file. Or just re-open to avoid the problem, assuming if using the FDCache in this context does not matter.
#25 Updated by Loïc Dachary over 8 years ago
tentative solution to a problem related to what happens although it does not address the main issue https://github.com/ceph/ceph/pull/2379 . Low hanging fruit to make a little progress ;-)
#26 Updated by Loïc Dachary over 8 years ago
remove_snap_mapped_object removes the object but apparently does not clear the FDCache because the underlying file does not exist. This should be fixed.
The scrub logic should be abstracted so that information can be contributed to it from various contexts, not just when scrubbing.
#27 Updated by Samuel Just over 8 years ago
- Target version changed from 0.86 to 0.88
#28 Updated by Loïc Dachary over 8 years ago
- Status changed from In Progress to Fix Under Review
- % Done changed from 0 to 80
#29 Updated by Loïc Dachary over 8 years ago
The proposed fix greatly reduces the window of opportunity for this problem to happen. It can however still happen and another issue ( #9328 ) was created for the more general solution.
#30 Updated by Loïc Dachary over 8 years ago
running rados thrasher on the proposed fix branch
#31 Updated by Loïc Dachary over 8 years ago
The following teuthology job passes:
- thrashosds: chance_pgnum_grow: 1 chance_pgpnum_fix: 1 timeout: 1200 - rados: clients: [client.0] ops: 4000 objects: 500 ec_pool: true op_weights: read: 100 write: 0 append: 100 delete: 50 snap_create: 50 snap_remove: 50 rollback: 50 copy_from: 50 setattr: 25 rmattr: 25 - rados: clients: [client.0] ops: 4000 objects: 500 op_weights: read: 100 write: 0 append: 100 delete: 50 snap_create: 50 snap_remove: 50 rollback: 50 copy_from: 50 setattr: 25 rmattr: 25
#32 Updated by Loïc Dachary over 8 years ago
- Status changed from Fix Under Review to Resolved
- % Done changed from 80 to 100