Project

General

Profile

Actions

Fix #8914

closed

osd crashed at assert ReplicatedBackend::build_push_op

Added by Sahana Lokeshappa almost 10 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
OSD
Target version:
% Done:

100%

Source:
Community (dev)
Tags:
Backport:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Steps to reproduce

./stop.sh
rm -fr dev out ;  mkdir -p dev ; CEPH_NUM_MON=1 CEPH_NUM_OSD=3 ./vstart.sh -d -n -X -l mon osd
./rados --pool rbd put SOMETHING /etc/group
# sleep 60 # comment this out and the problem does not show
rm dev/osd1/current/*/*SOMETHING* # osd.1 is the primary

It crashes in build_push_op because get_omap_iterator returned a null iterator because the file was removed.

Original description

OSD crashed with assert ReplicatedBackend::build_push_op .
Steps Followed:

sudo ceph pg map 3.151
osdmap e1274 pg 3.151 (3.151) -> up [2,9,20] acting [2,9,20]
I removed object file1 (inserted object using rados) , rm -f on /var/lib/ceph/osd/ceph-9/current/3.151/file1* and /var/lib/ceph/osd/ceph-2/current/3.151/file1*

Check for scrub errors using :
ceph pg scrub 3.151
ceph -w showed scrub errors on 2 and 9
Ran command:
ceph osd repair 2
Got Seg fault in osd.2:
2014-06-16 10:33:19.906324 7fb9e9543700 0 log [ERR] : 3.151 shard 2 missing a086551/file1/head//3
2014-06-16 10:33:19.906330 7fb9e9543700 0 log [ERR] : 3.151 shard 9 missing a086551/file1/head//3
2014-06-16 10:33:19.906362 7fb9e9543700 0 log [ERR] : 3.151 repair 1 missing, 0 inconsistent objects
2014-06-16 10:33:19.906378 7fb9e9543700 0 log [ERR] : 3.151 repair 2 errors, 2 fixed
2014-06-16 10:33:19.924977 7fb9e9d44700 -1 ** Caught signal (Segmentation fault) *
in thread 7fb9e9d44700

1: /usr/bin/ceph-osd() [0x974a1f]
2: (()+0x10340) [0x7fba089de340]
3: (ReplicatedBackend::build_push_op(ObjectRecoveryInfo const&, ObjectRecoveryProgress const&, ObjectRecoveryProgress*, PushOp*, object_stat_sum_t*)+0xc1c) [0x7d209c]
4: (ReplicatedBackend::prep_push(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, eversion_t, interval_set<unsigned long>&, std::map<hobject_t, interval_set<unsigned long>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_set<unsigned long> > > >&, PushOp*)+0x3d8) [0x7d2b48]
5: (ReplicatedBackend::prep_push_to_replica(std::tr1::shared_ptr<ObjectContext>, hobject_t const&, pg_shard_t, PushOp*)+0x3af) [0x7d6b8f]
6: (ReplicatedBackend::start_pushes(hobject_t const&, std::tr1::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x1af) [0x7d9c6f]
7: (C_ReplicatedBackend_OnPullComplete::finish(ThreadPool::TPHandle&)+0x143) [0x84b083]
8: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x9) [0x661a09]
9: (ReplicatedPG::BlessedGenContext<ThreadPool::TPHandle&>::finish(ThreadPool::TPHandle&)+0x95) [0x824f65]
10: (GenContext<ThreadPool::TPHandle&>::complete(ThreadPool::TPHandle&)+0x9) [0x661a09]
11: (ThreadPool::WorkQueueVal<GenContext<ThreadPool::TPHandle&>, GenContext<ThreadPool::TPHandle&>>::_void_process(void*, ThreadPool::TPHandle&)+0x62) [0x6697d2]
12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xaf1) [0xa4b351]
13: (ThreadPool::WorkThread::entry()+0x10) [0xa4c440]
14: (()+0x8182) [0x7fba089d6182]
15: (clone()+0x6d) [0x7fba06d7730d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Files

GOOD-1-removed-primary.txt (1.88 MB) GOOD-1-removed-primary.txt primary osd log after removed file from primary + scrub is fine Loïc Dachary, 08/22/2014 06:03 AM
BAD-removed-primary.txt (1.32 MB) BAD-removed-primary.txt primary osd log after removed file from primary + scrub crashes Loïc Dachary, 08/22/2014 06:03 AM

Related issues 2 (1 open1 closed)

Related to Ceph - Bug #9114: osd: segv in build_push_opDuplicate08/14/2014

Actions
Related to RADOS - Feature #9328: osd: generalize the scrub workflowNew

Actions
Actions

Also available in: Atom PDF