Project

General

Profile

Actions

Bug #23875

closed

Removal of snapshot with corrupt replica crashes osd

Added by David Zafman about 6 years ago. Updated over 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This may be a completely legitimate crash due to the curruption.

See pending test case TEST_scrub_snaps_replica in osd-scrub-snaps.sh:

2018-04-25 14:50:14.797 7f06b8852700 20 osd.0 op_wq(0) _process OpQueueItem(1.0 PGOpItem(op=osd_repop(osd.1.0:0 1.0 e26/23) v2) prio 127 cost 1041 e26) queued
2018-04-25 14:50:14.797 7f06b8852700 20 osd.0 op_wq(0) _process 1.0 to_process <OpQueueItem(1.0 PGOpItem(op=osd_repop(osd.1.0:0 1.0 e26/23) v2) prio 127 cost 1041 e26)> waiting <> waiting_peering {26=<OpQueueItem(1.0 PGPeeringEvent(epoch_sent: 26 epoch_requested: 26 MInfoRec from 1 info: 1.0( v 19'53 (0'0,19'53] local-lis/les=23/24 n=34 ec=10/10 lis/c 23/23 les/c/f 24/24/0 23/23/23)) prio 255 cost 10 e26)>}
2018-04-25 14:50:14.797 7f06b8852700 20 osd.0 op_wq(0) _process OpQueueItem(1.0 PGOpItem(op=osd_repop(osd.1.0:0 1.0 e26/23) v2) prio 127 cost 1041 e26) pg 0x557a76acd400
2018-04-25 14:50:14.797 7f06b8852700 10 osd.0 25 dequeue_op 0x557a77272a80 prio 127 cost 1041 latency 0.000145 osd_repop(osd.1.0:0 1.0 e26/23) v2 pg pg[1.0( v 19'53 (0'0,19'53] local-lis/les=23/24 n=34 ec=10/10 lis/c 23/23 les/c/f 24/24/0 23/23/23) [1,0] r=1 lpr=23 luod=0'0 crt=19'53 lcod 0'0 active mbc={}]
2018-04-25 14:50:14.797 7f06b8852700 20 osd.0 25 share_map osd.1 127.0.0.1:6806/570 26
2018-04-25 14:50:14.797 7f06b8852700 20 osd.0 25 should_share_map osd.1 127.0.0.1:6806/570 26
2018-04-25 14:50:14.797 7f06b8852700 10 osd.0 pg_epoch: 25 pg[1.0( v 19'53 (0'0,19'53] local-lis/les=23/24 n=34 ec=10/10 lis/c 23/23 les/c/f 24/24/0 23/23/23) [1,0] r=1 lpr=23 luod=0'0 crt=19'53 lcod 0'0 active mbc={}] _handle_message: 0x557a77272a80
2018-04-25 14:50:14.797 7f06b8852700 10 osd.0 pg_epoch: 25 pg[1.0( v 19'53 (0'0,19'53] local-lis/les=23/24 n=34 ec=10/10 lis/c 23/23 les/c/f 24/24/0 23/23/23) [1,0] r=1 lpr=23 luod=0'0 crt=19'53 lcod 0'0 active mbc={}] do_repop 1:ee9ae150:::obj4:7 v 26'55 (transaction) 328
2018-04-25 14:50:14.797 7f06b8852700 20 snap_mapper.update_snaps 1:ee9ae150:::obj4:7 3,4,5,6,7 was
2018-04-25 14:50:14.797 7f06b8852700 20 snap_mapper.get_snaps 1:ee9ae150:::obj4:7 got.empty()
2018-04-25 14:50:14.797 7f06b8852700 -1 /home/dzafman/ceph/src/osd/PG.cc: In function 'void PG::update_snap_map(const std::vector<pg_log_entry_t>&, ObjectStore::Transaction&)' thread 7f06b8852700 time 2018-04-25 14:50:14.800145
/home/dzafman/ceph/src/osd/PG.cc: 3851: FAILED assert(r == 0)

#0  0x00007f2b59bfa269 in raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/pt-raise.c:35
#1  0x000055c638a3db1e in reraise_fatal (signum=6) at /home/dzafman/ceph/src/global/signal_handler.cc:74
#2  handle_fatal_signal (signum=6) at /home/dzafman/ceph/src/global/signal_handler.cc:138
#3  <signal handler called>
#4  0x00007f2b58da9428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#5  0x00007f2b58dab02a in __GI_abort () at abort.c:89
#6  0x00007f2b5b0fae2b in ceph::__ceph_assert_fail (assertion=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>)
    at /home/dzafman/ceph/src/common/assert.cc:66
#7  0x00007f2b5b0fae97 in ceph::__ceph_assert_fail (ctx=...) at /home/dzafman/ceph/src/common/assert.cc:71
#8  0x000055c63859e33b in PG::update_snap_map (this=0x55c63ae75400, log_entries=std::vector of length 2, capacity 2 = {...}, t=...) at /home/dzafman/ceph/src/osd/PG.cc:3851
#9  0x000055c6385c4b71 in PG::append_log (this=0x55c63ae75400, logv=std::vector of length 2, capacity 2 = {...}, trim_to=..., roll_forward_to=..., t=..., transaction_applied=true)
    at /home/dzafman/ceph/src/osd/PG.cc:3604
#10 0x000055c6386b7403 in non-virtual thunk to PrimaryLogPG::log_operation(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t> const&, eversion_t const&, eversion_t const&, bool, ObjectStore::Transaction&) ()
#11 0x000055c6387d4a79 in ReplicatedBackend::do_repop (this=this@entry=0x55c63a99f680, op=...) at /home/dzafman/ceph/src/osd/ReplicatedBackend.cc:1065
#12 0x000055c6387d7327 in ReplicatedBackend::_handle_message (this=0x55c63a99f680, op=...) at /home/dzafman/ceph/src/osd/ReplicatedBackend.cc:203
#13 0x000055c6386ebf87 in PGBackend::handle_message (this=<optimized out>, op=...) at /home/dzafman/ceph/src/osd/PGBackend.cc:114
#14 0x000055c63869b8ed in PrimaryLogPG::do_request (this=0x55c63ae75400, op=..., handle=...) at /home/dzafman/ceph/src/osd/PrimaryLogPG.cc:1794
#15 0x000055c6384fef08 in OSD::dequeue_op (this=this@entry=0x55c63ae72000, pg=..., op=..., handle=...) at /home/dzafman/ceph/src/osd/OSD.cc:8905
#16 0x000055c63876fc22 in PGOpItem::run (this=<optimized out>, osd=0x55c63ae72000, sdata=<optimized out>, pg=..., handle=...) at /home/dzafman/ceph/src/osd/OpQueueItem.cc:24
#17 0x000055c63851c2a4 in OpQueueItem::run (handle=..., pg=..., sdata=<optimized out>, osd=<optimized out>, this=0x7f2b3afa80e0) at /home/dzafman/ceph/src/osd/OpQueueItem.h:134
#18 OSD::ShardedOpWQ::_process (this=<optimized out>, thread_index=<optimized out>, hb=<optimized out>) at /home/dzafman/ceph/src/osd/OSD.cc:9909
#19 0x00007f2b5b0ffc7e in ShardedThreadPool::shardedthreadpool_worker (this=0x55c63ae729c8, thread_index=0) at /home/dzafman/ceph/src/common/WorkQueue.cc:339
#20 0x00007f2b5b101d00 in ShardedThreadPool::WorkThreadSharded::entry (this=<optimized out>) at /home/dzafman/ceph/src/common/WorkQueue.h:690
#21 0x00007f2b59bf06ba in start_thread (arg=0x7f2b3afad700) at pthread_create.c:333
#22 0x00007f2b58e7a82d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #24396: osd crashes in on_local_recover due to stray cloneResolved06/04/2018

Actions
Actions

Also available in: Atom PDF