Project

General

Profile

Actions

Bug #41517

closed

Missing head object at primary with snapshots crashes primary

Added by David Zafman over 4 years ago. Updated over 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This script crashes osd.1 when it wants to recover to osd.3 after osd.2 is marked out. When it sees the missing "object" on the primary it tries to pull it from osd.0 and hits an assert. The code is comparing the snapset seq that arrived from the replica with the local information. In this case without an object the snapset doesn't really exist. Maybe the code should check if pi.obc->ssc->exists == false, then we must just accept the replica's snapset.

#! /bin/sh

../src/stop.sh
MDS=0 MON=1 MGR=1 OSD=4 ../src/vstart.sh -l -d -n

sleep 5
bin/ceph osd pool create test 1 1
bin/rados -p test put object /etc/passwd
bin/rados -p test mksnap snap1
bin/rados -p test put object /etc/fstab

../src/stop.sh
bin/ceph-objectstore-tool --no-mon-config --data-path dev/osd1 --force '["1.0",{"oid":"object","key":"","snapid":-2,"hash":1460548130,"max":0,"pool":1,"namespace":"","max":0}]' remove

MDS=0 MON=1 MGR=1 OSD=4 ../src/vstart.sh -l -d

bin/ceph osd out osd.2
    -8> 2019-08-27T03:06:09.839+0000 7f3f817d8700 10 osd.1 pg_epoch: 32 pg[1.0( v 21'3 (0'0,21'3] local-lis/les=31/32 n=2 ec=19/19 lis/c=31/28 les/c/f=32/29/0 sis=31) [1,0,3] r=0 lpr=31 pi=[28,31)/1 crt=21'3 lcod 0'0 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=1}}] _handle_message: 0x55728ad611e0
    -7> 2019-08-27T03:06:09.839+0000 7f3f817d8700 10 osd.1 pg_epoch: 32 pg[1.0( v 21'3 (0'0,21'3] local-lis/les=31/32 n=2 ec=19/19 lis/c=31/28 les/c/f=32/29/0 sis=31) [1,0,3] r=0 lpr=31 pi=[28,31)/1 crt=21'3 lcod 0'0 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=1}}] handle_pull_response ObjectRecoveryInfo(1:444c70ea:::object:head@21'3, size: 89, copy_subset: [0~89], clone_subset: {}, snapset: 0=[]:{}, object_exist: 0)ObjectRecoveryProgress(!first, data_recovered_to:89, data_complete:true, omap_recovered_to:, omap_complete:true, error:false) data.size() is 89 data_included: [0~89]
    -6> 2019-08-27T03:06:09.839+0000 7f3f817d8700 10 osd.1 pg_epoch: 32 pg[1.0( v 21'3 (0'0,21'3] local-lis/les=31/32 n=2 ec=19/19 lis/c=31/28 les/c/f=32/29/0 sis=31) [1,0,3] r=0 lpr=31 pi=[28,31)/1 crt=21'3 lcod 0'0 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=1}}] get_object_context: obc NOT found in cache: 1:444c70ea:::object:head
    -5> 2019-08-27T03:06:09.839+0000 7f3f817d8700 10 osd.1 pg_epoch: 32 pg[1.0( v 21'3 (0'0,21'3] local-lis/les=31/32 n=2 ec=19/19 lis/c=31/28 les/c/f=32/29/0 sis=31) [1,0,3] r=0 lpr=31 pi=[28,31)/1 crt=21'3 lcod 0'0 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=1}}] populate_obc_watchers 1:444c70ea:::object:head
    -4> 2019-08-27T03:06:09.839+0000 7f3f817d8700 20 osd.1 pg_epoch: 32 pg[1.0( v 21'3 (0'0,21'3] local-lis/les=31/32 n=2 ec=19/19 lis/c=31/28 les/c/f=32/29/0 sis=31) [1,0,3] r=0 lpr=31 pi=[28,31)/1 crt=21'3 lcod 0'0 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=1}}] PrimaryLogPG::check_blacklisted_obc_watchers for obc 1:444c70ea:::object:head
    -3> 2019-08-27T03:06:09.839+0000 7f3f817d8700 10 osd.1 pg_epoch: 32 pg[1.0( v 21'3 (0'0,21'3] local-lis/les=31/32 n=2 ec=19/19 lis/c=31/28 les/c/f=32/29/0 sis=31) [1,0,3] r=0 lpr=31 pi=[28,31)/1 crt=21'3 lcod 0'0 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=1}}] get_object_context: creating obc from disk: 0x55728b0422c0
    -2> 2019-08-27T03:06:09.839+0000 7f3f817d8700 10 osd.1 pg_epoch: 32 pg[1.0( v 21'3 (0'0,21'3] local-lis/les=31/32 n=2 ec=19/19 lis/c=31/28 les/c/f=32/29/0 sis=31) [1,0,3] r=0 lpr=31 pi=[28,31)/1 crt=21'3 lcod 0'0 mlcod 0'0 active+recovering rops=1 m=1 mbc={255={(1+1)=1}}] get_object_context: 0x55728b0422c0 1:444c70ea:::object:head rwstate(none n=0 w=0) oi: 1:444c70ea:::object:head(21'3 client.4147.0:1 dirty|data_digest s 89 uv 3 dd b5eba9a6 alloc_hint [0 0 0]) exists: 1 ssc: 0x557289945fa0 snapset: 0=[]:{}
#0  raise (sig=<optimized out>) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x000055727dd76546 in reraise_fatal (signum=6) at /home/dzafman/ceph/src/global/signal_handler.cc:81
#2  0x000055727dd7760b in handle_fatal_signal (signum=6) at /home/dzafman/ceph/src/global/signal_handler.cc:326
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#5  0x00007f3fa2800801 in __GI_abort () at abort.c:79
#6  0x00007f3fa27f039a in __assert_fail_base (fmt=0x7f3fa29777d8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x55727e960390 "ss.seq == pi.obc->ssc->snapset.seq", file=file@entry=0x55727e95e840 "/home/dzafman/ceph/src/osd/ReplicatedBackend.cc",
    line=line@entry=1794,
    function=function@entry=0x55727e9602a8 "bool ReplicatedBackend::handle_pull_response(pg_shard_t, const PushOp&, PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info>*, ObjectStore::Transaction*)") at assert.c:92
#7  0x00007f3fa27f0412 in __GI___assert_fail (assertion=0x55727e960390 "ss.seq == pi.obc->ssc->snapset.seq", file=0x55727e95e840 "/home/dzafman/ceph/src/osd/ReplicatedBackend.cc",
    line=1794,
    function=0x55727e9602a8 "bool ReplicatedBackend::handle_pull_response(pg_shard_t, const PushOp&, PullOp*, std::__cxx11::list<ReplicatedBackend::pull_complete_info>*, ObjectStore::Transaction*)") at assert.c:101
#8  0x000055727d9d23ef in ReplicatedBackend::handle_pull_response (this=0x557289941680, from=..., pop=..., response=0x55728afc3a80, to_continue=0x7f3f817d43f0, t=0x7f3f817d4410)
    at /home/dzafman/ceph/src/osd/ReplicatedBackend.cc:1794
#9  0x000055727d9c7e9f in ReplicatedBackend::_do_pull_response (this=0x557289941680, op=...) at /home/dzafman/ceph/src/osd/ReplicatedBackend.cc:831
#10 0x000055727d9dc58c in ReplicatedBackend::do_push (this=0x557289941680, op=...) at /home/dzafman/ceph/src/osd/ReplicatedBackend.h:227
#11 0x000055727d9c2f18 in ReplicatedBackend::_handle_message (this=0x557289941680, op=...) at /home/dzafman/ceph/src/osd/ReplicatedBackend.cc:193
#12 0x000055727d729200 in PGBackend::handle_message (this=0x557289941680, op=...) at /home/dzafman/ceph/src/osd/PGBackend.cc:114
#13 0x000055727d5ba774 in PrimaryLogPG::do_request (this=0x55728a6e1400, op=..., handle=...) at /home/dzafman/ceph/src/osd/PrimaryLogPG.cc:1577
#14 0x000055727d3cd81c in OSD::dequeue_op (this=0x55728a6de000, pg=..., op=..., handle=...) at /home/dzafman/ceph/src/osd/OSD.cc:9918
#15 0x000055727d87585c in PGOpItem::run (this=0x55728a995590, osd=0x55728a6de000, sdata=0x557289a00000, pg=..., handle=...) at /home/dzafman/ceph/src/osd/OpQueueItem.cc:24
#16 0x000055727d3fbd19 in OpQueueItem::run (this=0x7f3f817d5190, osd=0x55728a6de000, sdata=0x557289a00000, pg=..., handle=...) at /home/dzafman/ceph/src/osd/OpQueueItem.h:134
#17 0x000055727d3db271 in OSD::ShardedOpWQ::_process (this=0x55728a6df2d8, thread_index=0, hb=0x55728ad2e6e0) at /home/dzafman/ceph/src/osd/OSD.cc:11112
#18 0x000055727de1a932 in ShardedThreadPool::shardedthreadpool_worker (this=0x55728a6dead0, thread_index=0) at /home/dzafman/ceph/src/common/WorkQueue.cc:311
#19 0x000055727de1c315 in ShardedThreadPool::WorkThreadSharded::entry (this=0x55728a997fb0) at /home/dzafman/ceph/src/common/WorkQueue.h:705
#20 0x000055727de06bd6 in Thread::entry_wrapper (this=0x55728a997fb0) at /home/dzafman/ceph/src/common/Thread.cc:84
#21 0x000055727de06b54 in Thread::_entry_func (arg=0x55728a997fb0) at /home/dzafman/ceph/src/common/Thread.cc:71
#22 0x00007f3fa3b986db in start_thread (arg=0x7f3f817d8700) at pthread_create.c:463
#23 0x00007f3fa28e188f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Related issues 1 (0 open1 closed)

Related to RADOS - Bug #39286: primary recovery local missing object did not update obcResolved04/13/2019

Actions
Actions #1

Updated by David Zafman over 4 years ago

  • Status changed from 12 to In Progress
  • Pull request ID set to 29941
Actions #2

Updated by David Zafman over 4 years ago

  • Backport set to luminous, mimic, nautilus
Actions #3

Updated by David Zafman over 4 years ago

Actions #4

Updated by Nathan Cutler over 4 years ago

  • Related to Bug #39286: primary recovery local missing object did not update obc added
Actions #5

Updated by Nathan Cutler over 4 years ago

Backporting note:

cherry pick https://github.com/ceph/ceph/pull/27575 first, and then https://github.com/ceph/ceph/pull/29941 on top of it.

Actions #6

Updated by David Zafman over 4 years ago

  • Status changed from In Progress to Resolved
  • Backport deleted (luminous, mimic, nautilus)
Actions

Also available in: Atom PDF