Project

General

Profile

Bug #12990

osd: we do not ignore notify from down osds

Added by Sage Weil about 7 years ago. Updated almost 7 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

 -3711> 2015-09-07 16:08:15.169156 7f51dcf24700 -1 osd/ReplicatedBackend.cc: In function 'void ReplicatedBackend::prepare_pull(eversion_t, const hobject_t&, ObjectContextRef, ReplicatedBackend::RPGHandle*)' thread 7f51dcf24700 time 2015-09-07 16:08:15.146635
osd/ReplicatedBackend.cc: 1463: FAILED assert(get_osdmap()->is_up(p->osd))

 ceph version 9.0.3-1470-g39777a0 (39777a0453c326e49b3dc4c424250b060c53776d)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f51f0b90da5]
 2: (ReplicatedBackend::prepare_pull(eversion_t, hobject_t const&, std::shared_ptr<ObjectContext>, ReplicatedBackend::RPGHandle*)+0x27f) [0x7f51f09b188f]
 3: (ReplicatedBackend::recover_object(hobject_t const&, eversion_t, std::shared_ptr<ObjectContext>, std::shared_ptr<ObjectContext>, PGBackend::RecoveryHandle*)+0x256) [0x7f51f09b6e06]
 4: (ReplicatedPG::recover_missing(hobject_t const&, eversion_t, int, PGBackend::RecoveryHandle*)+0x5b6) [0x7f51f07b3b56]
 5: (ReplicatedPG::maybe_kick_recovery(hobject_t const&)+0x378) [0x7f51f07b7ff8]
 6: (ReplicatedPG::wait_for_unreadable_object(hobject_t const&, std::shared_ptr<OpRequest>)+0x44) [0x7f51f07b81b4]
 7: (ReplicatedPG::do_op(std::shared_ptr<OpRequest>&)+0x528) [0x7f51f07dc898]
 8: (ReplicatedPG::do_request(std::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x6f8) [0x7f51f0778d58]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ed) [0x7f51f05e077d]
 10: (PGQueueable::RunVis::operator()(std::shared_ptr<OpRequest>&)+0x6d) [0x7f51f05e09bd]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x693) [0x7f51f0606043]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0x7f51f0b812cf]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f51f0b83210]
 14: (()+0x7df5) [0x7f51eec35df5]
 15: (clone()+0x6d) [0x7f51ed4de1ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

/a/sage-2015-09-07_06:22:35-rados-wip-sage-testing---basic-multi/1045898

- osd.0 is stray, sends notify to primary in epoch 1065
- osd.3 receives it
2015-09-07 16:08:12.744499 7f51e8ea5700 20 osd.3 1065 _dispatch 0x7f51f9f6f620 pg_notify(1.13c(2),0.117(1),1.72(1),0.73(2),0.164(1),1.165(1),1.fa(2),1.f8(2),0.107(2),0.29(2),1.115(2),0.31(1),1.30(2),1.1ad(4),1.12(2) epoch 1065) v5
- pg queues 1066 null event
2015-09-07 16:08:13.931565 7f51e8ea5700 10 osd.3 pg_epoch: 1065 pg[1.f8( v 872'163 lc 83'135 (0'0,872'163] local-les=1064 n=2 ec=6 les/c 1054/1034 1059/1059/1059) [3,2] r=0 lpr=1059 pi=1031-1058/2 crt=872'163 lcod 0'0 mlcod 0'0 activating+degraded m=2 u=2] null
- pg advances to 1066
2015-09-07 16:08:14.345773 7f51def28700 10 osd.3 pg_epoch: 1065 pg[1.f8( v 872'163 lc 83'135 (0'0,872'163] local-les=1064 n=2 ec=6 les/c 1054/1034 1059/1059/1059) [3,2] r=0 lpr=1059 pi=1031-1058/2 crt=872'163 lcod 0'0 mlcod 0'0 activating+degraded m=2 u=2] handle_advance_map [3,2]/[3,2] -- 3/3
- later the pg gets the queued notify
2015-09-07 16:08:14.904536 7f51df729700 10 osd.3 pg_epoch: 1067 pg[1.f8( v 872'163 lc 83'135 (0'0,872'163] local-les=1064 n=2 ec=6 les/c 1054/1034 1059/1059/1059) [3,2] r=0 lpr=1059 pi=1031-1058/2 crt=872'163 lcod 0'0 mlcod 0'0 activating+degraded m=2 u=2] handle_peering_event: epoch_sent: 1064 epoch_requested: 1064 MNotifyRec from 0 notify: (query_epoch:1064, epoch_sent:1064, info:1.f8( v 872'163 (0'0,872'163] local-les=1054 n=2 ec=6 les/c 1054/1034 1059/1059/1059)) features: 0x7fffffffffffff
2015-09-07 16:08:14.904562 7f51df729700 10 osd.3 pg_epoch: 1067 pg[1.f8( v 872'163 lc 83'135 (0'0,872'163] local-les=1064 n=2 ec=6 les/c 1054/1034 1059/1059/1059) [3,2] r=0 lpr=1059 pi=1031-1058/2 crt=872'163 lcod 0'0 mlcod 0'0 activating+degraded m=2 u=2] state<Started/Primary/Active>: Active: got notify from 0, calling proc_replica_info and discover_all_missing
2015-09-07 16:08:14.904585 7f51df729700 10 osd.3 pg_epoch: 1067 pg[1.f8( v 872'163 lc 83'135 (0'0,872'163] local-les=1064 n=2 ec=6 les/c 1054/1034 1059/1059/1059) [3,2] r=0 lpr=1059 pi=1031-1058/2 crt=872'163 lcod 0'0 mlcod 0'0 activating+degraded m=2 u=2] got osd.0 1.f8( v 872'163 (0'0,872'163] local-les=1054 n=2 ec=6 les/c 1054/1034 1059/1059/1059)
2015-09-07 16:08:14.904617 7f51df729700 10 osd.3 pg_epoch: 1067 pg[1.f8( v 872'163 lc 83'135 (0'0,872'163] local-les=1064 n=2 ec=6 les/c 1054/1034 1059/1059/1059) [3,2] r=0 lpr=1059 pi=1031-1058/2 crt=872'163 lcod 0'0 mlcod 0'0 activating+degraded m=2 u=2] osd.0 has stray content: 1.f8( v 872'163 (0'0,872'163] local-les=1054 n=2 ec=6 les/c 1054/1034 1059/1059/1059)

The map advance isn't ordered wrt to teh queue... so the notify is processed after it is stale.


Related issues

Copied to Ceph - Backport #13336: osd: we do not ignore notify from down osds Resolved

Associated revisions

Revision 930d8eb1 (diff)
Added by Samuel Just about 7 years ago

PG: ignore info from down osd

Fixes: #12990
Backport: firefly, hammer
Signed-off-by: Samuel Just <>

Revision 7161a2c9 (diff)
Added by Samuel Just about 7 years ago

PG: ignore info from down osd

Fixes: #12990
Backport: firefly, hammer
Signed-off-by: Samuel Just <>
(cherry picked from commit 930d8eb1fb0daacd692d2e80b5ff473448bd4e8d)

History

#1 Updated by Sage Weil about 7 years ago

/a/sage-2015-09-12_17:44:29-rados-wip-sage-testing---basic-multi/1052905

#2 Updated by Sage Weil about 7 years ago

  • Status changed from New to 12
  • Assignee set to Samuel Just

#3 Updated by Samuel Just about 7 years ago

  • Status changed from 12 to 7

#4 Updated by Sage Weil about 7 years ago

  • Status changed from 7 to Pending Backport
  • Backport set to hammer

#5 Updated by Loïc Dachary about 7 years ago

What is the commit to be backported ?

#7 Updated by Loïc Dachary almost 7 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF