Bug #9835
osd: bug in misdirected op checks (firefly)
0%
Description
ubuntu@teuthology:/var/lib/teuthworker/archive/teuthology-2014-10-18_19:22:02-upgrade:firefly-x-giant-distro-basic-multi/556732
pg 3.cs0 maps to [-1,3,2] momentarily, and we get an op. firefly says
2014-10-20 04:02:08.246434 7f0d9f7f7700 7 osd.3 74 hit non-existent pg 3.cs0
2014-10-20 04:02:08.246441 7f0d9f7f7700 7 osd.3 74 we are valid target for op, waiting
because
if (osdmap->get_pg_acting_role(pgid.pgid, whoami) >= 0) { dout(7) << "we are valid target for op, waiting" << dendl; waiting_for_pg[pgid].push_back(op); op->mark_delayed("waiting for pg to exist locally"); return; }
but then the op never goes away because the pg never does get created on this node. this code is
all refactored in giant so the same problem doesn't exist there.
Related issues
Associated revisions
osd/OSD: use OSDMap helper to determine if we are correct op target
Use the new helper. This fixes our behavior for EC pools where targetting
a different shard is not correct, while for replicated pools it may be. In
the EC case, it leaves the op hanging indefinitely in the OpTracker because
the pgid exists but as a different shard.
Fixes: #9835
Signed-off-by: Sage Weil <sage@redhat.com>
osd: discard rank > 0 ops on erasure pools
Erasure pools do not support read from replica, so we should drop
any rank > 0 requests.
This fixes a bug where an erasure pool maps to [1,2,3], temporarily maps
to [-1,2,3], sends a request to osd.2, and then remaps back to [1,2,3].
Because the 0 shard never appears on osd.2, the request sits in the
waiting_for_pg map indefinitely and cases slow request warnings.
This problem does not come up on replicated pools because all instances of
the PG are created equal.
Fix by only considering role == 0 for erasure pools as a correct mapping.
Fixes: #9835
Signed-off-by: Sage Weil <sage@redhat.com>
osd: use OSDMap helper to tell if ops are misdirected
calc_pg_role doesn't actually take into account primary affinity.
Fixes: #9835
Signed-off-by: Samuel Just <sam.just@inktank.com>
osd/OSD: use OSDMap helper to determine if we are correct op target
Use the new helper. This fixes our behavior for EC pools where targetting
a different shard is not correct, while for replicated pools it may be. In
the EC case, it leaves the op hanging indefinitely in the OpTracker because
the pgid exists but as a different shard.
Fixes: #9835
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit 9e05ba086a36ae9a04b347153b685c2b8adac2c3)
History
#1 Updated by Sage Weil over 9 years ago
- Description updated (diff)
#2 Updated by Greg Farnum over 9 years ago
Maybe we need to adjust how we're handling waiting_for_pg, but I don't think that this particular check is a bug — this is an op that can be legitimately targeted at us (we're a replica, and we allow replica ops), but we don't have the PG. So we have to wait for it.
#3 Updated by Sage Weil over 9 years ago
- Subject changed from osd: bug in misdirected op checks to osd: bug in misdirected op checks (firefly)
- Status changed from New to Fix Under Review
#4 Updated by Sage Weil over 9 years ago
- Status changed from Fix Under Review to Resolved