Bug #13116
osd: pg stuck in replay
0%
Description
"description": "osd_op(mds.0.172:22092466 1000fb5427c.00000000 [create 0~0,setxattr parent (347)] 0.e463883e RETRY=6 ondisk+retry+write+known_if_redirected e648294)",
"initiated_at": "2015-09-15 19:42:36.227613",
"age": 35796.821774,
"duration": 210.344416,
"type_data": [
"delayed",
[
{
"time": "2015-09-15 19:42:36.227613",
"event": "initiated"
},
{
"time": "2015-09-15 19:42:36.377110",
"event": "reached_pg"
},
{
"time": "2015-09-15 19:42:37.338774",
"event": "reached_pg"
},
{
"time": "2015-09-15 19:42:38.260745",
"event": "reached_pg"
},
{
"time": "2015-09-15 19:44:08.373928",
"event": "reached_pg"
},
{
"time": "2015-09-15 19:44:08.541626",
"event": "reached_pg"
},
{
"time": "2015-09-15 19:44:08.635922",
"event": "reached_pg"
},
{
"time": "2015-09-15 19:44:53.262739",
"event": "reached_pg"
},
{
"time": "2015-09-15 19:45:04.467969",
"event": "reached_pg"
},
{
"time": "2015-09-15 19:46:06.571989",
"event": "reached_pg"
},
{
"time": "2015-09-15 19:46:06.572029",
"event": "waiting for replay end"
}
]
]
and now is Wed Sep 16 05:40:28 PDT 2015 >10 hrs later
pg appears to be still stuck in replay.
Related issues
Associated revisions
osd: fix requeue of replay requests during activating
If the replay period expires while we are still in the activating
state, we can simply insert our list of requests at the front of
the waiting_for_active list.
Fixes: #13116
Signed-off-by: Sage Weil <sage@redhat.com>
osd: fix requeue of replay requests during activating
If the replay period expires while we are still in the activating
state, we can simply insert our list of requests at the front of
the waiting_for_active list.
Fixes: #13116
Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit d18cf51d9419819cdda3782b188b010969288911)
History
#1 Updated by Sage Weil over 8 years ago
- Subject changed from hammer: pg stuck in replay to osd: pg stuck in replay
- Status changed from In Progress to 12
- Assignee deleted (
Sage Weil)
2015-09-15 7ff7eb936700 10 osd.79 pg_epoch: 648335 pg[0.83e( v 648179'693979 (623242'690950,648179'693979] local-les=633480 n=7438 ec=1 les/c 633480/636597 648320/648320/648294) [79,101,3] r=0 lpr=648320 pi=633391-648319/10 crt=648049'693976 lcod 0'0 mlcod 0'0 inactive] activate starting replay interval for 45 until 2015-09-15 19:45:49.466019 ... 2015-09-15 19:45:50.351316 7ff80498a700 10 osd.79 648335 check_replay_queue pg[0.83e( v 648179'693979 (623242'690950,648179'693979] local-les=648335 n=7438 ec=1 les/c 633480/636597 648320/648320/648294) [79,101,3] r=0 lpr=648320 pi=633391-648319/10 crt=648049'693976 lcod 0'0 mlcod 0'0 activating+replay+degraded]
the pg isn't active, so it fails this test:
dout(10) << "check_replay_queue " << *pg << dendl; if (pg->is_active() && pg->is_replay() && pg->is_primary() && pg->replay_until == p->second) { pg->replay_queued_ops(); }
we need to requeue everything on waiting_for_active, probably?
#2 Updated by Sage Weil over 8 years ago
To get good coverage of this case we shoudl set replay interval for test pools to something short (5 or 10 seconds)
#3 Updated by Sage Weil over 8 years ago
- Status changed from 12 to Fix Under Review
#4 Updated by Sage Weil over 8 years ago
- Status changed from Fix Under Review to 7
#5 Updated by Samuel Just over 8 years ago
- Status changed from 7 to Resolved
#6 Updated by Greg Farnum over 8 years ago
- Status changed from Resolved to Pending Backport
- Priority changed from Urgent to Normal
Can we get this put in hammer as well when it's convenient?
#7 Updated by Nathan Cutler over 8 years ago
- Backport set to hammer
#8 Updated by Nathan Cutler over 8 years ago
- Copied to Backport #13620: osd: pg stuck in replay added
#9 Updated by Loïc Dachary over 8 years ago
- Status changed from Pending Backport to Resolved