Actions
Bug #23847
closedosd stuck recovery
% Done:
0%
Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 1.7 0 0 0 0 0 0 0 0 active+clean 2018-04-24 20:06:47.267682 0'0 551:533 [6,3,7] 6 [6,3,7] 6 0'0 2018-04-24 20:06:46.250035 0'0 2018-04-24 20:06:46.250035 0 1.6 0 0 0 0 0 0 0 0 active+clean 2018-04-24 20:07:57.304165 0'0 551:549 [6,0,3] 6 [6,0,3] 6 0'0 2018-04-24 20:06:46.250035 0'0 2018-04-24 20:06:46.250035 0 3.3 58 0 33 0 0 160075511 293 293 active+recovering+undersized+degraded+remapped 2018-04-24 20:07:58.720378 56'293 549:1027 [4,0,6] 4 [4,6] 4 0'0 2018-04-24 20:06:59.306887 0'0 2018-04-24 20:06:59.306887 0 2.2 117 0 0 0 0 324376158 562 562 active+clean 2018-04-24 20:07:17.249546 55'562 550:1664 [5,1,6] 5 [5,1,6] 5 0'0 2018-04-24 20:06:57.290080 0'0 2018-04-24 20:06:57.290080 0 1.1 0 0 0 0 0 0 0 0 active+clean 2018-04-24 20:07:57.303446 0'0 550:530 [5,0,2] 5 [5,0,2] 5 0'0 2018-04-24 20:06:46.250035 0'0 2018-04-24 20:06:46.250035 0 3.2 79 0 0 0 0 184647675 362 362 active+clean 2018-04-24 20:07:45.339762 55'362 550:1265 [5,3,6] 5 [5,3,6] 5 0'0 2018-04-24 20:06:59.306887 0'0 2018-04-24 20:06:59.306887 0 2.3 114 0 0 0 0 315464473 618 618 active+clean 2018-04-24 20:07:17.247905 56'618 550:1730 [5,2,7] 5 [5,2,7] 5 0'0 2018-04-24 20:06:57.290080 0'0 2018-04-24 20:06:57.290080 0 1.0 0 0 0 0 0 0 0 0 active+clean 2018-04-24 20:07:57.307186 0'0 551:550 [7,0,6] 7 [7,0,6] 7 0'0 2018-04-24 20:06:46.250035 0'0 2018-04-24 20:06:46.250035 0 3.0 80 0 0 0 0 187773501 319 319 active+clean 2018-04-24 20:07:00.329366 56'319 550:1125 [1,2,6] 1 [1,2,6] 1 0'0 2018-04-24 20:06:59.306887 0'0 2018-04-24 20:06:59.306887 0 2.1 99 0 0 0 0 268803379 527 527 active+clean 2018-04-24 20:07:59.800901 55'527 550:1582 [2,3,0] 2 [2,3,0] 2 0'0 2018-04-24 20:06:57.290080 0'0 2018-04-24 20:06:57.290080 0 1.2 0 0 0 0 0 0 0 0 active+clean 2018-04-24 20:07:57.304541 0'0 551:551 [7,0,4] 7 [7,0,4] 7 0'0 2018-04-24 20:06:46.250035 0'0 2018-04-24 20:06:46.250035 0 3.1 53 0 36 0 0 127669239 230 230 active+recovering+undersized+degraded+remapped 2018-04-24 20:07:58.322796 55'230 463:579 [0,4,3] 0 [4,3] 4 0'0 2018-04-24 20:06:59.306887 0'0 2018-04-24 20:06:59.306887 0 2.0 103 0 66 0 0 288678194 522 522 active+recovery_wait+undersized+degraded+remapped 2018-04-24 20:07:58.321745 56'522 551:1448 [7,1,0] 7 [7,1] 7 0'0 2018-04-24 20:06:57.290080 0'0 2018-04-24 20:06:57.290080 0 1.3 0 0 0 0 0 0 0 0 active+clean 2018-04-24 20:06:47.275325 0'0 550:536 [1,6,4] 1 [1,6,4] 1 0'0 2018-04-24 20:06:46.250035 0'0 2018-04-24 20:06:46.250035 0 1.4 0 0 0 0 0 0 0 0 active+clean 2018-04-24 20:07:45.338578 0'0 550:535 [5,3,7] 5 [5,3,7] 5 0'0 2018-04-24 20:06:46.250035 0'0 2018-04-24 20:06:46.250035 0
after lots of elapsed time.
/a/sage-2018-04-24_16:58:54-rados-wip-sage3-testing-2018-04-24-1010-distro-basic-smithi/2434862
saw a similar failure yesterday.
Updated by Sage Weil almost 6 years ago
recovery on 3.3 stalls out here
2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 pg_epoch: 56 pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 rops=1 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] finish_recovery_op 3:ca231196:::smithi00216430-270:head 2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 56 finish_recovery_op pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] 3:ca231196:::smithi00216430-270:head dequeue=0 (10/10 rops) 2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 56 _maybe_queue_recovery starting 1, recovery_ops_reserved 0 -> 1 2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 pg_epoch: 56 pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] queue_recovery -- queuing 2018-04-24 20:07:58.919 7f53419c0700 15 osd.4 56 _recover_now active 9 + reserved 1 >= max 10 2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 pg_epoch: 56 pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] release_backoffs [3:ca231196:::smithi00216430-270:head,3:ca231196:::smithi00216430-270:head) 2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 pg_epoch: 56 pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] finish_degraded_object 3:ca231196:::smithi00216430-270:head
Updated by Sage Weil almost 6 years ago
- Related to Bug #23664: cache-try-flush hits wrlock, busy loops added
Updated by Sage Weil almost 6 years ago
Recovery is starved by #23664, a cache tiering infinite loop.
Actions