Project

General

Profile

Bug #23847

osd stuck recovery

Added by Sage Weil almost 6 years ago. Updated almost 6 years ago.

Status:
Duplicate
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES     LOG DISK_LOG STATE                                             STATE_STAMP                VERSION REPORTED UP      UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN 
1.7           0                  0        0         0       0         0   0        0                                      active+clean 2018-04-24 20:06:47.267682     0'0  551:533 [6,3,7]          6 [6,3,7]              6        0'0 2018-04-24 20:06:46.250035             0'0 2018-04-24 20:06:46.250035             0 
1.6           0                  0        0         0       0         0   0        0                                      active+clean 2018-04-24 20:07:57.304165     0'0  551:549 [6,0,3]          6 [6,0,3]              6        0'0 2018-04-24 20:06:46.250035             0'0 2018-04-24 20:06:46.250035             0 
3.3          58                  0       33         0       0 160075511 293      293    active+recovering+undersized+degraded+remapped 2018-04-24 20:07:58.720378  56'293 549:1027 [4,0,6]          4   [4,6]              4        0'0 2018-04-24 20:06:59.306887             0'0 2018-04-24 20:06:59.306887             0 
2.2         117                  0        0         0       0 324376158 562      562                                      active+clean 2018-04-24 20:07:17.249546  55'562 550:1664 [5,1,6]          5 [5,1,6]              5        0'0 2018-04-24 20:06:57.290080             0'0 2018-04-24 20:06:57.290080             0 
1.1           0                  0        0         0       0         0   0        0                                      active+clean 2018-04-24 20:07:57.303446     0'0  550:530 [5,0,2]          5 [5,0,2]              5        0'0 2018-04-24 20:06:46.250035             0'0 2018-04-24 20:06:46.250035             0 
3.2          79                  0        0         0       0 184647675 362      362                                      active+clean 2018-04-24 20:07:45.339762  55'362 550:1265 [5,3,6]          5 [5,3,6]              5        0'0 2018-04-24 20:06:59.306887             0'0 2018-04-24 20:06:59.306887             0 
2.3         114                  0        0         0       0 315464473 618      618                                      active+clean 2018-04-24 20:07:17.247905  56'618 550:1730 [5,2,7]          5 [5,2,7]              5        0'0 2018-04-24 20:06:57.290080             0'0 2018-04-24 20:06:57.290080             0 
1.0           0                  0        0         0       0         0   0        0                                      active+clean 2018-04-24 20:07:57.307186     0'0  551:550 [7,0,6]          7 [7,0,6]              7        0'0 2018-04-24 20:06:46.250035             0'0 2018-04-24 20:06:46.250035             0 
3.0          80                  0        0         0       0 187773501 319      319                                      active+clean 2018-04-24 20:07:00.329366  56'319 550:1125 [1,2,6]          1 [1,2,6]              1        0'0 2018-04-24 20:06:59.306887             0'0 2018-04-24 20:06:59.306887             0 
2.1          99                  0        0         0       0 268803379 527      527                                      active+clean 2018-04-24 20:07:59.800901  55'527 550:1582 [2,3,0]          2 [2,3,0]              2        0'0 2018-04-24 20:06:57.290080             0'0 2018-04-24 20:06:57.290080             0 
1.2           0                  0        0         0       0         0   0        0                                      active+clean 2018-04-24 20:07:57.304541     0'0  551:551 [7,0,4]          7 [7,0,4]              7        0'0 2018-04-24 20:06:46.250035             0'0 2018-04-24 20:06:46.250035             0 
3.1          53                  0       36         0       0 127669239 230      230    active+recovering+undersized+degraded+remapped 2018-04-24 20:07:58.322796  55'230  463:579 [0,4,3]          0   [4,3]              4        0'0 2018-04-24 20:06:59.306887             0'0 2018-04-24 20:06:59.306887             0 
2.0         103                  0       66         0       0 288678194 522      522 active+recovery_wait+undersized+degraded+remapped 2018-04-24 20:07:58.321745  56'522 551:1448 [7,1,0]          7   [7,1]              7        0'0 2018-04-24 20:06:57.290080             0'0 2018-04-24 20:06:57.290080             0 
1.3           0                  0        0         0       0         0   0        0                                      active+clean 2018-04-24 20:06:47.275325     0'0  550:536 [1,6,4]          1 [1,6,4]              1        0'0 2018-04-24 20:06:46.250035             0'0 2018-04-24 20:06:46.250035             0 
1.4           0                  0        0         0       0         0   0        0                                      active+clean 2018-04-24 20:07:45.338578     0'0  550:535 [5,3,7]          5 [5,3,7]              5        0'0 2018-04-24 20:06:46.250035             0'0 2018-04-24 20:06:46.250035             0 

after lots of elapsed time.

/a/sage-2018-04-24_16:58:54-rados-wip-sage3-testing-2018-04-24-1010-distro-basic-smithi/2434862

saw a similar failure yesterday.


Related issues

Related to RADOS - Bug #23664: cache-try-flush hits wrlock, busy loops Resolved 04/11/2018

History

#1 Updated by Sage Weil almost 6 years ago

recovery on 3.3 stalls out here

2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 pg_epoch: 56 pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 rops=1 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] finish_recovery_op 3:ca231196:::smithi00216430-270:head
2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 56 finish_recovery_op pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] 3:ca231196:::smithi00216430-270:head dequeue=0 (10/10 rops)
2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 56 _maybe_queue_recovery starting 1, recovery_ops_reserved 0 -> 1
2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 pg_epoch: 56 pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] queue_recovery -- queuing
2018-04-24 20:07:58.919 7f53419c0700 15 osd.4 56 _recover_now active 9 + reserved 1 >= max 10
2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 pg_epoch: 56 pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] release_backoffs [3:ca231196:::smithi00216430-270:head,3:ca231196:::smithi00216430-270:head)
2018-04-24 20:07:58.919 7f53419c0700 10 osd.4 pg_epoch: 56 pg[3.3( v 56'276 (0'0,56'276] local-lis/les=54/55 n=56 ec=19/19 lis/c 54/19 les/c/f 55/20/0 53/54/19) [4,0,6]/[4,6] async=[0] r=0 lpr=54 pi=[19,54)/1 luod=55'275 crt=56'276 lcod 55'274 mlcod 36'96 active+recovering+undersized+degraded+remapped mbc={255={(2+0)=33}}] finish_degraded_object 3:ca231196:::smithi00216430-270:head

#2 Updated by Sage Weil almost 6 years ago

  • Related to Bug #23664: cache-try-flush hits wrlock, busy loops added

#3 Updated by Sage Weil almost 6 years ago

Recovery is starved by #23664, a cache tiering infinite loop.

#4 Updated by Sage Weil almost 6 years ago

  • Status changed from 12 to Duplicate

Also available in: Atom PDF