Project

General

Profile

Bug #22145

PG stuck in recovery_unfound

Added by Sage Weil about 1 year ago. Updated 12 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
11/16/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:

Description

2017-11-14 00:39:44.466 7ff634832700 10 osd.1 pg_epoch: 625 pg[2.b( v 612'499 lc 584'495 (0'0,612'499] local-lis/les=592/593 n=7 ec=19/19 lis/c 592/565 les/c/f 593/577/0 592/592/592) [1,3,6] r=0 lpr=592 pi=[565,592)/1 crt=612'499 mlcod 547'475 active+recovery_unfound+degraded m=2 u=2 rsq/ps=23/41] state<Started/Primary/Active>: Active: got notify from 2, calling proc_replica_info and discover_all_missing
2017-11-14 00:39:44.466 7ff634832700 20 osd.1 pg_epoch: 625 pg[2.b( v 612'499 lc 584'495 (0'0,612'499] local-lis/les=592/593 n=7 ec=19/19 lis/c 592/565 les/c/f 593/577/0 592/592/592) [1,3,6] r=0 lpr=592 pi=[565,592)/1 crt=612'499 mlcod 547'475 active+recovery_unfound+degraded m=2 u=2 rsq/ps=23/41] discover_all_missing: osd.2: in peer_missing_requested
2017-11-14 00:39:44.466 7ff634832700 20 osd.1 pg_epoch: 625 pg[2.b( v 612'499 lc 584'495 (0'0,612'499] local-lis/les=592/593 n=7 ec=19/19 lis/c 592/565 les/c/f 593/577/0 592/592/592) [1,3,6] r=0 lpr=592 pi=[565,592)/1 crt=612'499 mlcod 547'475 active+recovery_unfound+degraded m=2 u=2 rsq/ps=23/41] discover_all_missing: osd.3: we already have pg_missing_t
2017-11-14 00:39:44.466 7ff634832700 20 osd.1 pg_epoch: 625 pg[2.b( v 612'499 lc 584'495 (0'0,612'499] local-lis/les=592/593 n=7 ec=19/19 lis/c 592/565 les/c/f 593/577/0 592/592/592) [1,3,6] r=0 lpr=592 pi=[565,592)/1 crt=612'499 mlcod 547'475 active+recovery_unfound+degraded m=2 u=2 rsq/ps=23/41] discover_all_missing: osd.6: we already have pg_missing_t
2017-11-14 00:39:44.466 7ff63f0de700  1 -- 172.21.15.4:6809/30792 <== osd.2 172.21.15.4:6817/30430 3 ==== pg_notify((query:616 sent:616 2.b( v 591'497 (0'0,591'497] local-lis/les=565/566 n=6 ec=19/19 lis/c 565/565 les/c/f 566/577/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:616 sent:616 2.1b( v 591'455 (0'0,591'455] local-lis/les=565/566 n=5 ec=417/19 lis/c 565/565 les/c/f 566/579/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:616 sent:616 2.2b( v 591'487 (0'0,591'487] local-lis/les=565/566 n=1 ec=446/19 lis/c 565/565 les/c/f 566/570/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:616 sent:616 2.2b( v 591'487 (0'0,591'487] local-lis/les=565/566 n=1 ec=446/19 lis/c 565/565 les/c/f 566/570/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:625 sent:625 2.18( v 589'181 (0'0,589'181] local-lis/les=565/566 n=3 ec=54/19 lis/c 565/565 les/c/f 566/572/0 592/592/565))=([565,591] intervals=([565,591] acting 1,2,3)) epoch 625) v6 ==== 7060+0+0 (4186841310 0 0) 0x5608198d14a0 con 0x560819d99700
2017-11-14 00:39:44.466 7ff63f0de700 20 osd.1 625 _dispatch 0x5608198d14a0 pg_notify((query:616 sent:616 2.b( v 591'497 (0'0,591'497] local-lis/les=565/566 n=6 ec=19/19 lis/c 565/565 les/c/f 566/577/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:616 sent:616 2.1b( v 591'455 (0'0,591'455] local-lis/les=565/566 n=5 ec=417/19 lis/c 565/565 les/c/f 566/579/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:616 sent:616 2.2b( v 591'487 (0'0,591'487] local-lis/les=565/566 n=1 ec=446/19 lis/c 565/565 les/c/f 566/570/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:616 sent:616 2.2b( v 591'487 (0'0,591'487] local-lis/les=565/566 n=1 ec=446/19 lis/c 565/565 les/c/f 566/570/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:625 sent:625 2.18( v 589'181 (0'0,589'181] local-lis/les=565/566 n=3 ec=54/19 lis/c 565/565 les/c/f 566/572/0 592/592/565))=([565,591] intervals=([565,591] acting 1,2,3)) epoch 625) v6
2017-11-14 00:39:44.466 7ff63f0de700 20 osd.1 625 OSD::ms_dispatch: pg_notify((query:616 sent:616 2.b( v 591'497 (0'0,591'497] local-lis/les=565/566 n=6 ec=19/19 lis/c 565/565 les/c/f 566/577/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:616 sent:616 2.1b( v 591'455 (0'0,591'455] local-lis/les=565/566 n=5 ec=417/19 lis/c 565/565 les/c/f 566/579/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:616 sent:616 2.2b( v 591'487 (0'0,591'487] local-lis/les=565/566 n=1 ec=446/19 lis/c 565/565 les/c/f 566/570/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:616 sent:616 2.2b( v 591'487 (0'0,591'487] local-lis/les=565/566 n=1 ec=446/19 lis/c 565/565 les/c/f 566/570/0 592/592/592))=([565,591] intervals=([565,591] acting 1,2,3)) (query:625 sent:625 2.18( v 589'181 (0'0,589'181] local-lis/les=565/566 n=3 ec=54/19 lis/c 565/565 les/c/f 566/572/0 592/592/565))=([565,591] intervals=([565,591] acting 1,2,3)) epoch 625) v6
2017-11-14 00:39:44.469 7ff63f0de700  1 -- 172.21.15.4:6809/30792 <== osd.2 172.21.15.4:6817/30430 4 ==== pg_log(2.b epoch 625 log log((0'0,591'497], crt=591'497) pi ([0,0] intervals=) query_epoch 625) v5 ==== 83488+0+0 (3797587334 0 0) 0x56081a569c00 con 0x560819d99700
2017-11-14 00:39:44.469 7ff63f0de700 20 osd.1 625 _dispatch 0x56081a569c00 pg_log(2.b epoch 625 log log((0'0,591'497], crt=591'497) pi ([0,0] intervals=) query_epoch 625) v5
2017-11-14 00:39:44.469 7ff63f0de700 20 osd.1 625 OSD::ms_dispatch: pg_log(2.b epoch 625 log log((0'0,591'497], crt=591'497) pi ([0,0] intervals=) query_epoch 625) v5
2017-11-14 00:39:44.470 7ff634832700 10 osd.1 pg_epoch: 625 pg[2.b( v 612'499 lc 584'495 (0'0,612'499] local-lis/les=592/593 n=7 ec=19/19 lis/c 592/565 les/c/f 593/577/0 592/592/592) [1,3,6] r=0 lpr=592 pi=[565,592)/1 crt=612'499 mlcod 547'475 active+recovery_unfound+degraded m=2 rsq/ps=23/41] queue_recovery -- already queued
...
2017-11-14 00:39:44.557 7ff63282e700 10 osd.1 pg_epoch: 625 pg[2.b( v 612'499 lc 584'495 (0'0,612'499] local-lis/les=592/593 n=7 ec=19/19 lis/c 592/565 les/c/f 593/577/0 592/592/592) [1,3,6] r=0 lpr=592 pi=[565,592)/1 crt=612'499 mlcod 547'475 active+recovery_unfound+degraded m=2 rsq/ps=23/41] recovery raced and were queued twice, ignoring!

/a/sage-2017-11-13_22:25:20-rados-wip-sage-testing-2017-11-13-0823-distro-basic-smithi/1844925

PG is in NotRecovering, and backfilling|recovering bits not set, so when recovery runs it bails out in start_recovery_ops.

MLogRec handler is in Active state and just does queue_recovery. Probably needs to queue DoRecovery event or similar.


Related issues

Related to RADOS - Bug #19377: mark_unfound_lost revert won't actually recover the objects unless there are some found objects to recover as well Duplicate 03/24/2017
Related to RADOS - Bug #20059: miscounting degraded objects Resolved 05/23/2017
Copied to RADOS - Backport #22387: luminous: PG stuck in recovery_unfound Resolved

History

#2 Updated by Sage Weil about 1 year ago

  • Related to Bug #19377: mark_unfound_lost revert won't actually recover the objects unless there are some found objects to recover as well added

#3 Updated by Sage Weil about 1 year ago

  • Status changed from In Progress to Pending Backport
  • Backport set to luminous

#4 Updated by Nathan Cutler about 1 year ago

#5 Updated by David Zafman 12 months ago

  • Related to Bug #20059: miscounting degraded objects added

#6 Updated by David Zafman 12 months ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF