Actions
Bug #2044
closedosd: pg stuck in active+backfill
% Done:
0%
Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
jmlowe ran into this on his cluster several times. The primary doing backfill failed to requeue the pg for recovery.
This was the last time the recovery thread ran:
2012-02-07 20:51:24.938132 7f0b4a5a6700 osd.10 2387 pg[0.2dc( v 2025'2662 (1119'1662,2025'2662] n=1 ec=1 les/c 2387/2174 2384/2386/2386) [10,0]/[10,0,5] r=0 lpr=2386 bft=0 lcod 0'0 mlcod 0'0 active +backfill] removing peer b7b82dc/rb.0.13.000000009306/head 2012-02-07 20:51:24.938176 7f0b4a5a6700 osd.10 2387 pg[0.2dc( v 2025'2662 (1119'1662,2025'2662] n=1 ec=1 les/c 2387/2174 2384/2386/2386) [10,0]/[10,0,5] r=0 lpr=2386 bft=0 lcod 0'0 mlcod 0'0 active +backfill] send_remove_op 7b8182dc/rb.0.9.000000000e4d/head from osd.0 tid 2 2012-02-07 20:51:24.938228 7f0b4a5a6700 -- 149.165.228.11:6814/32218 --> osd.0 149.165.228.10:6802/12583 -- osd_sub_op(osd.10.0:2 0.2dc 7b8182dc/rb.0.9.000000000e4d/head [delete] v 219'18557 snapse t=0=[]:[] snapc=0=[]) v1 -- ?+0 0x42a7b00 2012-02-07 20:51:24.938284 7f0b4a5a6700 osd.10 2387 pg[0.2dc( v 2025'2662 (1119'1662,2025'2662] n=1 ec=1 les/c 2387/2174 2384/2386/2386) [10,0]/[10,0,5] r=0 lpr=2386 bft=0 lcod 0'0 mlcod 0'0 active +backfill] send_remove_op dbe182dc/rb.0.1a.000000000c57/head from osd.0 tid 3 2012-02-07 20:51:24.938361 7f0b4a5a6700 -- 149.165.228.11:6814/32218 --> osd.0 149.165.228.10:6802/12583 -- osd_sub_op(osd.10.0:3 0.2dc dbe182dc/rb.0.1a.000000000c57/head [delete] v 155'15389 snaps et=0=[]:[] snapc=0=[]) v1 -- ?+0 0x42a8600 2012-02-07 20:51:24.938402 7f0b4a5a6700 osd.10 2387 pg[0.2dc( v 2025'2662 (1119'1662,2025'2662] n=1 ec=1 les/c 2387/2174 2384/2386/2386) [10,0]/[10,0,5] r=0 lpr=2386 bft=0 lcod 0'0 mlcod 0'0 active +backfill] send_remove_op 295282dc/rb.0.18.000000002208/head from osd.0 tid 4 2012-02-07 20:51:24.938443 7f0b4a5a6700 -- 149.165.228.11:6814/32218 --> osd.0 149.165.228.10:6802/12583 -- osd_sub_op(osd.10.0:4 0.2dc 295282dc/rb.0.18.000000002208/head [delete] v 110'6343 snapse t=0=[]:[] snapc=0=[]) v1 -- ?+0 0x42a8080 2012-02-07 20:51:24.938487 7f0b4a5a6700 osd.10 2387 pg[0.2dc( v 2025'2662 (1119'1662,2025'2662] n=1 ec=1 les/c 2387/2174 2384/2386/2386) [10,0]/[10,0,5] r=0 lpr=2386 bft=0 lcod 0'0 mlcod 0'0 active +backfill] send_remove_op f4a982dc/rb.0.1a.0000000163fa/head from osd.0 tid 5 2012-02-07 20:51:24.938555 7f0b4a5a6700 -- 149.165.228.11:6814/32218 --> osd.0 149.165.228.10:6802/12583 -- osd_sub_op(osd.10.0:5 0.2dc f4a982dc/rb.0.1a.0000000163fa/head [delete] v 155'16264 snaps et=0=[]:[] snapc=0=[]) v1 -- ?+0 0x42a9680 2012-02-07 20:51:24.938596 7f0b4a5a6700 osd.10 2387 pg[0.2dc( v 2025'2662 (1119'1662,2025'2662] n=1 ec=1 les/c 2387/2174 2384/2386/2386) [10,0]/[10,0,5] r=0 lpr=2386 bft=0 lcod 0'0 mlcod 0'0 active +backfill] send_remove_op b7b82dc/rb.0.13.000000009306/head from osd.0 tid 6 2012-02-07 20:51:24.938636 7f0b4a5a6700 -- 149.165.228.11:6814/32218 --> osd.0 149.165.228.10:6802/12583 -- osd_sub_op(osd.10.0:6 0.2dc b7b82dc/rb.0.13.000000009306/head [delete] v 108'4426 snapset =0=[]:[] snapc=0=[]) v1 -- ?+0 0x42a9100 2012-02-07 20:51:24.938678 7f0b4a5a6700 -- 149.165.228.11:6814/32218 --> osd.0 149.165.228.10:6802/12583 -- pg_backfill(progress 0.2dc e 2387/2387 lb afa092dc/rb.0.19.000000007971/head) v1 -- ?+0 0 x4b93b40 2012-02-07 20:51:24.938743 7f0b4a5a6700 osd.10 2387 pg[0.2dc( v 2025'2662 (1119'1662,2025'2662] n=1 ec=1 les/c 2387/2174 2384/2386/2386) [10,0]/[10,0,5] r=0 lpr=2386 bft=0 lcod 0'0 mlcod 0'0 active +backfill] peer num_objects now 0 / 1 2012-02-07 20:51:24.938784 7f0b4a5a6700 osd.10 2387 pg[0.2dc( v 2025'2662 (1119'1662,2025'2662] n=1 ec=1 les/c 2387/2174 2384/2386/2386) [10,0]/[10,0,5] r=0 lpr=2386 bft=0 lcod 0'0 mlcod 0'0 active +backfill] started 5 2012-02-07 20:51:24.938823 7f0b4a5a6700 osd.10 2387 do_recovery started 5 (0/5 rops) on pg[0.2dc( v 2025'2662 (1119'1662,2025'2662] n=1 ec=1 les/c 2387/2174 2384/2386/2386) [10,0]/[10,0,5] r=0 lpr= 2386 bft=0 lcod 0'0 mlcod 0'0 active+backfill]
Updated by Josh Durgin about 12 years ago
- Status changed from New to 7
This should be fixed by f0334673ab8547807b961aae19a8e53531585e3f.
Actions