Bug #847
closedOSD: Recovery does not complete, PG's stay active
Description
After I've expanded my cluster from 4 to 8 OSD's the recovery would not complete.
The state right now:
pg v112366: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3126/134690 degraded (2.321%)
First it were 64 PG's which stayed active, after some OSD restarts it got back to 48.
3.7c9 46 46 46 92 188416 192937984 4704 4704 active 316'242 1381'458 [4,6] [4,6] 316'242 2011-02-22 07:29:12.919625
That PG for example, that one stays "active".
Browsing through the logs I found:
Feb 25 22:27:23 atom osd.6[18705]: 7f81df605700 osd6 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1186 1258/1258/1248) [4,6] r=1 crashed+stray m=46] clear_recovery_state Feb 25 22:27:23 atom osd.6[18705]: 7f81df605700 osd6 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1186 1258/1258/1248) [4,6] r=1 crashed+stray m=46] up [4] -> [4,6], acting [4] -> [4,6], role -1 -> 1 Feb 25 22:27:23 atom osd.6[18705]: 7f81df605700 osd6 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1186 1258/1258/1248) [4,6] r=1 crashed+stray m=46] on_role_change Feb 25 22:27:23 atom osd.6[18705]: 7f81df605700 osd6 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1186 1258/1258/1248) [4,6] r=1 crashed+stray m=46] cancel_generate_backlog Feb 25 22:27:30 atom osd.4[18838]: 7f3a0d7b3700 osd4 1258 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1185 1258/1258/1248) [4,6] r=0 mlcod 0'0 crashed+replay+degraded m=46 u=46] cancel_recovery
This process keeps repeating for this PG, the recovery gets cancelled?
In about 20 min a lot of these lines are printed, which then ends with:
Feb 25 22:47:44 atom osd.4[21301]: 7faa5870f700 osd4 1379 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les= 1186 1377/1377/1377) [4,6] r=0 mlcod 0'0 crashed+replay m=46 u=46] update_stats 1377'437 Feb 25 22:48:29 noisy osd.1[27292]: 7f31fb7a9700 osd1 1380 pg[3.7c9( v 316'242 (316'238,316'242]+backlog n=46 ec=2 les=1185 1 380/1380/1380) [6] r=-1 lcod 0'0 stray] up [4,6] -> [6], acting [4,6] -> [6], role -1 -> -1 Feb 25 22:48:29 noisy osd.1[27292]: 7f31fb7a9700 osd1 1380 pg[3.7c9( v 316'242 (316'238,316'242]+backlog n=46 ec=2 les=1185 1 380/1380/1380) [6] r=-1 lcod 0'0 stray] on_change Feb 25 22:48:29 noisy osd.1[27292]: 7f31fb7a9700 osd1 1380 pg[3.7c9( v 316'242 (316'238,316'242]+backlog n=46 ec=2 les=1185 1 380/1380/1380) [6] r=-1 lcod 0'0 stray] [4,6] -> [6], acting primary 4 -> 6
After that, it starts printing lines like, after which the PG seems "idle":
Feb 26 00:24:11 noisy osd.1[27292]: 7f3201fb6700 osd1 1389 on 2011-02-22 07:29:12.919625 3.7c9 Feb 26 00:24:12 noisy osd.1[27292]: 7f3201fb6700 osd1 1389 on 2011-02-22 07:29:12.919625 3.7c9 Feb 26 00:24:14 noisy osd.1[27292]: 7f3201fb6700 osd1 1389 on 2011-02-22 07:29:12.919625 3.7c9
What is osd1 doing with this PG? (This keeps repeating for thousands of loglines)
I do see this PG being scrubbed multiple times, but other then that, I see the "osd1" lines repeating.
I just restarted osd4 again and saw:
Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 filestore(/var/lib/ceph/osd.4) getattr /var/lib/ceph/osd.4/current/3.7c9_head/rb.0.1.000000002dd0_head '_' Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 osd4 1415 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1381/1381/1381) [] r=0 mlcod 0'0 inactive m=40 u=40] read_log missing 316'78 (316'77) b rb.0.1.000000002dd0/head by client4304.0:64776 2011-02-21 12:31:58.473405 Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 filestore(/var/lib/ceph/osd.4) getattr /var/lib/ceph/osd.4/current/3.7c9_head/rb.0.1.00000000117f_head '_' = -2 Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 filestore(/var/lib/ceph/osd.4) getattr /var/lib/ceph/osd.4/current/3.7c9_head/rb.0.1.0000000010be_head '_' = -2 Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 osd4 1415 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1381/1381/1381) [] r=0 mlcod 0'0 inactive m=44 u=44] read_log missing 316'59 (316'58) b rb.0.1.0000000010be/head by client4304.0:32191 2011-02-21 11:18:13.436974 Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 filestore(/var/lib/ceph/osd.4) getattr /var/lib/ceph/osd.4/current/3.7c9_head/rb.0.1.00000000a638_head '_' Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 osd4 1415 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1381/1381/1381) [] r=0 mlcod 0'0 inactive m=45 u=45] read_log missing 316'55 (316'54) b rb.0.1.00000000a638/head by client4304.0:21256 2011-02-21 11:00:31.259864 Feb 28 20:56:39 atom osd.4[24320]: 7f434613b720 osd4 1415 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1381/1381/1381) [] r=0 mlcod 0'0 inactive m=46 u=46] read_log done Feb 28 20:56:42 atom osd.4[24320]: 7f433b388700 osd4 1416 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1389 1416/1416/1416) [6] r=-1 inactive m=46] cancel_recovery
The recovery still seems to get cancelled? Due to the missing rb.* file?
The recovery stuff is still a bit new to me, so I'm not sure what to look at.
The logs are available on logger.ceph.widodh.nl:/srv/ceph/remote-syslog when you will find "ceph-osd.log", 2.gz and 3.gz are the ones where I got this data from.
On this cluster I'm running: c78b29a47d7211a4b8b1585112ac22b8435a82c7
Updated by Wido den Hollander about 13 years ago
osd4 started printing some more lines after it's restart:
Feb 28 20:59:41 atom osd.4[24320]: 7f4341495700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les= 1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] activate - snap_trimq [] Feb 28 20:59:41 atom osd.4[24320]: 7f4341495700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les= 1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] write_info info 309 Feb 28 20:59:41 atom osd.4[24320]: 7f4341495700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les= 1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] activate - starting recovery Feb 28 20:59:41 atom osd.4[24320]: 7f4341495700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les= 1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] activate peer osd6 sending log(0'0,0'0] missing(0) Feb 28 20:59:41 atom osd.4[24320]: 7f4341495700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les= 1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] is_all_uptodate: primary has 46 Feb 28 20:59:41 atom osd.4[24320]: 7f4341495700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les= 1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] is_all_uptodate: osd6 has 46 missing ... ... ... 1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] recover_replicas: rb.0.1.00000000d1b0/head still unfound Feb 28 20:59:56 atom osd.4[24320]: 7f4338382700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] recover_replicas: rb.0.1.00000000d4aa/head still unfound Feb 28 20:59:56 atom osd.4[24320]: 7f4338382700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] recover_replicas: rb.0.1.00000000f6d2/head still unfound Feb 28 20:59:56 atom osd.4[24320]: 7f4338382700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] recover_replicas: rb.0.1.000000010293/head still unfound Feb 28 20:59:56 atom osd.4[24320]: 7f4338382700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] is_all_uptodate: osd6 has 46 missing Feb 28 20:59:56 atom osd.4[24320]: 7f4338382700 osd4 1436 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1436 1434/1434/1434) [4,6] r=0 mlcod 0'0 active m=46 u=46] start_recovery_ops: some OSDs are not up-to-date yet. Recovery isn't finished yet.
Updated by Sage Weil about 13 years ago
I believe this will fix it: 19bcd747b452b79c80744620e635f036bfc1b738
Updated by Wido den Hollander about 13 years ago
I've tried your patch, but unfortunately it did not work out.
Did a little digging and out:
Mar 1 13:53:33 atom osd.4[29656]: 7f9e891f4700 osd4 1528 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1513 1504/1504/1504) [4,6] r=0 mlcod 0'0 active m=46] search_for_missing missing {rb.0.1.000000000446/head=316'242,rb.0.1.0000000010be/head=316'59,rb.0.1.00000000117f/head=316'63,rb.0.1.000000001397/head=316'67,rb.0.1.0000000020c9/head=316'75,rb.0.1.000000002dd0/head=316'78,rb.0.1.000000002f5a/head=316'82,rb.0.1.000000003122/head=316'86,rb.0.1.00000000395a/head=316'90,rb.0.1.000000003fb2/head=316'95,rb.0.1.000000004188/head=316'99,rb.0.1.00000000482a/head=316'103,rb.0.1.000000005cd5/head=316'107,rb.0.1.0000000061a9/head=316'112,rb.0.1.000000006408/head=316'116,rb.0.1.00000000648c/head=316'120,rb.0.1.000000006ff5/head=316'125,rb.0.1.000000007386/head=316'129,rb.0.1.0000000074f8/head=316'133,rb.0.1.00000000787d/head=316'137,rb.0.1.000000007bda/head=316'141,rb.0.1.000000008411/head=316'145,rb.0.1.0000000089b7/head=316'149,rb.0.1.00000000a27f/head=316'153,rb.0.1.00000000a638/head=316'55,rb.0.1.00000000a96c/head=316'155,rb.0.1.00000000adcc/head=316'159,rb.0.1.00000000aee6/head=316'165,rb.0.1.00000000b12e/head=316'169,rb.0.1.00000000b36b/head=316'173,rb.0.1.00000000b6e8/head=316'177,rb.0.1.00000000bc0d/head=316'181,rb.0.1.00000000c1c9/head=316'186,rb.0.1.00000000c3c2/head=316'190,rb.0.1.00000000c4c5/head=316'194,rb.0.1.00000000c929/head=316'199,rb.0.1.00000000d1b0/head=316'203,rb.0.1.00000000d4aa/head=316'207,rb.0.1.00000000e17d/head=316'211,rb.0.1.00000000e2ce/head=316'215,rb.0.1.00000000e353/head=316'219,rb.0.1.00000000f6c0/head=316'223,rb.0.1.00000000f6d2/head=316'227,rb.0.1.000000010293/head=316'231,rb.0.1.00000001046a/head=316'235,rb.0.1.000000010669/head=316'239} Mar 1 13:53:33 atom osd.4[29656]: 7f9e891f4700 osd4 1528 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1513 1504/1504/1504) [4,6] r=0 mlcod 0'0 active m=46] ignoring osd1 log, pg is already active Mar 1 13:53:33 atom osd.4[29656]: 7f9e899f5700 osd4 1529 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1513 1504/1504/1504) [4,6] r=0 mlcod 0'0 active m=46] unaffected with [4,6]/[4,6] up/acting Mar 1 13:53:37 atom osd.4[29656]: 7f9e891f4700 osd4 1529 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1513 1504/1504/1504) [4,6] r=0 mlcod 0'0 active m=46] got dup osd2 info 3.7c9( v 316'242 (316'238,316'242] n=46 ec=2 les=514 1504/1504/1504), identical to ours ... ... ... ... Mar 1 13:53:42 atom osd.4[29656]: 7f9e891f4700 osd4 1530 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1513 1504/1504/1504) [4,6] r=0 mlcod 0'0 active m=46] update_stats 1504'499 Mar 1 13:53:43 atom osd.4[29656]: 7f9e891f4700 osd4 1530 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1513 1504/1504/1504) [4,6] r=0 mlcod 0'0 active m=46] got dup osd2 info 3.7c9( v 316'242 (316'238,316'242] n=46 ec=2 les=514 1504/1504/1504), identical to ours Mar 1 13:55:25 atom osd.4[29656]: 7f9e8fb02700 osd4 1530 on 2011-02-22 07:29:12.919625 3.7c9
osd4 keeps saying that osd6 is missing objects, that's what I see repeating, over and over.
Updated by Sage Weil about 13 years ago
Oh, I see the problem:
2011-03-01 22:39:59.563426 log 2011-03-01 22:27:18.519821 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/4281 34 : [ERR] osd4 tried to pull rb.0.1.000000001277/head in 3.c9 but got error 2: No such file or directory
and osd recovery max active = 1. FBOFW the recovery slot isn't freed up when these errors are detected.
Updated by Sage Weil about 13 years ago
Pushed an osd_recovery branch that should continuing recovering, skipping over the missing objects. If there are other available replicas, it will try to use those to recover.
Updated by Wido den Hollander about 13 years ago
I tried the new branch, but it isn't working as expected yet.
When following the progress I saw:
2011-03-03 11:47:27.448864 pg v133244: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3106/134690 degraded (2.306%) 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.404287 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 204 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.0000000020c9/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.407369 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 205 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000002dd0/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.414558 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 206 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000002f5a/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.420450 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 207 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000003122/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.429385 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 208 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000395a/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.436475 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 209 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000003fb2/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.448904 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 210 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000004188/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.455656 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 211 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000482a/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.468122 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 212 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000005cd5/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.472866 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 213 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.0000000061a9/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.484156 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 214 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000006408/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.497592 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 215 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000648c/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.506712 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 216 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000006ff5/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.526531 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 217 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000007386/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.535226 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 218 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.0000000074f8/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.560339 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 219 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000787d/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.572708 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 220 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000007bda/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.574649 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 221 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000008411/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.581441 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 222 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.0000000089b7/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.601985 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 223 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000a27f/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.604099 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 224 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000a96c/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.629478 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 225 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000adcc/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.643160 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 226 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000aee6/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.662869 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 227 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000b12e/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.676948 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 228 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000b36b/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.697511 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 229 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000b6e8/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.712442 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 230 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000bc0d/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.730804 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 231 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000c1c9/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.742646 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 232 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000c3c2/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.747028 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 233 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000c4c5/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.752920 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 234 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000c929/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.757412 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 235 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000d1b0/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.763600 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 236 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000d4aa/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.768695 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 237 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000e17d/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.774378 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 238 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000e2ce/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.779276 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 239 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000e353/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.785502 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 240 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000f6c0/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.786110 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 241 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000000f6d2/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.793151 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 242 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.000000010293/head but got error 2: No such file or directory 2011-03-03 11:47:27.956132 log 2011-03-03 11:47:04.793747 osd1 [2a00:f10:113:1:230:48ff:fe8d:a21e]:6804/29286 243 : [ERR] 3.7c9 osd4 tried to pull rb.0.1.00000001046a/head but got error 2: No such file or directory 2011-03-03 11:47:32.452360 pg v133245: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3102/134690 degraded (2.303%) 2011-03-03 11:48:46.140553 pg v133246: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3102/134690 degraded (2.303%) 2011-03-03 11:48:46.636801 pg v133247: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3102/134690 degraded (2.303%) 2011-03-03 11:48:46.925668 pg v133248: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3102/134690 degraded (2.303%) 2011-03-03 11:48:49.292291 pg v133249: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3102/134690 degraded (2.303%) 2011-03-03 11:48:51.341869 pg v133250: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3102/134690 degraded (2.303%) 2011-03-03 11:49:32.560627 pg v133251: 8240 pgs: 48 active, 8192 active+clean; 260 GB data, 523 GB used, 14364 GB / 14904 GB avail; 3102/134690 degraded (2.303%)
It did went a bit further, but than stopped again.
In the logs I saw:
Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 do_recovery starting 2 (0/2 rops) on pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] peer osd6 missing {rb.0.1.0000000010be/head=316'59,rb.0.1.00000000117f/head=316'63,rb.0.1.000000001397/head=316'67,rb.0.1.0000000020c9/head=316'75,rb.0.1.000000002dd0/head=316'78,rb.0.1.000000002f5a/head=316'82,rb.0.1.000000003122/head=316'86,rb.0.1.00000000395a/head=316'90,rb.0.1.000000003fb2/head=316'95,rb.0.1.000000004188/head=316'99,rb.0.1.00000000482a/head=316'103,rb.0.1.000000005cd5/head=316'107,rb.0.1.0000000061a9/head=316'112,rb.0.1.000000006408/head=316'116,rb.0.1.00000000648c/head=316'120,rb.0.1.000000006ff5/head=316'125,rb.0.1.000000007386/head=316'129,rb.0.1.0000000074f8/head=316'133,rb.0.1.00000000787d/head=316'137,rb.0.1.000000007bda/head=316'141,rb.0.1.000000008411/head=316'145,rb.0.1.0000000089b7/head=316'149,rb.0.1.00000000a27f/head=316'153,rb.0.1.00000000a638/head=316'55,rb.0.1.00000000a96c/head=316'155,rb.0.1.00000000adcc/head=316'159,rb.0.1.00000000aee6/head=316'165,rb.0.1.00000000b12e/head=316'169,rb.0.1.00000000b36b/head=316'173,rb.0.1.00000000b6e8/head=316'177,rb.0.1.00000000bc0d/head=316'181,rb.0.1.00000000c1c9/head=316'186,rb.0.1.00000000c3c2/head=316'190,rb.0.1.00000000c4c5/head=316'194,rb.0.1.00000000c929/head=316'199,rb.0.1.00000000d1b0/head=316'203,rb.0.1.00000000d4aa/head=316'207,rb.0.1.00000000e17d/head=316'211,rb.0.1.00000000e2ce/head=316'215,rb.0.1.00000000e353/head=316'219,rb.0.1.00000000f6c0/head=316'223,rb.0.1.00000000f6d2/head=316'227,rb.0.1.000000010293/head=316'231,rb.0.1.00000001046a/head=316'235} Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.000000001397/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.0000000020c9/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.000000004188/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.000000006ff5/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.000000007386/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.0000000089b7/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.00000000a27f/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.00000000b36b/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.00000000b6e8/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.00000000c4c5/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.00000000c929/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.00000000e353/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.00000000f6c0/head still unfound Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44] start_recovery_ops: some OSDs are not up-to-date yet. Recovery isn't finished yet. Mar 3 11:55:02 atom osd.4[7377]: 7f19de10a700 osd4 1597 do_recovery started 0 (0/2 rops) on pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=44 u=44]
I've set 'osd recovery max active = 2', but that wouldn't help either.
Updated by Sage Weil about 13 years ago
Can you look manually on the OSDs for the objects that it's showing as 'still unfound'? rb.0.1.00000000f6c0/head becomes rb.0.1.00000000f6c0_head. I think some objects got lost somewhere. Also, can you change your syslog so it retains logs for longer? When I was looking at this the other day they didn't go far back enough to find out when/how some of the objects got lost.
Also, is the picture similar for the other non-clean PGs? (ceph pg dump -o - | grep -v clean, then grep for that pgid in the logs and see if there are similar no such file or directory errors).
Updated by Wido den Hollander about 13 years ago
I searched for the object:
root@noisy:~# find /var/lib/ceph -name 'rb.0.1.00000000f6c0_head' /var/lib/ceph/osd.2/snap_623655/3.7c9_head/rb.0.1.00000000f6c0_head /var/lib/ceph/osd.2/current/3.7c9_head/rb.0.1.00000000f6c0_head /var/lib/ceph/osd.2/snap_630809/3.7c9_head/rb.0.1.00000000f6c0_head /var/lib/ceph/osd.2/snap_630833/3.7c9_head/rb.0.1.00000000f6c0_head root@noisy:~#
In all the logs I saw osd1, osd4 and osd6 involved in this object. It seems to be on osd2, but it isn't searching there?
I checked other PG's, like 3.6a1, but those are also missing objects:
Mar 3 11:54:46 atom osd.4[7377]: 7f19e0a10700 osd4 1597 pg[3.6a1( v 316'153 lc 0'0 (316'151,316'153]+backlog n=37 ec=2 les=1597 1594/1594/1594) [4,5] r=0 mlcod 0'0 active m=37 u=37] search_for_missing missing {rb.0.1.000000000499/head=172'5,rb.0.1.0000000008ee/head=242'9,rb.0.1.000000001060/head=316'15,rb.0.1.0000000010f2/head=316'19,rb.0.1.000000001bd8/head=316'23,rb.0.1.000000002b4f/head=316'27,rb.0.1.000000002e76/head=316'31,rb.0.1.00000000354e/head=316'35,rb.0.1.000000004285/head=316'40,rb.0.1.00000000493d/head=316'44,rb.0.1.000000004b1f/head=316'48,rb.0.1.000000004c72/head=316'52,rb.0.1.000000004e67/head=316'56,rb.0.1.000000005391/head=316'60,rb.0.1.000000005b97/head=316'65,rb.0.1.000000005c3b/head=316'13,rb.0.1.000000005d72/head=316'69,rb.0.1.00000000607a/head=316'73,rb.0.1.00000000635c/head=316'78,rb.0.1.000000008138/head=316'82,rb.0.1.0000000088f7/head=316'86,rb.0.1.0000000098bf/head=316'91,rb.0.1.000000009c79/head=316'95,rb.0.1.000000009ff4/head=316'99,rb.0.1.00000000b1c1/head=316'104,rb.0.1.00000000b3c0/head=316'109,rb.0.1.00000000bcb4/head=316'113,rb.0.1.00000000c6d0/head=316'117,rb.0.1.00000000d3ea/head=316'121,rb.0.1.00000000df4b/head=316'125,rb.0.1.00000000ea84/head=316'129,rb.0.1.00000000ebca/head=316'133,rb.0.1.00000000f20b/head=316'137,rb.0.1.00000000f575/head=316'141,rb.0.1.000000010020/head=316'145,rb.0.1.0000000105da/head=316'149,rb.0.1.000000010b30/head=316'153}
Searching for that objects shows:
root@noisy:~# find /var/lib/ceph -name 'rb.0.1.000000000499_head' /var/lib/ceph/osd.1/snap_754352/3.6a1_head/rb.0.1.000000000499_head /var/lib/ceph/osd.1/current/3.6a1_head/rb.0.1.000000000499_head /var/lib/ceph/osd.1/snap_762081/3.6a1_head/rb.0.1.000000000499_head /var/lib/ceph/osd.1/snap_762090/3.6a1_head/rb.0.1.000000000499_head /var/lib/ceph/osd.2/snap_623655/3.6a1_head/rb.0.1.000000000499_head /var/lib/ceph/osd.2/current/3.6a1_head/rb.0.1.000000000499_head /var/lib/ceph/osd.2/snap_630809/3.6a1_head/rb.0.1.000000000499_head /var/lib/ceph/osd.2/snap_630833/3.6a1_head/rb.0.1.000000000499_head root@noisy:~#
3.6a1 37 37 37 74 151552 155189248 3724 3724 active 316'153 1599'528 [4,5] [4,5] 316'153 2011-02-22 07:09:20.128641
For another PG I found:
Mar 3 11:54:46 atom osd.4[7377]: 7f19e0a10700 osd4 1597 pg[3.5c9( v 316'123 lc 0'0 (316'117,316'123]+backlog n=30 ec=2 les=1597 1594/1594/1594) [4,6] r=0 mlcod 0'0 active m=28 u=7] search_for_missing rb.0.1.00000000cf1d/head 316'94 is on osd1
But searching for the object shows:
root@noisy:~# find /var/lib/ceph -name 'rb.0.1.00000000cf1d_head' /var/lib/ceph/osd.2/snap_623655/3.5c9_head/rb.0.1.00000000cf1d_head /var/lib/ceph/osd.2/current/3.5c9_head/rb.0.1.00000000cf1d_head /var/lib/ceph/osd.2/snap_630809/3.5c9_head/rb.0.1.00000000cf1d_head /var/lib/ceph/osd.2/snap_630833/3.5c9_head/rb.0.1.00000000cf1d_head root@noisy:~#
It isn't on osd1, but on osd2.
I'll try to fix the syslog, my logrotate configuration says 14 days, no idea why it rotates that fast.
Updated by Wido den Hollander about 13 years ago
Just checked another PG (just to make sure), 3.22b in this case:
Mar 3 11:54:45 atom osd.4[7377]: 7f19e0a10700 osd4 1597 pg[3.22b( v 316'138 lc 0'0 (316'135,316'138]+backlog n=33 ec=2 les=1592 1594/1594/1594) [4,1] r=0 mlcod 0'0 peering m=33 u=33] search_for_missing rb.0.1.00000000db84/head 316'117 also missing on osd1
Searching for this object shows:
root@noisy:~# find /var/lib/ceph -name 'rb.0.1.00000000db84_head' /var/lib/ceph/osd.2/snap_623655/3.22b_head/rb.0.1.00000000db84_head /var/lib/ceph/osd.2/current/3.22b_head/rb.0.1.00000000db84_head /var/lib/ceph/osd.2/snap_630897/3.22b_head/rb.0.1.00000000db84_head /var/lib/ceph/osd.2/snap_630898/3.22b_head/rb.0.1.00000000db84_head /var/lib/ceph/osd.0/snap_784865/3.22b_head/rb.0.1.00000000db84_head /var/lib/ceph/osd.0/current/3.22b_head/rb.0.1.00000000db84_head /var/lib/ceph/osd.0/snap_796258/3.22b_head/rb.0.1.00000000db84_head /var/lib/ceph/osd.0/snap_796259/3.22b_head/rb.0.1.00000000db84_head root@noisy:~#
Updated by Sage Weil about 13 years ago
Starting looking through your logs to debug this but it really looks like they are incomplete.. either syslog is dropping something or the new logging code is. Can we go back to logging to a file (over NFS?) for these nodes? At least temporarily?
Thanks!
Updated by Wido den Hollander about 13 years ago
I think it was syslog. The logger machine is not on the same location, could be that there is a bandwith issue which caused the UDP packages to be dropped.
I setup a NFS server at the same location where the cluster is running, right now the machines are logging over NFS, all OSD's are at 'debug osd = 20'.
Updated by Wido den Hollander about 13 years ago
I've searched the logs again for rb.0.1.00000000a638/head and I found:
2011-03-04 10:34:31.649578 7fee49878700 osd4 1712 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1712 1702/1702/1699) [4,6] r=0 mlcod 0'0 active m=44 u=44] recover_replicas: rb.0.1.00000000a638/head still unfound 2011-03-04 10:39:55.904877 16'55 (316'54) b rb.0.1.00000000a638/head by client4304.0:21256 2011-02-21 11:00:31.259864 indexed 2011-03-04 10:39:55.907236 7fee4c17e700 osd4 1728 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1712 1702/1702/1699) [4,6] r=0 mlcod 0'0 active m=44 u=21] search_for_missing rb.0.1.00000000a638/head 316'55 is on osd1
It thinks this object is on osd1, but a find tells me:
root@noisy:~# find /var/lib/ceph -type f -name 'rb.0.1.00000000a638_head' /var/lib/ceph/osd.2/snap_634414/3.7c9_head/rb.0.1.00000000a638_head /var/lib/ceph/osd.2/current/3.7c9_head/rb.0.1.00000000a638_head /var/lib/ceph/osd.2/snap_640162/3.7c9_head/rb.0.1.00000000a638_head /var/lib/ceph/osd.2/snap_640165/3.7c9_head/rb.0.1.00000000a638_head root@noisy:~#
Looking on osd1 in 3.7c9 I found:
root@noisy:~# ls /var/lib/ceph/osd.1/current/3.7c9_head/ rb.0.1.000000000446_head rb.0.1.000000010669_head root@noisy:~#
Looking on osd2:
root@noisy:~# ls /var/lib/ceph/osd.2/current/3.7c9_head/ rb.0.1.000000000446_head rb.0.1.000000005cd5_head rb.0.1.00000000a638_head rb.0.1.00000000d1b0_head rb.0.1.0000000010be_head rb.0.1.0000000061a9_head rb.0.1.00000000a96c_head rb.0.1.00000000d4aa_head rb.0.1.00000000117f_head rb.0.1.000000006408_head rb.0.1.00000000adcc_head rb.0.1.00000000e17d_head rb.0.1.000000001397_head rb.0.1.00000000648c_head rb.0.1.00000000aee6_head rb.0.1.00000000e2ce_head rb.0.1.0000000020c9_head rb.0.1.000000006ff5_head rb.0.1.00000000b12e_head rb.0.1.00000000e353_head rb.0.1.000000002dd0_head rb.0.1.000000007386_head rb.0.1.00000000b36b_head rb.0.1.00000000f6c0_head rb.0.1.000000002f5a_head rb.0.1.0000000074f8_head rb.0.1.00000000b6e8_head rb.0.1.00000000f6d2_head rb.0.1.000000003122_head rb.0.1.00000000787d_head rb.0.1.00000000bc0d_head rb.0.1.000000010293_head rb.0.1.00000000395a_head rb.0.1.000000007bda_head rb.0.1.00000000c1c9_head rb.0.1.00000001046a_head rb.0.1.000000003fb2_head rb.0.1.000000008411_head rb.0.1.00000000c3c2_head rb.0.1.000000010669_head rb.0.1.000000004188_head rb.0.1.0000000089b7_head rb.0.1.00000000c4c5_head rb.0.1.00000000482a_head rb.0.1.00000000a27f_head rb.0.1.00000000c929_head root@noisy:~#
If I understand it correctly, the PG (3.7c9) is on osd1, osd4 and osd6, but it isn't, since most of the data of this PG is on osd2 where osd4 and osd6 are still missing it:
root@atom:/var/log/ceph# ls /var/lib/ceph/osd.4/current/3.7c9_head/ rb.0.1.000000000446_head rb.0.1.000000010669_head root@atom:/var/log/ceph# ls /var/lib/ceph/osd.6/current/3.7c9_head/ rb.0.1.000000000446_head rb.0.1.000000010669_head root@atom:/var/log/ceph#
I'm just trying to understand what is going on here. The recovery code isn't looking for the objects on osd2, since it thinks they are not there?
But in the logs I see:
2011-03-04 11:01:43.784979 7fee4c17e700 osd4 1731 pg[3.7c9( v 316'242 lc 0'0 (316'238,316'242]+backlog n=46 ec=2 les=1712 1702/1702/1699) [4,6] r=0 mlcod 0'0 active m=44] got dup osd2 info 3.7c9( v 316'242 (316'238,316'242] n=46 ec=2 les=514 1702/1702/1699), identical to ours
Updated by Sage Weil about 13 years ago
- Status changed from 7 to Resolved
phew! all pgs are now active+clean. hopefully scrub won't turn anything up :)