Project

General

Profile

Bug #15130

OSD stuck in pre-boot

Added by Xiaoxi Chen over 3 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
03/15/2016
Due date:
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Full log attached.

Basically i tried to upgrade from hammer (0.94.6) to jewel(10.0.5), the OSD stuck in pre-boot stage for very long time, but after some times reboot OSD, it could goes to active.

Digging into it , seems the incremental map always get CRC error, and then we request full map from [1311 ~ 1411], note we are requesting 101 versions (including 1411), but in mon/OSDMonitor.cc:preprocess_get_osdmap, we cap to g_conf->osd_map_message_max(default to 100), thus the monitor only return [1311, 1410].

And due to the logic in OSD::request_full_map, we will always skip the request to get e1411 as we think we had requested before.


LM-SHC-16501223:~ xiaoxchen$ cat jewel_osd_bug.log |grep full_map
2016-03-14 22:57:05.499771 7fe577f31700 10 osd.120 1310 request_full_map 1311..1411, previously requested 0..0
2016-03-14 22:57:05.636903 7fe577f31700 10 osd.120 1310 got_full_map 1311, requested 1311..1411, still need more
2016-03-14 22:57:05.637173 7fe577f31700 10 osd.120 1310 got_full_map 1312, requested 1312..1411, still need more
2016-03-14 22:57:05.637416 7fe577f31700 10 osd.120 1310 got_full_map 1313, requested 1313..1411, still need more
2016-03-14 22:57:05.637664 7fe577f31700 10 osd.120 1310 got_full_map 1314, requested 1314..1411, still need more
2016-03-14 22:57:05.637906 7fe577f31700 10 osd.120 1310 got_full_map 1315, requested 1315..1411, still need more
2016-03-14 22:57:05.638146 7fe577f31700 10 osd.120 1310 got_full_map 1316, requested 1316..1411, still need more
2016-03-14 22:57:05.638391 7fe577f31700 10 osd.120 1310 got_full_map 1317, requested 1317..1411, still need more
2016-03-14 22:57:05.638632 7fe577f31700 10 osd.120 1310 got_full_map 1318, requested 1318..1411, still need more
2016-03-14 22:57:05.638874 7fe577f31700 10 osd.120 1310 got_full_map 1319, requested 1319..1411, still need more
2016-03-14 22:57:05.639119 7fe577f31700 10 osd.120 1310 got_full_map 1320, requested 1320..1411, still need more
2016-03-14 22:57:05.639365 7fe577f31700 10 osd.120 1310 got_full_map 1321, requested 1321..1411, still need more
2016-03-14 22:57:05.639604 7fe577f31700 10 osd.120 1310 got_full_map 1322, requested 1322..1411, still need more
2016-03-14 22:57:05.639847 7fe577f31700 10 osd.120 1310 got_full_map 1323, requested 1323..1411, still need more
2016-03-14 22:57:05.640092 7fe577f31700 10 osd.120 1310 got_full_map 1324, requested 1324..1411, still need more
2016-03-14 22:57:05.640348 7fe577f31700 10 osd.120 1310 got_full_map 1325, requested 1325..1411, still need more
2016-03-14 22:57:05.640606 7fe577f31700 10 osd.120 1310 got_full_map 1326, requested 1326..1411, still need more
2016-03-14 22:57:05.640854 7fe577f31700 10 osd.120 1310 got_full_map 1327, requested 1327..1411, still need more
2016-03-14 22:57:05.641091 7fe577f31700 10 osd.120 1310 got_full_map 1328, requested 1328..1411, still need more
2016-03-14 22:57:05.641332 7fe577f31700 10 osd.120 1310 got_full_map 1329, requested 1329..1411, still need more
2016-03-14 22:57:05.641570 7fe577f31700 10 osd.120 1310 got_full_map 1330, requested 1330..1411, still need more
2016-03-14 22:57:05.641806 7fe577f31700 10 osd.120 1310 got_full_map 1331, requested 1331..1411, still need more
2016-03-14 22:57:05.642088 7fe577f31700 10 osd.120 1310 got_full_map 1332, requested 1332..1411, still need more
2016-03-14 22:57:05.642345 7fe577f31700 10 osd.120 1310 got_full_map 1333, requested 1333..1411, still need more
2016-03-14 22:57:05.642595 7fe577f31700 10 osd.120 1310 got_full_map 1334, requested 1334..1411, still need more
2016-03-14 22:57:05.642837 7fe577f31700 10 osd.120 1310 got_full_map 1335, requested 1335..1411, still need more
2016-03-14 22:57:05.643077 7fe577f31700 10 osd.120 1310 got_full_map 1336, requested 1336..1411, still need more
2016-03-14 22:57:05.643311 7fe577f31700 10 osd.120 1310 got_full_map 1337, requested 1337..1411, still need more
2016-03-14 22:57:05.643583 7fe577f31700 10 osd.120 1310 got_full_map 1338, requested 1338..1411, still need more
2016-03-14 22:57:05.643853 7fe577f31700 10 osd.120 1310 got_full_map 1339, requested 1339..1411, still need more
2016-03-14 22:57:05.644096 7fe577f31700 10 osd.120 1310 got_full_map 1340, requested 1340..1411, still need more
2016-03-14 22:57:05.644357 7fe577f31700 10 osd.120 1310 got_full_map 1341, requested 1341..1411, still need more
2016-03-14 22:57:05.644619 7fe577f31700 10 osd.120 1310 got_full_map 1342, requested 1342..1411, still need more
2016-03-14 22:57:05.644861 7fe577f31700 10 osd.120 1310 got_full_map 1343, requested 1343..1411, still need more
2016-03-14 22:57:05.645098 7fe577f31700 10 osd.120 1310 got_full_map 1344, requested 1344..1411, still need more
2016-03-14 22:57:05.645332 7fe577f31700 10 osd.120 1310 got_full_map 1345, requested 1345..1411, still need more
2016-03-14 22:57:05.645567 7fe577f31700 10 osd.120 1310 got_full_map 1346, requested 1346..1411, still need more
2016-03-14 22:57:05.645802 7fe577f31700 10 osd.120 1310 got_full_map 1347, requested 1347..1411, still need more
2016-03-14 22:57:05.646030 7fe577f31700 10 osd.120 1310 got_full_map 1348, requested 1348..1411, still need more
2016-03-14 22:57:05.646272 7fe577f31700 10 osd.120 1310 got_full_map 1349, requested 1349..1411, still need more
2016-03-14 22:57:05.646499 7fe577f31700 10 osd.120 1310 got_full_map 1350, requested 1350..1411, still need more
2016-03-14 22:57:05.646730 7fe577f31700 10 osd.120 1310 got_full_map 1351, requested 1351..1411, still need more
2016-03-14 22:57:05.646960 7fe577f31700 10 osd.120 1310 got_full_map 1352, requested 1352..1411, still need more
2016-03-14 22:57:05.647190 7fe577f31700 10 osd.120 1310 got_full_map 1353, requested 1353..1411, still need more
2016-03-14 22:57:05.647447 7fe577f31700 10 osd.120 1310 got_full_map 1354, requested 1354..1411, still need more
2016-03-14 22:57:05.647674 7fe577f31700 10 osd.120 1310 got_full_map 1355, requested 1355..1411, still need more
2016-03-14 22:57:05.647903 7fe577f31700 10 osd.120 1310 got_full_map 1356, requested 1356..1411, still need more
2016-03-14 22:57:05.648139 7fe577f31700 10 osd.120 1310 got_full_map 1357, requested 1357..1411, still need more
2016-03-14 22:57:05.648365 7fe577f31700 10 osd.120 1310 got_full_map 1358, requested 1358..1411, still need more
2016-03-14 22:57:05.648609 7fe577f31700 10 osd.120 1310 got_full_map 1359, requested 1359..1411, still need more
2016-03-14 22:57:05.648830 7fe577f31700 10 osd.120 1310 got_full_map 1360, requested 1360..1411, still need more
2016-03-14 22:57:05.649071 7fe577f31700 10 osd.120 1310 got_full_map 1361, requested 1361..1411, still need more
2016-03-14 22:57:05.649289 7fe577f31700 10 osd.120 1310 got_full_map 1362, requested 1362..1411, still need more
2016-03-14 22:57:05.649536 7fe577f31700 10 osd.120 1310 got_full_map 1363, requested 1363..1411, still need more
2016-03-14 22:57:05.649775 7fe577f31700 10 osd.120 1310 got_full_map 1364, requested 1364..1411, still need more
2016-03-14 22:57:05.650016 7fe577f31700 10 osd.120 1310 got_full_map 1365, requested 1365..1411, still need more
2016-03-14 22:57:05.650253 7fe577f31700 10 osd.120 1310 got_full_map 1366, requested 1366..1411, still need more
2016-03-14 22:57:05.650510 7fe577f31700 10 osd.120 1310 got_full_map 1367, requested 1367..1411, still need more
2016-03-14 22:57:05.650748 7fe577f31700 10 osd.120 1310 got_full_map 1368, requested 1368..1411, still need more
2016-03-14 22:57:05.650997 7fe577f31700 10 osd.120 1310 got_full_map 1369, requested 1369..1411, still need more
2016-03-14 22:57:05.651224 7fe577f31700 10 osd.120 1310 got_full_map 1370, requested 1370..1411, still need more
2016-03-14 22:57:05.651477 7fe577f31700 10 osd.120 1310 got_full_map 1371, requested 1371..1411, still need more
2016-03-14 22:57:05.651708 7fe577f31700 10 osd.120 1310 got_full_map 1372, requested 1372..1411, still need more
2016-03-14 22:57:05.651957 7fe577f31700 10 osd.120 1310 got_full_map 1373, requested 1373..1411, still need more
2016-03-14 22:57:05.652203 7fe577f31700 10 osd.120 1310 got_full_map 1374, requested 1374..1411, still need more
2016-03-14 22:57:05.652491 7fe577f31700 10 osd.120 1310 got_full_map 1375, requested 1375..1411, still need more
2016-03-14 22:57:05.652734 7fe577f31700 10 osd.120 1310 got_full_map 1376, requested 1376..1411, still need more
2016-03-14 22:57:05.652950 7fe577f31700 10 osd.120 1310 got_full_map 1377, requested 1377..1411, still need more
2016-03-14 22:57:05.653148 7fe577f31700 10 osd.120 1310 got_full_map 1378, requested 1378..1411, still need more
2016-03-14 22:57:05.653365 7fe577f31700 10 osd.120 1310 got_full_map 1379, requested 1379..1411, still need more
2016-03-14 22:57:05.653560 7fe577f31700 10 osd.120 1310 got_full_map 1380, requested 1380..1411, still need more
2016-03-14 22:57:05.653766 7fe577f31700 10 osd.120 1310 got_full_map 1381, requested 1381..1411, still need more
2016-03-14 22:57:05.653975 7fe577f31700 10 osd.120 1310 got_full_map 1382, requested 1382..1411, still need more
2016-03-14 22:57:05.654180 7fe577f31700 10 osd.120 1310 got_full_map 1383, requested 1383..1411, still need more
2016-03-14 22:57:05.654392 7fe577f31700 10 osd.120 1310 got_full_map 1384, requested 1384..1411, still need more
2016-03-14 22:57:05.654623 7fe577f31700 10 osd.120 1310 got_full_map 1385, requested 1385..1411, still need more
2016-03-14 22:57:05.654828 7fe577f31700 10 osd.120 1310 got_full_map 1386, requested 1386..1411, still need more
2016-03-14 22:57:05.655034 7fe577f31700 10 osd.120 1310 got_full_map 1387, requested 1387..1411, still need more
2016-03-14 22:57:05.655232 7fe577f31700 10 osd.120 1310 got_full_map 1388, requested 1388..1411, still need more
2016-03-14 22:57:05.655466 7fe577f31700 10 osd.120 1310 got_full_map 1389, requested 1389..1411, still need more
2016-03-14 22:57:05.655667 7fe577f31700 10 osd.120 1310 got_full_map 1390, requested 1390..1411, still need more
2016-03-14 22:57:05.655872 7fe577f31700 10 osd.120 1310 got_full_map 1391, requested 1391..1411, still need more
2016-03-14 22:57:05.656086 7fe577f31700 10 osd.120 1310 got_full_map 1392, requested 1392..1411, still need more
2016-03-14 22:57:05.656334 7fe577f31700 10 osd.120 1310 got_full_map 1393, requested 1393..1411, still need more
2016-03-14 22:57:05.656560 7fe577f31700 10 osd.120 1310 got_full_map 1394, requested 1394..1411, still need more
2016-03-14 22:57:05.656848 7fe577f31700 10 osd.120 1310 got_full_map 1395, requested 1395..1411, still need more
2016-03-14 22:57:05.657091 7fe577f31700 10 osd.120 1310 got_full_map 1396, requested 1396..1411, still need more
2016-03-14 22:57:05.657305 7fe577f31700 10 osd.120 1310 got_full_map 1397, requested 1397..1411, still need more
2016-03-14 22:57:05.657521 7fe577f31700 10 osd.120 1310 got_full_map 1398, requested 1398..1411, still need more
2016-03-14 22:57:05.657734 7fe577f31700 10 osd.120 1310 got_full_map 1399, requested 1399..1411, still need more
2016-03-14 22:57:05.657957 7fe577f31700 10 osd.120 1310 got_full_map 1400, requested 1400..1411, still need more
2016-03-14 22:57:05.658176 7fe577f31700 10 osd.120 1310 got_full_map 1401, requested 1401..1411, still need more
2016-03-14 22:57:05.658389 7fe577f31700 10 osd.120 1310 got_full_map 1402, requested 1402..1411, still need more
2016-03-14 22:57:05.658616 7fe577f31700 10 osd.120 1310 got_full_map 1403, requested 1403..1411, still need more
2016-03-14 22:57:05.658845 7fe577f31700 10 osd.120 1310 got_full_map 1404, requested 1404..1411, still need more
2016-03-14 22:57:05.659065 7fe577f31700 10 osd.120 1310 got_full_map 1405, requested 1405..1411, still need more
2016-03-14 22:57:05.659289 7fe577f31700 10 osd.120 1310 got_full_map 1406, requested 1406..1411, still need more
2016-03-14 22:57:05.659502 7fe577f31700 10 osd.120 1310 got_full_map 1407, requested 1407..1411, still need more
2016-03-14 22:57:05.659712 7fe577f31700 10 osd.120 1310 got_full_map 1408, requested 1408..1411, still need more
2016-03-14 22:57:05.659949 7fe577f31700 10 osd.120 1310 got_full_map 1409, requested 1409..1411, still need more
2016-03-14 22:57:05.660184 7fe577f31700 10 osd.120 1310 got_full_map 1410, requested 1410..1411, still need more
2016-03-14 22:57:05.674553 7fe577f31700 10 osd.120 1410 request_full_map 1411..1511, previously requested 1411..1411 //dropped as dup
2016-03-14 22:57:05.701652 7fe577f31700 10 osd.120 1410 request_full_map 1411..1511, previously requested 1411..1511 // never get 1411

log.tar.gz (280 KB) Xiaoxi Chen, 03/15/2016 02:04 PM

Associated revisions

Revision c804416d (diff)
Added by Xiaoxi Chen about 3 years ago

osd/OSD.cc: finish full_map_request every MOSDMap message.

We remember the range of requested full map in requested_full_first/last
and prevent sending duplicate requests.

But monitor will cap the reply to osd_map_message_max number of maps, for example,
OSD request [100, 200] while monitor only return [100,149], previous code think
[150, 200] is dup and prevent the OSD to send out the request, which is wrong.

Fix this by clear the requested_full_first/last field at the end of handle_osd_map.

Fixes: #15130

Signed-off-by: Xiaoxi Chen <>

History

#1 Updated by Xiaoxi Chen over 3 years ago

  • File log.tar.gz added

#2 Updated by Xiaoxi Chen over 3 years ago

See how stop/start handle the case...

root@slc5b03c-6ncp:~# ceph daemon --cluster slc07_ceph_02 osd.102 status {
"cluster_fsid": "1ab722a9-96c8-4048-bebd-35511bdf09d9",
"osd_fsid": "1fbf33b3-3a7a-43d0-89f6-8ae3105a4a1f",
"whoami": 102,
"state": "preboot",
"oldest_map": 1010,
"newest_map": 1210,
"num_pgs": 0
}

root@slc5b03c-6ncp:~# ceph daemon --cluster slc07_ceph_02 osd.102 status {
"cluster_fsid": "1ab722a9-96c8-4048-bebd-35511bdf09d9",
"osd_fsid": "1fbf33b3-3a7a-43d0-89f6-8ae3105a4a1f",
"whoami": 102,
"state": "preboot",
"oldest_map": 1010,
"newest_map": 1210,
"num_pgs": 0
}

root@slc5b03c-6ncp:~# ceph daemon --cluster slc07_ceph_02 osd.102 status {
"cluster_fsid": "1ab722a9-96c8-4048-bebd-35511bdf09d9",
"osd_fsid": "1fbf33b3-3a7a-43d0-89f6-8ae3105a4a1f",
"whoami": 102,
"state": "preboot",
"oldest_map": 1010,
"newest_map": 1210,
"num_pgs": 0
}

root@slc5b03c-6ncp:~# stop ceph-osd cluster=slc07_ceph_02 id=102
ceph-osd stop/waiting
You have new mail in /var/mail/root
root@slc5b03c-6ncp:~# start ceph-osd cluster=slc07_ceph_02 id=102
ceph-osd (slc07_ceph_02/102) start/running, process 11674
root@slc5b03c-6ncp:~# ceph daemon --cluster slc07_ceph_02 osd.102 status {
"cluster_fsid": "1ab722a9-96c8-4048-bebd-35511bdf09d9",
"osd_fsid": "1fbf33b3-3a7a-43d0-89f6-8ae3105a4a1f",
"whoami": 102,
"state": "preboot",
"oldest_map": 1010,
"newest_map": 1310,
"num_pgs": 0
}

root@slc5b03c-6ncp:~# stop ceph-osd cluster=slc07_ceph_02 id=102
ceph-osd stop/waiting
root@slc5b03c-6ncp:~# start ceph-osd cluster=slc07_ceph_02 id=102
ceph-osd (slc07_ceph_02/102) start/running, process 11911
root@slc5b03c-6ncp:~# ceph daemon --cluster slc07_ceph_02 osd.102 status {
"cluster_fsid": "1ab722a9-96c8-4048-bebd-35511bdf09d9",
"osd_fsid": "1fbf33b3-3a7a-43d0-89f6-8ae3105a4a1f",
"whoami": 102,
"state": "preboot",
"oldest_map": 1010,
"newest_map": 1410,
"num_pgs": 0
}

root@slc5b03c-6ncp:~# stop ceph-osd cluster=slc07_ceph_02 id=102
ceph-osd stop/waiting
root@slc5b03c-6ncp:~# start ceph-osd cluster=slc07_ceph_02 id=102
ceph-osd (slc07_ceph_02/102) start/running, process 12064
root@slc5b03c-6ncp:~# ceph daemon --cluster slc07_ceph_02 osd.102 status {
"cluster_fsid": "1ab722a9-96c8-4048-bebd-35511bdf09d9",
"osd_fsid": "1fbf33b3-3a7a-43d0-89f6-8ae3105a4a1f",
"whoami": 102,
"state": "active",
"oldest_map": 1010,
"newest_map": 1606,
"num_pgs": 0
}

#3 Updated by Kefu Chai over 3 years ago

  • Status changed from New to Need Review
  • Assignee set to Xiaoxi Chen
  • Priority changed from Normal to High
  • Source changed from other to Community (dev)

#4 Updated by Xiaoxi Chen over 3 years ago

  • File deleted (log.tar.gz)

#5 Updated by Xiaoxi Chen over 3 years ago

#6 Updated by Xiaoxi Chen about 3 years ago

https://github.com/ceph/ceph/pull/8147 which was merged to master

#7 Updated by Xiaoxi Chen about 3 years ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF