Actions
Bug #21328
closedPerformance: Slow OSD startup, heavy LevelDB activity
% Done:
0%
Source:
Community (user)
Tags:
performance
Backport:
jewel, luminous
Regression:
Yes
Severity:
2 - major
Reviewed:
Description
Many of our OSDs are very slow to start, with a ~2 minute gap between some of the debugging messages.
This can be seen in both regular OSD start and ceph-objectstore-tool usage.
Doing strace on the process shows lots of LevelDB activity. davidz thinks it might be an upgrade that isn't marking it self as having been completed, and re-running on every startup.
08-15:59:57 < davidz> The "repair that resurrects" isn't a pg repair, but rather something that happens once on start up. Hmm….maybe that is the 2 minute startup. It only needs to happen once on first OSD start-up after upgrade. 08-16:00:29 < robbat2> all of these OSDs have been restarted since 10.2.9, some multiple times 08-16:00:44 < robbat2> the 2 minute startup happens every time 08-16:01:28 < davidz> robbat2: The code doesn't record that it has performed that special start-up "repair"
The omap directory for this OSD is ~2.3GB.
ceph-objectstore-tool --data-path $datapath --pgid $pg --journal /srv/ceph/devices/osd.${osd}.journal --debug $obj get-omaphdr | tee /tmp/omaphdr 2017-09-08 22:36:26.163881 7f4cecdd9980 0 filestore(/srv/ceph/osd/166/) backend xfs (magic 0x58465342) 2017-09-08 22:36:26.164422 7f4cecdd9980 0 genericfilestorebackend(/srv/ceph/osd/166/) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2017-09-08 22:36:26.164433 7f4cecdd9980 0 genericfilestorebackend(/srv/ceph/osd/166/) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option 2017-09-08 22:36:26.164453 7f4cecdd9980 0 genericfilestorebackend(/srv/ceph/osd/166/) detect_features: splice is supported 2017-09-08 22:36:26.180020 7f4cecdd9980 0 genericfilestorebackend(/srv/ceph/osd/166/) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2017-09-08 22:36:26.180146 7f4cecdd9980 0 xfsfilestorebackend(/srv/ceph/osd/166/) detect_feature: extsize is disabled by conf ############################# #### LARGE TIME GAP HERE #### ############################# 2017-09-08 22:38:49.192232 7f4cecdd9980 0 filestore(/srv/ceph/osd/166/) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 28 00 00 00 00 20 00 00 00 00 00 00 85 01 03 0a 00 00 00 00 00 00 00 00 00 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 28 00 00 00 00 20 00 00 00 00 00 00 85 01 03 0a 00 00 00 00 00 00 00 00 00 HDIO_DRIVE_CMD(identify) failed: Input/output error 2017-09-08 22:38:49.226694 7f4cecdd9980 1 journal _open /srv/ceph/devices/osd.166.journal fd 14: 10736369664 bytes, block size 4096 bytes, directio = 1, aio = 1 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 28 00 00 00 00 20 00 00 00 00 00 00 85 01 03 0a 00 00 00 00 00 00 00 00 00 SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 28 00 00 00 00 20 00 00 00 00 00 00 85 01 03 0a 00 00 00 00 00 00 00 00 00 HDIO_DRIVE_CMD(identify) failed: Input/output error 2017-09-08 22:38:49.281060 7f4cecdd9980 1 journal _open /srv/ceph/devices/osd.166.journal fd 14: 10736369664 bytes, block size 4096 bytes, directio = 1, aio = 1 2017-09-08 22:38:49.283207 7f4cecdd9980 1 filestore(/srv/ceph/osd/166/) upgrade Cluster fsid=31b8be2f-ac05-4e56-96b7-e702df166e29 Supported features: compat={},rocompat={},incompat={1=initial feature set(~v.18),2=pginfo object,3=object locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,11=sharded objects,12=transaction hints,13=pg meta object} On-disk features: compat={},rocompat={},incompat={1=initial feature set(~v.18),2=pginfo object,3=object locator,4=last_epoch_clean,5=categories,6=hobjectpool,7=biginfo,8=leveldbinfo,9=leveldblog,10=snapmapper,12=transaction hints,13=pg meta object} 200 pgs to scan ...
Updated by David Zafman over 6 years ago
- Status changed from New to Fix Under Review
Updated by David Zafman over 6 years ago
- Subject changed from Slow OSD startup, heavy LevelDB activity to Performance: Slow OSD startup, heavy LevelDB activity
- Target version set to v12.2.1
- Source set to Community (user)
- Tags set to performance
- Regression changed from No to Yes
- Release deleted (
jewel) - Release set to luminous
- Affected Versions v12.2.0 added
- Affected Versions deleted (
v10.2.9)
Updated by David Zafman over 6 years ago
- Backport changed from luminous to jewel, luminous
Updated by Josh Durgin over 6 years ago
The PR is https://github.com/ceph/ceph/pull/17705
Updated by Kefu Chai over 6 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Nathan Cutler over 6 years ago
- Copied to Backport #21439: luminous: Performance: Slow OSD startup, heavy LevelDB activity added
Updated by Nathan Cutler over 6 years ago
- Copied to Backport #21440: jewel: Performance: Slow OSD startup, heavy LevelDB activity added
Updated by Nathan Cutler over 6 years ago
- Related to Bug #21907: On pg repair the primary is not favored as was intended added
Updated by David Zafman about 6 years ago
- Status changed from Pending Backport to Resolved
Actions