Project

General

Profile

Actions

Bug #37839

closed

Compression not working, and when applied OSD disks are failing randomly

Added by Greg Smith over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Target version:
% Done:

0%

Source:
Tags:
Backport:
mimic, luminous
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
fs, upgrade/mimic-x
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We have a few Ceph mimic production environments, versions 13.2.2 to 13.2.4.
When we enable compression, either on the pool or bluestore compression (together and apart. We tried forced and aggressive modes, and snappy and lz4 algorithms), we experience failing OSD.node happening randomly, they come back up and fail again. No specific ones, but randomly.
When we apply both compression methods, the whole Ceph array is down with OSDs failing, coming up and going down randomly.
When we aplly the compression methods separately, it happens less frequently.
mon: ceph.log
Debug 0:
2019-01-09 06:19:38.762734 mon.mon01 mon.0 192.168.31.21:6789/0 470697 : cluster [INF] osd.14 failed (root=default,host=osd1204) (connection refused reported by osd.11)
2019-01-09 06:19:38.803930 mon.mon01 mon.0 192.168.31.21:6789/0 470730 : cluster [INF] osd.4 failed (root=default,host=osd1202) (connection refused reported by osd.1)
2019-01-09 06:19:38.905222 mon.mon01 mon.0 192.168.31.21:6789/0 470763 : cluster [INF] osd.6 failed (root=default,host=osd1202) (connection refused reported by osd.2)
2019-01-09 06:19:39.208453 mon.mon01 mon.0 192.168.31.21:6789/0 470829 : cluster [WRN] Health check failed: 3 osds down (OSD_DOWN)
2019-01-09 06:19:41.263681 mon.mon01 mon.0 192.168.31.21:6789/0 470834 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive (PG_AVAILABILITY)
2019-01-09 06:19:41.263702 mon.mon01 mon.0 192.168.31.21:6789/0 470835 : cluster [WRN] Health check failed: Degraded data redundancy: 123081/8859170 objects degraded (1.389%), 1 pg degraded (PG_DEGRADED)
2019-01-09 06:19:47.350430 mon.mon01 mon.0 192.168.31.21:6789/0 470845 : cluster [WRN] Health check update: Reduced data availability: 3 pgs inactive (PG_AVAILABILITY)
2019-01-09 06:19:47.350464 mon.mon01 mon.0 192.168.31.21:6789/0 470846 : cluster [WRN] Health check update: Degraded data redundancy: 1655628/8861834 objects degraded (18.683%), 421 pgs degraded (PG_DEGRADED)
2019-01-09 06:19:53.487677 mon.mon01 mon.0 192.168.31.21:6789/0 470859 : cluster [WRN] Health check update: Degraded data redundancy: 1656297/8865413 objects degraded (18.683%), 421 pgs degraded (PG_DEGRADED)
2019-01-09 06:19:58.582334 mon.mon01 mon.0 192.168.31.21:6789/0 470871 : cluster [WRN] Health check update: Degraded data redundancy: 1657027/8869309 objects degraded (18.683%), 421 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:04.056659 mon.mon01 mon.0 192.168.31.21:6789/0 470885 : cluster [WRN] Health check update: Degraded data redundancy: 1657504/8871781 objects degraded (18.683%), 421 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:09.057037 mon.mon01 mon.0 192.168.31.21:6789/0 470893 : cluster [WRN] Health check update: Degraded data redundancy: 1658215/8875645 objects degraded (18.683%), 421 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:11.062027 mon.mon01 mon.0 192.168.31.21:6789/0 470900 : cluster [WRN] Health check update: 2 osds down (OSD_DOWN)
2019-01-09 06:20:11.079604 mon.mon01 mon.0 192.168.31.21:6789/0 470901 : cluster [INF] osd.6 192.168.31.202:6812/3037601 boot
2019-01-09 06:20:12.095299 mon.mon01 mon.0 192.168.31.21:6789/0 470907 : cluster [INF] osd.14 192.168.31.204:6805/3037895 boot
2019-01-09 06:20:13.085150 mon.mon01 mon.0 192.168.31.21:6789/0 470912 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-01-09 06:20:13.104798 mon.mon01 mon.0 192.168.31.21:6789/0 470913 : cluster [INF] osd.4 192.168.31.202:6804/3037583 boot
2019-01-09 06:20:14.057327 mon.mon01 mon.0 192.168.31.21:6789/0 470916 : cluster [WRN] Health check update: Degraded data redundancy: 1499363/8877876 objects degraded (16.889%), 383 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:17.178687 mon.mon01 mon.0 192.168.31.21:6789/0 470926 : cluster [WRN] Health check update: Reduced data availability: 1 pg inactive (PG_AVAILABILITY)
2019-01-09 06:20:19.057628 mon.mon01 mon.0 192.168.31.21:6789/0 470932 : cluster [WRN] Health check update: Degraded data redundancy: 238606/8879223 objects degraded (2.687%), 303 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:19.213015 mon.mon01 mon.0 192.168.31.21:6789/0 470933 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive)
2019-01-09 06:20:24.057965 mon.mon01 mon.0 192.168.31.21:6789/0 470942 : cluster [WRN] Health check update: Degraded data redundancy: 3923/8882203 objects degraded (0.044%), 366 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:29.058406 mon.mon01 mon.0 192.168.31.21:6789/0 470951 : cluster [WRN] Health check update: Degraded data redundancy: 3903/8885980 objects degraded (0.044%), 364 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:34.058800 mon.mon01 mon.0 192.168.31.21:6789/0 470959 : cluster [WRN] Health check update: Degraded data redundancy: 3857/8888131 objects degraded (0.043%), 361 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:34.636590 mon.mon01 mon.0 192.168.31.21:6789/0 470963 : cluster [INF] osd.6 failed (root=default,host=osd1202) (connection refused reported by osd.2)
2019-01-09 06:20:34.690979 mon.mon01 mon.0 192.168.31.21:6789/0 470993 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN)
2019-01-09 06:20:39.059173 mon.mon01 mon.0 192.168.31.21:6789/0 471004 : cluster [WRN] Health check update: Degraded data redundancy: 370629/8891325 objects degraded (4.168%), 366 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:44.059551 mon.mon01 mon.0 192.168.31.21:6789/0 471014 : cluster [WRN] Health check update: Degraded data redundancy: 589578/8893338 objects degraded (6.629%), 386 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:49.059880 mon.mon01 mon.0 192.168.31.21:6789/0 471023 : cluster [WRN] Health check update: Degraded data redundancy: 589685/8895766 objects degraded (6.629%), 381 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:54.060222 mon.mon01 mon.0 192.168.31.21:6789/0 471031 : cluster [WRN] Health check update: Degraded data redundancy: 589872/8899146 objects degraded (6.628%), 379 pgs degraded (PG_DEGRADED)
2019-01-09 06:20:59.060564 mon.mon01 mon.0 192.168.31.21:6789/0 471041 : cluster [WRN] Health check update: Degraded data redundancy: 590074/8903097 objects degraded (6.628%), 373 pgs degraded (PG_DEGRADED)

Debug 20:
2019-01-09 06:41:53.623945 mon.mon01 mon.0 192.168.31.21:6789/0 473901 : cluster [INF] osd.8 failed (root=default,host=osd1203) (connection refused reported by osd.7)
2019-01-09 06:41:53.637700 mon.mon01 mon.0 192.168.31.21:6789/0 473917 : cluster [INF] osd.11 failed (root=default,host=osd1203) (connection refused reported by osd.7)
2019-01-09 06:41:53.680703 mon.mon01 mon.0 192.168.31.21:6789/0 473947 : cluster [WRN] Health check failed: 2 osds down (OSD_DOWN)
2019-01-09 06:41:53.701781 mon.mon01 mon.0 192.168.31.21:6789/0 473950 : cluster [INF] osd.2 failed (root=default,host=osd1201) (connection refused reported by osd.6)
2019-01-09 06:41:53.721279 mon.mon01 mon.0 192.168.31.21:6789/0 473974 : cluster [INF] osd.13 failed (root=default,host=osd1204) (connection refused reported by osd.3)
2019-01-09 06:41:53.722198 mon.mon01 mon.0 192.168.31.21:6789/0 473976 : cluster [INF] osd.5 failed (root=default,host=osd1202) (connection refused reported by osd.10)
2019-01-09 06:41:53.722389 mon.mon01 mon.0 192.168.31.21:6789/0 473978 : cluster [INF] osd.1 failed (root=default,host=osd1201) (connection refused reported by osd.14)
2019-01-09 06:41:53.748354 mon.mon01 mon.0 192.168.31.21:6789/0 474003 : cluster [INF] osd.15 failed (root=default,host=osd1204) (connection refused reported by osd.4)
2019-01-09 06:41:55.704801 mon.mon01 mon.0 192.168.31.21:6789/0 474185 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive, 33 pgs down, 13 pgs peering (PG_AVAILABILITY)
2019-01-09 06:41:55.704819 mon.mon01 mon.0 192.168.31.21:6789/0 474186 : cluster [WRN] Health check failed: Degraded data redundancy: 51809/9633318 objects degraded (0.538%), 17 pgs degraded (PG_DEGRADED)
2019-01-09 06:41:56.358352 mon.mon01 mon.0 192.168.31.21:6789/0 474190 : cluster [INF] osd.7 failed (root=default,host=osd1202) (connection refused reported by osd.3)
2019-01-09 06:41:59.184629 mon.mon01 mon.0 192.168.31.21:6789/0 474216 : cluster [WRN] Health check update: 8 osds down (OSD_DOWN)
2019-01-09 06:42:01.814323 mon.mon01 mon.0 192.168.31.21:6789/0 474221 : cluster [WRN] Health check update: Reduced data availability: 5 pgs inactive, 157 pgs down, 2 pgs peering (PG_AVAILABILITY)
2019-01-09 06:42:01.814360 mon.mon01 mon.0 192.168.31.21:6789/0 474222 : cluster [WRN] Health check update: Degraded data redundancy: 2410795/9634227 objects degraded (25.023%), 439 pgs degraded (PG_DEGRADED)
2019-01-09 06:42:08.246731 mon.mon01 mon.0 192.168.31.21:6789/0 474235 : cluster [WRN] Health check update: Degraded data redundancy: 2472269/9634299 objects degraded (25.661%), 452 pgs degraded (PG_DEGRADED)
2019-01-09 06:42:09.194577 mon.mon01 mon.0 192.168.31.21:6789/0 474237 : cluster [WRN] Health check update: Reduced data availability: 4 pgs inactive, 157 pgs down, 1 pg peering (PG_AVAILABILITY)
2019-01-09 06:42:25.536251 mon.mon01 mon.0 192.168.31.21:6789/0 474269 : cluster [WRN] Health check update: 7 osds down (OSD_DOWN)
2019-01-09 06:42:25.563997 mon.mon01 mon.0 192.168.31.21:6789/0 474270 : cluster [INF] osd.15 192.168.31.204:6800/3044308 boot
2019-01-09 06:42:26.593924 mon.mon01 mon.0 192.168.31.21:6789/0 474281 : cluster [INF] osd.8 192.168.31.203:6800/1677717 boot
2019-01-09 06:42:26.594062 mon.mon01 mon.0 192.168.31.21:6789/0 474282 : cluster [INF] osd.1 192.168.31.201:6812/3034346 boot
2019-01-09 06:42:26.594219 mon.mon01 mon.0 192.168.31.21:6789/0 474283 : cluster [INF] osd.2 192.168.31.201:6808/3034328 boot
2019-01-09 06:42:27.578917 mon.mon01 mon.0 192.168.31.21:6789/0 474291 : cluster [WRN] Health check update: Reduced data availability: 4 pgs inactive, 143 pgs down, 1 pg peering (PG_AVAILABILITY)
2019-01-09 06:42:27.578943 mon.mon01 mon.0 192.168.31.21:6789/0 474292 : cluster [WRN] Health check update: Degraded data redundancy: 2260708/9634307 objects degraded (23.465%), 416 pgs degraded (PG_DEGRADED)
2019-01-09 06:42:27.617366 mon.mon01 mon.0 192.168.31.21:6789/0 474294 : cluster [INF] osd.5 192.168.31.202:6800/3044271 boot
2019-01-09 06:42:27.617496 mon.mon01 mon.0 192.168.31.21:6789/0 474295 : cluster [INF] osd.13 192.168.31.204:6812/3044309 boot
2019-01-09 06:42:28.605060 mon.mon01 mon.0 192.168.31.21:6789/0 474297 : cluster [WRN] Health check failed: 1 MDSs report slow metadata IOs (MDS_SLOW_METADATA_IO)
2019-01-09 06:42:28.605084 mon.mon01 mon.0 192.168.31.21:6789/0 474298 : cluster [WRN] Health check failed: 1 MDSs report slow requests (MDS_SLOW_REQUEST)
2019-01-09 06:42:29.666787 mon.mon01 mon.0 192.168.31.21:6789/0 474307 : cluster [INF] osd.11 192.168.31.203:6804/1677715 boot
2019-01-09 06:42:31.699526 mon.mon01 mon.0 192.168.31.21:6789/0 474314 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-01-09 06:42:31.727461 mon.mon01 mon.0 192.168.31.21:6789/0 474316 : cluster [INF] osd.7 192.168.31.202:6808/3044340 boot
2019-01-09 06:42:31.976347 mon.mon01 mon.0 192.168.31.21:6789/0 474318 : cluster [INF] MDS health message cleared (mds.0): 58 slow requests are blocked > 30 secs
2019-01-09 06:42:31.976374 mon.mon01 mon.0 192.168.31.21:6789/0 474319 : cluster [INF] MDS health message cleared (mds.0): 96 slow metadata IOs are blocked > 30 secs, oldest blocked for 33 secs
2019-01-09 06:42:32.717031 mon.mon01 mon.0 192.168.31.21:6789/0 474321 : cluster [INF] Health check cleared: MDS_SLOW_METADATA_IO (was: 1 MDSs report slow metadata IOs)
2019-01-09 06:42:32.717049 mon.mon01 mon.0 192.168.31.21:6789/0 474322 : cluster [INF] Health check cleared: MDS_SLOW_REQUEST (was: 1 MDSs report slow requests)
2019-01-09 06:42:33.746188 mon.mon01 mon.0 192.168.31.21:6789/0 474327 : cluster [WRN] Health check update: Reduced data availability: 1 pg inactive, 63 pgs down (PG_AVAILABILITY)
2019-01-09 06:42:33.746216 mon.mon01 mon.0 192.168.31.21:6789/0 474328 : cluster [WRN] Health check update: Degraded data redundancy: 399509/9634810 objects degraded (4.147%), 208 pgs degraded (PG_DEGRADED)
2019-01-09 06:42:34.007401 mon.mon01 mon.0 192.168.31.21:6789/0 474331 : cluster [INF] osd.9 failed (root=default,host=osd1203) (connection refused reported by osd.15)
2019-01-09 06:42:34.031384 mon.mon01 mon.0 192.168.31.21:6789/0 474361 : cluster [INF] osd.10 failed (root=default,host=osd1203) (connection refused reported by osd.13)
2019-01-09 06:42:34.122505 mon.mon01 mon.0 192.168.31.21:6789/0 474393 : cluster [INF] osd.6 failed (root=default,host=osd1202) (connection refused reported by osd.2)
2019-01-09 06:42:34.142497 mon.mon01 mon.0 192.168.31.21:6789/0 474422 : cluster [INF] osd.3 failed (root=default,host=osd1201) (connection refused reported by osd.8)
2019-01-09 06:42:26.419966 mds.mon01 mds.0 192.168.31.21:6800/4117728989 33 : cluster [WRN] 58 slow requests, 5 included below; oldest blocked for > 33.180825 secs
2019-01-09 06:42:26.419970 mds.mon01 mds.0 192.168.31.21:6800/4117728989 34 : cluster [WRN] slow request 33.179826 seconds old, received at 2019-01-09 06:41:53.240026: client_request(client.21117692:200638 unlink #0x1000002f669/.lck-97fc030000010000 2019-01-09 06:41:53.232981 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply
2019-01-09 06:42:26.419974 mds.mon01 mds.0 192.168.31.21:6800/4117728989 35 : cluster [WRN] slow request 30.643034 seconds old, received at 2019-01-09 06:41:55.776819: client_request(client.21117692:200670 mknod #0x10000034c06/.lck-e0fc030000010000 2019-01-09 06:41:55.773064 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply
2019-01-09 06:42:26.419978 mds.mon01 mds.0 192.168.31.21:6800/4117728989 36 : cluster [WRN] slow request 33.175806 seconds old, received at 2019-01-09 06:41:53.244046: client_request(client.21117692:200639 mknod #0x1000002f669/win7test30.vmx 2019-01-09 06:41:53.236981 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply
2019-01-09 06:42:26.419982 mds.mon01 mds.0 192.168.31.21:6800/4117728989 37 : cluster [WRN] slow request 30.642486 seconds old, received at 2019-01-09 06:41:55.777366: client_request(client.21117692:200671 unlink #0x10000034c06/ca-tr-bphost-vm1203.cloudwm.com.vmx.lck 2019-01-09 06:41:55.773064 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply
2019-01-09 06:42:26.419985 mds.mon01 mds.0 192.168.31.21:6800/4117728989 38 : cluster [WRN] slow request 32.759245 seconds old, received at 2019-01-09 06:41:53.660607: client_request(client.21117692:200640 mknod #0x1000002f669/.lck-19fc030000010000 2019-01-09 06:41:53.656995 caller_uid=0, caller_gid=0{}) currently submit entry: journal_and_reply
2019-01-09 06:42:34.761176 mon.mon01 mon.0 192.168.31.21:6789/0 474564 : cluster [WRN] Health check failed: 4 osds down (OSD_DOWN)
2019-01-09 06:42:37.804984 mon.mon01 mon.0 192.168.31.21:6789/0 474571 : cluster [WRN] Health check failed: 57/2408922 objects unfound (0.002%) (OBJECT_UNFOUND)
2019-01-09 06:42:39.196786 mon.mon01 mon.0 192.168.31.21:6789/0 474575 : cluster [WRN] Health check update: Reduced data availability: 4 pgs inactive, 12 pgs down, 6 pgs peering (PG_AVAILABILITY)
2019-01-09 06:42:39.196806 mon.mon01 mon.0 192.168.31.21:6789/0 474576 : cluster [WRN] Health check update: Degraded data redundancy: 1368290/9635262 objects degraded (14.201%), 362 pgs degraded (PG_DEGRADED)
2019-01-09 06:42:43.890826 mon.mon01 mon.0 192.168.31.21:6789/0 474585 : cluster [WRN] Health check update: 98/2409041 objects unfound (0.004%) (OBJECT_UNFOUND)
2019-01-09 06:42:44.197164 mon.mon01 mon.0 192.168.31.21:6789/0 474586 : cluster [WRN] Health check update: Reduced data availability: 3 pgs inactive, 15 pgs down (PG_AVAILABILITY)
2019-01-09 06:42:44.197189 mon.mon01 mon.0 192.168.31.21:6789/0 474587 : cluster [WRN] Health check update: Degraded data redundancy: 2262337/9635738 objects degraded (23.479%), 530 pgs degraded (PG_DEGRADED)
2019-01-09 06:42:49.197542 mon.mon01 mon.0 192.168.31.21:6789/0 474600 : cluster [WRN] Health check update: 98/2409282 objects unfound (0.004%) (OBJECT_UNFOUND)
2019-01-09 06:42:49.197564 mon.mon01 mon.0 192.168.31.21:6789/0 474601 : cluster [WRN] Health check update: Degraded data redundancy: 2262551/9636702 objects degraded (23.478%), 530 pgs degraded (PG_DEGRADED)
2019-01-09 06:42:54.197893 mon.mon01 mon.0 192.168.31.21:6789/0 474609 : cluster [WRN] Health check update: Degraded data redundancy: 2262546/9636702 objects degraded (23.478%), 530 pgs degraded (PG_DEGRADED)
2019-01-09 06:42:56.097889 mon.mon01 mon.0 192.168.31.21:6789/0 474613 : cluster [WRN] Health check update: 98/2409281 objects unfound (0.004%) (OBJECT_UNFOUND)
2019-01-09 06:42:59.198295 mon.mon01 mon.0 192.168.31.21:6789/0 474618 : cluster [WRN] Health check update: Degraded data redundancy: 2262546/9636694 objects degraded (23.478%), 530 pgs degraded (PG_DEGRADED)
2019-01-09 06:43:02.184175 mon.mon01 mon.0 192.168.31.21:6789/0 474625 : cluster [WRN] Health check update: 98/2409273 objects unfound (0.004%) (OBJECT_UNFOUND)
2019-01-09 06:43:04.198683 mon.mon01 mon.0 192.168.31.21:6789/0 474629 : cluster [WRN] Health check update: Degraded data redundancy: 2262538/9636666 objects degraded (23.478%), 530 pgs degraded (PG_DEGRADED)
2019-01-09 06:43:05.250864 mon.mon01 mon.0 192.168.31.21:6789/0 474635 : cluster [WRN] Health check update: 3 osds down (OSD_DOWN)
2019-01-09 06:43:05.268234 mon.mon01 mon.0 192.168.31.21:6789/0 474636 : cluster [INF] osd.3 192.168.31.201:6800/3034707 boot
2019-01-09 06:43:06.284262 mon.mon01 mon.0 192.168.31.21:6789/0 474641 : cluster [INF] osd.6 192.168.31.202:6812/3044608 boot
2019-01-09 06:43:06.292100 mon.mon01 mon.0 192.168.31.21:6789/0 474643 : cluster [WRN] Health check update: Reduced data availability: 3 pgs inactive, 11 pgs down, 1 pg peering (PG_AVAILABILITY)
2019-01-09 06:43:08.326630 mon.mon01 mon.0 192.168.31.21:6789/0 474653 : cluster [WRN] Health check update: 70/2409280 objects unfound (0.003%) (OBJECT_UNFOUND)
2019-01-09 06:43:08.353101 mon.mon01 mon.0 192.168.31.21:6789/0 474654 : cluster [INF] osd.10 192.168.31.203:6805/1678244 boot
2019-01-09 06:43:09.199077 mon.mon01 mon.0 192.168.31.21:6789/0 474658 : cluster [WRN] Health check update: Degraded data redundancy: 1708091/9636694 objects degraded (17.725%), 449 pgs degraded (PG_DEGRADED)
2019-01-09 06:43:10.359006 mon.mon01 mon.0 192.168.31.21:6789/0 474662 : cluster [INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2019-01-09 06:43:10.376429 mon.mon01 mon.0 192.168.31.21:6789/0 474663 : cluster [INF] osd.9 192.168.31.203:6810/1678245 boot
2019-01-09 06:43:12.420687 mon.mon01 mon.0 192.168.31.21:6789/0 474670 : cluster [WRN] Health check update: Reduced data availability: 1 pg inactive, 1 pg peering (PG_AVAILABILITY)
2019-01-09 06:43:14.199445 mon.mon01 mon.0 192.168.31.21:6789/0 474676 : cluster [WRN] Health check update: 20/2409449 objects unfound (0.001%) (OBJECT_UNFOUND)
2019-01-09 06:43:14.199480 mon.mon01 mon.0 192.168.31.21:6789/0 474677 : cluster [WRN] Health check update: Degraded data redundancy: 719403/9637370 objects degraded (7.465%), 390 pgs degraded (PG_DEGRADED)
2019-01-09 06:43:14.437872 mon.mon01 mon.0 192.168.31.21:6789/0 474678 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 1 pg peering)
2019-01-09 06:43:16.472600 mon.mon01 mon.0 192.168.31.21:6789/0 474682 : cluster [INF] Health check cleared: OBJECT_UNFOUND (was: 8/2409543 objects unfound (0.000%))
2019-01-09 06:43:19.199828 mon.mon01 mon.0 192.168.31.21:6789/0 474687 : cluster [WRN] Health check update: Degraded data redundancy: 877/9639866 objects degraded (0.009%), 364 pgs degraded (PG_DEGRADED)
2019-01-09 06:43:24.200196 mon.mon01 mon.0 192.168.31.21:6789/0 474696 : cluster [WRN] Health check update: Degraded data redundancy: 844/9643350 objects degraded (0.009%), 356 pgs degraded (PG_DEGRADED)
2019-01-09 06:43:29.200569 mon.mon01 mon.0 192.168.31.21:6789/0 474706 : cluster [WRN] Health check update: Degraded data redundancy: 839/9645870 objects degraded (0.009%), 355 pgs degraded (PG_DEGRADED)
2019-01-09 06:43:34.201084 mon.mon01 mon.0 192.168.31.21:6789/0 474716 : cluster [WRN] Health check update: Degraded data redundancy: 837/9649106 objects degraded (0.009%), 355 pgs degraded (PG_DEGRADED)
2019-01-09 06:43:39.201589 mon.mon01 mon.0 192.168.31.21:6789/0 474725 : cluster [WRN] Health check update: Degraded data redundancy: 831/9651930 objects degraded (0.009%), 353 pgs degraded (PG_DEGRADED)
2019-01-09 06:43:44.202388 mon.mon01 mon.0 192.168.31.21:6789/0 474737 : cluster [WRN] Health check update: Degraded data redundancy: 826/9654681 objects degraded (0.009%), 353 pgs degraded (PG_DEGRADED)

OSD log:
2019-01-09 06:42:23.802 7fb20ce22700 10 osd.6 2353 dequeue_op 0x561f1e550140 finish
2019-01-09 06:42:23.802 7fb20ce22700 20 osd.6 op_wq(3) _process empty q, waiting
2019-01-09 06:42:23.802 7fb20d623700 20 osd.6 pg_epoch: 2353 pg[1.1d3s0( v 2353'29578 (2152'26500,2353'29578] local-lis/les=2350/2351 n=4821 ec=78/78 lis/c 2350/2338 les/c/f 2351/2343/0 2350/2350/2317) [6,2147483647,12,21474836
47]p6(0) r=0 lpr=2350 pi=[2338,2350)/1 crt=2353'29578 lcod 2348'29577 mlcod 2348'29577 active+undersized+degraded mbc={}] try_finish_rmw: ExtentCache(
2019-01-09 06:42:23.802 7fb20d623700 20 osd.6 pg_epoch: 2353 pg[1.1d3s0( v 2353'29578 (2152'26500,2353'29578] local-lis/les=2350/2351 n=4821 ec=78/78 lis/c 2350/2338 les/c/f 2351/2343/0 2350/2350/2317) [6,2147483647,12,21474836
47]p6(0) r=0 lpr=2350 pi=[2338,2350)/1 crt=2353'29578 lcod 2348'29577 mlcod 2348'29577 active+undersized+degraded mbc={}] try_finish_rmw: clearing pipeline_state CACHE_VALID
2019-01-09 06:42:23.802 7fb20d623700 10 osd.6 2353 dequeue_op 0x561f1e550680 finish
2019-01-09 06:42:23.802 7fb20d623700 20 osd.6 op_wq(2) _process empty q, waiting
2019-01-09 06:42:24.046 7fb226655700 10 osd.6 2353 tick
2019-01-09 06:42:24.046 7fb226655700 10 osd.6 2353 do_waiters -- start
2019-01-09 06:42:24.046 7fb226655700 10 osd.6 2353 do_waiters -- finish
2019-01-09 06:42:24.171 7fb20b61f700 20 osd.6 2353 update_osd_stat osd_stat(1.2 TiB used, 9.7 TiB avail, 11 TiB total, peers [0,3,4,9,10,12,14] op hist [])
2019-01-09 06:42:24.171 7fb20b61f700 20 osd.6 2353 check_full_status cur ratio 0.112159. nearfull_ratio 0.92. backfillfull_ratio 0.93, full_ratio 0.97, failsafe_ratio 0.97, new state none
2019-01-09 06:42:24.171 7fb20b61f700 5 osd.6 2353 heartbeat: osd_stat(1.2 TiB used, 9.7 TiB avail, 11 TiB total, peers [0,3,4,9,10,12,14] op hist [])
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.201:6806/3028713 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f186cfe00 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.201:6807/3028713 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f19dfae00 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.201:6802/3028659 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f19dfca00 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.201:6803/3028659 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f19dfb200 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.202:6806/3039099 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f1fe4f000 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.202:6807/3039099 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f20b4ae00 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.203:6809/1667078 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f2013ea00 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.203:6810/1667078 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f2013dc00 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.203:6814/1667202 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f2013da00 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.203:6815/1667202 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f2013de00 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.204:6806/3038474 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f2013e000 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.204:6807/3038474 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f2013e200 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.204:6810/3038601 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f2013e400 con 0
2019-01-09 06:42:24.171 7fb20b61f700 1 -- 192.168.31.202:0/3038018 --> 192.168.31.204:6811/3038601 -- osd_ping(ping e2353 stamp 2019-01-09 06:42:24.174554) v4 -- 0x561f2013e600 con 0
2019-01-09 06:42:24.171 7fb22b098700 1 -- 192.168.31.202:0/3038018 <== osd.4 192.168.31.202:6806/3039099 8 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f19f00600 con 0x
561f1865ae00
2019-01-09 06:42:24.171 7fb22b098700 20 osd.6 2353 share_map_peer 0x561f0e90c000 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b098700 1 -- 192.168.31.202:0/3038018 <== osd.4 192.168.31.202:6807/3039099 8 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f19f00600 con 0x
561f1865ca00
2019-01-09 06:42:24.171 7fb22b098700 20 osd.6 2353 share_map_peer 0x561f0e90c000 already has epoch 2353
2019-01-09 06:42:24.171 7fb22a897700 1 -- 192.168.31.202:0/3038018 <== osd.0 192.168.31.201:6806/3028713 361 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f1f760200 con
0x561f184c3c00
2019-01-09 06:42:24.171 7fb22b899700 1 -- 192.168.31.202:0/3038018 <== osd.3 192.168.31.201:6803/3028659 361 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f24b1c400 con
0x561f17928300
2019-01-09 06:42:24.171 7fb22a897700 20 osd.6 2353 share_map_peer 0x561f1790b500 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b899700 20 osd.6 2353 share_map_peer 0x561f181ece00 already has epoch 2353
2019-01-09 06:42:24.171 7fb22a897700 1 -- 192.168.31.202:0/3038018 <== osd.0 192.168.31.201:6807/3028713 361 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f1f760200 con
0x561f184c3500
2019-01-09 06:42:24.171 7fb22a897700 20 osd.6 2353 share_map_peer 0x561f1790b500 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b899700 1 -- 192.168.31.202:0/3038018 <== osd.14 192.168.31.204:6810/3038601 361 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f24b1c400 con
0x561f184c2e00
2019-01-09 06:42:24.171 7fb22b899700 20 osd.6 2353 share_map_peer 0x561f181eea00 already has epoch 2353
2019-01-09 06:42:24.171 7fb22a897700 1 -- 192.168.31.202:0/3038018 <== osd.3 192.168.31.201:6802/3028659 361 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f1f760200 con
0x561f17927c00
2019-01-09 06:42:24.171 7fb22a897700 20 osd.6 2353 share_map_peer 0x561f181ece00 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b098700 1 -- 192.168.31.202:0/3038018 <== osd.12 192.168.31.204:6806/3038474 360 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f19f00600 con
0x561f0e90f800
2019-01-09 06:42:24.171 7fb22b899700 1 -- 192.168.31.202:0/3038018 <== osd.9 192.168.31.203:6809/1667078 361 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f24b1c400 con
0x561f1772c300
2019-01-09 06:42:24.171 7fb22b098700 20 osd.6 2353 share_map_peer 0x561f181ec000 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b899700 20 osd.6 2353 share_map_peer 0x561f181ee300 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b098700 1 -- 192.168.31.202:0/3038018 <== osd.10 192.168.31.203:6815/1667202 360 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f19f00600 con
0x561f178fa000
2019-01-09 06:42:24.171 7fb22b098700 20 osd.6 2353 share_map_peer 0x561f184c2000 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b899700 1 -- 192.168.31.202:0/3038018 <== osd.12 192.168.31.204:6807/3038474 360 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f24b1c400 con
0x561f0e90f100
2019-01-09 06:42:24.171 7fb22b899700 20 osd.6 2353 share_map_peer 0x561f181ec000 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b098700 1 -- 192.168.31.202:0/3038018 <== osd.14 192.168.31.204:6811/3038601 361 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f19f00600 con
0x561f184c5100
2019-01-09 06:42:24.171 7fb22b098700 20 osd.6 2353 share_map_peer 0x561f181eea00 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b098700 1 -- 192.168.31.202:0/3038018 <== osd.9 192.168.31.203:6810/1667078 361 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f19f00600 con
0x561f12b8a000
2019-01-09 06:42:24.171 7fb22b098700 20 osd.6 2353 share_map_peer 0x561f181ee300 already has epoch 2353
2019-01-09 06:42:24.171 7fb22b098700 1 -- 192.168.31.202:0/3038018 <== osd.10 192.168.31.203:6814/1667202 360 ==== osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.174554) v4 ==== 2004+0+0 (476572518 0 0) 0x561f19f00600 con
0x561f178fd800
2019-01-09 06:42:24.171 7fb22b098700 20 osd.6 2353 share_map_peer 0x561f184c2000 already has epoch 2353
2019-01-09 06:42:24.187 7fb22b899700 1 -- 192.168.31.202:6812/3038018 <== mds.0 192.168.31.21:6800/4117728989 1713 ==== ping magic: 0 v1 ==== 0+0+0 (0 0 0) 0x561f17dfd2c0 con 0x561f12b8a700
2019-01-09 06:42:24.187 7fb22b899700 10 osd.6 2353 ping from mds.0
2019-01-09 06:42:24.215 7fb225e54700 10 osd.6 2353 tick_without_osd_lock
2019-01-09 06:42:24.215 7fb225e54700 20 osd.6 2353 can_inc_scrubs_pending 0 > 1 (max 1, active 0)
2019-01-09 06:42:24.215 7fb225e54700 20 osd.6 2353 scrub_time_permit should run between 0 - 24 now 6 = yes
2019-01-09 06:42:24.215 7fb225e54700 20 osd.6 2353 scrub_load_below_threshold loadavg per cpu 0.25 < max 0.5 = yes
2019-01-09 06:42:24.215 7fb225e54700 20 osd.6 2353 sched_scrub load_is_low=1
2019-01-09 06:42:24.215 7fb225e54700 10 osd.6 2353 sched_scrub scrubbing 1.1ebs0 at 2019-01-09 00:58:39.739994, load_is_low
2019-01-09 06:42:24.215 7fb225e54700 10 osd.6 2353 sched_scrub scrubbing 1.fes0 at 2019-01-09 04:56:55.323104, load_is_low
2019-01-09 06:42:24.215 7fb225e54700 10 osd.6 2353 sched_scrub scrubbing 1.6es2 at 2019-01-09 06:39:54.015435, load_is_low
2019-01-09 06:42:24.215 7fb225e54700 10 osd.6 2353 sched_scrub 1.aas0 scheduled at 2019-01-09 07:35:14.776786 > 2019-01-09 06:42:24.218285
2019-01-09 06:42:24.215 7fb225e54700 20 osd.6 2353 sched_scrub done
2019-01-09 06:42:24.215 7fb225e54700 10 osd.6 2353 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target 25 obj/sec or 5 MiB/sec
2019-01-09 06:42:24.215 7fb225e54700 20 osd.6 2353 promote_throttle_recalibrate new_prob 1000
2019-01-09 06:42:24.215 7fb225e54700 10 osd.6 2353 promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000
2019-01-09 06:42:24.215 7fb225e54700 15 osd.6 2353 _recover_now active 0 + reserved 0 >= max 0
2019-01-09 06:42:24.971 7fb22a897700 1 -
192.168.31.202:6814/3038018 <== osd.3 192.168.31.201:0/3028659 390 ==== osd_ping(ping e2353 stamp 2019-01-09 06:42:24.970273) v4 ==== 2004+0+0 (2437370177 0 0) 0x561f1f760200 con 0x561
f1865bc00
2019-01-09 06:42:24.971 7fb22a897700 1 -- 192.168.31.202:6814/3038018 --> 192.168.31.201:0/3028659 -- osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.970273) v4 -- 0x561f19dfca00 con 0
2019-01-09 06:42:24.971 7fb22b899700 1 -- 192.168.31.202:6815/3038018 <== osd.3 192.168.31.201:0/3028659 390 ==== osd_ping(ping e2353 stamp 2019-01-09 06:42:24.970273) v4 ==== 2004+0+0 (2437370177 0 0) 0x561f24b1c400 con 0x561
f1790d800
2019-01-09 06:42:24.971 7fb22a897700 20 osd.6 2353 share_map_peer 0x561f181ece00 already has epoch 2353
2019-01-09 06:42:24.971 7fb22b899700 1 -- 192.168.31.202:6815/3038018 --> 192.168.31.201:0/3028659 -- osd_ping(ping_reply e2353 stamp 2019-01-09 06:42:24.970273) v4 -- 0x561f2013e400 con 0
2019-01-09 06:42:24.971 7fb22b899700 20 osd.6 2353 share_map_peer 0x561f181ece00 already has epoch 2353
2019-01-09 06:42:25.015 7fb226655700 10 osd.6 2353 tick
2019-01-09 06:42:25.015 7fb226655700 10 osd.6 2353 do_waiters -- start
2019-01-09 06:42:25.015 7fb226655700 10 osd.6 2353 do_waiters -- finish
2019-01-09 06:42:25.211 7fb225e54700 10 osd.6 2353 tick_without_osd_lock
2019-01-09 06:42:25.211 7fb225e54700 20 osd.6 2353 can_inc_scrubs_pending 0 -> 1 (max 1, active 0)
2019-01-09 06:42:25.211 7fb225e54700 20 osd.6 2353 scrub_time_permit should run between 0 - 24 now 6 = yes
2019-01-09 06:42:25.211 7fb225e54700 20 osd.6 2353 scrub_load_below_threshold loadavg per cpu 0.25 < max 0.5 = yes
2019-01-09 06:42:25.211 7fb225e54700 20 osd.6 2353 sched_scrub load_is_low=1
2019-01-09 06:42:25.211 7fb225e54700 10 osd.6 2353 sched_scrub scrubbing 1.1ebs0 at 2019-01-09 00:58:39.739994, load_is_low
2019-01-09 06:42:25.211 7fb225e54700 10 osd.6 2353 sched_scrub scrubbing 1.fes0 at 2019-01-09 04:56:55.323104, load_is_low
2019-01-09 06:42:25.211 7fb225e54700 10 osd.6 2353 sched_scrub scrubbing 1.6es2 at 2019-01-09 06:39:54.015435, load_is_low
2019-01-09 06:42:25.211 7fb225e54700 10 osd.6 2353 sched_scrub 1.aas0 scheduled at 2019-01-09 07:35:14.776786 > 2019-01-09 06:42:25.214729
2019-01-09 06:42:25.211 7fb225e54700 20 osd.6 2353 sched_scrub done
2019-01-09 06:42:25.211 7fb225e54700 10 osd.6 2353 promote_throttle_recalibrate 0 attempts, promoted 0 objects and 0 B; target 25 obj/sec or 5 MiB/sec
2019-01-09 06:42:25.211 7fb225e54700 20 osd.6 2353 promote_throttle_recalibrate new_prob 1000
2019-01-09 06:42:25.211 7fb225e54700 10 osd.6 2353 promote_throttle_recalibrate actual 0, actual/prob ratio 1, adjusted new_prob 1000, prob 1000 -> 1000
2019-01-09 06:42:25.211 7fb225e54700 15 osd.6 2353 _recover_now active 0 + reserved 0 >= max 0


Files

ceph-osd.6.log.tar.gz (176 Bytes) ceph-osd.6.log.tar.gz Greg Smith, 01/09/2019 12:31 PM
ceph-osd.06.log.gz (678 KB) ceph-osd.06.log.gz Greg Smith, 01/09/2019 02:19 PM
ceph-osd.06.log.bz2 (800 KB) ceph-osd.06.log.bz2 Greg Smith, 01/09/2019 03:16 PM

Related issues 2 (0 open2 closed)

Copied to bluestore - Backport #37990: mimic: Compression not working, and when applied OSD disks are failing randomlyResolvedIgor FedotovActions
Copied to bluestore - Backport #37991: luminous: Compression not working, and when applied OSD disks are failing randomlyResolvedIgor FedotovActions
Actions #1

Updated by Igor Fedotov over 5 years ago

Attached OSD log seems to be broken, could you please upload it or any other valid one again?

Actions #3

Updated by Igor Fedotov over 5 years ago

From mon log I can see osd.6 connection termination report at 06:42:34:
2019-01-09 06:42:34.122505 mon.mon01 mon.0 192.168.31.21:6789/0 474393 : cluster [INF] osd.6 failed (root=default,host=osd1202) (connection refused reported by osd.2)

and boot at 06:43:06:
2019-01-09 06:43:06.284262 mon.mon01 mon.0 192.168.31.21:6789/0 474641 : cluster [INF] osd.6 192.168.31.202:6812/3044608 boot

The latest OSD log starts at 06:41:53 and ends at 06:42:28 hence not including the above failure period:
2019-01-09 06:41:53.009 7fb22a897700 1 -- 192.168.31.202:6813/3038018 <== osd.0 192.168.31.201:6805/3028713 16643 ==== MOSDECSubOpWrite(1.79s1 2348/2338 ECSubWrite(tid=31034, reqid=client.21117692.0:3534428, at_version=2348'29537, trim_to=2152'26500, roll_forward_to=2348'29536)) v2 ==== 2099044+0+0 (2979773302 0 0) 0x561f1f680000 con 0x561f1790b500
...
2019-01-09 06:42:28.935 7fb20c621700 20 osd.6 pg_epoch: 2357 pg[1.1c6s2( v 2357'29190 (2152'26100,2357'29190] local-lis/les=2355/2356 n=4678 ec=78/78 lis/c 2355/2334 les/c/f 2356/2343/0 2355/2355/2332) [10,2,6,14]p10(0) r=2 lpr=2355 pi=[2334,2355)/1 luod=0'0 crt=2357'29190 active mbc={}] rollforward: entry=2357'29190 (0'0) modify 1:63d59ef3:::1000003fced.00000000:head by client.21117692.0:3534807 2019-01-09 06:41:55.797065 0

Would you share Do you have OSD log for that "broken state" period (06:42:34 - 06:43:06), please.

Actions #4

Updated by Igor Fedotov over 5 years ago

That would be great if you manage to collect "broken state" OSD log with "debug bluestore set to 20"

Actions #6

Updated by Igor Fedotov over 5 years ago

Great, thanks! Do you remember which compression method was configured in this specific case?

Actions #7

Updated by Greg Smith over 5 years ago

aggressive on the pool alone: ceph osd pool set cephfs_data compression_mode aggressive

Actions #8

Updated by Igor Fedotov over 5 years ago

Sorry, I meant what compression algorithm?

Actions #9

Updated by Igor Fedotov over 5 years ago

I bet it's lz4 not snappy.
Could you please switch everything to snappy and check if issue is still present?

Actions #10

Updated by Igor Fedotov over 5 years ago

  • Assignee set to Igor Fedotov
Actions #11

Updated by Igor Fedotov over 5 years ago

  • Status changed from New to 12
Actions #12

Updated by Greg Smith over 5 years ago

In this specific case it's lz4, but it also happened in snappy.

Actions #13

Updated by Igor Fedotov over 5 years ago

OSD.6 asserts on non-zero result returned from the compress method:

2019-01-09 06:42:33.599 7fb20e625700 -1 /build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: In function 'int BlueStore::_do_alloc_write(BlueStore::TransContext*, BlueStore::CollectionRef, BlueStore::OnodeRef, BlueStore::WriteContext*)' thread 7fb20e625700 time 2019-01-09 06:42:33.598421
/build/ceph-13.2.2/src/os/bluestore/BlueStore.cc: 10559: FAILED assert(r == 0)

Snappy compressor plugin (compress() method from SnappyCompressor.h) never returns non-zero error code while LZ4 (LZCompressor.h) does.
Hence Snappy can't trigger the above assertion.
So either you observe two different issues or the above comment about broken cases for Snappy is incorrect.

Could you please double check Snappy compressor case (perhaps by disabling compression for all OSDs but a specific one where snappy to be enabled) and collect corresponding logs for failures at such an OSD if any.

Meanwhile I'm going to fix compress call error code processing that triggers the first assertion.

Actions #14

Updated by Greg Smith over 5 years ago

Hi Igor,
I meant that we had the same issue with snappy.
We'll try to reproduce it and send logs.
Thank you.

Actions #15

Updated by Igor Fedotov over 5 years ago

  • Pull request ID set to 25891
Actions #16

Updated by Greg Smith over 5 years ago

Hi Igor,
A question: What is the best implantation of the compression, or what is your recommendation When the algorithm is set to snappy (as you recommended earlier)?
ceph osd pool set cephfs_data compression_mode force/aggressive
together with
bluestore_compression_mode = force/aggressive
or only one of them (which one if so)?

Actions #17

Updated by Igor Fedotov over 5 years ago

I suggest to alter specific OSD compression settings only, not pool ones. I.e. bluestore_compression_mode + bluestore_compression_algorithm parameters.

Actions #22

Updated by Kefu Chai about 5 years ago

  • Status changed from 12 to Pending Backport
  • Backport set to mimic, luminous
Actions #23

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #37990: mimic: Compression not working, and when applied OSD disks are failing randomly added
Actions #24

Updated by Nathan Cutler about 5 years ago

  • Copied to Backport #37991: luminous: Compression not working, and when applied OSD disks are failing randomly added
Actions #25

Updated by Nathan Cutler about 5 years ago

@Igor Gajowiak - the command for creating backport issues is src/script/backport-create-issue (see comments at top of script to get it working in your environment)

It takes a single argument: the number of the master issue. The master issue must already be in "Pending Backport" status.

Actions #26

Updated by Igor Fedotov about 5 years ago

  • Private changed from No to Yes

@Nathan Weinberg - good to know, thanks.

Actions #27

Updated by Igor Fedotov about 5 years ago

  • Private changed from Yes to No
Actions #29

Updated by Igor Fedotov about 5 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF