Project

General

Profile

Actions

Bug #21028

closed

Ceph random slow request

Added by Artem Karamyshev over 6 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello! We have strange issue in our ceph installation on 3 storage nodes with 26 disk each. Periodically I see slow request on random osd's. There isnt' any correlations between osd number and slow requests used it. For example:

2017-08-17 18:55:10.968544 mon.controller1 mon.0 172.28.1.5:6789/0 45257 : cluster [WRN] Health check update: 95 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:55:01.953912 osd.54 osd.54 172.28.6.131:6833/15178 392 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.186866 secs
2017-08-17 18:55:01.953920 osd.54 osd.54 172.28.6.131:6833/15178 393 : cluster [WRN] slow request 30.186866 seconds old, received at 2017-08-17 18:54:31.766962: osd_op(client.126396492.0:5548060 5.408020f7 5:ef040102:::rbd_data.6dc5615f16fd4e.00000000000000b8:head [stat,write 794624~4096] snapc 29=[29] ack+ondisk+write+known_if_redirected e28759) currently op_applied
2017-08-17 18:55:03.644976 osd.44 osd.44 172.28.5.2:6835/23838 355 : cluster [WRN] 6 slow requests, 1 included below; oldest blocked for > 32.158929 secs
2017-08-17 18:55:03.644983 osd.44 osd.44 172.28.5.2:6835/23838 356 : cluster [WRN] slow request 30.381824 seconds old, received at 2017-08-17 18:54:33.263073: osd_repop(client.102143459.0:1443395 5.9c6 e28759/28746 5:63902efc:::rbd_data.106f27327cfe90.0000000000000643:head v 28759'5894821) currently started
2017-08-17 18:55:05.645371 osd.44 osd.44 172.28.5.2:6835/23838 357 : cluster [WRN] 10 slow requests, 4 included below; oldest blocked for > 34.159267 secs
2017-08-17 18:55:05.645378 osd.44 osd.44 172.28.5.2:6835/23838 358 : cluster [WRN] slow request 30.122020 seconds old, received at 2017-08-17 18:54:35.523216: osd_repop(client.86267632.0:26917592 5.f6d e28759/28681 5:b6fe3d6b:::rbd_data.886e6e7a21c4c4.0000000000000240:head v 28759'2859480) currently started
2017-08-17 18:55:05.645384 osd.44 osd.44 172.28.5.2:6835/23838 359 : cluster [WRN] slow request 30.121414 seconds old, received at 2017-08-17 18:54:35.523822: osd_repop(client.86267632.0:26917593 5.f6d e28759/28681 5:b6fe3d6b:::rbd_data.886e6e7a21c4c4.0000000000000240:head v 28759'2859481) currently started
2017-08-17 18:55:05.645389 osd.44 osd.44 172.28.5.2:6835/23838 360 : cluster [WRN] slow request 30.120861 seconds old, received at 2017-08-17 18:54:35.524375: osd_repop(client.86267632.0:26917594 5.f6d e28759/28681 5:b6fe3d6b:::rbd_data.886e6e7a21c4c4.0000000000000240:head v 28759'2859482) currently started
2017-08-17 18:55:05.645402 osd.44 osd.44 172.28.5.2:6835/23838 361 : cluster [WRN] slow request 30.120522 seconds old, received at 2017-08-17 18:54:35.524714: osd_repop(client.86267632.0:26917595 5.f6d e28759/28681 5:b6fe3d6b:::rbd_data.886e6e7a21c4c4.0000000000000240:head v 28759'2859483) currently started
2017-08-17 18:55:06.645633 osd.44 osd.44 172.28.5.2:6835/23838 362 : cluster [WRN] 12 slow requests, 2 included below; oldest blocked for > 35.159568 secs
2017-08-17 18:55:06.645640 osd.44 osd.44 172.28.5.2:6835/23838 363 : cluster [WRN] slow request 30.947984 seconds old, received at 2017-08-17 18:54:35.697553: osd_repop(client.86267632.0:26917596 5.f6d e28759/28681) currently queued_for_pg
2017-08-17 18:55:06.645645 osd.44 osd.44 172.28.5.2:6835/23838 364 : cluster [WRN] slow request 30.827234 seconds old, received at 2017-08-17 18:54:35.818302: osd_op(client.107876915.0:443224 5.d12 5:48b84e9c:::rbd_data.69e68d46c3679f.000000000000080f:head [write 1015808~8192] snapc 0=[] ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 99
2017-08-17 18:55:08.645922 osd.44 osd.44 172.28.5.2:6835/23838 365 : cluster [WRN] 14 slow requests, 2 included below; oldest blocked for > 37.159864 secs
2017-08-17 18:55:08.645950 osd.44 osd.44 172.28.5.2:6835/23838 366 : cluster [WRN] slow request 30.563850 seconds old, received at 2017-08-17 18:54:38.081983: osd_repop(client.102558248.0:738319 5.1d4 e28759/28681 5:2b8ffc30:::rbd_data.1ce9837e3e5cd1.0000000000000210:head v 28759'20917834) currently started
2017-08-17 18:55:08.645953 osd.44 osd.44 172.28.5.2:6835/23838 367 : cluster [WRN] slow request 30.522532 seconds old, received at 2017-08-17 18:54:38.123300: osd_repop(client.109509827.0:2065753 5.7c8 e28759/28681 5:13ebd96a:::rbd_data.5fbcdbee3dc891.0000000000000249:head v 28759'2693053) currently started
2017-08-17 18:55:08.954980 osd.54 osd.54 172.28.6.131:6833/15178 394 : cluster [WRN] 2 slow requests, 1 included below; oldest blocked for > 37.187922 secs
2017-08-17 18:55:08.954987 osd.54 osd.54 172.28.6.131:6833/15178 395 : cluster [WRN] slow request 30.597113 seconds old, received at 2017-08-17 18:54:38.357771: osd_op(client.126396492.0:5548079 5.408020f7 5:ef040102:::rbd_data.6dc5615f16fd4e.00000000000000b8:head [stat,write 794624~4096] snapc 29=[29] ack+ondisk+write+known_if_redirected e28759) currently waiting for rw locks
2017-08-17 18:55:09.646158 osd.44 osd.44 172.28.5.2:6835/23838 368 : cluster [WRN] 15 slow requests, 1 included below; oldest blocked for > 38.160122 secs
2017-08-17 18:55:09.646166 osd.44 osd.44 172.28.5.2:6835/23838 369 : cluster [WRN] slow request 30.304649 seconds old, received at 2017-08-17 18:54:39.341441: osd_repop(client.61379588.0:9056160 3.266 e28759/28681) currently queued_for_pg
2017-08-17 18:55:10.646365 osd.44 osd.44 172.28.5.2:6835/23838 370 : cluster [WRN] 16 slow requests, 1 included below; oldest blocked for > 39.160294 secs
2017-08-17 18:55:10.646381 osd.44 osd.44 172.28.5.2:6835/23838 371 : cluster [WRN] slow request 30.371119 seconds old, received at 2017-08-17 18:54:40.275143: osd_op(client.102663270.0:753674 5.dd8 5.eb917dd8 (undecoded) ondisk+write+known_if_redirected e28759) currently queued_for_pg
2017-08-17 18:55:13.150381 mon.controller1 mon.0 172.28.1.5:6789/0 45259 : cluster [WRN] Health check update: 103 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:55:05.741361 osd.94 osd.94 172.28.6.132:6815/28309 297 : cluster [WRN] 5 slow requests, 5 included below; oldest blocked for > 30.219177 secs
2017-08-17 18:55:05.741373 osd.94 osd.94 172.28.6.132:6815/28309 298 : cluster [WRN] slow request 30.219177 seconds old, received at 2017-08-17 18:54:35.522046: osd_op(client.86267632.0:26917592 5.d6bc7f6d 5:b6fe3d6b:::rbd_data.886e6e7a21c4c4.0000000000000240:head [write 1114112~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 46
2017-08-17 18:55:05.741381 osd.94 osd.94 172.28.6.132:6815/28309 299 : cluster [WRN] slow request 30.218997 seconds old, received at 2017-08-17 18:54:35.522225: osd_op(client.86267632.0:26917593 5.d6bc7f6d 5:b6fe3d6b:::rbd_data.886e6e7a21c4c4.0000000000000240:head [write 3350528~8192] snapc 0=[] ack+ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 46
2017-08-17 18:55:05.741388 osd.94 osd.94 172.28.6.132:6815/28309 300 : cluster [WRN] slow request 30.218558 seconds old, received at 2017-08-17 18:54:35.522664: osd_op(client.86267632.0:26917594 5.d6bc7f6d 5:b6fe3d6b:::rbd_data.886e6e7a21c4c4.0000000000000240:head [write 3833856~8192] snapc 0=[] ack+ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 46
2017-08-17 18:55:05.741394 osd.94 osd.94 172.28.6.132:6815/28309 301 : cluster [WRN] slow request 30.218288 seconds old, received at 2017-08-17 18:54:35.522934: osd_op(client.86267632.0:26917595 5.d6bc7f6d 5:b6fe3d6b:::rbd_data.886e6e7a21c4c4.0000000000000240:head [write 3883008~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e28759) currently op_applied
2017-08-17 18:55:05.741407 osd.94 osd.94 172.28.6.132:6815/28309 302 : cluster [WRN] slow request 30.218113 seconds old, received at 2017-08-17 18:54:35.523109: osd_op(client.86267632.0:26917596 5.d6bc7f6d 5:b6fe3d6b:::rbd_data.886e6e7a21c4c4.0000000000000240:head [write 3891200~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e28759) currently op_applied
2017-08-17 18:55:08.450147 osd.62 osd.62 172.28.6.131:6842/15662 571 : cluster [WRN] 2 slow requests, 1 included below; oldest blocked for > 36.683877 secs
2017-08-17 18:55:08.450153 osd.62 osd.62 172.28.6.131:6842/15662 572 : cluster [WRN] slow request 30.092954 seconds old, received at 2017-08-17 18:54:38.357121: osd_op(client.126396492.0:5548070 5.9d90d118 5:188b09b9:::rbd_data.6dc5615f16fd4e.000000000000011e:head [stat,write 2600960~8192] snapc 29=[29] ack+ondisk+write+known_if_redirected e28759) currently waiting for rw locks
2017-08-17 18:55:14.451115 osd.62 osd.62 172.28.6.131:6842/15662 573 : cluster [WRN] 3 slow requests, 1 included below; oldest blocked for > 42.684841 secs
2017-08-17 18:55:14.451124 osd.62 osd.62 172.28.6.131:6842/15662 574 : cluster [WRN] slow request 30.091586 seconds old, received at 2017-08-17 18:54:44.359453: osd_op(client.126396492.0:5548084 5.9d90d118 5:188b09b9:::rbd_data.6dc5615f16fd4e.000000000000011e:head [stat,write 2605056~8192] snapc 29=[29] ack+ondisk+write+known_if_redirected e28759) currently waiting for rw locks
2017-08-17 18:55:15.650515 mon.controller1 mon.0 172.28.1.5:6789/0 45261 : cluster [WRN] Health check update: 104 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:55:10.055545 osd.60 osd.60 172.28.6.131:6810/16349 228 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.716770 secs
2017-08-17 18:55:10.055571 osd.60 osd.60 172.28.6.131:6810/16349 229 : cluster [WRN] slow request 30.716770 seconds old, received at 2017-08-17 18:54:39.338709: osd_op(client.61379588.0:9056160 3.8a8d7266 3:664eb151:::rbd_data.162c90d3b761c43.0000000000002090:head [write 2465792~65536] snapc 4f=[] ack+ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 87
2017-08-17 18:55:17.318085 mon.controller1 mon.0 172.28.1.5:6789/0 45262 : cluster [WRN] Health check update: 103 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:55:14.682470 osd.68 osd.68 172.28.6.131:6804/15295 267 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.549232 secs
2017-08-17 18:55:14.682480 osd.68 osd.68 172.28.6.131:6804/15295 268 : cluster [WRN] slow request 30.549232 seconds old, received at 2017-08-17 18:54:44.133119: osd_op(client.102678393.0:2104391 5.994 5:299b8b3c:::rbd_data.1da7137e2e8611.0000000000000201:head [stat,write 2228224~4096] snapc 0=[] ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 95
2017-08-17 18:55:19.801253 mon.controller1 mon.0 172.28.1.5:6789/0 45263 : cluster [WRN] Health check update: 9 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:55:21.312444 mon.controller1 mon.0 172.28.1.5:6789/0 45264 : cluster [WRN] Health check update: 6 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:55:13.646911 osd.44 osd.44 172.28.5.2:6835/23838 372 : cluster [WRN] 20 slow requests, 4 included below; oldest blocked for > 42.160828 secs
2017-08-17 18:55:13.646926 osd.44 osd.44 172.28.5.2:6835/23838 373 : cluster [WRN] slow request 30.134656 seconds old, received at 2017-08-17 18:54:43.512141: osd_op(client.122117334.0:305054 5.59d 5.fd5de59d (undecoded) ondisk+write+known_if_redirected e28759) currently queued_for_pg
2017-08-17 18:55:13.646930 osd.44 osd.44 172.28.5.2:6835/23838 374 : cluster [WRN] slow request 30.097782 seconds old, received at 2017-08-17 18:54:43.549014: osd_repop_reply(client.122117334.0:305053 5.59d e28759/28681) currently queued_for_pg
2017-08-17 18:55:13.646935 osd.44 osd.44 172.28.5.2:6835/23838 375 : cluster [WRN] slow request 30.089324 seconds old, received at 2017-08-17 18:54:43.557472: osd_repop_reply(client.122117334.0:305053 5.59d e28759/28681) currently queued_for_pg
2017-08-17 18:55:13.646940 osd.44 osd.44 172.28.5.2:6835/23838 376 : cluster [WRN] slow request 30.134812 seconds old, received at 2017-08-17 18:54:43.511985: osd_op(client.122117334.0:305053 5.59d 5:b9a7babf:::rbd_data.2fbcb64b793a20.0000000000000000:head [stat,write 1049088~512] snapc 0=[] ondisk+write+known_if_redirected e28759) currently waiting for subops from 50,86
2017-08-17 18:55:14.647179 osd.44 osd.44 172.28.5.2:6835/23838 377 : cluster [WRN] 16 slow requests, 4 included below; oldest blocked for > 39.123872 secs
2017-08-17 18:55:14.647185 osd.44 osd.44 172.28.5.2:6835/23838 378 : cluster [WRN] slow request 30.513369 seconds old, received at 2017-08-17 18:54:44.133718: osd_repop(client.102678393.0:2104386 5.a08 e28759/28739 5:10585d96:::rbd_data.1da7137e2e8611.0000000000000000:head v 28759'2680676) currently started
2017-08-17 18:55:14.647189 osd.44 osd.44 172.28.5.2:6835/23838 379 : cluster [WRN] slow request 30.512976 seconds old, received at 2017-08-17 18:54:44.134112: osd_repop(client.102678393.0:2104391 5.994 e28759/28681 5:299b8b3c:::rbd_data.1da7137e2e8611.0000000000000201:head v 28759'20910252) currently started
2017-08-17 18:55:14.647193 osd.44 osd.44 172.28.5.2:6835/23838 380 : cluster [WRN] slow request 30.513069 seconds old, received at 2017-08-17 18:54:44.134019: osd_repop(client.102678393.0:2104389 5.183 e28759/28681 5:c18d7abf:::rbd_data.1da7137e2e8611.0000000000000071:head v 28759'1834682) currently started
2017-08-17 18:55:14.647197 osd.44 osd.44 172.28.5.2:6835/23838 381 : cluster [WRN] slow request 30.466500 seconds old, received at 2017-08-17 18:54:44.180587: osd_op(client.102666564.0:498462 5.9d0 5:0b934a8e:::rbd_data.1e8c9a65420bc0.0000000000000223:head [stat,write 3162112~12288] snapc 0=[] ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 63
2017-08-17 18:55:14.245916 osd.69 osd.69 172.28.6.131:6803/15073 261 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.113067 secs
2017-08-17 18:55:14.245924 osd.69 osd.69 172.28.6.131:6803/15073 262 : cluster [WRN] slow request 30.113067 seconds old, received at 2017-08-17 18:54:44.132765: osd_op(client.102678393.0:2104386 5.a08 5:10585d96:::rbd_data.1da7137e2e8611.0000000000000000:head [stat,write 1049600~512] snapc 0=[] ondisk+write+known_if_redirected e28759) currently op_applied
2017-08-17 18:55:23.603223 mon.controller1 mon.0 172.28.1.5:6789/0 45265 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 6 slow requests are blocked > 32 sec)
2017-08-17 18:55:23.603312 mon.controller1 mon.0 172.28.1.5:6789/0 45266 : cluster [INF] Cluster is now healthy
2017-08-17 18:55:53.522736 mon.controller1 mon.0 172.28.1.5:6789/0 45267 : cluster [WRN] Health check failed: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:55:59.870285 mon.controller1 mon.0 172.28.1.5:6789/0 45268 : cluster [WRN] Health check update: 13 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:56:03.487251 mon.controller1 mon.0 172.28.1.5:6789/0 45269 : cluster [WRN] Health check update: 26 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:56:07.264360 mon.controller1 mon.0 172.28.1.5:6789/0 45270 : cluster [WRN] Health check update: 30 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:56:09.442625 mon.controller1 mon.0 172.28.1.5:6789/0 45271 : cluster [WRN] Health check update: 47 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:56:03.196903 osd.45 osd.45 172.28.6.131:6850/15777 385 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.676124 secs
2017-08-17 18:56:03.196919 osd.45 osd.45 172.28.6.131:6850/15777 386 : cluster [WRN] slow request 30.676124 seconds old, received at 2017-08-17 18:55:32.520643: osd_op(client.86267632.0:26918130 5.9dd55ecd 5:b37aabb9:::rbd_data.886e6e7a21c4c4.0000000000000b64:head [write 1204224~4096] snapc 0=[] ack+ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 111
2017-08-17 18:56:06.197475 osd.45 osd.45 172.28.6.131:6850/15777 387 : cluster [WRN] 2 slow requests, 1 included below; oldest blocked for > 33.676745 secs
2017-08-17 18:56:06.197482 osd.45 osd.45 172.28.6.131:6850/15777 388 : cluster [WRN] slow request 30.042934 seconds old, received at 2017-08-17 18:55:36.154454: osd_op(client.123160221.0:1653394 5.1d7 5:eb8051c8:::rbd_data.4e24fe2910bfc8.0000000000003785:head [write 3432448~4096] snapc 0=[] ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 41
2017-08-17 18:56:07.197683 osd.45 osd.45 172.28.6.131:6850/15777 389 : cluster [WRN] 3 slow requests, 1 included below; oldest blocked for > 34.676957 secs
2017-08-17 18:56:07.197695 osd.45 osd.45 172.28.6.131:6850/15777 390 : cluster [WRN] slow request 30.151070 seconds old, received at 2017-08-17 18:55:37.046530: osd_repop(client.126396492.0:5548396 5.837 e28759/28681 5:ec14da05:::rbd_data.6dc5615f16fd4e.0000000000000e05:head v 28759'3437812) currently started
2017-08-17 18:56:07.437482 osd.34 osd.34 172.28.5.2:6845/24323 332 : cluster [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.397031 secs
2017-08-17 18:56:07.437490 osd.34 osd.34 172.28.5.2:6845/24323 333 : cluster [WRN] slow request 30.397031 seconds old, received at 2017-08-17 18:55:37.040348: osd_op(client.126396492.0:5548396 5.a05b2837 5:ec14da05:::rbd_data.6dc5615f16fd4e.0000000000000e05:head [write 2473984~409600] snapc 29=[29] ack+ondisk+write+known_if_redirected e28759) currently sub_op_commit_rec from 111
2017-08-17 18:56:11.925449 mon.controller1 mon.0 172.28.1.5:6789/0 45274 : cluster [WRN] Health check update: 44 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:56:13.636603 mon.controller1 mon.0 172.28.1.5:6789/0 45275 : cluster [WRN] Health check update: 1 slow requests are blocked > 32 sec (REQUEST_SLOW)
2017-08-17 18:56:15.510045 mon.controller1 mon.0 172.28.1.5:6789/0 45276 : cluster [INF] Health check cleared: REQUEST_SLOW (was: 1 slow requests are blocked > 32 sec)
2017-08-17 18:56:15.510146 mon.controller1 mon.0 172.28.1.5:6789/0 45277 : cluster [INF] Cluster is now healthy

I've followed the instructions on
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
to hopefully find out what's happening but as far as the hardware is
concerned everything looks fine. No smart errors logged, iostats shows some
activity but nothing pegged to 100%, no messages in dmesg and the cpus are
only used for around 25% max.

I check network and I did not found loss,error etc.

Could anybody give me some more steps I can take to further discover where
this bottleneck lies?

Thanks in advance!


Files

ceph.conf (733 Bytes) ceph.conf Artem Karamyshev, 08/17/2017 05:56 PM
ceph-osd.45.log (17.6 KB) ceph-osd.45.log Artem Karamyshev, 08/17/2017 06:00 PM
ceph-osd.34.log (4.98 KB) ceph-osd.34.log Artem Karamyshev, 08/17/2017 06:00 PM
ceph-osd.69.log (7.14 KB) ceph-osd.69.log Artem Karamyshev, 08/17/2017 06:01 PM
Actions #1

Updated by Artem Karamyshev over 6 years ago

Version 12.1.4 luminous

Actions #2

Updated by Sage Weil almost 3 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF