Project

General

Profile

Actions

Bug #5401

closed

cuttlefish osd recovery slow

Added by Stefan Priebe almost 11 years ago. Updated almost 11 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
cuttlefish
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

While the peering is fine now (Bug #5232) (latest upstream/cuttlefish) even without wip_cuttlefish_compact_on_startup the recovery itself is not.

While the osd is recovering i'm still seeing a lot of SLOW requests and also stucked I/O on the virtual machines.This also happens at night when there is no real load on the ceph storage (ceph -w show only 60-200 ops).

I also tested these settings:
osd recovery max active = 2
osd_recovery_op_priority = 1
osd_min_pg_log_entries = 300

but they do not help either. All OSDs have their journal on SSDs.

just a very small portion of the osd log:
2013-06-20 00:22:04.762379 7f7216fb8700 0 log [WRN] : 6 slow requests, 6 included below; oldest blocked for > 30.646148 secs
2013-06-20 00:22:04.762389 7f7216fb8700 0 log [WRN] : slow request 30.646148 seconds old, received at 2013-06-20 00:21:34.115930: osd_op(client.12181171.0:501117 rbd_data.63aa616b8b4567.00000000000097b1 [write 655360~4096] 4.f700780f snapc 603e=[] e95953) v4 currently commit sent
2013-06-20 00:22:04.762392 7f7216fb8700 0 log [WRN] : slow request 30.646131 seconds old, received at 2013-06-20 00:21:34.115947: osd_op(client.12181171.0:501118 rbd_data.63aa616b8b4567.00000000000097b1 [write 663552~4096] 4.f700780f snapc 603e=[] e95953) v4 currently commit sent
2013-06-20 00:22:04.762394 7f7216fb8700 0 log [WRN] : slow request 30.646117 seconds old, received at 2013-06-20 00:21:34.115961: osd_op(client.12181171.0:501119 rbd_data.63aa616b8b4567.00000000000097b1 [write 671744~4096] 4.f700780f snapc 603e=[] e95953) v4 currently commit sent
2013-06-20 00:22:04.762396 7f7216fb8700 0 log [WRN] : slow request 30.646104 seconds old, received at 2013-06-20 00:21:34.115974: osd_op(client.12181171.0:501120 rbd_data.63aa616b8b4567.00000000000097b1 [write 692224~4096] 4.f700780f snapc 603e=[] e95953) v4 currently no flag points reached
2013-06-20 00:22:04.762400 7f7216fb8700 0 log [WRN] : slow request 30.646079 seconds old, received at 2013-06-20 00:21:34.115999: osd_op(client.12181171.0:501121 rbd_data.63aa616b8b4567.00000000000097b1 [write 708608~4096] 4.f700780f snapc 603e=[] e95953) v4 currently no flag points reached
2013-06-20 00:22:05.762538 7f7216fb8700 0 log [WRN] : 10 slow requests, 6 included below; oldest blocked for > 31.646542 secs
2013-06-20 00:22:05.762544 7f7216fb8700 0 log [WRN] : slow request 31.646476 seconds old, received at 2013-06-20 00:21:34.116013: osd_op(client.12181171.0:501122 rbd_data.63aa616b8b4567.00000000000097b1 [write 716800~4096] 4.f700780f snapc 603e=[] e95953) v4 currently no flag points reached
2013-06-20 00:22:05.762546 7f7216fb8700 0 log [WRN] : slow request 31.646458 seconds old, received at 2013-06-20 00:21:34.116031: osd_op(client.12181171.0:501123 rbd_data.63aa616b8b4567.00000000000097b1 [write 724992~4096] 4.f700780f snapc 603e=[] e95953) v4 currently no flag points reached
2013-06-20 00:22:05.762548 7f7216fb8700 0 log [WRN] : slow request 31.646439 seconds old, received at 2013-06-20 00:21:34.116050: osd_op(client.12181171.0:501124 rbd_data.63aa616b8b4567.00000000000097b1 [write 733184~4096] 4.f700780f snapc 603e=[] e95953) v4 currently no flag points reached
2013-06-20 00:22:05.762551 7f7216fb8700 0 log [WRN] : slow request 31.646410 seconds old, received at 2013-06-20 00:21:34.116079: osd_op(client.12181171.0:501125 rbd_data.63aa616b8b4567.00000000000097b1 [write 741376~4096] 4.f700780f snapc 603e=[] e95953) v4 currently no flag points reached
2013-06-20 00:22:05.762553 7f7216fb8700 0 log [WRN] : slow request 31.646398 seconds old, received at 2013-06-20 00:21:34.116091: osd_op(client.12181171.0:501126 rbd_data.63aa616b8b4567.00000000000097b1 [write 749568~4096] 4.f700780f snapc 603e=[] e95953) v4 currently no flag points reached
2013-06-20 00:22:06.762687 7f7216fb8700 0 log [WRN] : 14 slow requests, 6 included below; oldest blocked for > 32.646670 secs
2013-06-20 00:22:06.762693 7f7216fb8700 0 log [WRN] : slow request 32.646530 seconds old, received at 2013-06-20 00:21:34.116101: osd_op(client.12181171.0:501127 rbd_data.63aa616b8b4567.00000000000097b1 [write 757760~4096] 4.f700780f snapc 603e=[] e95953) v4 currently no flag points reached
2013-06-20 00:22:06.762696 7f7216fb8700 0 log [WRN] : slow request 32.646521 seconds old, received at 2013-06-20 00:21:34.116110: osd_op(client.12181171.0:501128 rbd_data.63aa616b8b4567.00000000000097b1 [write 765952~4096] 4.f700780f snapc 603e=[] e95953) v4 currently no flag points reached

Actions #1

Updated by Stefan Priebe almost 11 years ago

Lowering osd recovery max active makes it even more worth as the over all recovery takes longer. So it's not the I/O load which is the problem.

It just seems that the osd does not serve specific requests while recovering. Might it be that it already gets requests for PGs not already recovered and then can't serve them?

Actions #2

Updated by Stefan Priebe almost 11 years ago

Some more information. While recoverig i see nearly no CPU load. If i look at the disk activity i see a HUGE amount of reads and just a low amount of writes. I thought that this would be the other way round.

Actions #3

Updated by Stefan Priebe almost 11 years ago

Not sure if this helps:

  1. /etc/init.d/ceph stop osd.24; sleep 15; /etc/init.d/ceph start osd.24; sleep 10; inotifywatch -t 10 -r /ceph/osd.24/

=== osd.24 ===
Stopping Ceph osd.24 on 1264...kill 413...done === osd.24 ===
create-or-move updated item id 24 name 'osd.24' weight 0.91 at location {host=1264,root=default} to crush map
Starting Ceph osd.24 on 1264...
starting osd.24 at :/0 osd_data /ceph/osd.24/ /dev/disk/by-partlabel/journalosd24
Establishing watches...
Finished establishing watches, now collecting statistics.
total access modify attrib close_write close_nowrite open create delete filename
1171 175 180 170 158 163 321 2 2 /ceph/osd.24/current/4.1d0_head/
1081 0 120 240 121 240 360 0 0 /ceph/osd.24/current/4.a82_head/
1017 0 113 226 113 226 339 0 0 /ceph/osd.24/current/4.db8_head/
612 0 68 136 68 136 204 0 0 /ceph/osd.24/current/4.e9e_head/
515 1 57 115 57 113 170 1 1 /ceph/osd.24/current/4.1e5_head/
515 0 2 307 0 92 68 46 0 /ceph/osd.24/current/
483 0 151 14 152 5 157 2 2 /ceph/osd.24/current/4.1d0_TEMP/
457 0 49 98 49 106 155 0 0 /ceph/osd.24/current/4.649_head/
315 0 35 70 35 70 105 0 0 /ceph/osd.24/current/4.a32_head/
296 66 70 36 28 32 60 2 2 /ceph/osd.24/current/4.b0c_head/
288 40 41 43 40 41 81 1 1 /ceph/osd.24/current/4.671_head/
286 0 30 60 30 68 98 0 0 /ceph/osd.24/current/4.231_head/
283 40 41 42 39 40 79 1 1 /ceph/osd.24/current/4.40e_head/
274 44 47 39 34 37 71 1 1 /ceph/osd.24/current/4.cbd_head/
244 47 51 32 26 30 56 1 1 /ceph/osd.24/current/4.12c_head/
237 37 38 34 31 32 63 1 1 /ceph/osd.24/current/4.f1c_head/
225 46 47 28 25 26 51 1 1 /ceph/osd.24/current/4.56c_head/
218 0 22 44 22 54 76 0 0 /ceph/osd.24/current/4.cb_head/
213 5 26 46 23 44 67 1 1 /ceph/osd.24/current/4.f51_head/
205 0 21 42 21 50 71 0 0 /ceph/osd.24/current/4.cc3_head/
182 39 44 23 16 21 37 1 1 /ceph/osd.24/current/4.6bf_head/
179 39 42 22 17 20 37 1 1 /ceph/osd.24/current/4.2bb_head/
174 0 18 36 18 42 60 0 0 /ceph/osd.24/current/4.7be_head/
171 40 43 20 15 18 33 1 1 /ceph/osd.24/current/4.bb4_head/
170 41 42 19 16 17 33 1 1 /ceph/osd.24/current/4.eac_head/
165 0 17 34 17 40 57 0 0 /ceph/osd.24/current/4.d21_head/
152 37 38 17 14 15 29 1 1 /ceph/osd.24/current/4.d75_head/
151 36 41 18 11 16 27 1 1 /ceph/osd.24/current/4.341_head/
141 0 13 26 13 38 51 0 0 /ceph/osd.24/current/4.862_head/
129 36 36 13 10 10 22 1 1 /ceph/osd.24/current/4.df_head/
129 0 38 7 38 3 41 1 1 /ceph/osd.24/current/4.40e_TEMP/
122 0 39 7 29 3 42 1 1 /ceph/osd.24/current/4.671_TEMP/
121 33 37 13 7 11 18 1 1 /ceph/osd.24/current/4.64_head/
114 34 37 11 6 9 15 1 1 /ceph/osd.24/current/4.e86_head/
113 35 36 10 7 8 15 1 1 /ceph/osd.24/current/4.738_head/
102 0 30 7 27 3 33 1 1 /ceph/osd.24/current/4.f1c_TEMP/
102 0 30 7 27 3 33 1 1 /ceph/osd.24/current/4.cbd_TEMP/
101 34 35 8 5 6 11 1 1 /ceph/osd.24/current/4.ace_head/
97 0 23 13 23 5 29 2 2 /ceph/osd.24/current/4.b0c_TEMP/
96 34 35 7 4 5 9 1 1 /ceph/osd.24/current/4.fd9_head/
96 34 35 7 4 5 9 1 1 /ceph/osd.24/current/4.b35_head/
90 9 16 17 8 15 23 1 1 /ceph/osd.24/current/4.fbe_head/
88 28 29 7 6 7 11 0 0 /ceph/osd.24/current/4.de2_head/
88 0 24 7 25 3 27 1 1 /ceph/osd.24/current/4.56c_TEMP/
87 34 34 5 3 3 6 1 1 /ceph/osd.24/current/4.bd1_head/
85 0 9 20 9 18 27 1 1 /ceph/osd.24/current/4.78e_head/
84 26 28 8 4 6 10 1 1 /ceph/osd.24/current/4.2ab_head/
80 22 25 9 4 7 11 1 1 /ceph/osd.24/current/4.858_head/
80 0 8 18 8 18 26 1 1 /ceph/osd.24/current/4.353_head/
79 0 21 7 22 3 24 1 1 /ceph/osd.24/current/4.12c_TEMP/
72 0 8 16 8 16 24 0 0 /ceph/osd.24/current/4.308_head/
72 0 8 16 8 16 24 0 0 /ceph/osd.24/current/4.760_head/
66 0 7 15 8 15 21 0 0 /ceph/osd.24/current/4.5ab_head/
62 0 6 14 6 14 20 1 1 /ceph/osd.24/current/4.f65_head/
59 18 19 6 3 4 7 1 1 /ceph/osd.24/current/4.a4_head/
58 0 6 14 6 12 18 1 1 /ceph/osd.24/current/4.323_head/
58 11 14 9 4 7 11 1 1 /ceph/osd.24/current/4.216_head/
58 0 14 7 15 3 17 1 1 /ceph/osd.24/current/4.eac_TEMP/
55 23 23 3 1 1 2 1 1 /ceph/osd.24/current/4.5c9_head/
55 14 16 7 3 5 8 1 1 /ceph/osd.24/current/4.988_head/
55 0 13 7 14 3 16 1 1 /ceph/osd.24/current/4.2bb_TEMP/
55 0 13 7 14 3 16 1 1 /ceph/osd.24/current/4.d75_TEMP/
54 0 6 12 6 12 18 0 0 /ceph/osd.24/current/4.f6c_head/
53 0 5 10 5 14 19 0 0 /ceph/osd.24/current/4.1eb_head/
52 0 12 7 13 3 15 1 1 /ceph/osd.24/current/4.bb4_TEMP/
52 0 12 7 13 3 15 1 1 /ceph/osd.24/current/4.df_TEMP/
46 0 10 7 11 3 13 1 1 /ceph/osd.24/current/4.6bf_TEMP/
45 0 5 10 5 10 15 0 0 /ceph/osd.24/current/4.f1d_head/
44 0 4 8 4 12 16 0 0 /ceph/osd.24/current/4.50c_head/
44 0 4 8 4 12 16 0 0 /ceph/osd.24/current/4.684_head/
39 0 3 6 3 12 15 0 0 /ceph/osd.24/current/4.afe_head/
35 0 3 6 3 10 13 0 0 /ceph/osd.24/current/4.86b_head/
35 0 8 4 9 2 11 1 0 /ceph/osd.24/current/4.847_TEMP/
34 0 6 7 7 3 9 1 1 /ceph/osd.24/current/4.738_TEMP/
31 0 3 8 3 6 9 1 1 /ceph/osd.24/current/4.c11_head/
31 0 3 8 3 6 9 1 1 /ceph/osd.24/current/4.f19_head/
31 0 5 7 6 3 8 1 1 /ceph/osd.24/current/4.341_TEMP/
27 0 3 6 3 6 9 0 0 /ceph/osd.24/current/4.c89_head/
27 0 3 6 3 6 9 0 0 /ceph/osd.24/current/4.a1_head/
27 0 3 6 3 6 9 0 0 /ceph/osd.24/current/4.9dc_head/
27 0 3 6 3 6 9 0 0 /ceph/osd.24/current/4.f3e_head/
27 0 3 6 3 6 9 0 0 /ceph/osd.24/current/4.5_head/
27 0 3 6 3 6 9 0 0 /ceph/osd.24/current/4.90c_head/
27 0 3 6 3 6 9 0 0 /ceph/osd.24/current/4.137_head/
26 0 2 6 2 6 8 1 1 /ceph/osd.24/current/4.119_head/
26 0 2 4 2 8 10 0 0 /ceph/osd.24/current/4.66a_head/
26 0 2 4 2 8 10 0 0 /ceph/osd.24/current/4.29c_head/
26 0 2 4 2 8 10 0 0 /ceph/osd.24/current/4.990_head/
25 0 3 7 4 3 6 1 1 /ceph/osd.24/current/4.64_TEMP/
25 0 3 7 4 3 6 1 1 /ceph/osd.24/current/4.ace_TEMP/
24 0 3 6 4 3 6 1 1 /ceph/osd.24/current/4.f51_TEMP/
22 0 2 4 2 6 8 0 0 /ceph/osd.24/current/4.418_head/
22 0 2 4 2 6 8 0 0 /ceph/osd.24/current/4.a71_head/
22 0 2 6 2 4 6 1 1 /ceph/osd.24/current/4.685_head/
22 0 2 4 2 6 8 0 0 /ceph/osd.24/current/4.9f9_head/
22 0 2 6 2 4 6 1 1 /ceph/osd.24/current/4.518_head/
22 0 2 7 3 3 5 1 1 /ceph/osd.24/current/4.e86_TEMP/
22 0 1 9 2 3 5 1 1 /ceph/osd.24/current/4.72f_TEMP/
22 0 2 7 3 3 5 1 1 /ceph/osd.24/current/4.b35_TEMP/
22 0 2 7 3 3 5 1 1 /ceph/osd.24/current/4.fd9_TEMP/
22 0 2 7 3 3 5 1 1 /ceph/osd.24/current/4.a4_TEMP/
22 0 2 7 3 3 5 1 1 /ceph/osd.24/current/4.bd1_TEMP/
21 0 2 6 3 3 5 1 1 /ceph/osd.24/current/4.1e5_TEMP/
21 0 1 8 2 3 5 1 1 /ceph/osd.24/current/4.216_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.c11_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.2ab_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.78e_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.988_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.858_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.fbe_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.119_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.323_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.f19_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.aad_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.281_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.6f5_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.353_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.eec_TEMP/
19 0 1 7 2 3 4 1 1 /ceph/osd.24/current/4.518_TEMP/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.900_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.a24_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.a58_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.960_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.d12_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.68b_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.2b_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.896_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.fab_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.e0f_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.c0c_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.a9f_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.ba7_head/
18 0 2 4 2 4 6 0 0 /ceph/osd.24/current/4.76d_head/
18 0 1 6 2 3 4 1 1 /ceph/osd.24/current/4.5c9_TEMP/
18 0 1 6 2 3 4 1 1 /ceph/osd.24/current/4.f65_TEMP/
18 0 1 6 2 3 4 1 1 /ceph/osd.24/current/4.145_TEMP/
17 0 1 2 1 6 7 0 0 /ceph/osd.24/current/4.8bc_head/
17 0 1 6 2 2 4 1 1 /ceph/osd.24/current/4.685_TEMP/
17 0 1 2 1 6 7 0 0 /ceph/osd.24/current/4.475_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.ee_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.d22_head/
13 0 1 4 1 2 3 1 1 /ceph/osd.24/current/4.145_head/
13 0 1 4 1 2 3 1 1 /ceph/osd.24/current/4.281_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.7c_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.583_head/
13 0 1 4 1 2 3 1 1 /ceph/osd.24/current/4.aad_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.e82_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.eab_head/
13 0 1 4 1 2 3 1 1 /ceph/osd.24/current/4.6f5_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.7d5_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.f55_head/
13 0 1 4 1 2 3 1 1 /ceph/osd.24/current/4.72f_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.ad4_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.ed8_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.626_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.672_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.e2b_head/
13 0 1 2 1 4 5 0 0 /ceph/osd.24/current/4.340_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.e0c_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.c83_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.972_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.74_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.f88_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.183_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.3d5_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.7c3_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.372_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.8ff_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.d91_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.fe8_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.0_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.5af_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.41_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.5a_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.ca4_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.502_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.383_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.cee_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.ccc_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.274_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.7ab_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.7da_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.c51_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.66b_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.e1_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.d79_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.edb_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.cdd_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.18c_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.235_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.9d_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.cc8_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.cbc_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.212_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.b06_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.5d6_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.209_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.425_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.9af_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.a17_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.f42_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.d07_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.a2f_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.e6_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.e39_head/
9 0 1 2 1 2 3 0 0 /ceph/osd.24/current/4.a49_head/
8 0 0 2 0 2 2 1 1 /ceph/osd.24/current/4.eec_head/
5 0 2 0 0 0 3 0 0 /ceph/osd.24/current/omap/
5 1 0 0 0 2 2 0 0 /ceph/osd.24/current/4.182_head/
5 0 0 2 1 1 0 0 1 /ceph/osd.24/current/4.5ab_TEMP/
5 1 0 0 0 2 2 0 0 /ceph/osd.24/current/4.d4c_head/
5 1 0 0 0 2 2 0 0 /ceph/osd.24/current/4.32e_head/
4 0 0 1 1 1 0 0 1 /ceph/osd.24/current/4.de2_TEMP/
4 0 0 0 0 2 2 0 0 /ceph/osd.24/current/4.74a_head/
4 0 0 0 0 2 2 0 0 /ceph/osd.24/current/4.96_head/
2 0 0 1 0 0 0 0 1 /ceph/osd.24/current/4.847_head/
1 0 0 0 1 0 0 0 0 /ceph/osd.24/current/4.372_TEMP/
1 0 0 0 1 0 0 0 0 /ceph/osd.24/current/4.235_TEMP/
1 0 0 0 1 0 0 0 0 /ceph/osd.24/current/4.a2f_TEMP/
1 0 0 0 1 0 0 0 0 /ceph/osd.24/current/4.649_TEMP/
1 0 0 0 1 0 0 0 0 /ceph/osd.24/current/4.e_head/

Actions #4

Updated by Stefan Priebe almost 11 years ago

Full backtrace (while recovering):
http://pastebin.com/raw.php?i=DWGHiNP6

2nd full backtrace:
http://pastebin.com/raw.php?i=KvPrXdaz

Actions #5

Updated by Stefan Priebe almost 11 years ago

I could hide this problem by higher "osd recovery delay start" => 120 but then the overall recovery time is very high - even after 120s? Did i understand this option wrong?

Actions #6

Updated by Stefan Priebe almost 11 years ago

It did defer the recovery again and again. Some example log output:

2013-06-22 15:07:20.187878 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:07:23.862535
2013-06-22 15:07:21.188003 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:07:23.862535
2013-06-22 15:07:21.188084 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:07:23.862535
2013-06-22 15:07:22.188136 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:07:23.862535
2013-06-22 15:07:22.188180 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:07:23.862535
2013-06-22 15:07:23.188329 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:07:23.862535
2013-06-22 15:07:23.188371 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:07:23.862535
2013-06-22 15:07:24.054136 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:24.054131
2013-06-22 15:07:24.188494 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:24.054131
2013-06-22 15:07:24.188551 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:24.054131
2013-06-22 15:07:24.355476 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:24.355439
2013-06-22 15:07:25.045116 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:25.188681 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:25.188725 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:26.188829 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:26.188872 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:27.188850 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:27.188861 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:28.189089 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:28.189139 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:29.189286 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:29.189327 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:30.189455 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:30.189504 7f8c3ec9b700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075
2013-06-22 15:07:31.189642 7f8c3f49c700 15 osd.44 96832 _recover_now defer until 2013-06-22 15:09:25.045075

Actions #7

Updated by Stefan Priebe almost 11 years ago

g_conf->osd_recovery_max_active is 5 but on a freshly started osd i'm seeing log messages like these ...2013-06-22 ... dequeue=0 (30/5 rops)

log file uploaded to cephdrop folder 5401/ceph-osd.44.log.gz

Actions #8

Updated by Stefan Priebe almost 11 years ago

It seems locking and unlocking the mutex isn't working correctly. So multiple threads seem to higher recovery_ops_active while others calculate max.

void OSD::do_recovery(PG *pg) {
// see how many we should try to start. note that this is a bit racy.
recovery_wq.lock();
int max = g_conf->osd_recovery_max_active - recovery_ops_active;

Actions #9

Updated by Stefan Priebe almost 11 years ago

max may also be negative but there is only a check for == 0 in OSD.cc

Actions #10

Updated by Stefan Priebe almost 11 years ago

This one seems to "hide" the problem:
commit 9fe5611fdd7374654ad58185fa3988e216c52f08
Author: Stefan Priebe <>
Date: Sat Jun 22 23:01:17 2013 +0200

max may be negative

diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 7ed62c7..9f2711d 100644
--- a/src/osd/OSD.cc
++ b/src/osd/OSD.cc
@ -6049,7 +6049,7 @ void OSD::do_recovery(PG *pg)
recovery_wq.lock();
int max = g_conf->osd_recovery_max_active - recovery_ops_active;
recovery_wq.unlock();
- if (max == 0) {
if (max <= 0) {
dout(10) << "do_recovery raced and failed to start anything; requeuing " << *pg << dendl;
recovery_wq.queue(pg);
} else {

Actions #11

Updated by Sage Weil almost 11 years ago

  • Assignee set to Sage Weil
Actions #12

Updated by Sage Weil almost 11 years ago

  • Status changed from New to In Progress

that fix looks correct, but i'm not sure it is the problem here.. if i'm reading this right, blowing past the max will mean that we continue to start recovery ops even when we shouldn't. that should make things go faster, not slower. looking more closely at the logs now..

Actions #13

Updated by Stefan Priebe almost 11 years ago

Yes generally this check needs to be fixed. But it does not solve the problem.

To be a bit more specific.

First the slow requests and VM hangs are results of a heavily stressed disk. I see a MASSIVE amount of concurrent READ and WRITE requests to the disk vom the osd daemon. But it's def. not the normal I/O all other disks are low in use. I've no idea what the OSD daemon is doing. This stops after 2-3 Minutes and then the recovery starts to be fast and i don't see any slow requests. This is around the same time i'm starting to see only a max of 5/5 roips. Even with my small fix the recovery_ops_active counter is still racy.

void OSD::do_recovery(PG *pg) {
// see how many we should try to start. note that this is a bit racy.
recovery_wq.lock();

the processes hang at this lock then they proceed calculating a max value. If this one is > 0 they proceed may be all of them in a very short time frame but the recovery_ops_active counter is raised below in void OSD::start_recovery_op.

So i think to fix this correctly the counter needs to be increased in the same lock as the max check and in case of failures or other conditions it needs to get decremented. It's safer to have lower roips in progress than more than the maximum.

Actions #14

Updated by Greg Farnum almost 11 years ago

Stefan Priebe wrote:

First the slow requests and VM hangs are results of a heavily stressed disk. I see a MASSIVE amount of concurrent READ and WRITE requests to the disk vom the osd daemon. But it's def. not the normal I/O all other disks are low in use. I've no idea what the OSD daemon is doing. This stops after 2-3 Minutes and then the recovery starts to be fast and i don't see any slow requests.

Can you expand on this part?

Actions #15

Updated by Stefan Priebe almost 11 years ago

Maybe what's missing. The backtraces i provided and the logs are form that timeframe. Also it's the timeframe i'm seeing 20-40 roips instead of 5.

Actions #16

Updated by Greg Farnum almost 11 years ago

Did you put the logs in cephdrop?
I'm mostly curious to see if we can figure out where the disk thrashing is coming from — what you're describing sounds to me like the log compaction stuff we were running into with peering, but you say peering has already completed, so I'm not sure what it could be. (Unless it's way behind on OSDMaps or something...maybe?)

Actions #17

Updated by Sage Weil almost 11 years ago

i've pushed wip-5021 which makes a small optimization that will improve recovery for small objects. it's not well tested, so only try this on test clusters pls.

looking further, tho, i think the real problem is with the rops code, which looks.. broken. playing with it now.

Actions #18

Updated by Stefan Priebe almost 11 years ago

Yes see comment #7 (log file uploaded to cephdrop folder 5401/ceph-osd.44.log.gz).

Yes peering is already fully completed.

Actions #19

Updated by Stefan Priebe almost 11 years ago

thanks sage i tested wip-5401 branch no change ;-(

Actions #20

Updated by Sage Weil almost 11 years ago

Stefan, can you capture an osd log that covers the entire period from before recovery starts up until when it slows/stops?

Thanks!

Actions #21

Updated by Sage Weil almost 11 years ago

pushed another wip-5401 and resolves the race more thoroughly. tested lightly and it behaves. but i haven't been able to reproduce what you're seeing...

Actions #22

Updated by Stefan Priebe almost 11 years ago

Will test soon. Can also provide a complete log. Just debug osd 20? everything else can be 0?

Actions #23

Updated by Sage Weil almost 11 years ago

debug osd = 20
debug ms = 1

thanks!

Actions #24

Updated by Stefan Priebe almost 11 years ago

OK this is a complete log from starting a stopped OSD, recover and then shut it down again. This one was done using the wip-5401 branch (a1a930b24c417ea3f5388855b3a137cc9c21c15d).

I've monitored the whole process using iostat. Disk I/O gets reduced at around 00:09:01.

Logfile is uploaded to
cephdrop: 5401/ceph-osd.3.log.gz

Actions #25

Updated by Sage Weil almost 11 years ago

hmm, i don't see anything awry here. the rops count does go over the max because we get client io requests to missing objects, which triggers immediate recovery. which is good. it may be that the nodes we are pulling from are also busy...

i spot checked a few slow object recoveries and they are slow just because client io has priority, and there are only 5 ops in flight.

can you try

- increasing osd max recovery active (to something > 5)
- increase osd recovery op priority (to something > 10 and < 63, the client request priority)

Actions #26

Updated by Stefan Priebe almost 11 years ago

Hi sage very strange. I tried your suggestions but also setting recovery active to 20 and priority to 60 does not help.

Wouldn't it make generally sense to redirect client I/O for not recovered objects instead of trying to recovery them immediately?

The other osds are def. not that busy that they can't deliver.

Just an idea what would happen if there are 100 frequently used pgs out of the 300 each osd has? Wouldn't this mean that this osd has to recover 100 pgs in < 30s? This seems unrealistic to me. So I have again the question why the osd tries to recover these osds immediately.

Also I'm still seeing a lot of I/O in the first minutes after start.

Actions #27

Updated by Stefan Priebe almost 11 years ago

mikedawson told me on irc that he has the same problem - since upgrading from bobtail to cuttlefish

Actions #28

Updated by Stefan Priebe almost 11 years ago

While recovering i'm seeing two threads out of the OSD doing in parallel 10-15MB/s reading and 10-20Mb/s writing. The heavy reads in parallel are the problem. What does the OSD do here?

Actions #29

Updated by Stefan Priebe almost 11 years ago

Generally i think i would be the best to optionally block/redirect client I/O on a recovering OSD. So it can recover "peacefully" and then get back to work.

Actions #30

Updated by Sage Weil almost 11 years ago

Stefan Priebe wrote:

While recovering i'm seeing two threads out of the OSD doing in parallel 10-15MB/s reading and 10-20Mb/s writing. The heavy reads in parallel are the problem. What does the OSD do here?

The reads are coming out of osd_op_threads, which defaults to 2. You might try adjusting that. Writes happen from filestore_op_threads (also defaults 2), although they have to pass through the osd_op_threads first (generally without doing io).

Actions #31

Updated by Stefan Priebe almost 11 years ago

thanks sage - i already played with the threading but that's not the problem. The problem is 2x 10-15MB/s reading while having 2x 10-20Mb/s writing. That's too much for a disk. Normal disk load is at 0-10MB/s writing not more no reading at all (all in linux cache). While recovering it is as much as the disk can handle in reading AND writing. That this results in SLOW requests is clear.

What about my idea to redirect / block client I/O for a recovering OSD? Is this too complex?

Actions #32

Updated by Sage Weil almost 11 years ago

It is a good idea. I have to think a bit more about how hard it would be to implement. Currently the pg_temp remapping is triggered deterministically based on which peers have what pg data, but this would require that we introduce an outside factor in deciding what to do... probably we'll need an osdmap flag that means something like 'avoid me for primary, please' so that everyone agrees. We've talked about doing this in other contexts as well as a general means of balancing load (by effectively moving the read workload around without touching writes).

If this is the issue we're narrowing in on, it's worth noting that this is no different than what bobtail (or even argonaut) do.

Actions #33

Updated by Stefan Priebe almost 11 years ago

I'm not really sure that this is problem cause it was working fine under bobtail. But i can't guarantee that the load was lower under bobtail and i just did not hit this one.

What i still don't understand is the read workload. Where does it come from. The files itself should be still in linux cache. Is there any chance to get an idea which files ceph reads / writes and at which speed?

Actions #34

Updated by Sage Weil almost 11 years ago

Stefan Priebe wrote:

I'm not really sure that this is problem cause it was working fine under bobtail. But i can't guarantee that the load was lower under bobtail and i just did not hit this one.

What i still don't understand is the read workload. Where does it come from. The files itself should be still in linux cache. Is there any chance to get an idea which files ceph reads / writes and at which speed?

If you set debug filestore = 20 you can see exactly what is being read or written . I'm interested in see what it is doing...

Actions #35

Updated by Stefan Priebe almost 11 years ago

OK Log uploaded to:
/home/cephdrop/5401/debug_filestore_20_ceph-osd.3.log.gz

The high read rates started at 20:22:22.

Actions #36

Updated by Stefan Priebe almost 11 years ago

Just to note the log was done with (filestore_queue_max_ops = 50, filestore_queue_committing_max_ops = 50) I had tried this for testing and forgot to remove but as the effect is the same i hope the log is still ok.

Actions #37

Updated by Sage Weil almost 11 years ago

all of the read activity is from using the clone when recovering snapped objects. can you see how things perform with 'osd recovery clone overlap = false'? this will result in more data copied over the network, but will avoid slower read/write/fsync cycles on the recovering node.

Actions #38

Updated by Stefan Priebe almost 11 years ago

new files here - still a lot of reading goes on /home/cephdrop/5401/overlap_false_ceph-osd.3.log.gz

Actions #39

Updated by Stefan Priebe almost 11 years ago

Log still contains:
2013-06-28 20:51:29.928755 7fe5c33ba700 20 filestore(/ceph/osd.3/) _do_clone_range 3260416~8192 to 3260416
2013-06-28 20:51:29.928758 7fe5c33ba700 20 filestore(/ceph/osd.3/) _do_clone_range using copy
2013-06-28 20:51:29.928759 7fe5c33ba700 20 filestore(/ceph/osd.3/) _do_copy_range 3260416~8192 to 3260416
2013-06-28 20:51:29.928768 7fe5c33ba700 20 filestore(/ceph/osd.3/) _do_copy_range 3260416~8192 to 3260416 = 8192

or what did you mean by using the clone when recovering snapped objects

Actions #40

Updated by Stefan Priebe almost 11 years ago

Oh i forgot to say that in last log i activated debug_filestore later on at 20:51:02. it was 0 before.

Actions #41

Updated by Sage Weil almost 11 years ago

oops, it's 'osd recover clone overlap = false' (no y)

Actions #42

Updated by Stefan Priebe almost 11 years ago

mhm retestet. Log looks different but the result is still the same. Strangely after processing around 100pgs the recovering the rest works fine no slow requests.

Actions #43

Updated by Stefan Priebe almost 11 years ago

Just another point regarding my idea. If we have an osd having writes to avg 50 PGs/min then ceph now wants that this osd if it was for example 5 min doen to recover them all in < 30s? this seems to be unrealistic. Or did i get it wrong?

Actions #44

Updated by Stefan Priebe almost 11 years ago

The following seems to help:
- osd recovery max active = 2
- filestore min sync interval = 1
- filestore max sync interval = 15
- filestore op threads = 1
- filestore queue max ops = 1000
- filestore_queue_committing_max_ops = 10000

But does not fix it but minimizes the effect

Actions #45

Updated by Stefan Priebe almost 11 years ago

Sage - any news / ideas?

Actions #46

Updated by Samuel Just almost 11 years ago

What prompted you to set 'filestore min sync interval' and 'filestore max sync interval'? The defaults are 0.01 and 5 respectively. Changing especially the min could have surprising consequences.

Actions #47

Updated by Stefan Priebe almost 11 years ago

Nobody - i was just helpless and tried to find a solution for me.

Actions #48

Updated by Stefan Priebe almost 11 years ago

Sorry solution is the wrong word - minimize the effect (it'sstill there but not as much as before)

Actions #49

Updated by Stefan Priebe almost 11 years ago

OK reverting them to the defaults seems to work fine too.

Actions #50

Updated by Stefan Priebe almost 11 years ago

Still seeing those SLOW requests and client crashes on cuttlefish. ANy Ide what to try? i've no expanded the cluster to have lower I/O per OSD but this does not help either.

6 hosts 24 OSDs upgraded to 9 hosts 36 OSDs

Actions #51

Updated by Sage Weil almost 11 years ago

  • Status changed from In Progress to 12
  • Assignee deleted (Sage Weil)
  • Priority changed from Urgent to High

sorry, just can't prioritize this right now. not getting similar reports from other users, and need to focus on making sure these problems are all resolved for the dumpling release.

the current thread on ceph-devel suggests the peering may be part of the slow request problem, btw... which is i think unrelated to the slow recovery rate.

also note that we are spending significant effort to tune recovery rates better for dumpling.

Actions #52

Updated by Sage Weil almost 11 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF