Project

General

Profile

Actions

Bug #5084

closed

osd: slow peering after osd restart (bobtail)

Added by Sage Weil almost 11 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Files

osd-tree (3.83 KB) osd-tree Faidon Liambotis, 05/16/2013 11:26 AM
osd-dump (32.4 KB) osd-dump Faidon Liambotis, 05/16/2013 11:26 AM
ceph.log (280 KB) ceph.log Faidon Liambotis, 05/16/2013 11:26 AM
ceph.log-2 (152 KB) ceph.log-2 Faidon Liambotis, 05/16/2013 01:56 PM
recovery-wait.txt (458 Bytes) recovery-wait.txt Faidon Liambotis, 05/16/2013 01:56 PM
peering.txt (162 KB) peering.txt Faidon Liambotis, 05/16/2013 01:56 PM
ceph-osd.6.log.bz2 (358 KB) ceph-osd.6.log.bz2 Faidon Liambotis, 05/17/2013 10:29 AM
peering-osd6.txt (48.5 KB) peering-osd6.txt Faidon Liambotis, 05/17/2013 10:29 AM
ceph.log-osd6 (58 KB) ceph.log-osd6 Faidon Liambotis, 05/17/2013 10:29 AM
node03-ceph-clean.log (84.3 KB) node03-ceph-clean.log Igor Lukyanov, 05/23/2013 01:13 PM
node03-iostat-clean.log (23.5 KB) node03-iostat-clean.log Igor Lukyanov, 05/23/2013 01:13 PM
slow-peering-cuttlefish (187 KB) slow-peering-cuttlefish Faidon Liambotis, 06/06/2013 08:21 AM
5084-jun19.log (201 KB) 5084-jun19.log Faidon Liambotis, 06/18/2013 05:43 PM

Updated by Faidon Liambotis almost 11 years ago

Thanks for opening this. Attached are osd dump, osd tree and the ceph.log right after I did "restart ceph-osd id=0". mons are 0.56.6, most osds including 0 are at 0.56.4.

Actions #2

Updated by Sage Weil almost 11 years ago

two theories:

- deep scrub is slowing things down. can you try 'ceph osd set nodeepscrub' and/or 'ceph osd set noscrub', wait for scrubs to finish, and then repeat?

- one or a handful of osds are slow peering. can you do something like

while sleep 5 ; do date ; ceph pg dump | grep peering ; done | tee peering.txt

while an osd is restarted, stopping only after everything is healthy again?

Updated by Faidon Liambotis almost 11 years ago

Did all that, attached the logs & peering.txt. Peering took two minutes, with recovery_wait taking another two, so I pg dump'ed recovery_wait too.

Updated by Faidon Liambotis almost 11 years ago

Here's the same with osd.6, but with --debug-ms 1 as requested. Besides peering being slow, the recovery_wait phase seems to be also producing a lot of slow requests for a while. Is that normal?

Actions #5

Updated by Faidon Liambotis almost 11 years ago

One interesting observation is that when I tried restarting an OSD a few minutes after it has been restarted and cluster settled, peering finished up in seconds. I then tried restarting osd.0, which I had restarted yesterday (so it was running 0.56.6); this resulted in slow peering again.

Another observation is that I've seen a few times is "active" (not active+clean) pgs right after peering and for a few minutes. These came with slow requests, so not sure if it's normal either.

Actions #6

Updated by Sage Weil almost 11 years ago

  • Assignee changed from Sage Weil to Samuel Just
Actions #7

Updated by Samuel Just almost 11 years ago

This could be a result of writing out pg epochs on each map change. I have a branch which should greatly reduce the problem.

wip_bobtail_pg_epoch 2c5a9f0e178843e7ed514708bab137def840ab89

I just pushed that version, so verify the sha1 before using (the gitbuilder might take a little while to get to it).

It's based on current bobtail. Let us know if it helps.
-Sam

Actions #8

Updated by Faidon Liambotis almost 11 years ago

Something that's not clear to me: does this need to be in all peers to have an effect? Or in other words, to fix this would I need to upgrade all of my OSDs to a version containing this to have a faster peering?

Also, this doesn't affect Cuttlefish at all, right?

Updated by Igor Lukyanov almost 11 years ago

We repeat the same bug on both Bobtail and Cuttlefish deployments just by calling osd in/out/reweight.
Peering completely freezes IO for a few minutes until it finishes.
On a non-patched cluster, we observe nearly 100% disk utilization during all that period.
Provided patch made things better in the sense that disk load dropped off to 50-60%, but the problem of completely stalled client IO is still remains. It also seems that patch does not reduce total peering time: peering still takes about 2 mins on our cluster.

As we can assume client ops are waiting for new OSD map that is issued only after peering finishes.
It seems that reasonable way to fix the problem is to speed up issuing of new OSD map, is it possible? Thank you.

P.S. ceph and iostat logs are attached, 2 osds (placed on sda and sdb) are marked as 'out' during the test.

Actions #10

Updated by Igor Lukyanov almost 11 years ago

As we can assume client ops are waiting for new OSD map that is issued only after peering finishes.
It seems that reasonable way to fix the problem is to speed up issuing of new OSD map, is it possible? Thank you.
P.S. ceph and iostat logs are attached, 2 osds (placed on sda and sdb) are marked as 'out' during the test.

After some testing want to make a little correction to the last point.
Actually, ops to a specific PG start to be processed after the one becomes active, not after peering completely finishes. Assuming entire peering takes T (T is in order of minutes), then avg latency of a single read op would be T/2 and avg latency of a single write op would be 3/4T (due to waiting for replicas). So a typical latency during peering takes tens of seconds, that is inappropriate.

I guess it's unlikely to speed up peering by an order of magnitude, but may be it's ok to prioritize peering of PGs awaited by clients? Though docs say "peering messages must not be reordered"...

Actions #11

Updated by Igor Lukyanov almost 11 years ago

We wrote a test that sequentially reads 1M blocks spreaded by 64Mb offsets (i.e. from different placement groups) and prints each op latency, and started it simultaneously with peering. Here is what timings we've got:
2.384s
9.328s
2.034s
0.001s
1.179s
0.001s
0.797s
56.005s
23.488s
2.903s
38.236s
1.930s
7.875s
0.001s
3.760s
0.001s
4.768s
0.001s

Briefly, peering took 140 sec, and during that time cluster resolved about 20 sequential client ops with avg latency 7 sec.
We tried to fix that behavior by prioritizing peering of placement groups awaited by clients (see patch http://pastebin.com/PR8DmAR1) but for some unknown reasons it totally broke peering and caused marking all placement groups as staled. Will be very grateful for any suggestions for fixing the peering problem.

Actions #12

Updated by Samuel Just almost 11 years ago

Igur, a main problem is that we are writing out the pg epoch to the filestore when we don't need to. The second problem is that peering involves talking to peers. Prioritizing peering messages for pgs with waiting clients will have a side effect of delaying messages on stray pgs which may be blocking the primary on another OSD from going active.

Because your comparer depends on mutable state under the qlock without holding the qlock, it may reorder the list in surprising ways.

Also, that list is only the list to be processed in a single _process() call, the larger peering_wq is what you would need to prioritize to see a difference.

I think the best approach though is to decrease peering cost over all.

Actions #13

Updated by Samuel Just almost 11 years ago

Faidon, it shouldn't affect cuttlefish as much, though that is not clear. That patch would need to be installed on all OSDs. It's not well tested yet, so you probably only want to use it on a test cluster at this time.

I updated the branch to sha1: 99831601474eddd15da72aaff421be83c02eaf46.

Actions #14

Updated by John Nielsen almost 11 years ago

I just want to add that I am definitely seeing this behavior on Cuttlefish. We run a number of VM's atop RBD. Any time peering is triggered it stalls I/O on all the VM's. Just now I restarted a single OSD (out of 16) and had I/O stall for over 6 minutes:

[WRN] 7 slow requests, 1 included below; oldest blocked for > 401.472296 secs

Once peering is (mostly?) complete, I/O to the VM's continues as normal. Recovery of degraded objects doesn't have nearly the same impact, it proceeds without problem once the big peering delay is over.

This makes the RBD-backed VM's seem fragile since any time we do any maintenance they go out to lunch. Would love to see a solution for this in a future Cuttlefish patch.

Actions #15

Updated by John Nielsen almost 11 years ago

Some more details about my setup:
Hosts are CentOS 6.4 + elrepo kernel-ml. Ceph is cuttlefish (0.61.2) from official packages. OSD's are on rotating disks using btrfs, with journals on LVM volumes on SSD.
The majority of traffic is from VM servers using libvirt + qemu + KVM + librbd.
This cluster was built with bobtail originally, and I recall seeing similar behavior before the upgrade to cuttlefish, just not as pronounced.
Anecdotally, it seems like the delays are growing with the amount of data in the cluster or pool.
Let me know what other information I can provide that would be helpful.

Actions #16

Updated by Faidon Liambotis almost 11 years ago

So, I've upgraded my whole cluster to cuttlefish git (7d549cb), mainly to address this issue. The treee I've installed includes a5f6d89, "PG: don't write out pg map epoch every handle_activate_map".

Unfortunately, there doesn't seem to have made any difference regarding this issue. Attached is the mon log after a "restart ceph-osd id=1".

This is especially disappointing since the combination of this bug with the Cuttlefish PG upgrade which takes about an hour per OSD, resulted into multiple independent cycles of upgrading which took the cluster effectively down for the better part of two days.

Actions #17

Updated by Faidon Liambotis almost 11 years ago

I've uploaded slowpeer-ceph-osd.2.log.bz2 (--debug-ms=1 --debug-filestore=5 --debug-osd=20) & slowpeer-osd2-ceph.log from an OSD restart to cephdrop. The peering phase didn't last long this time (30s) but there were a slow requests with pgs in "active" (not active+clean) & active+recovery_wait.

Actions #18

Updated by John Nielsen almost 11 years ago

I updated my cluster from 0.61.2 to 0.61.3 and can tell a noticeable improvement. There are still some I/O stalls while PG's are peering but it doesn't take nearly as long so the impact is much less. I'm not afraid to restart hosts or daemons any more, at least. Further improvement to the time it takes to peer and/or the ability to process I/O (especially writes) while peering would still be welcome.

Actions #19

Updated by Faidon Liambotis almost 11 years ago

Just for the record:

We did a troubleshooting/log collecting session with Sam last week. It seems that the issue is apparent when an OSD has been running for a while, i.e. the first restart always restarts in slow peering (ranging from 40s to 2min+), while the subsequent ones peer instanteously. We tested a few ideas and the outcome (to my understanding) is that it is a local issue and non-pagecache-related. Sam had a log theory, which would be solved in a refactoring -- I'm unsure if this is still the prevalent theory.

Actions #20

Updated by Faidon Liambotis almost 11 years ago

I tried wip_cuttlefish_compact_on_startup today. First, I upgraded one box to 0.61.3-47-g47f1bed-1precise.

Then, without modifying the config file, I stopped osd.0 (control) and waited. Peering took 6s. Then, I started osd.0 again, peering took a little less than 4 minutes. Weird thing is, there were monitor elections being called once on down and three times on up, all while peering (no other elections in today's logs, so this isn't a coincidence). I didn't observe that with later runs, perhaps something with the monitor's pagecache?

I then set osd compact leveldb on mount = true on my config, stopped osd.2 and waited; peering took 30s, without starting an OSD. No monitor elections at this time. I then started osd.2; according to the logs, compacting leveldb took 4m20s. After that, peering took 14s. I did experience the slow recovery bug here though, with slow requests coming up until all pgs settled (~2mins in total).

I then repeated the same test with osd.1. No slow peering on stop at all here, compacting took 3m52s, peering 25s, complete recovery 1m48s.

All tests done with a very idle cluster. Everything but the test box running 0.61.3. I'm also attaching the log from the monitor with the above events.

Actions #21

Updated by Samuel Just over 10 years ago

  • Status changed from Need More Info to Resolved

278c7b59228f614addf830cb0afff4988c9bc8cb probably took care of this?

Actions

Also available in: Atom PDF