Bug #5084: osd: slow peering after osd restart (bobtail) - Ceph - Ceph

Actions

Copy link

Bug #5084

closed

osd: slow peering after osd restart (bobtail)

Added by Sage Weil almost 11 years ago. Updated over 10 years ago.

Status:

Resolved

Priority:

High

Assignee:

Samuel Just

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Files

Download all files

osd-tree (3.83 KB) osd-tree		Faidon Liambotis, 05/16/2013 11:26 AM
osd-dump (32.4 KB) osd-dump		Faidon Liambotis, 05/16/2013 11:26 AM
ceph.log (280 KB) ceph.log		Faidon Liambotis, 05/16/2013 11:26 AM
ceph.log-2 (152 KB) ceph.log-2		Faidon Liambotis, 05/16/2013 01:56 PM
recovery-wait.txt (458 Bytes) recovery-wait.txt		Faidon Liambotis, 05/16/2013 01:56 PM
peering.txt (162 KB) peering.txt		Faidon Liambotis, 05/16/2013 01:56 PM
ceph-osd.6.log.bz2 (358 KB) ceph-osd.6.log.bz2		Faidon Liambotis, 05/17/2013 10:29 AM
peering-osd6.txt (48.5 KB) peering-osd6.txt		Faidon Liambotis, 05/17/2013 10:29 AM
ceph.log-osd6 (58 KB) ceph.log-osd6		Faidon Liambotis, 05/17/2013 10:29 AM
node03-ceph-clean.log (84.3 KB) node03-ceph-clean.log		Igor Lukyanov, 05/23/2013 01:13 PM
node03-iostat-clean.log (23.5 KB) node03-iostat-clean.log		Igor Lukyanov, 05/23/2013 01:13 PM
slow-peering-cuttlefish (187 KB) slow-peering-cuttlefish		Faidon Liambotis, 06/06/2013 08:21 AM
5084-jun19.log (201 KB) 5084-jun19.log		Faidon Liambotis, 06/18/2013 05:43 PM

Actions

Copy link Download all files

Updated by Faidon Liambotis almost 11 years ago

File osd-tree osd-tree added
File osd-dump osd-dump added
File ceph.log ceph.log added

Thanks for opening this. Attached are osd dump, osd tree and the ceph.log right after I did "restart ceph-osd id=0". mons are 0.56.6, most osds including 0 are at 0.56.4.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

two theories:

- deep scrub is slowing things down. can you try 'ceph osd set nodeepscrub' and/or 'ceph osd set noscrub', wait for scrubs to finish, and then repeat?

- one or a handful of osds are slow peering. can you do something like

while sleep 5 ; do date ; ceph pg dump | grep peering ; done | tee peering.txt

while an osd is restarted, stopping only after everything is healthy again?

Actions

Copy link Download all files

Updated by Faidon Liambotis almost 11 years ago

File ceph.log-2 ceph.log-2 added
File recovery-wait.txt recovery-wait.txt added
File peering.txt peering.txt added

Did all that, attached the logs & peering.txt. Peering took two minutes, with recovery_wait taking another two, so I pg dump'ed recovery_wait too.

Actions

Copy link Download all files

Updated by Faidon Liambotis almost 11 years ago

File ceph-osd.6.log.bz2 ceph-osd.6.log.bz2 added
File peering-osd6.txt peering-osd6.txt added
File ceph.log-osd6 ceph.log-osd6 added

Here's the same with osd.6, but with --debug-ms 1 as requested. Besides peering being slow, the recovery_wait phase seems to be also producing a lot of slow requests for a while. Is that normal?

Actions

Copy link

Updated by Faidon Liambotis almost 11 years ago

One interesting observation is that when I tried restarting an OSD a few minutes after it has been restarted and cluster settled, peering finished up in seconds. I then tried restarting osd.0, which I had restarted yesterday (so it was running 0.56.6); this resulted in slow peering again.

Another observation is that I've seen a few times is "active" (not active+clean) pgs right after peering and for a few minutes. These came with slow requests, so not sure if it's normal either.

Actions

Copy link

Updated by Sage Weil almost 11 years ago

Assignee changed from Sage Weil to Samuel Just

Actions

Copy link

Updated by Samuel Just almost 11 years ago

This could be a result of writing out pg epochs on each map change. I have a branch which should greatly reduce the problem.

wip_bobtail_pg_epoch 2c5a9f0e178843e7ed514708bab137def840ab89

I just pushed that version, so verify the sha1 before using (the gitbuilder might take a little while to get to it).

It's based on current bobtail. Let us know if it helps.
-Sam

Actions

Copy link

Updated by Faidon Liambotis almost 11 years ago

Something that's not clear to me: does this need to be in all peers to have an effect? Or in other words, to fix this would I need to upgrade all of my OSDs to a version containing this to have a faster peering?

Also, this doesn't affect Cuttlefish at all, right?

Actions

Copy link Download all files

Updated by Igor Lukyanov almost 11 years ago

File node03-ceph-clean.log node03-ceph-clean.log added
File node03-iostat-clean.log node03-iostat-clean.log added

We repeat the same bug on both Bobtail and Cuttlefish deployments just by calling osd in/out/reweight.
Peering completely freezes IO for a few minutes until it finishes.
On a non-patched cluster, we observe nearly 100% disk utilization during all that period.
Provided patch made things better in the sense that disk load dropped off to 50-60%, but the problem of completely stalled client IO is still remains. It also seems that patch does not reduce total peering time: peering still takes about 2 mins on our cluster.

As we can assume client ops are waiting for new OSD map that is issued only after peering finishes.
It seems that reasonable way to fix the problem is to speed up issuing of new OSD map, is it possible? Thank you.

P.S. ceph and iostat logs are attached, 2 osds (placed on sda and sdb) are marked as 'out' during the test.

Actions

Copy link

#10

Updated by Igor Lukyanov almost 11 years ago

As we can assume client ops are waiting for new OSD map that is issued only after peering finishes.
It seems that reasonable way to fix the problem is to speed up issuing of new OSD map, is it possible? Thank you.
P.S. ceph and iostat logs are attached, 2 osds (placed on sda and sdb) are marked as 'out' during the test.

After some testing want to make a little correction to the last point.
Actually, ops to a specific PG start to be processed after the one becomes active, not after peering completely finishes. Assuming entire peering takes T (T is in order of minutes), then avg latency of a single read op would be T/2 and avg latency of a single write op would be 3/4T (due to waiting for replicas). So a typical latency during peering takes tens of seconds, that is inappropriate.

I guess it's unlikely to speed up peering by an order of magnitude, but may be it's ok to prioritize peering of PGs awaited by clients? Though docs say "peering messages must not be reordered"...

Actions

Copy link

#11

Updated by Igor Lukyanov almost 11 years ago

We wrote a test that sequentially reads 1M blocks spreaded by 64Mb offsets (i.e. from different placement groups) and prints each op latency, and started it simultaneously with peering. Here is what timings we've got:
2.384s
9.328s
2.034s
0.001s
1.179s
0.001s
0.797s
56.005s
23.488s
2.903s
38.236s
1.930s
7.875s
0.001s
3.760s
0.001s
4.768s
0.001s

Briefly, peering took 140 sec, and during that time cluster resolved about 20 sequential client ops with avg latency 7 sec.
We tried to fix that behavior by prioritizing peering of placement groups awaited by clients (see patch http://pastebin.com/PR8DmAR1) but for some unknown reasons it totally broke peering and caused marking all placement groups as staled. Will be very grateful for any suggestions for fixing the peering problem.

Actions

Copy link

#12

Updated by Samuel Just almost 11 years ago

Igur, a main problem is that we are writing out the pg epoch to the filestore when we don't need to. The second problem is that peering involves talking to peers. Prioritizing peering messages for pgs with waiting clients will have a side effect of delaying messages on stray pgs which may be blocking the primary on another OSD from going active.

Because your comparer depends on mutable state under the qlock without holding the qlock, it may reorder the list in surprising ways.

Also, that list is only the list to be processed in a single _process() call, the larger peering_wq is what you would need to prioritize to see a difference.

I think the best approach though is to decrease peering cost over all.

Actions

Copy link

#13

Updated by Samuel Just almost 11 years ago

Faidon, it shouldn't affect cuttlefish as much, though that is not clear. That patch would need to be installed on all OSDs. It's not well tested yet, so you probably only want to use it on a test cluster at this time.

I updated the branch to sha1: 99831601474eddd15da72aaff421be83c02eaf46.

Actions

Copy link

#14

Updated by John Nielsen almost 11 years ago

I just want to add that I am definitely seeing this behavior on Cuttlefish. We run a number of VM's atop RBD. Any time peering is triggered it stalls I/O on all the VM's. Just now I restarted a single OSD (out of 16) and had I/O stall for over 6 minutes:

[WRN] 7 slow requests, 1 included below; oldest blocked for > 401.472296 secs

Once peering is (mostly?) complete, I/O to the VM's continues as normal. Recovery of degraded objects doesn't have nearly the same impact, it proceeds without problem once the big peering delay is over.

This makes the RBD-backed VM's seem fragile since any time we do any maintenance they go out to lunch. Would love to see a solution for this in a future Cuttlefish patch.

Actions

Copy link

#15

Updated by John Nielsen almost 11 years ago

Some more details about my setup:
Hosts are CentOS 6.4 + elrepo kernel-ml. Ceph is cuttlefish (0.61.2) from official packages. OSD's are on rotating disks using btrfs, with journals on LVM volumes on SSD.
The majority of traffic is from VM servers using libvirt + qemu + KVM + librbd.
This cluster was built with bobtail originally, and I recall seeing similar behavior before the upgrade to cuttlefish, just not as pronounced.
Anecdotally, it seems like the delays are growing with the amount of data in the cluster or pool.
Let me know what other information I can provide that would be helpful.

Actions

Copy link

#16

Updated by Faidon Liambotis almost 11 years ago

File slow-peering-cuttlefish slow-peering-cuttlefish added

So, I've upgraded my whole cluster to cuttlefish git (7d549cb), mainly to address this issue. The treee I've installed includes a5f6d89, "PG: don't write out pg map epoch every handle_activate_map".

Unfortunately, there doesn't seem to have made any difference regarding this issue. Attached is the mon log after a "restart ceph-osd id=1".

This is especially disappointing since the combination of this bug with the Cuttlefish PG upgrade which takes about an hour per OSD, resulted into multiple independent cycles of upgrading which took the cluster effectively down for the better part of two days.

Actions

Copy link

#17

Updated by Faidon Liambotis almost 11 years ago

I've uploaded slowpeer-ceph-osd.2.log.bz2 (--debug-ms=1 --debug-filestore=5 --debug-osd=20) & slowpeer-osd2-ceph.log from an OSD restart to cephdrop. The peering phase didn't last long this time (30s) but there were a slow requests with pgs in "active" (not active+clean) & active+recovery_wait.

Actions

Copy link

#18

Updated by John Nielsen almost 11 years ago

I updated my cluster from 0.61.2 to 0.61.3 and can tell a noticeable improvement. There are still some I/O stalls while PG's are peering but it doesn't take nearly as long so the impact is much less. I'm not afraid to restart hosts or daemons any more, at least. Further improvement to the time it takes to peer and/or the ability to process I/O (especially writes) while peering would still be welcome.

Actions

Copy link

#19

Updated by Faidon Liambotis almost 11 years ago

Just for the record:

We did a troubleshooting/log collecting session with Sam last week. It seems that the issue is apparent when an OSD has been running for a while, i.e. the first restart always restarts in slow peering (ranging from 40s to 2min+), while the subsequent ones peer instanteously. We tested a few ideas and the outcome (to my understanding) is that it is a local issue and non-pagecache-related. Sam had a log theory, which would be solved in a refactoring -- I'm unsure if this is still the prevalent theory.

Actions

Copy link

#20

Updated by Faidon Liambotis almost 11 years ago

File 5084-jun19.log 5084-jun19.log added

I tried wip_cuttlefish_compact_on_startup today. First, I upgraded one box to 0.61.3-47-g47f1bed-1precise.

Then, without modifying the config file, I stopped osd.0 (control) and waited. Peering took 6s. Then, I started osd.0 again, peering took a little less than 4 minutes. Weird thing is, there were monitor elections being called once on down and three times on up, all while peering (no other elections in today's logs, so this isn't a coincidence). I didn't observe that with later runs, perhaps something with the monitor's pagecache?

I then set osd compact leveldb on mount = true on my config, stopped osd.2 and waited; peering took 30s, without starting an OSD. No monitor elections at this time. I then started osd.2; according to the logs, compacting leveldb took 4m20s. After that, peering took 14s. I did experience the slow recovery bug here though, with slow requests coming up until all pgs settled (~2mins in total).

I then repeated the same test with osd.1. No slow peering on stop at all here, compacting took 3m52s, peering 25s, complete recovery 1m48s.

All tests done with a very idle cluster. Everything but the test box running 0.61.3. I'm also attaching the log from the monitor with the above events.

Actions

Copy link

#21

Updated by Samuel Just over 10 years ago

Status changed from Need More Info to Resolved

278c7b59228f614addf830cb0afff4988c9bc8cb probably took care of this?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #5084

osd: slow peering after osd restart (bobtail)

Updated by Faidon Liambotis almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Faidon Liambotis almost 11 years ago

Updated by Faidon Liambotis almost 11 years ago

Updated by Faidon Liambotis almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Samuel Just almost 11 years ago

Updated by Faidon Liambotis almost 11 years ago

Updated by Igor Lukyanov almost 11 years ago

Updated by Igor Lukyanov almost 11 years ago

Updated by Igor Lukyanov almost 11 years ago

Updated by Samuel Just almost 11 years ago

Updated by Samuel Just almost 11 years ago

Updated by John Nielsen almost 11 years ago

Updated by John Nielsen almost 11 years ago

Updated by Faidon Liambotis almost 11 years ago

Updated by Faidon Liambotis almost 11 years ago

Updated by John Nielsen almost 11 years ago

Updated by Faidon Liambotis almost 11 years ago

Updated by Faidon Liambotis almost 11 years ago

Updated by Samuel Just over 10 years ago