Project

General

Profile

Bug #4579

kclient + ffsb workload makes osds mark themselves down

Added by Sage Weil about 6 years ago. Updated almost 6 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
03/29/2013
Due date:
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

The ffsb workload hammers the server side with small requests sufficiently hard that the osds get marked down. I'm guessing it is because htere are too many small io requests and it gets way behind, but that is surprising.

Figure out why. Decide if the fix is on the client side or server side.


Related issues

Related to Ceph - Feature #4782: osd: build writeback model to replace async flusher Resolved 04/22/2013

Associated revisions

Revision aca0aea1 (diff)
Added by Sage Weil about 6 years ago

osd: throttle client messages by count, not just by bytes

This lets us put a cap on outstanding client IOs. This is particularly
important for clients issuing lots of small IOs.

Fixes: #4579
Signed-off-by: Sage Weil <>

Revision bbcba292 (diff)
Added by Sage Weil about 6 years ago

set 'filestore flush min = 0' for all ffsb jobs

Until we fix #4579

History

#1 Updated by Sage Weil about 6 years ago

  • Project changed from fs to Ceph

#2 Updated by Sage Weil about 6 years ago

  • Assignee set to Sage Weil

#3 Updated by Sage Weil about 6 years ago

  • Status changed from New to In Progress

#4 Updated by Ian Colle about 6 years ago

  • Target version set to v0.61 - Cuttlefish

#5 Updated by Sage Weil about 6 years ago

head explodes:

{ "num_ops": 89315,

on the osds. osd should throttle number of messages.. not just # of bytes.

maybe libceph should limit ios in flight... maybe. I'm thinking no, since doing it on the server side is safer, and lets the client exploit very very large clusters without tuning.

#6 Updated by Greg Farnum about 6 years ago

What are the actual things being measured here? The OSDs do throttle the number of ops at least at the journal level — but perhaps they don't throttle in any layers above that and so end up suiciding when the ops above get stuck.

#7 Updated by Sage Weil about 6 years ago

Yeah. The only throttling above the filestore queues is the messenger's limit on client bytes.. but these ios are small enough that we get tens of thousands.

#8 Updated by Sage Weil about 6 years ago

  • Status changed from In Progress to Need Review

#9 Updated by Sage Weil about 6 years ago

  • Assignee changed from Sage Weil to Greg Farnum

#10 Updated by Greg Farnum about 6 years ago

  • Status changed from Need Review to Testing

We need to figure out how to set good defaults for this value before putting it in. 100 is likely to not be a very good default, especially compared with the other defaults we have (which are permissive).

#11 Updated by Greg Farnum about 6 years ago

  • Status changed from Testing to Resolved

Okay, Sam says 100 should be fine since the filestore_queue_max_ops is set at 50. I've merged this into next in 95374c628b2df020b9236b6373440d99545e4b1e.

However, I'm concerned that this improves the situation but doesn't remove it, because even if we only let in 100 ops at a time the journal can handle those in 40ms or less, and the filesystem can do an in-memory apply of them very quickly, but they still all need to get flushed out to disk. Testing on my local node I can get it to crash on a liveness assert even with these patches just by running "rados bench 60 write -t 200 -b 4096". We might need to add a counter for unsynced-to-backing-store as well... :/
I didn't observe any serious performance impacts in these local tests from the addition; it mostly just reduced latencies. So I guess it's not hurting anything at least on hard drives.

#12 Updated by Samuel Just about 6 years ago

  • Status changed from Resolved to Verified
  • Assignee changed from Greg Farnum to Samuel Just

Hmm. The osds aren't suiciding any more, but we still see 30s+ requests. We release filestore throttles when we hit applied and committed. We release the osd messenger throttles when a request is applied and committed at all replicas. However, we have deferred the cost of actually writing out the operation to the fs until sync time allowing us to build up a large io debt by the time the journal forces a sync resulting in the 60s+ sync we see below. Perhaps the journal needs to throttle the number of unsynced ios in addition to the size of unsynced ios?

2013-04-16 16:22:57.347190 7f19d6fe3700 0 log [WRN] : slow request 39.947319 seconds old, received at 2013-04-16 16:22:17.399814: osd_sub_op(client.4128.0:45218 0.d 7b0ad5ed/1000000016d.00000016/head//0 [] v 11'2443 snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached
2013-04-16 16:22:57.347194 7f19d6fe3700 0 log [WRN] : slow request 39.763946 seconds old, received at 2013-04-16 16:22:17.583187: osd_sub_op(client.4128.0:45230 0.3 81cbedb3/100000001cc.00000001/head//0 [] v 11'2687 snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached
2013-04-16 16:22:57.347197 7f19d6fe3700 0 log [WRN] : slow request 39.412822 seconds old, received at 2013-04-16 16:22:17.934311: osd_sub_op(client.4128.0:45236 0.d 696cd76d/10000000158.00000001/head//0 [] v 11'2444 snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached
2013-04-16 16:22:57.439687 7f19bc6f7700 0 -- 10.214.134.32:6807/61254 >> 10.214.134.34:6804/24428 pipe(0x7f19980c32d0 sd=26 :41120 s=2 pgs=713 cs=1 l=0).fault with nothing to send, going to standby
2013-04-16 16:22:58.347315 7f19d6fe3700 5 osd.1 11 tick
2013-04-16 16:22:58.347357 7f19d6fe3700 20 osd.1 11 scrub_should_schedule loadavg 4.08 < max 5 = yes
2013-04-16 16:22:58.347365 7f19d6fe3700 20 osd.1 11 sched_scrub load_is_low=1
2013-04-16 16:22:58.347367 7f19d6fe3700 10 osd.1 11 0.2 at 2013-04-16 16:06:36.023315 > min 2013-04-15 16:22:58.347367 (86400 seconds ago)
2013-04-16 16:22:58.347376 7f19d6fe3700 20 osd.1 11 sched_scrub done
2013-04-16 16:22:58.347379 7f19d6fe3700 10 osd.1 11 do_waiters -- start
2013-04-16 16:22:58.347380 7f19d6fe3700 10 osd.1 11 do_waiters -- finish
2013-04-16 16:22:58.934962 7f19d87e6700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f19a4ff9700' had timed out after 15
2013-04-16 16:22:58.934982 7f19d87e6700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f19a47f8700' had timed out after 15
2013-04-16 16:22:59.293570 7f19d57e0700 10 filestore(/var/lib/ceph/osd/ceph-1) sync_entry commit took 68.999479, interval was 77.510400
2013-04-16 16:22:59.293590 7f19d57e0700 10 journal commit_finish thru 19960
2013-04-16 16:22:59.293592 7f19d57e0700 5 journal committed_thru 19960 (last_committed_seq 18948)
2013-04-16 16:22:59.293639 7f19d57e0700 10 journal header: block_size 4096 alignment 4096 max_size 104857600
2013-04-16 16:22:59.293641 7f19d57e0700 10 journal header: start 71376896
2013-04-16 16:22:59.293643 7f19d57e0700 10 journal write_pos 35807232
2013-04-16 16:22:59.293646 7f19d57e0700 10 journal committed_thru done
2013-04-16 16:22:59.293648 7f19d57e0700 15 filestore(/var/lib/ceph/osd/ceph-1) sync_entry committed to op_seq 19960
2013-04-16 16:22:59.293652 7f19d57e0700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry waiting for max_interval 5.000000
2013-04-16 16:22:59.293652 7f19d4fdf700 20 journal write_thread_entry woke up
2013-04-16 16:22:59.293663 7f19d4fdf700 10 journal room 35569663 max_size 104857600 pos 35807232 header.start 71376896 top 4096
2013-04-16 16:22:59.293667 7f19d4fdf700 10 journal check_for_full at 35807232 : 8192 < 35569663
2013-04-16 16:22:59.293669 7f19d4fdf700 15 journal prepare_single_write 1 will write 35807232 : seq 25731 len 5586 -> 8192 (head 40 pre_pad 0 ebl 5586 post_pad 2526 tail 40) (ebl alignment -1)
2013-04-16 16:22:59.293721 7f19d4fdf700 10 journal room 35561471 max_size 104857600 pos 35815424 header.start 71376896 top 4096
2013-04-16 16:22:59.293723 7f19d4fdf700 10 journal check_for_full at 35815424 : 8192 < 35561471

#13 Updated by Sage Weil about 6 years ago

Note that the message recv stamp (which is used by the slow req warnings) is the time stamp before Pipe::read_message() waits on the throttler. What are you doing to reproduce this? It might be that just moving the stamp to be when we receive the entire message, or post-throttle, is sufficient to make some of the warnings go away.

Won't help with the 60s sync, though. Hrm.

#14 Updated by Greg Farnum about 6 years ago

We depend on those stamps in other places in order to be able to count throttle time (thus the distinction between recv_stamp and dispatch_stamp). We can redefine meanings if we want, but throttled time is something we should count so be cognizant of it if any changes do occur.

#15 Updated by Samuel Just about 6 years ago

passes with filestore_flush_min = 0.

#16 Updated by Sage Weil about 6 years ago

  • Assignee deleted (Samuel Just)
  • Priority changed from Urgent to High

this is not new behavior, so it is not a cuttlefish blocker. feature #4782 is the current proposed fix.

#17 Updated by Sage Weil about 6 years ago

ubuntu@teuthology:/a/teuthology-2013-05-11_01:00:38-fs-next-testing-basic/11289

#18 Updated by Samuel Just about 6 years ago

  • Assignee set to Samuel Just

#19 Updated by Samuel Just almost 6 years ago

  • Status changed from Verified to Resolved

e21f8df1eb0c459d12911785c69f7427d1ad5689

Also available in: Atom PDF