Bug #4579: kclient + ffsb workload makes osds mark themselves down - Ceph - Ceph

Actions

Copy link

Bug #4579

closed

kclient + ffsb workload makes osds mark themselves down

Added by Sage Weil about 11 years ago. Updated almost 11 years ago.

Status:

Resolved

Priority:

High

Assignee:

Samuel Just

Category:

Target version:

v0.61 - Cuttlefish

% Done:

Source:

Q/A

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The ffsb workload hammers the server side with small requests sufficiently hard that the osds get marked down. I'm guessing it is because htere are too many small io requests and it gets way behind, but that is surprising.

Figure out why. Decide if the fix is on the client side or server side.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Sage Weil about 11 years ago

Project changed from CephFS to Ceph

Actions

Copy link

Updated by Sage Weil about 11 years ago

Assignee set to Sage Weil

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from New to In Progress

Actions

Copy link

Updated by Ian Colle about 11 years ago

Target version set to v0.61 - Cuttlefish

Actions

Copy link

Updated by Sage Weil about 11 years ago

head explodes:

{ "num_ops": 89315,

on the osds. osd should throttle number of messages.. not just # of bytes.

maybe libceph should limit ios in flight... maybe. I'm thinking no, since doing it on the server side is safer, and lets the client exploit very very large clusters without tuning.

Actions

Copy link

Updated by Greg Farnum about 11 years ago

What are the actual things being measured here? The OSDs do throttle the number of ops at least at the journal level — but perhaps they don't throttle in any layers above that and so end up suiciding when the ops above get stuck.

Actions

Copy link

Updated by Sage Weil about 11 years ago

Yeah. The only throttling above the filestore queues is the messenger's limit on client bytes.. but these ios are small enough that we get tens of thousands.

Actions

Copy link

Updated by Sage Weil about 11 years ago

Status changed from In Progress to Fix Under Review

Actions

Copy link

Updated by Sage Weil about 11 years ago

Assignee changed from Sage Weil to Greg Farnum

Actions

Copy link

#10

Updated by Greg Farnum about 11 years ago

Status changed from Fix Under Review to 7

We need to figure out how to set good defaults for this value before putting it in. 100 is likely to not be a very good default, especially compared with the other defaults we have (which are permissive).

Actions

Copy link

#11

Updated by Greg Farnum about 11 years ago

Status changed from 7 to Resolved

Okay, Sam says 100 should be fine since the filestore_queue_max_ops is set at 50. I've merged this into next in 95374c628b2df020b9236b6373440d99545e4b1e.

However, I'm concerned that this improves the situation but doesn't remove it, because even if we only let in 100 ops at a time the journal can handle those in 40ms or less, and the filesystem can do an in-memory apply of them very quickly, but they still all need to get flushed out to disk. Testing on my local node I can get it to crash on a liveness assert even with these patches just by running "rados bench 60 write -t 200 -b 4096". We might need to add a counter for unsynced-to-backing-store as well... :/
I didn't observe any serious performance impacts in these local tests from the addition; it mostly just reduced latencies. So I guess it's not hurting anything at least on hard drives.

Actions

Copy link

#12

Updated by Samuel Just about 11 years ago

Status changed from Resolved to 12
Assignee changed from Greg Farnum to Samuel Just

Hmm. The osds aren't suiciding any more, but we still see 30s+ requests. We release filestore throttles when we hit applied and committed. We release the osd messenger throttles when a request is applied and committed at all replicas. However, we have deferred the cost of actually writing out the operation to the fs until sync time allowing us to build up a large io debt by the time the journal forces a sync resulting in the 60s+ sync we see below. Perhaps the journal needs to throttle the number of unsynced ios in addition to the size of unsynced ios?

2013-04-16 16:22:57.347190 7f19d6fe3700 0 log [WRN] : slow request 39.947319 seconds old, received at 2013-04-16 16:22:17.399814: osd_sub_op(client.4128.0:45218 0.d 7b0ad5ed/1000000016d.00000016/head//0 [] v 11'2443 snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached
2013-04-16 16:22:57.347194 7f19d6fe3700 0 log [WRN] : slow request 39.763946 seconds old, received at 2013-04-16 16:22:17.583187: osd_sub_op(client.4128.0:45230 0.3 81cbedb3/100000001cc.00000001/head//0 [] v 11'2687 snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached
2013-04-16 16:22:57.347197 7f19d6fe3700 0 log [WRN] : slow request 39.412822 seconds old, received at 2013-04-16 16:22:17.934311: osd_sub_op(client.4128.0:45236 0.d 696cd76d/10000000158.00000001/head//0 [] v 11'2444 snapset=0=[]:[] snapc=0=[]) v7 currently no flag points reached
2013-04-16 16:22:57.439687 7f19bc6f7700 0 -- 10.214.134.32:6807/61254 >> 10.214.134.34:6804/24428 pipe(0x7f19980c32d0 sd=26 :41120 s=2 pgs=713 cs=1 l=0).fault with nothing to send, going to standby
2013-04-16 16:22:58.347315 7f19d6fe3700 5 osd.1 11 tick
2013-04-16 16:22:58.347357 7f19d6fe3700 20 osd.1 11 scrub_should_schedule loadavg 4.08 < max 5 = yes
2013-04-16 16:22:58.347365 7f19d6fe3700 20 osd.1 11 sched_scrub load_is_low=1
2013-04-16 16:22:58.347367 7f19d6fe3700 10 osd.1 11 0.2 at 2013-04-16 16:06:36.023315 > min 2013-04-15 16:22:58.347367 (86400 seconds ago)
2013-04-16 16:22:58.347376 7f19d6fe3700 20 osd.1 11 sched_scrub done
2013-04-16 16:22:58.347379 7f19d6fe3700 10 osd.1 11 do_waiters -- start
2013-04-16 16:22:58.347380 7f19d6fe3700 10 osd.1 11 do_waiters -- finish
2013-04-16 16:22:58.934962 7f19d87e6700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f19a4ff9700' had timed out after 15
2013-04-16 16:22:58.934982 7f19d87e6700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f19a47f8700' had timed out after 15
2013-04-16 16:22:59.293570 7f19d57e0700 10 filestore(/var/lib/ceph/osd/ceph-1) sync_entry commit took 68.999479, interval was 77.510400
2013-04-16 16:22:59.293590 7f19d57e0700 10 journal commit_finish thru 19960
2013-04-16 16:22:59.293592 7f19d57e0700 5 journal committed_thru 19960 (last_committed_seq 18948)
2013-04-16 16:22:59.293639 7f19d57e0700 10 journal header: block_size 4096 alignment 4096 max_size 104857600
2013-04-16 16:22:59.293641 7f19d57e0700 10 journal header: start 71376896
2013-04-16 16:22:59.293643 7f19d57e0700 10 journal write_pos 35807232
2013-04-16 16:22:59.293646 7f19d57e0700 10 journal committed_thru done
2013-04-16 16:22:59.293648 7f19d57e0700 15 filestore(/var/lib/ceph/osd/ceph-1) sync_entry committed to op_seq 19960
2013-04-16 16:22:59.293652 7f19d57e0700 20 filestore(/var/lib/ceph/osd/ceph-1) sync_entry waiting for max_interval 5.000000
2013-04-16 16:22:59.293652 7f19d4fdf700 20 journal write_thread_entry woke up
2013-04-16 16:22:59.293663 7f19d4fdf700 10 journal room 35569663 max_size 104857600 pos 35807232 header.start 71376896 top 4096
2013-04-16 16:22:59.293667 7f19d4fdf700 10 journal check_for_full at 35807232 : 8192 < 35569663
2013-04-16 16:22:59.293669 7f19d4fdf700 15 journal prepare_single_write 1 will write 35807232 : seq 25731 len 5586 -> 8192 (head 40 pre_pad 0 ebl 5586 post_pad 2526 tail 40) (ebl alignment -1)
2013-04-16 16:22:59.293721 7f19d4fdf700 10 journal room 35561471 max_size 104857600 pos 35815424 header.start 71376896 top 4096
2013-04-16 16:22:59.293723 7f19d4fdf700 10 journal check_for_full at 35815424 : 8192 < 35561471

Actions

Copy link

#13

Updated by Sage Weil about 11 years ago

Note that the message recv stamp (which is used by the slow req warnings) is the time stamp before Pipe::read_message() waits on the throttler. What are you doing to reproduce this? It might be that just moving the stamp to be when we receive the entire message, or post-throttle, is sufficient to make some of the warnings go away.

Won't help with the 60s sync, though. Hrm.

Actions

Copy link

#14

Updated by Greg Farnum about 11 years ago

We depend on those stamps in other places in order to be able to count throttle time (thus the distinction between recv_stamp and dispatch_stamp). We can redefine meanings if we want, but throttled time is something we should count so be cognizant of it if any changes do occur.

Actions

Copy link

#15