Actions
Bug #44357
closedImprove the way we lock nodes for jobs with 5 nodes required to run
Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
% Done:
0%
Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
I am dumping the IRC chatter here for someone who will be working on it.
(08:23:18 AM) yuriw: key sage neha - we are talking about possibility of changing locking logic so we can easily schedule suite with 5 nodes (08:23:29 AM) yuriw: kyr ^ (08:24:20 AM) kyr: not sure changing locking logic (08:24:24 AM) kyr: but (08:24:36 AM) kyr: I'm looking at the pulpito queue (08:24:47 AM) kyr: it is a pretty long list (08:25:05 AM) yuriw: kyr was asking the same question I was asking - why when sage pausing the queue and then scheduling suite actually locks nodes?! (08:25:36 AM) yuriw: zackc ^ (08:26:24 AM) kyr: locking mechanism is that if suite requests a many nodes, those suites will run faster who has less the nodes which have minimum allowed (08:28:31 AM) kyr: I mean, if a pool has N - free nodes, and this number is less than (reserved number) + (required nodes) it will wait in a queue until other jobs with less appetite got served (08:28:32 AM) sage: right. if you have jobs that are trying ot get (many) n odes, just pause the queue for a bit and they will be able to get nodes and start (08:28:42 AM) sage: you have to wait until they are in the blue waiting state though (08:29:06 AM) kyr: or you can set a higher priority (08:30:07 AM) kyr: if it does not work than there is probably a bug :-) (08:30:40 AM) yuriw: I guess the question is - how to change this, so we don't have to jump over the hoops (08:30:52 AM) kyr: because jobs with fewer nodes and lower priority should wait until higher (08:32:39 AM) neha: kyr: Is there a way to guarantee that jobs with 5 node requirement run, no matter what the priority and without pausing the queue, with the existing locking mechanism? Is there a config parameter that gives up at 5? (08:32:43 AM) kyr: it means run jobs with higher number of nodes always with higher priority? (08:33:48 AM) kyr: @neha, there is a parameter that drops `reserved_nodes` to 0 but it all worker wide (08:33:53 AM) yuriw: kyr: "always with higher priority" does not work, we schedule regular nighties runs with 70 and it's not enough (08:33:58 AM) kyr: and I don't think it will help (08:34:34 AM) kyr: it will just increase the number of jobs with fewer number of nodes to be run in parallel (08:35:15 AM) kyr: I guess with just have not enough free nodes in the lab? (08:45:35 AM) yuriw: of cause if we had 30 free nodes it'd help, but in reality more node won't help and locking logic does not encourage jobs with many nodes (08:58:36 AM) kyr: <yuriw> kyr: "always with higher priority" does not work, we schedule regular nighties runs with 70 and it's not enough (08:59:13 AM) kyr: it is not the higher, I've spend sometime to monitor which prio engineers are using, many people are just ignore the rule of 70 (09:00:57 AM) kyr: if you run all the jobs with X priority should not help, what I am saying the nodes with bigger cluster requirements should be higher priority than 70 (09:01:21 AM) kyr: anyway I can't say more until I locker history (09:03:25 AM) kyr: for example this run http://pulpito.ceph.com/sage-2020-02-28_14:56:50-rados-master-distro-basic-smithi/ has priority 50 (09:07:25 AM) kyr: also, at the moment there is a long list jobs, I suppose it is not enough time to process all of them before the next portion of regular runs are added to a queue (09:08:09 AM) kyr: I guess locking ignores creation time, and just take into account the priority (09:08:58 AM) kyr: and if it works in lifo order it can be a big problem (09:14:42 AM) sage: locking has nothing to do with priority (09:14:50 AM) sage: it's very primitive... (09:14:57 AM) sage: priority means it dequeues first and the worker starts (09:15:04 AM) sage: then teh worker loops and tries every few seconds to lock N nodes. (09:15:26 AM) sage: if N is large, then ther eis an exponentially lower probability that N nodes will be free at once (09:15:31 AM) sage: so they tend to get stuck (09:15:53 AM) sage: pausing the queue means new jobs don't dequeue, so as runs finish the waiting workers are able to eventually find enough free nodes to start. (09:18:04 AM) sage: there is a delicate balance between the number of total workers, which is fixed per machine type. for smithi for example there are probably 10-20 workers trying to lock nodes at any given point in time. so if one of them needs 6 instead of 2 it will be very hard. fewer workers would make it easier, but then there is the risk that there are a lot of 1- and 2-node jobs and there aren't enough workers to keep all teh machines busy (09:18:15 AM) sage: basically, it sucks. (09:18:33 AM) zackc: it absolutely sucks (09:18:33 AM) sage: neha was suggesting we make this the focus on the gsoc teuthology project (09:18:57 AM) zackc: that sounds like a good idea (09:18:58 AM) sage: kyr: in the meantime, to make the upgrade jobs run, wait for the workesr to start (i.e., tasks are blue), then pause the queue for 20-30m (09:19:13 AM) sage: (or however long it takes) (09:19:23 AM) sage: you can pause with teuthology-queue -m smithi --pause $seconds (09:19:27 AM) sage: and unpuse with --pause 0 (10:04:37 AM) kyr: I was thinking about getting rid of so many workers in favor of one worker and review the locking mechanism (10:06:51 AM) jdillaman: sage: coming to this meeting? (10:07:04 AM) sage: omw (10:47:05 AM) kyr: @sage @yuriw do we have a ticket for this ugly lock behavior in the tracker? (10:56:31 AM) gregsfortytwo: kyr: I'm not sure the best architecture mechanism, but "make locking better than many dozen workers racing for locks" has been on the backlog since approximately 6 months after we added queuing with beanstalkd (10:56:59 AM) gregsfortytwo: I would love you forever more than I already do for your py3 and other work (11:09:03 AM) neha: sage: I have reproduced https://tracker.ceph.com/issues/44299 with debug_ms=20 http://pulpito.ceph.com/nojha-2020-02-28_01:16:11-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4807275/ (11:10:19 AM) neha: osd.7's communication with the mgr is of interest to us (11:30:12 AM) yuriw: kyr: not yet (11:30:19 AM) yuriw: AFAIK
Updated by Josh Durgin over 2 years ago
- Status changed from New to Resolved
The dispatcher handles multi-node jobs better
Actions