Project

General

Profile

Bug #44357

Improve the way we lock nodes for jobs with 5 nodes required to run

Added by Yuri Weinstein almost 2 years ago. Updated 3 months ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

I am dumping the IRC chatter here for someone who will be working on it.

(08:23:18 AM) yuriw: key sage neha - we are talking about possibility of changing locking logic so we can easily schedule suite with 5 nodes 
(08:23:29 AM) yuriw: kyr ^
(08:24:20 AM) kyr: not sure changing locking logic
(08:24:24 AM) kyr: but
(08:24:36 AM) kyr: I'm looking at the pulpito queue
(08:24:47 AM) kyr: it is a pretty long list
(08:25:05 AM) yuriw: kyr was asking the same question I was asking - why when sage pausing the queue and then scheduling suite actually locks nodes?!
(08:25:36 AM) yuriw: zackc ^
(08:26:24 AM) kyr: locking mechanism is that if suite requests a many nodes, those suites will run faster who has less the nodes which have minimum allowed
(08:28:31 AM) kyr: I mean, if a pool has N - free nodes, and this number is less than (reserved number) + (required nodes) it will wait in a queue until other jobs with less appetite got served
(08:28:32 AM) sage: right. if you have jobs that are trying ot get (many) n odes, just pause the queue for a bit and they will be able to get nodes and start
(08:28:42 AM) sage: you have to wait until they are in the blue waiting state though
(08:29:06 AM) kyr: or you can set a higher priority
(08:30:07 AM) kyr: if it does not work than there is probably a bug :-)
(08:30:40 AM) yuriw: I guess the question is - how to change this, so we don't have to jump over the hoops 
(08:30:52 AM) kyr: because jobs with fewer nodes and lower priority should wait until higher
(08:32:39 AM) neha: kyr: Is there a way to guarantee that jobs with 5 node requirement run, no matter what the priority and without pausing the queue, with the existing locking mechanism? Is there a config parameter that gives up at 5?
(08:32:43 AM) kyr: it means run jobs with higher number of nodes always with higher priority?
(08:33:48 AM) kyr: @neha, there is a parameter that drops `reserved_nodes` to 0 but it all worker wide 
(08:33:53 AM) yuriw: kyr: "always with higher priority" does not work, we schedule regular nighties runs with 70 and it's not enough 
(08:33:58 AM) kyr: and I don't think it will help
(08:34:34 AM) kyr: it will just increase the number of jobs with fewer number of nodes to be run in parallel
(08:35:15 AM) kyr: I guess with just have not enough free nodes in the lab?
(08:45:35 AM) yuriw: of cause if we had 30 free nodes it'd help, but in reality more node won't help and locking logic does not encourage jobs with many nodes 
(08:58:36 AM) kyr: <yuriw> kyr: "always with higher priority" does not work, we schedule regular nighties runs with 70 and it's not enough 
(08:59:13 AM) kyr: it is not the higher, I've spend sometime to monitor which prio engineers are using, many people are just ignore the rule of 70
(09:00:57 AM) kyr: if you run all the jobs with X priority should not help, what I am saying the nodes with bigger cluster requirements should be higher priority than 70
(09:01:21 AM) kyr: anyway I can't say more until I locker history
(09:03:25 AM) kyr: for example this run http://pulpito.ceph.com/sage-2020-02-28_14:56:50-rados-master-distro-basic-smithi/ has priority 50
(09:07:25 AM) kyr: also, at the moment there is a long list jobs, I suppose it is not enough time to process all of them before the next portion of regular runs are added to a queue
(09:08:09 AM) kyr: I guess locking ignores creation time, and just take into account the priority
(09:08:58 AM) kyr: and if it works in lifo order it can be a big problem
(09:14:42 AM) sage: locking has nothing to do with priority
(09:14:50 AM) sage: it's very primitive...
(09:14:57 AM) sage: priority means it dequeues first and the worker starts
(09:15:04 AM) sage: then teh worker loops and tries every few seconds to lock N nodes.
(09:15:26 AM) sage: if N is large, then ther eis an exponentially lower probability that N nodes  will be free at once
(09:15:31 AM) sage: so they tend to get stuck
(09:15:53 AM) sage: pausing the queue means new jobs don't dequeue, so as runs finish the waiting workers are able to eventually find enough free nodes to start.
(09:18:04 AM) sage: there is a delicate balance between the number of total workers, which is fixed per machine type.  for smithi for example there are probably 10-20 workers trying to lock nodes at any given point in time.  so if one of them needs 6 instead of 2 it will be very hard.  fewer workers would make it easier, but then there is the risk that there are a lot of 1- and 2-node jobs and there aren't enough workers to keep all teh machines busy
(09:18:15 AM) sage: basically, it sucks.
(09:18:33 AM) zackc: it absolutely sucks
(09:18:33 AM) sage: neha was suggesting we make this the focus on the gsoc teuthology project
(09:18:57 AM) zackc: that sounds like a good idea
(09:18:58 AM) sage: kyr: in the meantime, to make the upgrade jobs run, wait for the workesr to start (i.e., tasks are blue), then pause the queue for 20-30m
(09:19:13 AM) sage: (or however long it takes)
(09:19:23 AM) sage: you can pause with teuthology-queue -m smithi --pause $seconds
(09:19:27 AM) sage: and unpuse with --pause 0
(10:04:37 AM) kyr: I was thinking about getting rid of so many workers in favor of one worker and review the locking mechanism
(10:06:51 AM) jdillaman: sage: coming to this meeting?
(10:07:04 AM) sage: omw
(10:47:05 AM) kyr:  @sage @yuriw do we have a ticket for this ugly lock behavior in the tracker? 
(10:56:31 AM) gregsfortytwo: kyr: I'm not sure the best architecture mechanism, but "make locking better than many dozen workers racing for locks" has been on the backlog since approximately 6 months after we added queuing with beanstalkd
(10:56:59 AM) gregsfortytwo: I would love you forever more than I already do for your py3 and other work
(11:09:03 AM) neha: sage: I have reproduced https://tracker.ceph.com/issues/44299 with debug_ms=20 http://pulpito.ceph.com/nojha-2020-02-28_01:16:11-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4807275/
(11:10:19 AM) neha: osd.7's communication with the mgr is of interest to us
(11:30:12 AM) yuriw: kyr: not yet
(11:30:19 AM) yuriw: AFAIK

History

#1 Updated by Yuri Weinstein almost 2 years ago

  • Description updated (diff)

#2 Updated by Josh Durgin 3 months ago

  • Status changed from New to Resolved

The dispatcher handles multi-node jobs better

Also available in: Atom PDF