Project

General

Profile

Bug #19589

greedyspill.lua: :18: attempt to index a nil value (field '?')

Added by Dan van der Ster 3 months ago. Updated about 2 months ago.

Status:
Resolved
Priority:
Normal
Category:
multi-MDS
Target version:
-
Start date:
04/12/2017
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Component(FS):
Needs Doc:
No

Description

The included greedyspill.lua doesn't seem to work in a simple 3-active MDS scenario.

balancer    greedyspill.lua
292077:    128.142.158.23:6800/1018507254 'cephhalpert-mds-135c39f87d' mds.0.1592 up:active seq 92 export_targets=0,1,2
275888:    128.142.135.28:6800/608393891 'cephhalpert-mds-981001588f' mds.1.1579 up:active seq 895 export_targets=0,1,2
285379:    128.142.132.151:6800/3646166240 'cephhalpert-mds-96d8ad3ea3' mds.2.1600 up:active seq 28

But it fails on line 18?

2017-04-12 13:54:18.222378 7f34eb1db700  0 lua.balancer MDS0: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=0.35 > load=0.0
2017-04-12 13:54:18.222396 7f34eb1db700  0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=0.34 > load=0.0
2017-04-12 13:54:18.222404 7f34eb1db700  0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=0.27 > load=0.0
2017-04-12 13:54:18.222418 7f34eb1db700  0 lua.balancer WARNING: mantle could not execute script: [string "metrics = {"auth.meta_load", "all.meta_load",..."]:18: attempt to index a nil value (field '?')
2017-04-12 13:54:18.222449 7f34eb1db700  0 log_channel(cluster) log [WRN] : using old balancer; mantle failed for balancer=greedyspill.lua : (22) Invalid argument

History

#1 Updated by Dan van der Ster 3 months ago

Ahh, it's even documented:

Note that if you look at the last MDS (which could be a, b, or c -- it's
random), you will see an an attempt to index a nil value. This is because the
last MDS tries to check the load of its neighbor, which does not exist.

So the last MDS cannot send his load away?

#2 Updated by Mark Guz 3 months ago

I also see this error. I have 2 Active/active mdses. The first shows no errors, the second shows the errors above. No loadbalancing occurs, as the 1st mds remains at high cpu use and the second sits idle.

#3 Updated by Mark Guz 3 months ago

Dan, do you see any evidence of actual load balancing?

#4 Updated by Dan van der Ster 3 months ago

Yes. For example, when I have 50 clients untarring the linux kernel into unique directories, the load is moved around.

#5 Updated by Dan van der Ster 3 months ago

BTW, you need to set debug_mds_balancer = 2 to see the balance working.

#6 Updated by Mark Guz 3 months ago

did you modify the greedyspill.lua script at all?

#7 Updated by Dan van der Ster 3 months ago

Nope. Is your load on mds.0 ? If yes, and it get's heavily loaded, and if mds.1 has load = 0, then i expect the balancer to trigger. It works like that for me.
If not, maybe use the export_dir mds command to move the workload back to 0.

mds.1 will use the old default balancer, which in my (very limited) experience is less predictable than the lua thing.

#8 Updated by Mark Guz 3 months ago

i see this in the logs

2017-04-12 10:14:40.324788 7ff1b5abf700  0 lua.balancer MDS0: < auth.meta_load=35983.556838048 all.meta_load=4760.117183648 req_rate=2660151.0 queue_len=549.0 cpu_load_avg=2.3 > load=4760.117183648
2017-04-12 10:14:40.324807 7ff1b5abf700  0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=2557470.0 queue_len=0.0 cpu_load_avg=0.05 > load=0.0
2017-04-12 10:14:40.324817 7ff1b5abf700  0 lua.balancer MDS2: < auth.meta_load=0.86185489059074 all.meta_load=0.78425692586527 req_rate=28861.0 queue_len=0.0 cpu_load_avg=0.06 > load=0.78425692586527

#9 Updated by Patrick Donnelly 3 months ago

  • Assignee set to Patrick Donnelly

This error shouldn't be an expected occurrence. I'll create a fix for this.

#10 Updated by Mark Guz 3 months ago

and the cpu use on MDS0 stays at +/- 250%

#11 Updated by Michael Sevilla 3 months ago

We could make the greedyspill.lua balancer check to see if it is the last MDS. Then just return instead of failing. I can work on this when I have a few cycles.

@Dan: regarding your question -- "So the last MDS cannot send his load away?". Yes, that's true because the Greedy Spill algorithm waterfalls load down the MDS ranks. The last one has no neighbor to spill to -- we could configure it to spill back to MDS0 but our results show that ending the migrating here has the best performance.

#12 Updated by Dan van der Ster 3 months ago

Well, currently the last MDS fails over to the old balancer, so he can in fact shift his load back to the others according to the old rules, AFAIU. (And I observed this in practise).
Making greedyspill.lua return instead of failing would change this behaviour.

I think the key thing here is not to clog the lua failure -- or at least to make that opt-in. Currently the ceph.log gets these WRN messages which is misleading:

2017-04-13 09:50:57.629475 mds.2 128.142.158.23:6800/447961479 4033 : cluster [WRN] using old balancer; mantle failed for balancer=greedyspill.lua : (22) Invalid argument
2017-04-13 09:51:57.630770 mds.2 128.142.158.23:6800/447961479 4034 : cluster [WRN] using old balancer; mantle failed for balancer=greedyspill.lua : (22) Invalid argument
...

#13 Updated by Patrick Donnelly 2 months ago

  • Status changed from New to Need Review

#14 Updated by John Spray about 2 months ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF