Bug #19589
closedgreedyspill.lua: :18: attempt to index a nil value (field '?')
0%
Description
The included greedyspill.lua doesn't seem to work in a simple 3-active MDS scenario.
balancer greedyspill.lua 292077: 128.142.158.23:6800/1018507254 'cephhalpert-mds-135c39f87d' mds.0.1592 up:active seq 92 export_targets=0,1,2 275888: 128.142.135.28:6800/608393891 'cephhalpert-mds-981001588f' mds.1.1579 up:active seq 895 export_targets=0,1,2 285379: 128.142.132.151:6800/3646166240 'cephhalpert-mds-96d8ad3ea3' mds.2.1600 up:active seq 28
But it fails on line 18?
2017-04-12 13:54:18.222378 7f34eb1db700 0 lua.balancer MDS0: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=0.35 > load=0.0 2017-04-12 13:54:18.222396 7f34eb1db700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=0.34 > load=0.0 2017-04-12 13:54:18.222404 7f34eb1db700 0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=0.27 > load=0.0 2017-04-12 13:54:18.222418 7f34eb1db700 0 lua.balancer WARNING: mantle could not execute script: [string "metrics = {"auth.meta_load", "all.meta_load",..."]:18: attempt to index a nil value (field '?') 2017-04-12 13:54:18.222449 7f34eb1db700 0 log_channel(cluster) log [WRN] : using old balancer; mantle failed for balancer=greedyspill.lua : (22) Invalid argument
Updated by Dan van der Ster about 7 years ago
Ahh, it's even documented:
Note that if you look at the last MDS (which could be a, b, or c -- it's random), you will see an an attempt to index a nil value. This is because the last MDS tries to check the load of its neighbor, which does not exist.
So the last MDS cannot send his load away?
Updated by Mark Guz about 7 years ago
I also see this error. I have 2 Active/active mdses. The first shows no errors, the second shows the errors above. No loadbalancing occurs, as the 1st mds remains at high cpu use and the second sits idle.
Updated by Mark Guz about 7 years ago
Dan, do you see any evidence of actual load balancing?
Updated by Dan van der Ster about 7 years ago
Yes. For example, when I have 50 clients untarring the linux kernel into unique directories, the load is moved around.
Updated by Dan van der Ster about 7 years ago
BTW, you need to set debug_mds_balancer = 2 to see the balance working.
Updated by Mark Guz about 7 years ago
did you modify the greedyspill.lua script at all?
Updated by Dan van der Ster about 7 years ago
Nope. Is your load on mds.0 ? If yes, and it get's heavily loaded, and if mds.1 has load = 0, then i expect the balancer to trigger. It works like that for me.
If not, maybe use the export_dir mds command to move the workload back to 0.
mds.1 will use the old default balancer, which in my (very limited) experience is less predictable than the lua thing.
Updated by Mark Guz about 7 years ago
i see this in the logs
2017-04-12 10:14:40.324788 7ff1b5abf700 0 lua.balancer MDS0: < auth.meta_load=35983.556838048 all.meta_load=4760.117183648 req_rate=2660151.0 queue_len=549.0 cpu_load_avg=2.3 > load=4760.117183648 2017-04-12 10:14:40.324807 7ff1b5abf700 0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=2557470.0 queue_len=0.0 cpu_load_avg=0.05 > load=0.0 2017-04-12 10:14:40.324817 7ff1b5abf700 0 lua.balancer MDS2: < auth.meta_load=0.86185489059074 all.meta_load=0.78425692586527 req_rate=28861.0 queue_len=0.0 cpu_load_avg=0.06 > load=0.78425692586527
Updated by Patrick Donnelly about 7 years ago
- Assignee set to Patrick Donnelly
This error shouldn't be an expected occurrence. I'll create a fix for this.
Updated by Mark Guz about 7 years ago
and the cpu use on MDS0 stays at +/- 250%
Updated by Michael Sevilla about 7 years ago
We could make the greedyspill.lua balancer check to see if it is the last MDS. Then just return instead of failing. I can work on this when I have a few cycles.
@Dan: regarding your question -- "So the last MDS cannot send his load away?". Yes, that's true because the Greedy Spill algorithm waterfalls load down the MDS ranks. The last one has no neighbor to spill to -- we could configure it to spill back to MDS0 but our results show that ending the migrating here has the best performance.
Updated by Dan van der Ster about 7 years ago
Well, currently the last MDS fails over to the old balancer, so he can in fact shift his load back to the others according to the old rules, AFAIU. (And I observed this in practise).
Making greedyspill.lua return instead of failing would change this behaviour.
I think the key thing here is not to clog the lua failure -- or at least to make that opt-in. Currently the ceph.log gets these WRN messages which is misleading:
2017-04-13 09:50:57.629475 mds.2 128.142.158.23:6800/447961479 4033 : cluster [WRN] using old balancer; mantle failed for balancer=greedyspill.lua : (22) Invalid argument 2017-04-13 09:51:57.630770 mds.2 128.142.158.23:6800/447961479 4034 : cluster [WRN] using old balancer; mantle failed for balancer=greedyspill.lua : (22) Invalid argument ...
Updated by Patrick Donnelly about 7 years ago
- Status changed from New to Fix Under Review
Updated by John Spray almost 7 years ago
- Status changed from Fix Under Review to Resolved
Updated by Patrick Donnelly about 5 years ago
- Category deleted (
90) - Labels (FS) multimds added