Bug #19589: greedyspill.lua: :18: attempt to index a nil value (field '?') - CephFS - Ceph

Actions

Copy link

Bug #19589

closed

greedyspill.lua: :18: attempt to index a nil value (field '?')

Added by Dan van der Ster about 7 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Patrick Donnelly

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v12.0.0

ceph-qa-suite:

Component(FS):

Labels (FS):

multimds

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The included greedyspill.lua doesn't seem to work in a simple 3-active MDS scenario.

balancer    greedyspill.lua
292077:    128.142.158.23:6800/1018507254 'cephhalpert-mds-135c39f87d' mds.0.1592 up:active seq 92 export_targets=0,1,2
275888:    128.142.135.28:6800/608393891 'cephhalpert-mds-981001588f' mds.1.1579 up:active seq 895 export_targets=0,1,2
285379:    128.142.132.151:6800/3646166240 'cephhalpert-mds-96d8ad3ea3' mds.2.1600 up:active seq 28

But it fails on line 18?

2017-04-12 13:54:18.222378 7f34eb1db700  0 lua.balancer MDS0: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=0.35 > load=0.0
2017-04-12 13:54:18.222396 7f34eb1db700  0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=0.34 > load=0.0
2017-04-12 13:54:18.222404 7f34eb1db700  0 lua.balancer MDS2: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=0.0 queue_len=0.0 cpu_load_avg=0.27 > load=0.0
2017-04-12 13:54:18.222418 7f34eb1db700  0 lua.balancer WARNING: mantle could not execute script: [string "metrics = {"auth.meta_load", "all.meta_load",..."]:18: attempt to index a nil value (field '?')
2017-04-12 13:54:18.222449 7f34eb1db700  0 log_channel(cluster) log [WRN] : using old balancer; mantle failed for balancer=greedyspill.lua : (22) Invalid argument

Actions

Copy link

Updated by Dan van der Ster about 7 years ago

Ahh, it's even documented:

Note that if you look at the last MDS (which could be a, b, or c -- it's
random), you will see an an attempt to index a nil value. This is because the
last MDS tries to check the load of its neighbor, which does not exist.

So the last MDS cannot send his load away?

Actions

Copy link

Updated by Mark Guz about 7 years ago

I also see this error. I have 2 Active/active mdses. The first shows no errors, the second shows the errors above. No loadbalancing occurs, as the 1st mds remains at high cpu use and the second sits idle.

Actions

Copy link

Updated by Mark Guz about 7 years ago

Dan, do you see any evidence of actual load balancing?

Actions

Copy link

Updated by Dan van der Ster about 7 years ago

Yes. For example, when I have 50 clients untarring the linux kernel into unique directories, the load is moved around.

Actions

Copy link

Updated by Dan van der Ster about 7 years ago

BTW, you need to set debug_mds_balancer = 2 to see the balance working.

Actions

Copy link

Updated by Mark Guz about 7 years ago

did you modify the greedyspill.lua script at all?

Actions

Copy link

Updated by Dan van der Ster about 7 years ago

Nope. Is your load on mds.0 ? If yes, and it get's heavily loaded, and if mds.1 has load = 0, then i expect the balancer to trigger. It works like that for me.
If not, maybe use the export_dir mds command to move the workload back to 0.

mds.1 will use the old default balancer, which in my (very limited) experience is less predictable than the lua thing.

Actions

Copy link

Updated by Mark Guz about 7 years ago

i see this in the logs

2017-04-12 10:14:40.324788 7ff1b5abf700  0 lua.balancer MDS0: < auth.meta_load=35983.556838048 all.meta_load=4760.117183648 req_rate=2660151.0 queue_len=549.0 cpu_load_avg=2.3 > load=4760.117183648
2017-04-12 10:14:40.324807 7ff1b5abf700  0 lua.balancer MDS1: < auth.meta_load=0.0 all.meta_load=0.0 req_rate=2557470.0 queue_len=0.0 cpu_load_avg=0.05 > load=0.0
2017-04-12 10:14:40.324817 7ff1b5abf700  0 lua.balancer MDS2: < auth.meta_load=0.86185489059074 all.meta_load=0.78425692586527 req_rate=28861.0 queue_len=0.0 cpu_load_avg=0.06 > load=0.78425692586527

Actions

Copy link

Updated by Patrick Donnelly about 7 years ago

Assignee set to Patrick Donnelly

This error shouldn't be an expected occurrence. I'll create a fix for this.

Actions

Copy link

#10

Updated by Mark Guz about 7 years ago

and the cpu use on MDS0 stays at +/- 250%

Actions

Copy link

#11

Updated by Michael Sevilla about 7 years ago

We could make the greedyspill.lua balancer check to see if it is the last MDS. Then just return instead of failing. I can work on this when I have a few cycles.

@Dan: regarding your question -- "So the last MDS cannot send his load away?". Yes, that's true because the Greedy Spill algorithm waterfalls load down the MDS ranks. The last one has no neighbor to spill to -- we could configure it to spill back to MDS0 but our results show that ending the migrating here has the best performance.

Actions

Copy link

#12

Updated by Dan van der Ster about 7 years ago

Well, currently the last MDS fails over to the old balancer, so he can in fact shift his load back to the others according to the old rules, AFAIU. (And I observed this in practise).
Making greedyspill.lua return instead of failing would change this behaviour.

I think the key thing here is not to clog the lua failure -- or at least to make that opt-in. Currently the ceph.log gets these WRN messages which is misleading:

2017-04-13 09:50:57.629475 mds.2 128.142.158.23:6800/447961479 4033 : cluster [WRN] using old balancer; mantle failed for balancer=greedyspill.lua : (22) Invalid argument
2017-04-13 09:51:57.630770 mds.2 128.142.158.23:6800/447961479 4034 : cluster [WRN] using old balancer; mantle failed for balancer=greedyspill.lua : (22) Invalid argument
...

Actions

Copy link

#13