Project

General

Profile

Bug #22060

RGW: ERROR: failed to distribute cache

Added by Wido den Hollander almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
11/07/2017
Due date:
% Done:

0%

Source:
Tags:
rgw,cache,notify
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

As reported on the ceph-users list I see these messages in my RGW log running Luminous:

2017-11-07 09:45:22.150273 7f985b34f700 10 cache put: name=gn1-pf.rgw.meta++.meta:bucket:XXX-mon-bucket:_iaUdq4vufCpgnMlapZCm169:1 info.flags=0x17
2017-11-07 09:45:22.150283 7f985b34f700 10 adding gn1-pf.rgw.meta++.meta:bucket:XXX-mon-bucket:_iaUdq4vufCpgnMlapZCm169:1 to cache LRU end
2017-11-07 09:45:22.150291 7f985b34f700 10 distributing notification oid=notify.1 bl.length()=407
2017-11-07 09:45:31.881703 7f985b34f700 10 cache put: name=gn1-pf.rgw.data.root++XXX-mon-bucket info.flags=0x17
2017-11-07 09:45:31.881720 7f985b34f700 10 moving gn1-pf.rgw.data.root++XXX-mon-bucket to cache LRU end
2017-11-07 09:45:31.881733 7f985b34f700 10 distributing notification oid=notify.1 bl.length()=372

Not sure yet what these mean. In addition I see RGW stalls of up to 30 seconds.

Increasing the librados debugging I see this:

2017-11-07 10:30:11.982214 7f5c24a92700 10 updating xattr: name=user.rgw.acl bl.length()=155
2017-11-07 10:30:11.982226 7f5c24a92700 10 distributing notification oid=notify.6 bl.length()=708
2017-11-07 10:30:11.982273 7f5c24a92700 10 librados: notify issued linger op 0x5639b74fd500
2017-11-07 10:30:11.983585 7f5c24a92700 10 librados: notify linger op 0x5639b74fd500 acked (0)
2017-11-07 10:30:11.983589 7f5c24a92700 10 librados: notify waiting for watch_notify finish 0x5639b74fd500
2017-11-07 10:30:21.983629 7f5c24a92700  0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:.bucket.meta.XXXXX-mon-bucket:eb32b1ca-807a-4867-aea5-ff43ef7647c6.21905901.1

There is a 10 second (11 until 21) stall it seems when trying to notify the cache here.

History

#1 Updated by Wido den Hollander almost 2 years ago

I lowered client_notify_timeout to 5 seconds and now I see:

2017-11-07 11:10:16.301202 7fa79add3700 10 librados: notify linger op 0x5651e7a20d80 acked (0)
2017-11-07 11:10:16.301206 7fa79add3700 10 librados: notify waiting for watch_notify finish 0x5651e7a20d80
2017-11-07 11:10:21.301070 7fa79add3700  0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:XXXX-mon-bucket

As you can see, after 5 seconds it times out.

[client.rgw.ceph-rgw1]
client_notify_timeout = 5

Checking the watchers on each object it shows me that there are 144 watchers on each object.

rados --id rgw.ceph-rgw1 -p gn1-pf.rgw.control listwatchers notify.4|wc -l

That yields 144.

All the watchers are from the 2 RGWs I have running.

#2 Updated by Wido den Hollander almost 2 years ago

Digging further I see that my rgw1 sends out a notify:

2017-11-07 14:10:16.964036 7fa74051e700 10 cache put: name=gn1-pf.rgw.data.root++XXX-mon-bucket info.flags=0x17
2017-11-07 14:10:16.964042 7fa74051e700 10 moving gn1-pf.rgw.data.root++XX-mon-bucket to cache LRU end
2017-11-07 14:10:16.964053 7fa74051e700 10 distributing notification oid=notify.1 bl.length()=374
2017-11-07 14:10:16.964121 7fa74051e700 10 librados: notify issued linger op 0x5651e8b1b100
2017-11-07 14:10:16.965309 7fa74051e700 10 librados: notify linger op 0x5651e8b1b100 acked (0)
2017-11-07 14:10:16.965311 7fa74051e700 10 librados: notify waiting for watch_notify finish 0x5651e8b1b100
2017-11-07 14:10:21.965060 7fa74051e700  0 ERROR: failed to distribute cache for gn1-pf.rgw.data.root:XXXX-mon-bucket

And on rgw2 I see that at the same moment a notify is received and dealt with:


2017-11-07 14:10:16.968354 7f0d6f880700 10 RGWWatcher::handle_notify()  notify_id 799718615563302 cookie 94174036403072 notifier 21944058 bl.length()=374
2017-11-07 14:10:16.968386 7f0d6f880700 10 cache put: name=gn1-pf.rgw.data.root++XXXX-mon-bucket info.flags=0x17
2017-11-07 14:10:16.968393 7f0d6f880700 10 adding gn1-pf.rgw.data.root++XXXX-mon-bucket to cache LRU end

So it seems that rgw2 indeed handles the notify, but rgw1 doesn't seem to receive that. Causing it to timeout.

#3 Updated by Yehuda Sadeh almost 2 years ago

Sounds like a leak in the watch/notify registration. How many watchers are there on the other notify objects?

#4 Updated by Wido den Hollander almost 2 years ago

I saw 144 watchers on each object. All 8 notify objects have that many watchers. Is that normal?

I tried stopping one of the RGWs but that does not change the situation.

#5 Updated by Casey Bodley almost 2 years ago

Wido den Hollander wrote:

All the watchers are from the 2 RGWs I have running.

Are you saying that your 'listwatchers' output only shows two unique client ids? Or was that based on the ips?

I saw 144 watchers on each object. All 8 notify objects have that many watchers. Is that normal?

That's definitely not normal, and likely why you're seeing timeouts. Each gateway should only have one watcher on each object. Each instance of radosgw-admin will also add a watcher. Is it possible that you've spawned a lot of admin commands that are hung?

#6 Updated by Wido den Hollander almost 2 years ago

  • Status changed from New to Closed

As it turns out a admin on this system created a Zabbix check which ran 'radosgw-admin usage show' every minute. While that didn't finish yet another one was spawned by Zabbix.

This caused a deadlock which meant the radosgw-admin processes kept piling up.

Eventually this caused the notifies to timeout on the RGWs itself and making uploads and downloads slow.

Killing all the radosgw-admin processes made this go away.

Also available in: Atom PDF