Bug #20763
closedRadosgw hangs after a few hours
0%
Description
Since upgrading to 12.1, our Object Gateways hang after a few hours, I only see these messages in the log file:
2017-06-29 07:52:20.877587 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:07:20.877761 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:07:29.994979 7fa8e11e7700 0 process_single_logshard: Error in get_bucket_info: (2) No such file or directory
2017-06-29 08:22:20.877911 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:27:30.086119 7fa8e11e7700 0 process_single_logshard: Error in get_bucket_info: (2) No such file or directory
2017-06-29 08:37:20.878108 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:37:30.187696 7fa8e11e7700 0 process_single_logshard: Error in get_bucket_info: (2) No such file or directory
2017-06-29 08:52:20.878283 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:57:30.280881 7fa8e11e7700 0 process_single_logshard: Error in get_bucket_info: (2) No such file or directory
2017-06-29 09:07:20.878451 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
FYI: we do not use Keystone or Openstack.
This started after upgrading from jewel (via kraken) to luminous.
Updated by Vaibhav Bhembre almost 7 years ago
I am seeing similar issues on upgrade from 10.2.6 to 12.1.0 (dev). I am not using Keystone.
My keystone specific config (that should be disabled) looks like follows:
"rgw_keystone_url": ""
"rgw_keystone_admin_token": ""
"rgw_keystone_admin_user": ""
"rgw_keystone_admin_password": ""
"rgw_keystone_admin_tenant": ""
"rgw_keystone_admin_project": ""
"rgw_keystone_admin_domain": ""
"rgw_keystone_barbican_user": ""
"rgw_keystone_barbican_password": ""
"rgw_keystone_barbican_tenant": ""
"rgw_keystone_barbican_project": ""
"rgw_keystone_barbican_domain": ""
"rgw_keystone_api_version": "2"
"rgw_keystone_accepted_roles": "Member
"rgw_keystone_accepted_admin_roles": ""
"rgw_keystone_token_cache_size": "10000"
"rgw_keystone_revocation_interval": "900"
"rgw_keystone_verify_ssl": "true"
"rgw_keystone_implicit_tenants": "false"
"rgw_s3_auth_use_keystone": "false"
This is a blocker for us preventing us to move to Luminous.
Updated by Matt Benjamin almost 7 years ago
Hi Martin,
Can you provide a log snippet at -d --debug-rgw=20 --debug-ms=1 and include the transition to the hang state? Also the output of "ceph -s"
There's insufficient info here to speculate about what this hang amounts to.
Matt
Updated by Martin Emrich almost 7 years ago
Sure.
ceph -s:
[root@ceph-kl-mon1 ~]# ceph -s
cluster:
id: cfaf0f4e-3b09-49e8-875b-4b114b0c4842
health: HEALTH_OK
services:
mon: 3 daemons, quorum ceph-kl-mon2,ceph-kl-mon3,ceph-kl-mon1
mgr: ceph-kl-mon2(active), standbys: ceph-kl-mon1, ceph-kl-mon3
osd: 3 osds: 3 up, 3 in
data:
pools: 18 pools, 3464 pgs
objects: 899k objects, 17812 MB
usage: 74381 MB used, 227 GB / 299 GB avail
pgs: 3464 active+clean
io:
client: 6469 B/s wr, 0 op/s rd, 1 op/s wr
I am running one rgw now in a screen session, collecting debug messages as you requested. I will check the result tomorrow morning and report back.
Updated by Casey Bodley almost 7 years ago
Possible that this is related to http://tracker.ceph.com/issues/20686 ?
Updated by Graham Allan almost 7 years ago
I've been seeing the same thing with Luminous 12.1.1.
I think you're right that it could be associated with the SIGHUP issue (20686) - the hang consistently corresponds to log rotation time.
In the logs however, I don't see any real transition to the hung state. I have a periodic transfer into radosgw which no longer works after that, but other activities such as object expiration continue normally.
Updated by Vidushi Mishra almost 7 years ago
Observed same issue with luminous (rc) 12.1.1 .
2017-07-28 03:32:47.670056 7efc98597700 0 ERROR: keystone revocation processing returned error r=-22. This is without using openstack or keystone.
It is resolved by restarting radosgw service.
Updated by Martin Emrich almost 7 years ago
Sorry for the delay, had a busy week.
Sadly the log file I created had the error right away, now my rados gateway does not work at all, even after restarts.
Symptom: S3 clients get a "bucket not found" error, while I can list my buckets via radosgw-admin.
I am still on 12.1.0, will update to latest and try again...
Updated by Martin Emrich almost 7 years ago
I finally managed to reproduce it while collecting log files:
I did set up a continuous torture test using rclone: Upload 40000 files, list them, and delete them again, then pause for 30 minutes. Over night, the last successful run finished at 02:55, the next run at 3:25 failed.
Sadly the debug log is ca. 300MB uncompressed (and still 24MB gzipped), so I cannot attach it here.
So I uploaded it to my Dropbox: https://www.dropbox.com/s/9x5s69iwcm7d4ie/ceph-radosgw-debugging.log.gz?dl=0
In the log, you can see some of the DELETE calls of the last successful run, then somewhere in the next 30 minutes, the bug kicks in.
Updated by Radoslaw Zarzynski over 6 years ago
- Related to Bug #20866: rgw: crash in system_obj_set_attrs when got SIGHUP added
Updated by Nitin Kamble over 6 years ago
From the information here, looks like the hang is caused at the time of log rotation. That means it would be possible to work around the issue by disabling the log rotation.
Updated by Nitin Kamble over 6 years ago
Noticed the same keystone revocation processing error in rgw log, with the ceph release v12.1.4. The setup is not utilizing keystone or openstack setup. Unlike as mentioned on this tracker, v12.1.4 does not get the rgw hang.
Updated by Matt Benjamin over 6 years ago
The keystone revocation thread can be disabled altogether. The issue seems to me to be indeed SIGHUP related.
Matt