Bug #20763: Radosgw hangs after a few hours - rgw - Ceph

Actions

Copy link

Bug #20763

closed

Radosgw hangs after a few hours

Added by Martin Emrich over 6 years ago. Updated over 2 years ago.

Status:

Closed

Priority:

Normal

Assignee:

Matt Benjamin

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v12.1.0

ceph-qa-suite:

rgw

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Since upgrading to 12.1, our Object Gateways hang after a few hours, I only see these messages in the log file:

2017-06-29 07:52:20.877587 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:07:20.877761 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:07:29.994979 7fa8e11e7700 0 process_single_logshard: Error in get_bucket_info: (2) No such file or directory
2017-06-29 08:22:20.877911 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:27:30.086119 7fa8e11e7700 0 process_single_logshard: Error in get_bucket_info: (2) No such file or directory
2017-06-29 08:37:20.878108 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:37:30.187696 7fa8e11e7700 0 process_single_logshard: Error in get_bucket_info: (2) No such file or directory
2017-06-29 08:52:20.878283 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22
2017-06-29 08:57:30.280881 7fa8e11e7700 0 process_single_logshard: Error in get_bucket_info: (2) No such file or directory
2017-06-29 09:07:20.878451 7fa8e01e5700 0 ERROR: keystone revocation processing returned error r=-22

FYI: we do not use Keystone or Openstack.

This started after upgrading from jewel (via kraken) to luminous.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Vaibhav Bhembre over 6 years ago

I am seeing similar issues on upgrade from 10.2.6 to 12.1.0 (dev). I am not using Keystone.

My keystone specific config (that should be disabled) looks like follows:

"rgw_keystone_url": "" 
"rgw_keystone_admin_token": "" 
"rgw_keystone_admin_user": "" 
"rgw_keystone_admin_password": "" 
"rgw_keystone_admin_tenant": "" 
"rgw_keystone_admin_project": "" 
"rgw_keystone_admin_domain": "" 
"rgw_keystone_barbican_user": "" 
"rgw_keystone_barbican_password": "" 
"rgw_keystone_barbican_tenant": "" 
"rgw_keystone_barbican_project": "" 
"rgw_keystone_barbican_domain": "" 
"rgw_keystone_api_version": "2" 
"rgw_keystone_accepted_roles": "Member
"rgw_keystone_accepted_admin_roles": "" 
"rgw_keystone_token_cache_size": "10000" 
"rgw_keystone_revocation_interval": "900" 
"rgw_keystone_verify_ssl": "true" 
"rgw_keystone_implicit_tenants": "false" 
"rgw_s3_auth_use_keystone": "false"

This is a blocker for us preventing us to move to Luminous.

Actions

Copy link

Updated by Matt Benjamin over 6 years ago

Hi Martin,

Can you provide a log snippet at -d --debug-rgw=20 --debug-ms=1 and include the transition to the hang state? Also the output of "ceph -s"

There's insufficient info here to speculate about what this hang amounts to.

Matt

Actions

Copy link

Updated by Martin Emrich over 6 years ago

Sure.

ceph -s:

[root@ceph-kl-mon1 ~]# ceph -s
  cluster:
    id:     cfaf0f4e-3b09-49e8-875b-4b114b0c4842
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-kl-mon2,ceph-kl-mon3,ceph-kl-mon1
    mgr: ceph-kl-mon2(active), standbys: ceph-kl-mon1, ceph-kl-mon3
    osd: 3 osds: 3 up, 3 in

  data:
    pools:   18 pools, 3464 pgs
    objects: 899k objects, 17812 MB
    usage:   74381 MB used, 227 GB / 299 GB avail
    pgs:     3464 active+clean

  io:
    client:   6469 B/s wr, 0 op/s rd, 1 op/s wr

I am running one rgw now in a screen session, collecting debug messages as you requested. I will check the result tomorrow morning and report back.

Actions

Copy link

Updated by Casey Bodley over 6 years ago

Possible that this is related to http://tracker.ceph.com/issues/20686 ?

Actions

Copy link

Updated by Graham Allan over 6 years ago

I've been seeing the same thing with Luminous 12.1.1.

I think you're right that it could be associated with the SIGHUP issue (20686) - the hang consistently corresponds to log rotation time.

In the logs however, I don't see any real transition to the hung state. I have a periodic transfer into radosgw which no longer works after that, but other activities such as object expiration continue normally.

Actions

Copy link

Updated by Vidushi Mishra over 6 years ago

Observed same issue with luminous (rc) 12.1.1 .
2017-07-28 03:32:47.670056 7efc98597700 0 ERROR: keystone revocation processing returned error r=-22. This is without using openstack or keystone.
It is resolved by restarting radosgw service.

Actions

Copy link

Updated by Martin Emrich over 6 years ago

Sorry for the delay, had a busy week.
Sadly the log file I created had the error right away, now my rados gateway does not work at all, even after restarts.

Symptom: S3 clients get a "bucket not found" error, while I can list my buckets via radosgw-admin.

I am still on 12.1.0, will update to latest and try again...

Actions

Copy link

Updated by Martin Emrich over 6 years ago

I finally managed to reproduce it while collecting log files:

I did set up a continuous torture test using rclone: Upload 40000 files, list them, and delete them again, then pause for 30 minutes. Over night, the last successful run finished at 02:55, the next run at 3:25 failed.
Sadly the debug log is ca. 300MB uncompressed (and still 24MB gzipped), so I cannot attach it here.
So I uploaded it to my Dropbox: https://www.dropbox.com/s/9x5s69iwcm7d4ie/ceph-radosgw-debugging.log.gz?dl=0

In the log, you can see some of the DELETE calls of the last successful run, then somewhere in the next 30 minutes, the bug kicks in.

Actions

Copy link

Updated by Orit Wasserman over 6 years ago

Assignee set to Matt Benjamin

Actions

Copy link

#10

Updated by Radoslaw Zarzynski over 6 years ago

Related to Bug #20866: rgw: crash in system_obj_set_attrs when got SIGHUP added

Actions

Copy link

#11

Updated by Matt Benjamin over 6 years ago

Status changed from New to 12

Actions

Copy link

#12

Updated by Nitin Kamble over 6 years ago

From the information here, looks like the hang is caused at the time of log rotation. That means it would be possible to work around the issue by disabling the log rotation.

Actions

Copy link

#13

Updated by Nitin Kamble over 6 years ago

Noticed the same keystone revocation processing error in rgw log, with the ceph release v12.1.4. The setup is not utilizing keystone or openstack setup. Unlike as mentioned on this tracker, v12.1.4 does not get the rgw hang.

Actions

Copy link

#14