Project

General

Profile

Actions

Bug #21226

open

Expired Keystone Tokens not removed from Cache

Added by Johannes Rudolph over 6 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
-
% Done:

0%

Source:
Tags:
keystone
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Note: This is my first contribution to ceph, so please bear with me if there's any missing info or incorrect usage of the bug tracker.

We spent quite a bit of effort on reproducing an issue with radosgw that lead to users of our Swift instances to periodically see all their requests denied with 401 Unauthorized error, while the same Keystone token continued to work fine with other OpenStack services.

From studying the radosgw code in question, I believe there are multiple bugs in the Keystone Token Cache implementation of radosgw. Let's start with a description of what happens:

For every request of a user with a Keystone user token radosgw needs to validate the token with Keystone. For the Keystone uuid tokens that we use, this check involves POSTing the token to {{KEYSTONE_URL}}/v3/auth/tokens. To perform this user token validation, radosgw uses its own admin token which it obtained by authenticating with keystone using its configured rgw keystone admin user.

Because both calls are relatively expensive, radosgw applies caching to the result of both calls. What we observe in practice and have verified using radosgw logs (on level=20) is that the admin token becomes invalid or expires, yet radosgw continues to use it for validating user tokens. The observable beahvior for a user is that requests to Swift API endpoints backed by radosgw return 401 errors. radosgw does not recover from this issue until it thinks the token should actually expire and evicts it from the cache.

I believe this is caused by the following implementation issues outlined in the call sequence below (I will refer to the source code based on a git checkout v10.2.9):

- rgw_swift.cc:504 calls get_keystone_admin_token(...) to get the admin token to use
- this does a cache lookup at rgw_swift.cc:227
- the cache lookup checks expiry of the token at rgw_keystone.cc:217 and will produce a cache miss if the cached token is expired according to rgw_keystone.h:86

What this code path totally misses are two edge conditions:

- 1) The token is fetched at t < expires, yet the request is made at t + 1 > expires > keystone returns a 401 due to invalid admin token
2) corollary: The admin token is invalidated at Keystone (e.g. by operator, token cleanup or whatever) -> keystone returns a 401 due to invalid admin token

In both cases, the code fails to handle the returned 401 response and correctly evict the token from the cache. This leads to subsequent requests reusing the same, invalidated admin token and failing accordingly. What radosgw should do instead:

- 1) provide a configurable grace period so that the cache expirs tokens early (optional)
- 2) handle 401 errors concerning the admin token by evicting the token from the cache and retrying the request once (required)

I have not found out yet why Keystone expires the tokens before radosgw does. The clocks are properly synchronized on all
Keystone and radosgw servers. In any case, Keystone owns these tokens and radosgw must play by its rules. After all:

There are only two hard things in Computer Science: cache invalidation and naming things.

-- Phil Karlton

Note: I'm also not sure why radosgw thinks it needs to maintain an admin token in the first place, it would also be possible to use the user token as X-Auth-Token _and_ @X-Subject-Token for the token validation in the POST to {{KEYSTONE_URL}}/v3/auth/tokens.

We were able to workaround these implementation issues by disabling the token cache using rgw keystone token cache size = 0 in the config. Obviously this is not a good idea as it will seriously hurt request performance. Since by default the token cache is activated (size: 10000) I classify this bug as major.

For reference, here is our full config:

[global]
fsid = 2dba243e-839c-439c-9f0e-b4bbedb4e697
public_network = 10.10.16.0/22
cluster_network = 192.168.10.0/24
mon_initial_members = ceph00, ceph01, ceph02
mon_host = 10.10.16.10,10.10.16.11,10.10.16.12
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
rgw s3 auth use keystone = true
rgw keystone api version = 3
rgw keystone url = http://10.10.30.100:35357
rgw keystone admin domain = default
rgw keystone admin project = service
rgw keystone admin user = swift
rgw keystone admin password = xxx
rgw keystone accepted roles = _member_, admin
rgw keystone token cache size = 0
rgw_keystone_make_new_tenants = true
rgw dns name = swift.os.eu-de-netde.msh.host
[mon]
mon clock drift allowed = .200
[osd]
osd crush update on start = false
[mon]
mon clock drift allowed = .200

Actions #1

Updated by Orit Wasserman over 6 years ago

  • Assignee set to Radoslaw Zarzynski
Actions #2

Updated by Orit Wasserman over 6 years ago

  • Assignee changed from Radoslaw Zarzynski to Marcus Watts
Actions #3

Updated by hoan nv over 6 years ago

Hi Johannes Rudolph

Did you config radosgw integration with openstack successfully?
I am integrating, it has some errors.
Can you share with me.
Send email me .

Thanks.

Hoan

Actions #4

Updated by Casey Bodley over 2 years ago

  • Tags set to keystone
Actions

Also available in: Atom PDF