Bug #15489
openCORS property of bucket can get out of sync with multiple RGWs, caching issue
0%
Description
This is a really hard bug to reproduce. It's happened about once a month for the last 6 months (maybe longer), each time for a single bucket.
In our case, we've got 10 RGWs (behind haproxies). Tests performed directly against the civetweb RGWs, bypassing the haproxy.
Testcase:
- On the bad bucket:
- pass1: loop over each RGW, fetch the CORS configuration and add a CORS rule with the ID value named uniquely to the RGW. I use the name 's3-cors-health-$HOSTNAME'.
- sleep
- pass2: loop over all RGWs, fetch the CORS configuration again:
-- On the good RGWs, you'll get a CORS configuration that misses the rule for the bad RGWs.
-- On the bad RGWs, you'll get some older version of the CORS configuration, that never gets updated anymore (this is VERY evident when the bucket initially had no CORS configuration).
What I don't know is what the initial trigger condition is. The most recent one, is on a user bucket that is 4 days old.
I cannot reproduce it on any other bucket right now, including many of various ages (including one made a few minutes before the user bucket).
The only thing I can say is that we haven't seen the issue happen on any RGW instances with runtimes under 1 week.
Restarting the bad RGW instances causes the problem to be resolved.
Updated by Konstantin Shalygin 20 days ago
- Source changed from other to Community (user)