rgw: break up user reset-stats into multiple cls ops
Currently when a user requests the reset of user stats via radosgw-admin, a single write op is sent to the OSD holding the user's info object, and it's omap entries are read in a loop and the final result written to the object's header.
The advantage to this technique is that it is atomic, manipulating a single object with a write operation.
The downside, though, is that on OSDs that may be bogged down for other reasons, this write operation may take a while and limit access to the pg on which this object resides.
There are a couple of ideas that might mitigate this. However given how infrequent this operation is, these changes are likely worth implementing, at least at this point in time. Instead, this tracker is here primarily to capture these ideas for the future.
It's important to understand that one object is read from and written to for this op.
For a user that has a lot of buckets this operation incrementally reads through their buckets, totaling the stats as it goes along, with one final write. Bucket stats are read in groups of 1000, so if a user had 100,000 buckets, this would involve 100 reads.
So the first idea is to do the reads as one op to determine the total and the write as a second op, to update the header. Presumably other reads on the PG could take place during the read op. The primary challenge here is to make sure there were no intervening writes between the read op and write op. A generation number and/or timestamp of the header write could be used to insure that the write op is ok to complete. Otherwise an error could take place, and possibly a set of retries.
The second idea would be even to break the reads into multiple ops, with enough information returned from each to continue the operation with more reads, followed by a single write. The same challenge as listed above is applicable here, although with more opportunities for races with other write ops.
#2 Updated by Josh Durgin over 1 year ago
There are a finite number of OSD op threads. If the 100 reads in a single op take a while, they will block one of those threads. By default there are 2 threads per shard for SSD, and 8 shards, so if these kind of ops were more common, they could end up blocking I/O for 1/8th of the PGs.
The first idea doesn't help much since reads to the same PG would still be blocked on each other - there's no parallelism there today. The 2nd idea, with multiple ops, would get around this and let other work happen interspersed with these operations.