Inconsistency raised while performing multiple "image rename" in parallel.
It is observed that may be due to poor internal locking mechanism the image header can persist in OSDs after renaming the image frequently in parallel, whereas the image doesn't exist in the pool at all. For this behavior we can't assign the new name which is already present in the headers.
Image headers should be deleted properly after renaming the image frequently in parallel so that any kind of consistency related problem can't occur in future.
Step- 1 : Create an image in a pool # rbd create mypool/image --size 1024
[ Note: For testing purpose make ensure that the pool does not exist any other images ]
Step-2 : Execute below scripts in two different shells to perform rename operation frequently in parallel
<Shell-1> # ./rbd_rename_program1.sh mypool
<Shell-2> # ./rbd_rename_program2.sh mypool
[ Note:- Here both scripts are renaming the image by fetching the image name from the pool. Two scripts changes the name in two different manner]
Step-3 : Execute Step-2 again until we get "image already exist" error.
Step-4 : Now find out for which image the CLI is throwing error and then check the existence of the image in pool # rbd ls mypool
- ceph osd map mypool <image name for which error is showing>
- cat /var/lib/ceph/osd/ceph <osd_id>/current/<PG_num>_head
#2 Updated by Jason Dillaman 3 months ago
- Status changed from New to Need More Info
- Priority changed from Normal to Low
While this is in fact a valid issue, I am hesitant to add the complexity of 2 phase commit logic for this edge condition. Is there a valid use-case where you are seeing this hit (besides just trying)?
#3 Updated by Debashis Mondal 3 months ago
When we are trying to simultaneously access the image for renaming in parallel from client side we observed this unexpected behavior of Ceph. If any corruption occurs in header then it can produce a larger problem in production when the image will be accessed by different clients.
#5 Updated by Debashis Mondal 3 months ago
Actually we have checked this scenario based on the provided features of ceph.
As ceph provides parallel access to the same image from different client hence there should be proper locking and cleaning mechanism to handle the parallelism and consistency.
When I checked this behavior as per development point of view then we got this kind of improper handling that causes inconsistency in ceph server which should not be accepted.
I simulate this behavior to find the coding related bugs. As per my understanding this bug should be fixed to maintain the consistency
#7 Updated by Debashis Mondal 3 months ago
But according to me it must be fixed. As multiple client have access right to the image then the user can rename the image as per their requirement. Now if 100 users hit the same operation at the same time then this kind of situation will occur for the improper lock handling. Now if somehow a client is trying to give the name which is already provided by some other clients and for the above error the header entry is present in ceph server then the requesting user can't give the same name of their choice because there are multiple header entries (same header entry also) are present in ceph directory which is obviously a bug and mismanagement of ceph server as per user perspective.
This kind of situation can occurs at any point of time based on the users. So this should be fixed to overcome this type of possibilities in future and should be reflected on current releases also. Ceph server should be reliable at any point regardless of how user will use the server.