Bug #55217
pybind/mgr/volumes: Clone operation hangs
100%
Description
The clone operation hangs while testing clone failure PR locally (teuthology via vstart_runner).
The hang is seen while running two successive tests sequentially. The operations done in the
tests are as follows.
test1
-------------------
1. Created subvolume,
2. Written some I/O on the subvolume
3. Create snapshot of the subvolume
4. Create clone of the snapshot
5. Delete snapshot from back end (don't use subvolume interface) before clone completes (used snapshot_clone_delay configuration to achieve it)
6. Delete clone with force
7. Delete subvolume
test2
------------------
1. Created subvolume,
2. Written some I/O on the subvolume
3. Create snapshot of the subvolume
4. Create clone of the snapshot <---------------THIS OPERATION HANGS -----------------
The 'test1' succeeds but while running 'test2', the clone operation hangs.
Analysis and Root Cause:
------------------------
In step 5 of 'test1', the snapshot is deleted from the back end without using subvolume interface.
Because of this, the clone scheduled in step 4, fails. When the clone is failed, we detach the
snapshot ('detach_snapshot' routine) which removes the tracker from the clone index located at
'/volumes/_index/clone/'. Since the snapshot is removed from the back end, 'detach_snapshot'
fails to remove the index.
def detach_snapshot(self, snapname, track_id): if not snapname.encode('utf-8') in self.list_snapshots(): raise VolumeException(-errno.ENOENT, "snapshot '{0}' does not exist".format(snapname)) <---- FAILS HERE----- try: with open_clone_index(self.fs, self.vol_spec) as index: index.untrack(track_id) self._remove_snap_clone(track_id) except (IndexException, MetadataMgrException) as e: log.warning("error delining snapshot from clone: {0}".format(e)) raise VolumeException(-errno.EINVAL, "error delinking snapshot from clone")
So the cloner thread (from async_job.py), goes into infinite loop trying to clone this index. This involves 'self.async_job.lock()'.
Taking this lock:
1. opens the volume without lock and reads the index to get the job
2. registers the above job
Schedules the job outslide the lock
Check below.
def run(self): retries = 0 thread_id = threading.currentThread() assert isinstance(thread_id, JobThread) thread_name = thread_id.getName() log.debug("thread [{0}] starting".format(thread_name)) while retries < JobThread.MAX_RETRIES_ON_EXCEPTION: vol_job = None try: # fetch next job to execute with self.async_job.lock: while True: if self.should_reconfigure_num_threads(): log.info("thread [{0}] terminating due to reconfigure".format(thread_name)) self.async_job.threads.remove(self) return vol_job = self.async_job.get_job() if vol_job: break self.async_job.cv.wait() self.async_job.register_async_job(vol_job[0], vol_job[1], thread_id) # execute the job (outside lock) self.async_job.execute_job(vol_job[0], vol_job[1], should_cancel=lambda: thread_id.should_cancel()) ...
The new clone operation involves the same lock to queue the job:
1. The '_prepare_clone_subvolume' calls the 'self.cloner.queue_job' (async_job.py) which
takes the lock and queues the job and notifies the cloner threads
def queue_job(self, volname): """ queue a volume for asynchronous job execution. """ log.info("queuing job for volume '{0}'".format(volname)) with lock_timeout_log(self.lock): <---------------------------SAME LOCK ------------- if volname not in self.q: self.q.append(volname) self.jobs[volname] = [] self.cv.notifyAll()
DEAD LOCK EXPLAINED
-----------------------
The cloner thread from 'test1' is in infinite loop of 'lock - getjob - register job - unlock - schedule clone - fails'.
When the 'test1' is completed, the 'fs' is destroyed along with the pools. The cloner thread, while getting job with in the lock,
hangs while reading the index in 'get_oldest_clone_entry' routine since the 'fs' is not longer present,
def get_oldest_clone_entry(self, exclude=[]): min_ctime_entry = None exclude_tracking_ids = [v[0] for v in exclude] log.debug("excluded tracking ids: {0}".format(exclude_tracking_ids)) <----SOMEWHERE AFTER THIS ---------- for entry in list_one_entry_at_a_time(self.fs, self.path): dname = entry.d_name dpath = os.path.join(self.path, dname) st = self.fs.lstat(dpath) if dname not in exclude_tracking_ids and stat.S_ISLNK(st.st_mode): if min_ctime_entry is None or st.st_ctime < min_ctime_entry[1].st_ctime: min_ctime_entry = (dname, st) if min_ctime_entry: try: linklen = min_ctime_entry[1].st_size sink_path = self.fs.readlink(os.path.join(self.path, min_ctime_entry[0]), CloneIndex.PATH_MAX) return (min_ctime_entry[0], sink_path[:linklen]) except cephfs.Error as e: raise IndexException(-e.args[0], e.args[1]) return None
Now the new clone operation from 'test2' waits for the same lock forever and hangs.
Related issues
History
#1 Updated by Kotresh Hiremath Ravishankar almost 2 years ago
- Component(FS) mgr/volumes added
- Labels (FS) task(medium) added
#2 Updated by Kotresh Hiremath Ravishankar almost 2 years ago
- File logs.tar.gz added
The tests ran and mgr logs attached. The code was instrumented a bit to find the callers of the locks.
The mgr.x.log contains the logs after the 'test_clone_failure_status_failed' is run. You can verify that the
thread lock is taken and not released after the test is completed. It doesn't contain logs for the hang, as the
second test was not run in this instance.
----------
2022-04-07T12:29:17.014+0530 7f018b12e700 0 [volumes DEBUG mgr_util] locking <unlocked _thread.lock object at 0x7f01931b04e0> with 5 timeout caller: enter <-- run <-- _bootstrap_inner
2022-04-07T12:29:17.014+0530 7f018b12e700 0 [volumes DEBUG mgr_util] locked <locked _thread.lock object at 0x7f01931b04e0> caller: enter <-- run <-- _bootstrap_inner
2022-04-07T12:29:17.014+0530 7f018b12e700 0 [volumes DEBUG volumes.fs.async_job] processing 1 volume entries
2022-04-07T12:29:17.014+0530 7f018b12e700 0 [volumes DEBUG volumes.fs.async_cloner] fetching clone entry for volume 'cephfs'
2022-04-07T12:29:17.014+0530 7f018b12e700 0 [volumes DEBUG mgr_util] [get] volume: (cephfs) connection: ([<mgr_util.CephfsConnectionPool.Connection object at 0x7f018f953f40>, <mgr_util.CephfsConnectionPool.Connection object at 0x7f018f95
3fd0>])
-------------------
This the last instance of the successful lock, after which it's a dead lock.
#3 Updated by Kotresh Hiremath Ravishankar almost 2 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 45816
#4 Updated by Venky Shankar almost 2 years ago
- Category set to Correctness/Safety
- Assignee set to Kotresh Hiremath Ravishankar
- Target version set to v18.0.0
- Backport set to quincy, pacific
#5 Updated by Kotresh Hiremath Ravishankar almost 2 years ago
- Status changed from Fix Under Review to Pending Backport
#6 Updated by Backport Bot almost 2 years ago
- Copied to Backport #55352: pacific: pybind/mgr/volumes: Clone operation hangs added
#7 Updated by Backport Bot almost 2 years ago
- Copied to Backport #55353: quincy: pybind/mgr/volumes: Clone operation hangs added
#8 Updated by Backport Bot over 1 year ago
- Tags set to backport_processed
#9 Updated by Konstantin Shalygin over 1 year ago
- Status changed from Pending Backport to Resolved
- % Done changed from 0 to 100
- Tags deleted (
backport_processed)