Bug #46832
closedclient: static dirent for readdir is not thread-safe
0%
Description
2020-08-04T01:39:29.002 INFO:tasks.ceph.mgr.y.smithi185.stderr:Exception in thread puregejob.0: 2020-08-04T01:39:29.002 INFO:tasks.ceph.mgr.y.smithi185.stderr:Traceback (most recent call last): 2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr: File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner 2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr: self.run() 2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr: File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 61, in run 2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr: self.async_job.unregister_async_job(vol_job[0], vol_job[1], thread_id) 2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr: File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 152, in unregister_async_job 2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr: log.debug("unregistering async job {0}.{1} from thread {2}".format(volname, job, thread_id)) 2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr: File "/usr/lib64/python2.7/logging/__init__.py", line 1137, in debug 2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr: self._log(DEBUG, msg, args, **kwargs) 2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr: File "/usr/lib64/python2.7/logging/__init__.py", line 1268, in _log 2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr: self.handle(record) 2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr: File "/usr/lib64/python2.7/logging/__init__.py", line 1278, in handle 2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr: self.callHandlers(record) 2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr: File "/usr/lib64/python2.7/logging/__init__.py", line 1318, in callHandlers 2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr: hdlr.handle(record) 2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr: File "/usr/lib63/python2.7/logging/__init__.py", line 749, in handle 2020-08-04T01:39:29.006 INFO:tasks.ceph.mgr.y.smithi185.stderr: self.emit(record) 2020-08-04T01:39:29.006 INFO:tasks.ceph.mgr.y.smithi185.stderr: File "/usr/share/ceph/mgr/mgr_module.py", line 65, in emit 2020-08-04T01:39:29.006 INFO:tasks.ceph.mgr.y.smithi185.stderr: self._module._ceph_log(ceph_level, self.format(record)) 2020-08-04T01:39:29.006 INFO:tasks.ceph.mgr.y.smithi185.stderr:TypeError: must be string without null bytes, not str
From: /ceph/teuthology-archive/teuthology-2020-08-04_01:12:01-fs-nautilus-distro-basic-smithi/5285246/teuthology.log
Updated by Venky Shankar over 3 years ago
Real backtrace:
020-08-04 01:39:27.841 7fe6face0700 1 mgr[volumes] traceback: Traceback (most recent call last): File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 44, in run self.async_job.execute_job(vol_job[0], vol_job[1], should_cancel=lambda: thread_id.should_cancel()) File "/usr/share/ceph/mgr/volumes/fs/purge_queue.py", line 58, in execute_job purge_trash_entry_for_volume(self.vc, volname, job, should_cancel) File "/usr/share/ceph/mgr/volumes/fs/purge_queue.py", line 37, in purge_trash_entry_for_volume trashcan.purge(purge_dir, should_cancel) File "/usr/share/ceph/mgr/volumes/fs/operations/trash.py", line 88, in purge rmtree(trashpath) File "/usr/share/ceph/mgr/volumes/fs/operations/trash.py", line 72, in rmtree rmtree(d_full) File "/usr/share/ceph/mgr/volumes/fs/operations/trash.py", line 64, in rmtree log.debug("rmtree {0}".format(root_path)) File "/usr/lib64/python2.7/logging/__init__.py", line 1137, in debug self._log(DEBUG, msg, args, **kwargs) File "/usr/lib64/python2.7/logging/__init__.py", line 1268, in _log self.handle(record) File "/usr/lib64/python2.7/logging/__init__.py", line 1278, in handle self.callHandlers(record) File "/usr/lib64/python2.7/logging/__init__.py", line 1318, in callHandlers hdlr.handle(record) File "/usr/lib64/python2.7/logging/__init__.py", line 749, in handle self.emit(record) File "/usr/share/ceph/mgr/mgr_module.py", line 65, in emit self._module._ceph_log(ceph_level, self.format(record)) TypeError: must be string without null bytes, not str
Looks like "root_path" variable in `trash::purge::rmtree()` had a '\0' char and python's logging module did not like it.
Updated by Venky Shankar over 3 years ago
So, this is probably a unicode+py2 thingy. One way to fix this is to log raw strings by prefixing the log message with r/R -- r'... {0} {1}".format(...).
Would we want to do this to all log messages in mgr/volumes? Or, does anyone has a better fix for this?
Updated by Patrick Donnelly over 3 years ago
Debugging branch: https://github.com/ceph/ceph/pull/36483
Updated by Venky Shankar over 3 years ago
- Has duplicate Bug #44318: nautilus: mgr/volumes: exception when logging message (in logging::log()) added
Updated by Patrick Donnelly over 3 years ago
- Subject changed from nautilus: pybind/mgr/volumes: TypeError: must be string without null bytes, not str to client: static dirent for readdir is not thread-safe
- Status changed from New to In Progress
- Assignee changed from Venky Shankar to Patrick Donnelly
- Component(FS) Client added
Another instance with debugging: https://pulpito.ceph.com/pdonnell-2020-08-06_15:29:50-fs-wip-pdonnell-testing-20200806.033244-nautilus-distro-basic-smithi/5292723/
This time in another location:
2020-08-06 17:33:25.228 7fb03c034700 1 mgr[volumes] traceback: Traceback (most recent call last): File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 41, in run self.async_job.register_async_job(vol_job[0], vol_job[1], thread_id) File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 148, in register_async_job log.debug("registering async job {0}.{1} with thread {2}".format(volname, job, thread_id)) File "/usr/lib/python2.7/logging/__init__.py", line 1155, in debug self._log(DEBUG, msg, args, **kwargs) File "/usr/lib/python2.7/logging/__init__.py", line 1286, in _log self.handle(record) File "/usr/lib/python2.7/logging/__init__.py", line 1296, in handle self.callHandlers(record) File "/usr/lib/python2.7/logging/__init__.py", line 1336, in callHandlers hdlr.handle(record) File "/usr/lib/python2.7/logging/__init__.py", line 759, in handle self.emit(record) File "/usr/share/ceph/mgr/mgr_module.py", line 65, in emit self._module._ceph_log(ceph_level, self.format(record)) TypeError: log() argument 2 must be string without null bytes, not str
From: /ceph/teuthology-archive/pdonnell-2020-08-06_15:29:50-fs-wip-pdonnell-testing-20200806.033244-nautilus-distro-basic-smithi/5292723/remote/smithi040/log/ceph-mgr.x.log.gz
The eureka moment here is the interleaved readdir operations from the purge queue threads. The problem is two libcephfs handles writing to the same dirent: https://github.com/ceph/ceph/blob/f95259251c1a51e8d60beb269ec8ecd371fc4c14/src/client/Client.cc#L8515
I'm working on a fix.
Updated by Patrick Donnelly over 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID set to 36503
Updated by Patrick Donnelly over 3 years ago
- Related to Fix #46851: qa: add debugging for volumes plugin use of libcephfs added
Updated by Patrick Donnelly over 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Patrick Donnelly over 3 years ago
- Copied to Backport #46855: octopus: client: static dirent for readdir is not thread-safe added
Updated by Patrick Donnelly over 3 years ago
- Copied to Backport #46856: nautilus: client: static dirent for readdir is not thread-safe added
Updated by Nathan Cutler over 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".