Project

General

Profile

Bug #46832

client: static dirent for readdir is not thread-safe

Added by Patrick Donnelly 3 months ago. Updated 2 days ago.

Status:
Resolved
Priority:
Immediate
Category:
-
Target version:
% Done:

0%

Source:
Q/A
Tags:
Backport:
octopus,nautilus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Client, mgr/volumes
Labels (FS):
crash
Pull request ID:
Crash signature:

Description

2020-08-04T01:39:29.002 INFO:tasks.ceph.mgr.y.smithi185.stderr:Exception in thread puregejob.0:
2020-08-04T01:39:29.002 INFO:tasks.ceph.mgr.y.smithi185.stderr:Traceback (most recent call last):
2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr:  File "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner
2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr:    self.run()
2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr:  File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 61, in run
2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr:    self.async_job.unregister_async_job(vol_job[0], vol_job[1], thread_id)
2020-08-04T01:39:29.003 INFO:tasks.ceph.mgr.y.smithi185.stderr:  File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 152, in unregister_async_job
2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr:    log.debug("unregistering async job {0}.{1} from thread {2}".format(volname, job, thread_id))
2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr:  File "/usr/lib64/python2.7/logging/__init__.py", line 1137, in debug
2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr:    self._log(DEBUG, msg, args, **kwargs)
2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr:  File "/usr/lib64/python2.7/logging/__init__.py", line 1268, in _log
2020-08-04T01:39:29.004 INFO:tasks.ceph.mgr.y.smithi185.stderr:    self.handle(record)
2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr:  File "/usr/lib64/python2.7/logging/__init__.py", line 1278, in handle
2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr:    self.callHandlers(record)
2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr:  File "/usr/lib64/python2.7/logging/__init__.py", line 1318, in callHandlers
2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr:    hdlr.handle(record)
2020-08-04T01:39:29.005 INFO:tasks.ceph.mgr.y.smithi185.stderr:  File "/usr/lib63/python2.7/logging/__init__.py", line 749, in handle
2020-08-04T01:39:29.006 INFO:tasks.ceph.mgr.y.smithi185.stderr:    self.emit(record)
2020-08-04T01:39:29.006 INFO:tasks.ceph.mgr.y.smithi185.stderr:  File "/usr/share/ceph/mgr/mgr_module.py", line 65, in emit
2020-08-04T01:39:29.006 INFO:tasks.ceph.mgr.y.smithi185.stderr:    self._module._ceph_log(ceph_level, self.format(record))
2020-08-04T01:39:29.006 INFO:tasks.ceph.mgr.y.smithi185.stderr:TypeError: must be string without null bytes, not str

From: /ceph/teuthology-archive/teuthology-2020-08-04_01:12:01-fs-nautilus-distro-basic-smithi/5285246/teuthology.log


Related issues

Related to fs - Fix #46851: qa: add debugging for volumes plugin use of libcephfs Resolved
Duplicated by fs - Bug #44318: nautilus: mgr/volumes: exception when logging message (in logging::log()) Duplicate
Copied to fs - Backport #46855: octopus: client: static dirent for readdir is not thread-safe Resolved
Copied to fs - Backport #46856: nautilus: client: static dirent for readdir is not thread-safe Resolved

History

#1 Updated by Venky Shankar 3 months ago

Real backtrace:

020-08-04 01:39:27.841 7fe6face0700  1 mgr[volumes] traceback: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 44, in run
    self.async_job.execute_job(vol_job[0], vol_job[1], should_cancel=lambda: thread_id.should_cancel())
  File "/usr/share/ceph/mgr/volumes/fs/purge_queue.py", line 58, in execute_job
    purge_trash_entry_for_volume(self.vc, volname, job, should_cancel)
  File "/usr/share/ceph/mgr/volumes/fs/purge_queue.py", line 37, in purge_trash_entry_for_volume
    trashcan.purge(purge_dir, should_cancel)
  File "/usr/share/ceph/mgr/volumes/fs/operations/trash.py", line 88, in purge
    rmtree(trashpath)
  File "/usr/share/ceph/mgr/volumes/fs/operations/trash.py", line 72, in rmtree
    rmtree(d_full)
  File "/usr/share/ceph/mgr/volumes/fs/operations/trash.py", line 64, in rmtree
    log.debug("rmtree {0}".format(root_path))
  File "/usr/lib64/python2.7/logging/__init__.py", line 1137, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1268, in _log
    self.handle(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1278, in handle
    self.callHandlers(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 1318, in callHandlers
    hdlr.handle(record)
  File "/usr/lib64/python2.7/logging/__init__.py", line 749, in handle
    self.emit(record)
  File "/usr/share/ceph/mgr/mgr_module.py", line 65, in emit
    self._module._ceph_log(ceph_level, self.format(record))
TypeError: must be string without null bytes, not str

Looks like "root_path" variable in `trash::purge::rmtree()` had a '\0' char and python's logging module did not like it.

#2 Updated by Venky Shankar 3 months ago

So, this is probably a unicode+py2 thingy. One way to fix this is to log raw strings by prefixing the log message with r/R -- r'... {0} {1}".format(...).

Would we want to do this to all log messages in mgr/volumes? Or, does anyone has a better fix for this?

#4 Updated by Venky Shankar 3 months ago

  • Duplicated by Bug #44318: nautilus: mgr/volumes: exception when logging message (in logging::log()) added

#5 Updated by Patrick Donnelly 3 months ago

  • Subject changed from nautilus: pybind/mgr/volumes: TypeError: must be string without null bytes, not str to client: static dirent for readdir is not thread-safe
  • Status changed from New to In Progress
  • Assignee changed from Venky Shankar to Patrick Donnelly
  • Component(FS) Client added

Another instance with debugging: https://pulpito.ceph.com/pdonnell-2020-08-06_15:29:50-fs-wip-pdonnell-testing-20200806.033244-nautilus-distro-basic-smithi/5292723/

This time in another location:

2020-08-06 17:33:25.228 7fb03c034700  1 mgr[volumes] traceback: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 41, in run
    self.async_job.register_async_job(vol_job[0], vol_job[1], thread_id)
  File "/usr/share/ceph/mgr/volumes/fs/async_job.py", line 148, in register_async_job
    log.debug("registering async job {0}.{1} with thread {2}".format(volname, job, thread_id))
  File "/usr/lib/python2.7/logging/__init__.py", line 1155, in debug
    self._log(DEBUG, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1286, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1296, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1336, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 759, in handle
    self.emit(record)
  File "/usr/share/ceph/mgr/mgr_module.py", line 65, in emit
    self._module._ceph_log(ceph_level, self.format(record))
TypeError: log() argument 2 must be string without null bytes, not str

From: /ceph/teuthology-archive/pdonnell-2020-08-06_15:29:50-fs-wip-pdonnell-testing-20200806.033244-nautilus-distro-basic-smithi/5292723/remote/smithi040/log/ceph-mgr.x.log.gz

The eureka moment here is the interleaved readdir operations from the purge queue threads. The problem is two libcephfs handles writing to the same dirent: https://github.com/ceph/ceph/blob/f95259251c1a51e8d60beb269ec8ecd371fc4c14/src/client/Client.cc#L8515

I'm working on a fix.

#6 Updated by Patrick Donnelly 3 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 36503

#7 Updated by Patrick Donnelly 3 months ago

  • Related to Fix #46851: qa: add debugging for volumes plugin use of libcephfs added

#8 Updated by Patrick Donnelly 3 months ago

  • Status changed from Fix Under Review to Pending Backport

#9 Updated by Patrick Donnelly 3 months ago

  • Copied to Backport #46855: octopus: client: static dirent for readdir is not thread-safe added

#10 Updated by Patrick Donnelly 3 months ago

  • Copied to Backport #46856: nautilus: client: static dirent for readdir is not thread-safe added

#11 Updated by Nathan Cutler 2 days ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Also available in: Atom PDF