Project

General

Profile

Bug #21512

qa: libcephfs_interface_tests: shutdown race failures

Added by Patrick Donnelly almost 2 years ago. Updated over 1 year ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
Correctness/Safety
Target version:
-
Start date:
09/22/2017
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
multimds
Component(FS):
Client
Labels (FS):
Pull request ID:

Description

Failure: Command failed (workunit test libcephfs/test.sh) on smithi150 with status 134: 'mkdir -p -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && cd -- /home/ubuntu/cephtest/mnt.0/client.0/tmp && CEPH_CLI_TEST_DUP_COMMAND=1 CEPH_REF=wip-pdonnell-testing-20170921 TESTDIR="/home/ubuntu/cephtest" CEPH_ARGS="--cluster ceph" CEPH_ID="0" PATH=$PATH:/usr/sbin CEPH_BASE=/home/ubuntu/cephtest/clone.client.0 CEPH_ROOT=/home/ubuntu/cephtest/clone.client.0 adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 3h /home/ubuntu/cephtest/clone.client.0/qa/workunits/libcephfs/test.sh'
7 jobs: ['1656945', '1657163', '1657213', '1657315', '1657101', '1657267', '1657025']
suites intersection: ['clusters/fixed-2-ucephfs.yaml', 'frag_enable.yaml', 'fs/basic_workload/{begin.yaml', 'mount/fuse.yaml', 'overrides/{debug.yaml', 'tasks/libcephfs_interface_tests.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']
suites union: ['clusters/fixed-2-ucephfs.yaml', 'frag_enable.yaml', 'fs/basic_workload/{begin.yaml', 'inline/no.yaml', 'inline/yes.yaml', 'mount/fuse.yaml', 'objectstore/bluestore.yaml', 'objectstore/filestore-xfs.yaml', 'omap_limit/10.yaml', 'omap_limit/10000.yaml', 'overrides/{debug.yaml', 'tasks/libcephfs_interface_tests.yaml}', 'whitelist_health.yaml', 'whitelist_wrongly_marked_down.yaml}']

From this test run: pdonnell-2017-09-22_01:39:19-fs-wip-pdonnell-testing-20170921-testing-basic-smithi

Here's one backtrace (N.B. the other tests may have failed differently):

(gdb) bt
#0  0x00007f03bd6551f7 in raise () from /lib64/libc.so.6
#1  0x00007f03bd6568e8 in abort () from /lib64/libc.so.6
#2  0x00007f03bd694f47 in __libc_message () from /lib64/libc.so.6
#3  0x00007f03bd69c619 in _int_free () from /lib64/libc.so.6
#4  0x00007f03b4e58002 in _M_dispose (__a=..., this=<optimized out>) at /usr/include/c++/4.8.2/bits/basic_string.h:249
#5  ~basic_string (this=0x556780688280, __in_chrg=<optimized out>) at /usr/include/c++/4.8.2/bits/basic_string.h:539
#6  _Destroy<std::basic_string<char> > (__pointer=0x556780688280) at /usr/include/c++/4.8.2/bits/stl_construct.h:93
#7  __destroy<std::basic_string<char>*> (__last=<optimized out>, __first=0x556780688280) at /usr/include/c++/4.8.2/bits/stl_construct.h:103
#8  _Destroy<std::basic_string<char>*> (__last=<optimized out>, __first=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_construct.h:126
#9  _Destroy<std::basic_string<char>*, std::basic_string<char> > (__last=0x556780688290, __first=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_construct.h:151
#10 _M_erase_at_end (this=<optimized out>, __pos=<optimized out>) at /usr/include/c++/4.8.2/bits/stl_vector.h:1352
#11 clear (this=0x7f03b541ea20 <env_to_vec(std::vector<char const*, std::allocator<char const*> >&, char const*)::str_vec>) at /usr/include/c++/4.8.2/bits/stl_vector.h:1126
#12 env_to_vec (args=std::vector of length 0, capacity 0, name=<optimized out>) at /usr/src/debug/ceph-13.0.0-1081-ga4aea1b/src/common/ceph_argparse.cc:103
#13 0x00007f03bec6d3c0 in conf_parse_env (this=<optimized out>, name=<optimized out>) at /usr/src/debug/ceph-13.0.0-1081-ga4aea1b/src/libcephfs.cc:205
#14 ceph_conf_parse_env (cmount=<optimized out>, name=<optimized out>) at /usr/src/debug/ceph-13.0.0-1081-ga4aea1b/src/libcephfs.cc:374
#15 0x000055677f2cd839 in shutdown_racer_func () at /usr/src/debug/ceph-13.0.0-1081-ga4aea1b/src/test/libcephfs/test.cc:1873
#16 0x00007f03bdfb02b0 in ?? () from /lib64/libstdc++.so.6
#17 0x00007f03bea45e25 in start_thread () from /lib64/libpthread.so.0
#18 0x00007f03bd71834d in clone () from /lib64/libc.so.6

You can debug the above using docker container pdonnell/ceph-ci:wip-pdonnell-testing-20170921-centos-7 on senta02.

0001-client-make-the-ShutdownRacer-test-even-more-thrashy.patch View (1.97 KB) Jeff Layton, 10/04/2017 04:19 PM


Related issues

Copied to fs - Backport #21874: luminous: qa: libcephfs_interface_tests: shutdown race failures Resolved

History

#1 Updated by Jeff Layton almost 2 years ago

Ouch! Looks like env_to_vec is also not threadsafe:

  static vector<string> str_vec;
  std::vector<const char*> env;
  str_vec.clear();

Not sure why that uses static variable there, probably should be done all on-stack? In any case, may be a week or so before I can get to this if you want someone else to have a look in the meantime.

#2 Updated by Jeff Layton almost 2 years ago

Nope, can't be done all on stack. This code is pretty reliant on keeping pointers around to some static stuff. What we'll probably want to do is ensure that str_vec is only set once there, by the first thread to get there. We may need a mutex and a flag to do this properly.

This function is not usually called frequently though, so I think the danger here is relatively low. This test is somewhat abusive. We do want to fix it, but we can disable this test in the meantime if it becomes too irksome.

#3 Updated by Jeff Layton over 1 year ago

Patch to make the ShutdownRace test even more thrashy. This has each thread do the setup and teardown in a tight loop. With this, the test seems to crash almost all the time if I have lockdep = true. With lockdep = false, it mostly survives.

I tried just setting lockdep to false in the test itself, but that's not really enough. lockdep is enabled automatically when the cct is allocated. The config handling for it is also questionable.

I've got a patch to fix up the env_to_vec races as well. I'll post that patch fairly soon. I haven't personally hit that race, but it does seem possible.

I think in the near term, we'll just have to ensure that any manila deployments (in particular) have lockdep disabled. I'd imagine that that is already the case, but it'd be good to spell that out.

Longer term we may need to look at making lockdep be a global object. I don't see how we make this work otherwise.

#4 Updated by Jeff Layton over 1 year ago

I'm now looking for ways to selectively disable lockdep for just this test. So far, I've been unable to do so:

(gdb) bt
#0  0x00007fffedb5a0f0 in lockdep_register_ceph_context(CephContext*)@plt () from /home/jlayton/git/ceph/build/lib/libceph-common.so.0
#1  0x00007fffeddf5e56 in (anonymous namespace)::LockdepObs::handle_conf_change (this=0x555555918540, conf=0x55555587b7c0, changed=...)
    at /home/jlayton/git/ceph/src/common/ceph_context.cc:62
#2  0x00007fffeded0d33 in md_config_t::_apply_changes (this=<optimized out>, oss=<optimized out>) at /home/jlayton/git/ceph/src/common/config.cc:721
#3  0x00007fffeded1070 in md_config_t::apply_changes (this=0x55555587b7c0, oss=oss@entry=0x0) at /home/jlayton/git/ceph/src/common/config.cc:662
#4  0x00007ffff7ad0520 in ceph_mount_info::conf_read_file (this=0x55555591bc40, path_list=<optimized out>) at /home/jlayton/git/ceph/src/libcephfs.cc:188
#5  ceph_conf_read_file (cmount=0x55555591bc40, path=<optimized out>) at /home/jlayton/git/ceph/src/libcephfs.cc:368
#6  0x0000555555591f69 in update_root_mode () at /home/jlayton/git/ceph/src/test/libcephfs/main.cc:25
#7  main (argc=<optimized out>, argv=0x7fffffffda68) at /home/jlayton/git/ceph/src/test/libcephfs/main.cc:40

Basically, if we call into lockdep_register_ceph_context then we've failed to turn it off properly. Here, we're doing it due to the call to update_root_mode() at the beginning of the test. We could probably fix that by just unregistering after we start the test. But later...

(gdb) bt
#0  0x00007fffedb5a0f0 in lockdep_register_ceph_context(CephContext*)@plt () from /home/jlayton/git/ceph/build/lib/libceph-common.so.0
#1  0x00007fffeddf5e56 in (anonymous namespace)::LockdepObs::handle_conf_change (this=0x7fffcc087250, conf=0x7fffcc000aa0, changed=...)
    at /home/jlayton/git/ceph/src/common/ceph_context.cc:62
#2  0x00007fffeded0d33 in md_config_t::_apply_changes (this=<optimized out>, oss=<optimized out>) at /home/jlayton/git/ceph/src/common/config.cc:721
#3  0x00007fffeded1070 in md_config_t::apply_changes (this=0x7fffcc000aa0, oss=oss@entry=0x0) at /home/jlayton/git/ceph/src/common/config.cc:662
#4  0x00007ffff7ad0520 in ceph_mount_info::conf_read_file (this=0x7fffcc08b860, path_list=path_list@entry=0x0) at /home/jlayton/git/ceph/src/libcephfs.cc:188
#5  ceph_conf_read_file (cmount=0x7fffcc08b860, path=path@entry=0x0) at /home/jlayton/git/ceph/src/libcephfs.cc:368
#6  0x00005555555b0474 in shutdown_racer_func () at /home/jlayton/git/ceph/src/test/libcephfs/test.cc:1873
#7  0x00007ffff6fad01f in ?? () from /lib64/libstdc++.so.6
#8  0x00007ffff78a636d in start_thread () from /lib64/libpthread.so.0
#9  0x00007ffff6703bbf in clone () from /lib64/libc.so.6

Here the problem is that we're calling parse_config_files and apply_changes. So there doesn't seem to be any way to override the lockdep setting without the config file's setting temporarily winning out for a time.

#5 Updated by Jeff Layton over 1 year ago

Looking now to see if we can somehow just fix up lockdep for this. Most of the problems I have seen have seen are false positives for recursive locks being detected. That's because we're using this as the check when trying to take a lock:

    if (p->first == id) {
      lockdep_dout(0) << "\n";
      *_dout << "recursive lock of " << name << " (" << id << ")\n";
      BackTrace *bt = new BackTrace(BACKTRACE_SKIP);
      bt->print(*_dout);
      if (p->second) {
        *_dout << "\npreviously locked at\n";
        p->second->print(*_dout);
      }
      delete bt;
      *_dout << dendl;
      ceph_abort();
    }

...after lockdep has been shut down and restarted, we almost certainly end up hitting reuse of the ids there. Maybe at the very least, we should make the free_ids hash be persistent across invocations of lockdep_unregister_ceph_context. That might be enough to fix this.

#6 Updated by Jeff Layton over 1 year ago

I have a PR up that seems to fix this, but it may not be what we need. env_to_vec seems like it ought to be reworked such that the caller manages the storage for the strings, rather than just a vec of pointers.

#7 Updated by Patrick Donnelly over 1 year ago

  • Status changed from New to Pending Backport

#8 Updated by Nathan Cutler over 1 year ago

  • Copied to Backport #21874: luminous: qa: libcephfs_interface_tests: shutdown race failures added

#10 Updated by Nathan Cutler over 1 year ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF