https://tracker.ceph.com/https://tracker.ceph.com/favicon.ico2017-09-22T18:49:08ZCeph CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=996792017-09-22T18:49:08ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Ouch! Looks like env_to_vec is also not threadsafe:</p>
<pre>
static vector<string> str_vec;
std::vector<const char*> env;
str_vec.clear();
</pre>
<p>Not sure why that uses static variable there, probably should be done all on-stack? In any case, may be a week or so before I can get to this if you want someone else to have a look in the meantime.</p> CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=997452017-09-23T11:48:15ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Nope, can't be done all on stack. This code is pretty reliant on keeping pointers around to some static stuff. What we'll probably want to do is ensure that str_vec is only set once there, by the first thread to get there. We may need a mutex and a flag to do this properly.</p>
<p>This function is not usually called frequently though, so I think the danger here is relatively low. This test is somewhat abusive. We do want to fix it, but we can disable this test in the meantime if it becomes too irksome.</p> CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=1002652017-10-04T16:22:38ZJeff Laytonjlayton@redhat.com
<ul><li><strong>File</strong> <a href="/attachments/download/3031/0001-client-make-the-ShutdownRacer-test-even-more-thrashy.patch">0001-client-make-the-ShutdownRacer-test-even-more-thrashy.patch</a> <a class="icon-only icon-magnifier" title="View" href="/attachments/3031/0001-client-make-the-ShutdownRacer-test-even-more-thrashy.patch">View</a> added</li></ul><p>Patch to make the ShutdownRace test even more thrashy. This has each thread do the setup and teardown in a tight loop. With this, the test seems to crash almost all the time if I have lockdep = true. With lockdep = false, it mostly survives.</p>
<p>I tried just setting lockdep to false in the test itself, but that's not really enough. lockdep is enabled automatically when the cct is allocated. The config handling for it is also questionable.</p>
<p>I've got a patch to fix up the env_to_vec races as well. I'll post that patch fairly soon. I haven't personally hit that race, but it does seem possible.</p>
<p>I think in the near term, we'll just have to ensure that any manila deployments (in particular) have lockdep disabled. I'd imagine that that is already the case, but it'd be good to spell that out.</p>
<p>Longer term we may need to look at making lockdep be a global object. I don't see how we make this work otherwise.</p> CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=1002712017-10-04T18:42:31ZJeff Laytonjlayton@redhat.com
<ul></ul><p>I'm now looking for ways to selectively disable lockdep for just this test. So far, I've been unable to do so:</p>
<pre>
(gdb) bt
#0 0x00007fffedb5a0f0 in lockdep_register_ceph_context(CephContext*)@plt () from /home/jlayton/git/ceph/build/lib/libceph-common.so.0
#1 0x00007fffeddf5e56 in (anonymous namespace)::LockdepObs::handle_conf_change (this=0x555555918540, conf=0x55555587b7c0, changed=...)
at /home/jlayton/git/ceph/src/common/ceph_context.cc:62
#2 0x00007fffeded0d33 in md_config_t::_apply_changes (this=<optimized out>, oss=<optimized out>) at /home/jlayton/git/ceph/src/common/config.cc:721
#3 0x00007fffeded1070 in md_config_t::apply_changes (this=0x55555587b7c0, oss=oss@entry=0x0) at /home/jlayton/git/ceph/src/common/config.cc:662
#4 0x00007ffff7ad0520 in ceph_mount_info::conf_read_file (this=0x55555591bc40, path_list=<optimized out>) at /home/jlayton/git/ceph/src/libcephfs.cc:188
#5 ceph_conf_read_file (cmount=0x55555591bc40, path=<optimized out>) at /home/jlayton/git/ceph/src/libcephfs.cc:368
#6 0x0000555555591f69 in update_root_mode () at /home/jlayton/git/ceph/src/test/libcephfs/main.cc:25
#7 main (argc=<optimized out>, argv=0x7fffffffda68) at /home/jlayton/git/ceph/src/test/libcephfs/main.cc:40
</pre>
<p>Basically, if we call into lockdep_register_ceph_context then we've failed to turn it off properly. Here, we're doing it due to the call to update_root_mode() at the beginning of the test. We could probably fix that by just unregistering after we start the test. But later...</p>
<pre>
(gdb) bt
#0 0x00007fffedb5a0f0 in lockdep_register_ceph_context(CephContext*)@plt () from /home/jlayton/git/ceph/build/lib/libceph-common.so.0
#1 0x00007fffeddf5e56 in (anonymous namespace)::LockdepObs::handle_conf_change (this=0x7fffcc087250, conf=0x7fffcc000aa0, changed=...)
at /home/jlayton/git/ceph/src/common/ceph_context.cc:62
#2 0x00007fffeded0d33 in md_config_t::_apply_changes (this=<optimized out>, oss=<optimized out>) at /home/jlayton/git/ceph/src/common/config.cc:721
#3 0x00007fffeded1070 in md_config_t::apply_changes (this=0x7fffcc000aa0, oss=oss@entry=0x0) at /home/jlayton/git/ceph/src/common/config.cc:662
#4 0x00007ffff7ad0520 in ceph_mount_info::conf_read_file (this=0x7fffcc08b860, path_list=path_list@entry=0x0) at /home/jlayton/git/ceph/src/libcephfs.cc:188
#5 ceph_conf_read_file (cmount=0x7fffcc08b860, path=path@entry=0x0) at /home/jlayton/git/ceph/src/libcephfs.cc:368
#6 0x00005555555b0474 in shutdown_racer_func () at /home/jlayton/git/ceph/src/test/libcephfs/test.cc:1873
#7 0x00007ffff6fad01f in ?? () from /lib64/libstdc++.so.6
#8 0x00007ffff78a636d in start_thread () from /lib64/libpthread.so.0
#9 0x00007ffff6703bbf in clone () from /lib64/libc.so.6
</pre>
<p>Here the problem is that we're calling parse_config_files and apply_changes. So there doesn't seem to be any way to override the lockdep setting without the config file's setting temporarily winning out for a time.</p> CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=1002902017-10-05T12:04:07ZJeff Laytonjlayton@redhat.com
<ul></ul><p>Looking now to see if we can somehow just fix up lockdep for this. Most of the problems I have seen have seen are false positives for recursive locks being detected. That's because we're using this as the check when trying to take a lock:</p>
<pre>
if (p->first == id) {
lockdep_dout(0) << "\n";
*_dout << "recursive lock of " << name << " (" << id << ")\n";
BackTrace *bt = new BackTrace(BACKTRACE_SKIP);
bt->print(*_dout);
if (p->second) {
*_dout << "\npreviously locked at\n";
p->second->print(*_dout);
}
delete bt;
*_dout << dendl;
ceph_abort();
}
</pre>
<p>...after lockdep has been shut down and restarted, we almost certainly end up hitting reuse of the ids there. Maybe at the very least, we should make the free_ids hash be persistent across invocations of lockdep_unregister_ceph_context. That might be enough to fix this.</p> CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=1003892017-10-06T14:21:47ZJeff Laytonjlayton@redhat.com
<ul></ul><p>I have a PR up that seems to fix this, but it may not be what we need. env_to_vec seems like it ought to be reworked such that the caller manages the storage for the strings, rather than just a vec of pointers.</p> CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=1007872017-10-14T00:16:15ZPatrick Donnellypdonnell@redhat.com
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Pending Backport</i></li></ul> CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=1011012017-10-20T09:30:22ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Copied to</strong> <i><a class="issue tracker-9 status-3 priority-4 priority-default closed" href="/issues/21874">Backport #21874</a>: luminous: qa: libcephfs_interface_tests: shutdown race failures</i> added</li></ul> CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=1011562017-10-20T15:21:54ZKen Dreyerkdreyer@redhat.com
<ul></ul><p>PR to master was <a class="external" href="https://github.com/ceph/ceph/pull/18139">https://github.com/ceph/ceph/pull/18139</a></p> CephFS - Bug #21512: qa: libcephfs_interface_tests: shutdown race failureshttps://tracker.ceph.com/issues/21512?journal_id=1058832018-01-25T19:11:21ZNathan Cutlerncutler@suse.cz
<ul><li><strong>Status</strong> changed from <i>Pending Backport</i> to <i>Resolved</i></li></ul>