Bug #24137
closedclient: segfault in trim_caps
0%
Description
ceph version 12.2.4-XXX (78f60b924802e34d44f7078029a40dbe6c0c922f) luminous (stable) 1: (()+0x6d4d94) [0x560100b8cd94] 2: (()+0x11390) [0x7f530ac3e390] 3: (Inode::get()+0x3c) [0x5601007792ec] 4: (Client::trim_caps(MetaSession*, int)+0x123) [0x56010071b493] 5: (Client::handle_client_session(MClientSession*)+0x331) [0x56010071c581] 6: (Client::ms_dispatch(Message*)+0x4cf) [0x56010074abdf] 7: (DispatchQueue::entry()+0xf4a) [0x560100b2347a] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x56010085913d] 9: (()+0x76ba) [0x7f530ac346ba] 10: (clone()+0x6d) [0x7f5309a9c41d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Files
Updated by Patrick Donnelly almost 6 years ago
Reasonable assumption about this crash is either the inode was deleted (in which case the Cap should have been deleted too) or the inode is a nullptr. The latter should never happen AFAICT as the inode is set only when the cap is created.
I couldn't find a way for the inode/cap to be deleted while the cap remains in the session cap list. Ideas?
Updated by Zheng Yan almost 6 years ago
commit 4dda1b6ead6b1f04f996403881f61e9b7d94dba0 Author: Patrick Donnelly <pdonnell@redhat.com> Date: Sun Nov 19 15:08:18 2017 -0800 client: anchor Inode while trimming caps This prevents the Inode from being deleted until after cap trimming is finished. In particular, this prevents remove_all_caps from being called which screws up the traversal of caps in trim_caps. Fixes: http://tracker.ceph.com/issues/22157 Signed-off-by: Patrick Donnelly <pdonnell@redhat.com> (cherry picked from commit 1439337e949c9fcb7d15eb38c22d19eb57d3d0f2)
I think above commit isn't quite right. how about patch below
diff --git a/src/client/Client.cc b/src/client/Client.cc index 5bbe449974..2885a87e48 100644 --- a/src/client/Client.cc +++ b/src/client/Client.cc @@ -4113,14 +4119,30 @@ void Client::trim_caps(MetaSession *s, uint64_t max) uint64_t trimmed = 0; auto p = s->caps.begin(); - std::set<InodeRef> anchor; /* prevent put_inode from deleting all caps during traversal */ - while ((caps_size - trimmed) > max && !p.end()) { - Cap *cap = *p; - InodeRef in(cap->inode); + + InodeRef next_in; + Cap *next_cap; + if (p.end()) { + next_cap = nullptr; + } else { + next_cap = *p; + next_in = next_cap->inode; + } + + while (next_cap && (caps_size - trimmed) > max) { + Cap *cap = next_cap; + InodeRef in; + in.swap(next_in); // Increment p early because it will be invalidated if cap // is deleted inside remove_cap ++p; + if (p.end()) { + next_cap = nullptr; + } else { + next_cap = *p; + next_in = next_cap->inode; + } if (in->caps.size() > 1 && cap != in->auth_cap) { int mine = cap->issued | cap->implemented; @@ -4129,7 +4151,6 @@ void Client::trim_caps(MetaSession *s, uint64_t max) if (!(get_caps_used(in.get()) & ~oissued & mine)) { ldout(cct, 20) << " removing unused, unneeded non-auth cap on " << *in << dendl; cap = (remove_cap(cap, true), nullptr); - /* N.B. no need to push onto anchor, as we are only removing one cap */ trimmed++; } } else { @@ -4148,8 +4169,6 @@ void Client::trim_caps(MetaSession *s, uint64_t max) // the end of this function. _schedule_invalidate_dentry_callback(dn, true); } - ldout(cct, 20) << " anchoring inode: " << in->ino << dendl; - anchor.insert(in); trim_dentry(dn); } else { ldout(cct, 20) << " not expirable: " << dn->name << dendl; @@ -4162,8 +4181,6 @@ void Client::trim_caps(MetaSession *s, uint64_t max) } } } - ldout(cct, 20) << " clearing anchored inodes" << dendl; - anchor.clear(); caps_size = s->caps.size(); if (caps_size > (size_t)max)
Updated by Patrick Donnelly almost 6 years ago
Zheng Yan wrote:
[...]
I think above commit isn't quite right. how about patch below
[...]
I'm not seeing how that helps the situation. The issue my patch was trying to solve is preventing the next cap (what p points to after ++p) from being invalidated due to trim_dentry -> Client::unlink -> Dentry::put -> Inode::put -> intrusive_ptr_release -> Client::put_inode -> Client::remove_all_caps. This should be prevented by anchoring the Inode references as in the original PR?
We need a test case to definitively show the issue. I'm not willing to push any new fixes for this without a reproducer.
Updated by Zheng Yan almost 6 years ago
The problem is that anchor only pins current inode. Client::unlink() still may drop reference of its parent inode.
Updated by Patrick Donnelly almost 6 years ago
Zheng Yan wrote:
The problem is that anchor only pins current inode. Client::unlink() still may drop reference of its parent inode.
Okay that makes sense I think. We still need a test that demonstrates the problem.
Updated by Zheng Yan almost 6 years ago
- File ceph-client.27063.log ceph-client.27063.log added
- File test_trim_caps.cc test_trim_caps.cc added
compile test_trim_caps.cc with the newest libcephfs. set mds_min_caps_per_client to 1, set mds_max_ratio_caps_per_client to 0. then run test_trim_caps
[Current thread is 1 (Thread 0x7fffd97fa700 (LWP 27088))] Missing separate debuginfos, use: dnf debuginfo-install glibc-2.25-13.fc26.x86_64 libblkid-2.30.2-1.fc26.x86_64 libgcc-7.3.1-2.fc26.x86_64 libibverbs-1.2.1-4.fc26.x86_64 libnl3-3.3.0-1.fc26.x86_64 libstdc++-7.3.1-2.fc26.x86_64 libuuid-2.30.2-1.fc26.x86_64 lttng-ust-2.9.0-2.fc26.x86_64 nspr-4.19.0-1.fc26.x86_64 nss-3.36.1-1.0.fc26.x86_64 nss-softokn-3.36.1-1.0.fc26.x86_64 nss-softokn-freebl-3.36.1-1.0.fc26.x86_64 nss-util-3.36.1-1.0.fc26.x86_64 sqlite-libs-3.20.1-2.fc26.x86_64 userspace-rcu-0.9.3-2.fc26.x86_64 zlib-1.2.11-2.fc26.x86_64 (gdb) bt #0 0x00007fffd40004d8 in ?? () #1 0x00007fffee48bd9b in ceph::logging::log_clock::now (this=0x7fffd40004e8) at /home/zhyan/Ceph/ceph-2/src/log/LogClock.h:95 #2 ceph::logging::Log::create_entry (this=0x7fffd40004b8, level=level@entry=15, subsys=subsys@entry=21, expected_size=expected_size@entry=0x7ffff7dd5ed0 <Inode::get()::_log_exp_length>) at /home/zhyan/Ceph/ceph-2/src/log/Log.cc:263 #3 0x00007ffff7b82efd in Inode::get (this=0x7fffd4014820) at /home/zhyan/Ceph/ceph-2/src/client/Inode.cc:383 #4 0x00007ffff7aec2ca in intrusive_ptr_add_ref (in=<optimized out>) at /home/zhyan/Ceph/ceph-2/src/client/Client.cc:13974 #5 0x00007ffff7b2188a in boost::intrusive_ptr<Inode>::intrusive_ptr (add_ref=true, p=<optimized out>, this=0x7fffd97f75d8) at /mnt/misc/Ceph/ceph-2/build/boost/include/boost/smart_ptr/intrusive_ptr.hpp:69 #6 Client::trim_caps (this=this@entry=0x77edd0, s=s@entry=0x7958a8, max=1) at /home/zhyan/Ceph/ceph-2/src/client/Client.cc:4124 #7 0x00007ffff7b22652 in Client::handle_client_session (this=this@entry=0x77edd0, m=m@entry=0x7fffe4000f00) at /home/zhyan/Ceph/ceph-2/src/client/Client.cc:2102 #8 0x00007ffff7b5461f in Client::ms_dispatch (this=0x77edd0, m=0x7fffe4000f00) at /home/zhyan/Ceph/ceph-2/src/client/Client.cc:2544 #9 0x00007fffee4e3b4b in Messenger::ms_deliver_dispatch (m=0x7fffe4000f00, this=0x78c750) at /home/zhyan/Ceph/ceph-2/src/msg/Messenger.h:667 #10 DispatchQueue::entry (this=0x78c8d0) at /home/zhyan/Ceph/ceph-2/src/msg/DispatchQueue.cc:201 #11 0x00007fffee5862dd in DispatchQueue::DispatchThread::entry (this=<optimized out>) at /home/zhyan/Ceph/ceph-2/src/msg/DispatchQueue.h:101 #12 0x00007fffec76436d in start_thread () from /lib64/libpthread.so.0 #13 0x00007ffff6f47b4f in clone () from /lib64/libc.so.6
Updated by Patrick Donnelly almost 6 years ago
- Status changed from New to In Progress
- Assignee set to Patrick Donnelly
Updated by Patrick Donnelly almost 6 years ago
- Target version changed from v13.2.0 to v14.0.0
- Backport changed from luminous to mimic,luminous
Updated by Patrick Donnelly almost 6 years ago
- Status changed from In Progress to Pending Backport
Updated by Patrick Donnelly almost 6 years ago
- Copied to Backport #24185: luminous: client: segfault in trim_caps added
Updated by Nathan Cutler almost 6 years ago
- Copied to Backport #24186: mimic: client: segfault in trim_caps added
Updated by Nathan Cutler almost 6 years ago
- Status changed from Pending Backport to Resolved
Updated by Patrick Donnelly 11 months ago
- Related to Bug #61394: qa/quincy: cluster [WRN] evicting unresponsive client smithi152 (4298), after 303.726 seconds" in cluster log added