Bug #60629
opencrash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p != m.end())
0%
afc0135c3d4fe6a3ed1291d96d080d074a20fd2d0d0d6fa1c656dd165895de8b
c4ba364febcb504210a580b8ea77a2c6038c6e20d116ccf592f71543842437fd
Description
Assert condition: p != m.end()
Assert function: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]
Sanitized backtrace:
interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>) Server::_session_logged(Session*, unsigned long, bool, unsigned long, interval_set<inodeno_t, std::map> const&, unsigned long, interval_set<inodeno_t, std::map> const&, LogSegment*) C_MDS_session_finish::finish(int) MDSContext::complete(int) MDSIOContextBase::complete(int) MDSLogContextBase::complete(int) Finisher::finisher_thread_entry()
Crash dump sample:
{ "archived": "2023-04-13 09:43:58.669151", "assert_condition": "p != m.end()", "assert_file": "include/interval_set.h", "assert_func": "void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]", "assert_line": 567, "assert_msg": "include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7f642c5d1700 time 2023-04-12T23:14:02.750565+0100\ninclude/interval_set.h: 567: FAILED ceph_assert(p != m.end())", "assert_thread_name": "MR_Finisher", "backtrace": [ "/lib64/libpthread.so.0(+0x12cf0) [0x7f643a003cf0]", "gsignal()", "abort()", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f643b00b4e3]", "/usr/lib64/ceph/libceph-common.so.2(+0x26a64f) [0x7f643b00b64f]", "(interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2a9) [0x55abfb247b29]", "(Server::_session_logged(Session*, unsigned long, bool, unsigned long, interval_set<inodeno_t, std::map> const&, unsigned long, interval_set<inodeno_t, std::map> const&, LogSegment*)+0xf9) [0x55abfb1ece69]", "(C_MDS_session_finish::finish(int)+0x41) [0x55abfb25f711]", "(MDSContext::complete(int)+0x5f) [0x55abfb4b8fbf]", "(MDSIOContextBase::complete(int)+0x534) [0x55abfb4b9754]", "(MDSLogContextBase::complete(int)+0x5b) [0x55abfb4b9b1b]", "(Finisher::finisher_thread_entry()+0x18d) [0x7f643b0aa56d]", "/lib64/libpthread.so.0(+0x81cf) [0x7f6439ff91cf]", "clone()" ], "ceph_version": "17.2.6", "crash_id": "2023-04-12T22:14:02.752804Z_8b75c750-8b43-4c91-b120-446f93f96562", "entity_name": "mds.3177ab8603c724fe708f15f8148ba595387a8e81", "os_id": "almalinux", "os_name": "AlmaLinux", "os_version": "8.7 (Stone Smilodon)", "os_version_id": "8.7", "process_name": "ceph-mds", "stack_sig": "c4ba364febcb504210a580b8ea77a2c6038c6e20d116ccf592f71543842437fd", "timestamp": "2023-04-12T22:14:02.752804Z", "utsname_machine": "x86_64", "utsname_release": "4.18.0-425.10.1.el8_7.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Thu Jan 12 10:05:00 EST 2023" }
Updated by Milind Changire 10 months ago
- Related to Bug #61009: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start) added
Updated by Dhairya Parmar 10 months ago
- Crash signature (v1) updated (diff)
I'm trying to think out loud and this is just a hypothesis:
Server::_session_logged() has this part of code where this crash happened:
if (client_reconnect_gather.erase(session->info.get_client())) {
dout(20) << " removing client from reconnect set" << dendl;
if (client_reconnect_gather.empty()) {
dout(7) << " client " << session->info.inst << " was last reconnect, finishing" << dendl;
reconnect_gather_finish();
}
}
if (client_reclaim_gather.erase(session->info.get_client())) {
dout(20) << " removing client from reclaim set" << dendl;
if (client_reclaim_gather.empty()) {
dout(7) << " client " << session->info.inst << " was last reclaimed, finishing" << dendl;
mds->maybe_clientreplay_done();
}
}
It seems like the client from session map doesn't exist in client_reclaim_gather or client_reconnect_gather. This might be a reason:
Server::_session_logged() is called in C_MDS_session_finish::finish(); C_MDS_session_finish is used in Server::journal_close_session(), which is used in Server::terminate_sessions() and Server::kill_session(). Server::kill_session() is used to evict clients while Server::terminate_sessions() is used in MDCache::shutdown_pass(). So there could be situation where a/some sessions got evicted; since client_reclaim_gather and/or client_reconnect_gather doesn't contain these evicted sessions (because it is only possible to add to these sets in Server::reconnect_tick() and Server::reconnect_clients() respectively but it's not possible at this time because these clients are evicted) we see "p == m.end()" and the assertion in erase() present in src/include/interval_set.h fails. I think we should first check if the sets are empty or they contain these clients in them before trying to erase them.
Updated by Venky Shankar 9 months ago
Dhairya Parmar wrote:
I'm trying to think out loud and this is just a hypothesis:
Server::_session_logged() has this part of code where this crash happened:
[...]
It seems like the client from session map doesn't exist in client_reclaim_gather or client_reconnect_gather. This might be a reason:Server::_session_logged() is called in C_MDS_session_finish::finish(); C_MDS_session_finish is used in Server::journal_close_session(), which is used in Server::terminate_sessions() and Server::kill_session(). Server::kill_session() is used to evict clients while Server::terminate_sessions() is used in MDCache::shutdown_pass(). So there could be situation where a/some sessions got evicted; since client_reclaim_gather and/or client_reconnect_gather doesn't contain these evicted sessions (because it is only possible to add to these sets in Server::reconnect_tick() and Server::reconnect_clients() respectively but it's not possible at this time because these clients are evicted) we see "p == m.end()" and the assertion in erase() present in src/include/interval_set.h fails. I think we should first check if the sets are empty or they contain these clients in them before trying to erase them.
erase() would return `true` if the passed in value (client) was removed from the container. Its fine to pass in anything and erase() should just return true if the element was present and removed, false otherwise.
Updated by Venky Shankar 9 months ago
Dhairya, the interval set operation that's asserting is possibly here:
if (!open) { if (inos_to_purge.size()){ ceph_assert(ls); session->info.prealloc_inos.subtract(inos_to_purge); <-- here ls->purging_inodes.insert(inos_to_purge); if (mds->is_clientreplay() || mds->is_active() || mds->is_stopping()) mdcache->purge_inodes(inos_to_purge, ls); } if (inos_to_free.size()) { ceph_assert(piv); ceph_assert(session->is_closing() || session->is_killing() || session->is_opening()); // re-open closing session session->info.prealloc_inos.subtract(inos_to_free); <-- or, here mds->inotable->apply_release_ids(inos_to_free); ceph_assert(mds->inotable->get_version() == piv); } session->free_prealloc_inos = session->info.prealloc_inos; session->delegated_inos.clear(); }
Updated by Venky Shankar 9 months ago
Which means, the [start, len] in `inos_to_free` and/or `inos_to_purge` are not present in prealloc_inos for the client session(map).