Project

General

Profile

Bug #60629

crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p != m.end())

Added by Telemetry Bot 10 months ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Telemetry
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):

afc0135c3d4fe6a3ed1291d96d080d074a20fd2d0d0d6fa1c656dd165895de8b
c4ba364febcb504210a580b8ea77a2c6038c6e20d116ccf592f71543842437fd


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=978159d0d2675e074cadedff0e9abe7fe7254aa08b895f82334c012c99d4b838

Assert condition: p != m.end()
Assert function: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]

Sanitized backtrace:

    interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)
    Server::_session_logged(Session*, unsigned long, bool, unsigned long, interval_set<inodeno_t, std::map> const&, unsigned long, interval_set<inodeno_t, std::map> const&, LogSegment*)
    C_MDS_session_finish::finish(int)
    MDSContext::complete(int)
    MDSIOContextBase::complete(int)
    MDSLogContextBase::complete(int)
    Finisher::finisher_thread_entry()

Crash dump sample:
{
    "archived": "2023-04-13 09:43:58.669151",
    "assert_condition": "p != m.end()",
    "assert_file": "include/interval_set.h",
    "assert_func": "void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]",
    "assert_line": 567,
    "assert_msg": "include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7f642c5d1700 time 2023-04-12T23:14:02.750565+0100\ninclude/interval_set.h: 567: FAILED ceph_assert(p != m.end())",
    "assert_thread_name": "MR_Finisher",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12cf0) [0x7f643a003cf0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f643b00b4e3]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x26a64f) [0x7f643b00b64f]",
        "(interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2a9) [0x55abfb247b29]",
        "(Server::_session_logged(Session*, unsigned long, bool, unsigned long, interval_set<inodeno_t, std::map> const&, unsigned long, interval_set<inodeno_t, std::map> const&, LogSegment*)+0xf9) [0x55abfb1ece69]",
        "(C_MDS_session_finish::finish(int)+0x41) [0x55abfb25f711]",
        "(MDSContext::complete(int)+0x5f) [0x55abfb4b8fbf]",
        "(MDSIOContextBase::complete(int)+0x534) [0x55abfb4b9754]",
        "(MDSLogContextBase::complete(int)+0x5b) [0x55abfb4b9b1b]",
        "(Finisher::finisher_thread_entry()+0x18d) [0x7f643b0aa56d]",
        "/lib64/libpthread.so.0(+0x81cf) [0x7f6439ff91cf]",
        "clone()" 
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-04-12T22:14:02.752804Z_8b75c750-8b43-4c91-b120-446f93f96562",
    "entity_name": "mds.3177ab8603c724fe708f15f8148ba595387a8e81",
    "os_id": "almalinux",
    "os_name": "AlmaLinux",
    "os_version": "8.7 (Stone Smilodon)",
    "os_version_id": "8.7",
    "process_name": "ceph-mds",
    "stack_sig": "c4ba364febcb504210a580b8ea77a2c6038c6e20d116ccf592f71543842437fd",
    "timestamp": "2023-04-12T22:14:02.752804Z",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-425.10.1.el8_7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Thu Jan 12 10:05:00 EST 2023" 
}


Related issues

Related to CephFS - Bug #61009: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start) Fix Under Review

History

#1 Updated by Telemetry Bot 10 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.2.5, v17.2.6 added

#2 Updated by Milind Changire 9 months ago

  • Related to Bug #61009: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start) added

#3 Updated by Dhairya Parmar 8 months ago

  • Crash signature (v1) updated (diff)

I'm trying to think out loud and this is just a hypothesis:

Server::_session_logged() has this part of code where this crash happened:

    if (client_reconnect_gather.erase(session->info.get_client())) {
      dout(20) << " removing client from reconnect set" << dendl;
      if (client_reconnect_gather.empty()) {
        dout(7) << " client " << session->info.inst << " was last reconnect, finishing" << dendl;
        reconnect_gather_finish();
      }
    }
    if (client_reclaim_gather.erase(session->info.get_client())) {
      dout(20) << " removing client from reclaim set" << dendl;
      if (client_reclaim_gather.empty()) {
        dout(7) << " client " << session->info.inst << " was last reclaimed, finishing" << dendl;
    mds->maybe_clientreplay_done();
      }
    }

It seems like the client from session map doesn't exist in client_reclaim_gather or client_reconnect_gather. This might be a reason:

Server::_session_logged() is called in C_MDS_session_finish::finish(); C_MDS_session_finish is used in Server::journal_close_session(), which is used in Server::terminate_sessions() and Server::kill_session(). Server::kill_session() is used to evict clients while Server::terminate_sessions() is used in MDCache::shutdown_pass(). So there could be situation where a/some sessions got evicted; since client_reclaim_gather and/or client_reconnect_gather doesn't contain these evicted sessions (because it is only possible to add to these sets in Server::reconnect_tick() and Server::reconnect_clients() respectively but it's not possible at this time because these clients are evicted) we see "p == m.end()" and the assertion in erase() present in src/include/interval_set.h fails. I think we should first check if the sets are empty or they contain these clients in them before trying to erase them.

#4 Updated by Venky Shankar 8 months ago

Dhairya Parmar wrote:

I'm trying to think out loud and this is just a hypothesis:

Server::_session_logged() has this part of code where this crash happened:

[...]
It seems like the client from session map doesn't exist in client_reclaim_gather or client_reconnect_gather. This might be a reason:

Server::_session_logged() is called in C_MDS_session_finish::finish(); C_MDS_session_finish is used in Server::journal_close_session(), which is used in Server::terminate_sessions() and Server::kill_session(). Server::kill_session() is used to evict clients while Server::terminate_sessions() is used in MDCache::shutdown_pass(). So there could be situation where a/some sessions got evicted; since client_reclaim_gather and/or client_reconnect_gather doesn't contain these evicted sessions (because it is only possible to add to these sets in Server::reconnect_tick() and Server::reconnect_clients() respectively but it's not possible at this time because these clients are evicted) we see "p == m.end()" and the assertion in erase() present in src/include/interval_set.h fails. I think we should first check if the sets are empty or they contain these clients in them before trying to erase them.

erase() would return `true` if the passed in value (client) was removed from the container. Its fine to pass in anything and erase() should just return true if the element was present and removed, false otherwise.

#5 Updated by Venky Shankar 8 months ago

Dhairya, the interval set operation that's asserting is possibly here:

  if (!open) {
    if (inos_to_purge.size()){
      ceph_assert(ls);
      session->info.prealloc_inos.subtract(inos_to_purge);  <-- here
      ls->purging_inodes.insert(inos_to_purge);
      if (mds->is_clientreplay() || mds->is_active() || mds->is_stopping())
    mdcache->purge_inodes(inos_to_purge, ls);
    }

    if (inos_to_free.size()) {
      ceph_assert(piv);
      ceph_assert(session->is_closing() || session->is_killing() ||
          session->is_opening()); // re-open closing session                                                                                                                                                                                       
      session->info.prealloc_inos.subtract(inos_to_free);  <-- or, here
      mds->inotable->apply_release_ids(inos_to_free);
      ceph_assert(mds->inotable->get_version() == piv);
    }
    session->free_prealloc_inos = session->info.prealloc_inos;
    session->delegated_inos.clear();
  }

#6 Updated by Venky Shankar 8 months ago

Which means, the [start, len] in `inos_to_free` and/or `inos_to_purge` are not present in prealloc_inos for the client session(map).

Also available in: Atom PDF