Project

General

Profile

Bug #61009

crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start)

Added by Telemetry Bot 11 months ago. Updated about 1 month ago.

Status:
Fix Under Review
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

100%

Source:
Telemetry
Tags:
Backport:
reef,quincy,pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):

4e206621a98d608785fc55ccbeed72a8e30bdec5abd3aba2539532c0775833cb
f49c2f445b22540a1656aca2f64257fee623114cc65ade3553fda74a5bdbe465
f9026600d57dd2f850c5c5a65c1d1e1f6c55e33d29992655feac5a68a6c0bc5d


Description

http://telemetry.front.sepia.ceph.com:4000/d/jByk5HaMz/crash-spec-x-ray?orgId=1&var-sig_v2=0ea734a1639fc4740189dcd09016bfb0c228812033468b2d3ddcbf087ed7ca0e

Assert condition: p->first <= start
Assert function: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]

Sanitized backtrace:

    interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)
    EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)
    EUpdate::replay(MDSRank*)
    MDLog::_replay_thread()
    MDLog::ReplayThread::entry()

Crash dump sample:
{
    "assert_condition": "p->first <= start",
    "assert_file": "include/interval_set.h",
    "assert_func": "void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]",
    "assert_line": 568,
    "assert_msg": "include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7fad56dac700 time 2023-04-26T14:34:47.285484+0000\ninclude/interval_set.h: 568: FAILED ceph_assert(p->first <= start)",
    "assert_thread_name": "md_log_replay",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12cf0) [0x7fad661eccf0]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7fad671f2499]",
        "/usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fad671f2605]",
        "(interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x562d23e24ce5]",
        "(EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x59cd) [0x562d2412740d]",
        "(EUpdate::replay(MDSRank*)+0x40) [0x562d241285a0]",
        "(MDLog::_replay_thread()+0x9b3) [0x562d240ae443]",
        "(MDLog::ReplayThread::entry()+0x11) [0x562d23d6ae31]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7fad661e21ca]",
        "clone()" 
    ],
    "ceph_version": "17.2.5",
    "crash_id": "2023-04-26T14:34:47.292303Z_8606677b-4c0a-42e6-ac20-3d0a882cdbbf",
    "entity_name": "mds.488564f788c5cab7df94f7533a23bb23d498d7c8",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-mds",
    "stack_sig": "f9026600d57dd2f850c5c5a65c1d1e1f6c55e33d29992655feac5a68a6c0bc5d",
    "timestamp": "2023-04-26T14:34:47.292303Z",
    "utsname_machine": "x86_64",
    "utsname_release": "6.2.9-200.fc37.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Mar 30 22:31:57 UTC 2023" 
}


Subtasks

Bug #54943: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start)Duplicate

Bug #62356: mds: src/include/interval_set.h: 538: FAILED ceph_assert(p->first <= start)DuplicateXiubo Li


Related issues

Related to CephFS - Bug #60241: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start) New
Related to CephFS - Bug #60629: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p != m.end()) New
Related to CephFS - Bug #58489: mds stuck in 'up:replay' and crashed. Resolved
Related to CephFS - Bug #63103: mds: disable delegating inode ranges to clients Rejected

History

#1 Updated by Telemetry Bot 11 months ago

  • Crash signature (v1) updated (diff)
  • Crash signature (v2) updated (diff)
  • Affected Versions v17.2.5, v17.2.6 added

#2 Updated by Milind Changire 10 months ago

  • Related to Bug #60241: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start) added

#3 Updated by Milind Changire 10 months ago

  • Related to Bug #54943: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start) added

#4 Updated by Milind Changire 9 months ago

  • Related to Bug #60629: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p != m.end()) added

#5 Updated by Milind Changire 9 months ago

  • Assignee set to Dhairya Parmar
  • Crash signature (v1) updated (diff)

#6 Updated by Xiubo Li 8 months ago

It aborted in Line#1623. The session->take_ino() may return 0 if the used_preallocated_ino doesn't exist. Then when erasing the ino# 0 it will abort in the interval_set.

...
1604   if (sessionmapv) {          
1605     if (mds->sessionmap.get_version() >= sessionmapv ||
1606         unlikely(type == EVENT_UPDATE && skip_replaying_inotable)) {
1607       dout(10) << "EMetaBlob.replay sessionmap v " << sessionmapv
1608                << " <= table " << mds->sessionmap.get_version() << dendl;
1609       if (used_preallocated_ino)
1610         mds->mdcache->insert_taken_inos(used_preallocated_ino);
1611     } else {
1612       dout(10) << "EMetaBlob.replay sessionmap v " << sessionmapv
1613                << ", table " << mds->sessionmap.get_version()
1614                << " prealloc " << preallocated_inos
1615                << " used " << used_preallocated_ino
1616                << dendl;
1617       Session *session = mds->sessionmap.get_session(client_name);
1618       if (session) {
1619         dout(20) << " (session prealloc " << session->info.prealloc_inos << ")" << dendl;
1620         if (used_preallocated_ino) {        
1621           if (!session->info.prealloc_inos.empty()) {
1622             inodeno_t ino = session->take_ino(used_preallocated_ino);
1623             session->info.prealloc_inos.erase(ino);
1624             ceph_assert(ino == used_preallocated_ino);                                                                                                                                                                            
1625           }
1626           mds->sessionmap.replay_dirty_session(session);
1627         }
1628         if (!preallocated_inos.empty()) {   
1629           session->free_prealloc_inos.insert(preallocated_inos);
1630           session->info.prealloc_inos.insert(preallocated_inos);
1631           mds->sessionmap.replay_dirty_session(session);
1632         }
1633       } else {
...

#7 Updated by Xiubo Li 8 months ago

This should be a similar issue with https://tracker.ceph.com/issues/58489. Just the openc/mknod/mkdir/symblink events flushed but leaving the sessionmap event still in the active mds cache or dropped by the active MDS if it crashed.

So here maybe we should just respawn the mds automatically instead of crashing it, just like:

    -6> 2023-07-26T23:20:59.242+0000 7f65fc15a700  0 mds.71990086.journaler.mdlog(ro) _finish_read got error -2
    -5> 2023-07-26T23:20:59.242+0000 7f65fc15a700  0 mds.71990086.journaler.mdlog(ro) _finish_read got error -2
    -4> 2023-07-26T23:20:59.242+0000 7f65fc15a700  0 mds.71990086.journaler.mdlog(ro) _finish_read got error -2
    -3> 2023-07-26T23:20:59.242+0000 7f65fc15a700  0 mds.71990086.journaler.mdlog(ro) _finish_read got error -2
    -2> 2023-07-26T23:20:59.242+0000 7f65fa957700  0 mds.0.log _replay journaler got error -2, aborting
    -1> 2023-07-26T23:20:59.242+0000 7f65fa957700  0 mds.0.0 boot_start encountered an error CEPHFS_EAGAIN, respawning since we fell behind journal
     0> 2023-07-26T23:20:59.242+0000 7f65fa957700  1 mds.ocs-storagecluster-cephfilesystem-b respawn!

#8 Updated by Venky Shankar 8 months ago

Xiubo Li wrote:

This should be a similar issue with https://tracker.ceph.com/issues/58489. Just the openc/mknod/mkdir/symblink events flushed but leaving the sessionmap event still in the active mds cache or dropped by the active MDS if it crashed.

So here maybe we should just respawn the mds automatically instead of crashing it, just like:

[...]

I'm not exactly recalling the series of events that could cause this. Reading up 58489 again.

#9 Updated by Dhairya Parmar 8 months ago

Xiubo Li wrote:

It aborted in Line#1623. The session->take_ino() may return 0 if the used_preallocated_ino doesn't exist. Then when erasing the ino# 0 it will abort in the interval_set.

[...]

if it tries to remove ino 0 from interval set, there are are a few conditions that will make sure it doesn't abort. in void erase() we have auto p = find_inc_m(start) which has this code

  auto find_inc_m(T start) {
    auto p = m.lower_bound(start);
    if (p != m.begin() &&
        (p == m.end() || p->first > start)) {
      --p;   // might overlap?
      if (p->first + p->second <= start)
        ++p; // it doesn't.
    }
    return p;
  }

this makes sure we never have p->first > start so the assertion ceph_assert(p->first <= start) should not have failed. This makes me wonder if we are getting some garbage val that overflows and leads to the assertion failure?

#10 Updated by Dhairya Parmar 8 months ago

i have a concern about this code; if the p->first is still gt than start even after decrementing then it means a single decrement might just not be enough? i.e. keep on decrementing until we are sure it is less than start?

  auto find_inc_m(T start) {
    auto p = m.lower_bound(start);
    if (p != m.begin() &&
        (p == m.end() || p->first > start)) {
      --p;   // might overlap?
      if (p->first + p->second <= start)
        ++p; // it doesn't.
    }
    return p;
  }

#11 Updated by Venky Shankar 7 months ago

Xiubo Li wrote:

This should be a similar issue with https://tracker.ceph.com/issues/58489. Just the openc/mknod/mkdir/symblink events flushed but leaving the sessionmap event still in the active mds cache or dropped by the active MDS if it crashed.

So here maybe we should just respawn the mds automatically instead of crashing it, just like:

[...]

Xiubo, were you able to verify if this issue is similar to tracker #58489?

#12 Updated by Dhairya Parmar 7 months ago

Venky Shankar wrote:

Xiubo Li wrote:

This should be a similar issue with https://tracker.ceph.com/issues/58489. Just the openc/mknod/mkdir/symblink events flushed but leaving the sessionmap event still in the active mds cache or dropped by the active MDS if it crashed.

So here maybe we should just respawn the mds automatically instead of crashing it, just like:

[...]

Xiubo, were you able to verify if this issue is similar to tracker #58489?

Venky, the issue seems to be the partial flush of log entries where a standby replay would try to replay logs and would fail at line xiubo mentioned in https://tracker.ceph.com/issues/61009#note-6. It tries to compare map elements with 0 and thus fails since the decremented element's key (check https://tracker.ceph.com/issues/61009#note-10) is still gt than 0. This needs to be fixed at multiple points.

1) in journal.cc, first check if the used_preallocated_ino exists in session->info.prealloc_inos and then erase.
2) Make sure to flush the log entries before respawning mds
3) dont decrement just once in erase() in interval_set but instead iterate until we have p->first < start, raised a RFC PR for this https://github.com/ceph/ceph/pull/53063

#13 Updated by Greg Farnum 7 months ago

I was talking to Dhairya about this today and am not quite sure I understand.

Xiubo, Venky, are we contending the issue is that the mds journals something like:
"adding new preallocated inode(s including) x to session"
"using inode x"

and then later trims the event adding x to the preallocated list, without trimming the use of x? And this causes trouble?

If that's the case, it seems like we messed up journaling when adding preallocated inodes to the protocol — we need to add some kind of tracking structure that we keep persistent.

#14 Updated by Xiubo Li 7 months ago

Greg Farnum wrote:

I was talking to Dhairya about this today and am not quite sure I understand.

Xiubo, Venky, are we contending the issue is that the mds journals something like:
"adding new preallocated inode(s including) x to session"
"using inode x"

and then later trims the event adding x to the preallocated list, without trimming the use of x? And this causes trouble?

If that's the case, it seems like we messed up journaling when adding preallocated inodes to the protocol — we need to add some kind of tracking structure that we keep persistent.

I went through the code again, for example in case:

1, a client mkdir request comes
2, handle_client_mkdir() --> prepare_new_inode(): it will preallocate some ino#*s and save it in *mdr and return one ino# from them.
3, journal_and_reply(): this will submit a EUpdate(mdlog, "mkdir"); log entry to the MDLog.
4, after the above journal log entry is flushed to Rados pool, it will send a safe reply to client
5, for safe reply in respond_to_request() --> reply_client_request() --> apply_allocated_inos(): it will insert the preallocated ino#*s from Step2 to the corresponding *session, and then mark the session as dirty:

 3559 void Server::apply_allocated_inos(MDRequestRef& mdr, Session *session)
 3560 {
 3561   dout(10) << "apply_allocated_inos " << mdr->alloc_ino
 3562            << " / " << mdr->prealloc_inos      
 3563            << " / " << mdr->used_prealloc_ino << dendl; 
 3564  
 3565   if (mdr->alloc_ino) {
 3566     mds->inotable->apply_alloc_id(mdr->alloc_ino);
 3567   }
 3568   if (mdr->prealloc_inos.size()) {
 3569     ceph_assert(session);
 3570     session->pending_prealloc_inos.subtract(mdr->prealloc_inos);
 3571     session->free_prealloc_inos.insert(mdr->prealloc_inos);
 3572     session->info.prealloc_inos.insert(mdr->prealloc_inos);
 3573     mds->sessionmap.mark_dirty(session, !mdr->used_prealloc_ino);
 3574     mds->inotable->apply_alloc_ids(mdr->prealloc_inos);
 3575   }
 3576   if (mdr->used_prealloc_ino) {
 3577     ceph_assert(session);
 3578     session->info.prealloc_inos.erase(mdr->used_prealloc_ino);
 3579     mds->sessionmap.mark_dirty(session);
 3580   }
 3581 }                            

6, till now the journal logs in the Rados pool have include the ino# for the new made dirctory.
7, later when trim/expiring the MDLog segments the MDS will save() the dirty sessions to the pool, which including all the preallocated *ino#*s

What if just before the Step7 the MDS crashed and then failover to the standy-reply MDS and then at the same time starts a new standby-replay daemon ?

Maybe there are other cases like above ?

#15 Updated by Dhairya Parmar 7 months ago

Xiubo Li wrote:

Greg Farnum wrote:

I was talking to Dhairya about this today and am not quite sure I understand.

Xiubo, Venky, are we contending the issue is that the mds journals something like:
"adding new preallocated inode(s including) x to session"
"using inode x"

and then later trims the event adding x to the preallocated list, without trimming the use of x? And this causes trouble?

If that's the case, it seems like we messed up journaling when adding preallocated inodes to the protocol — we need to add some kind of tracking structure that we keep persistent.

I went through the code again, for example in case:

1, a client mkdir request comes
2, handle_client_mkdir() --> prepare_new_inode(): it will preallocate some ino#*s and save it in *mdr and return one ino# from them.
3, journal_and_reply(): this will submit a EUpdate(mdlog, "mkdir"); log entry to the MDLog.
4, after the above journal log entry is flushed to Rados pool, it will send a safe reply to client
5, for safe reply in respond_to_request() --> reply_client_request() --> apply_allocated_inos(): it will insert the preallocated ino#*s from Step2 to the corresponding *session, and then mark the session as dirty:

[...]

6, till now the journal logs in the Rados pool have include the ino# for the new made dirctory.
7, later when trim/expiring the MDLog segments the MDS will save() the dirty sessions to the pool, which including all the preallocated *ino#*s

What if just before the Step7 the MDS crashed and then failover to the standy-reply MDS and then at the same time starts a new standby-replay daemon ?

if this happens, since the dirty sessions are not saved to the pool, it implies they are now lost? If there is only a single active MDS then it is no more recoverable? i hope i'm wrong because this sounds a lot scary to me if it is true. if this hypothesis of mine is not true, in the case of what xiubo explained, how does our code make sure they are not lost? since the dirty sessions must be stored on the MDS cache right and if it's gone/restarted then the cache is gone too

Maybe there are other cases like above ?

#16 Updated by Greg Farnum 7 months ago

Xiubo Li wrote:

Greg Farnum wrote:

I was talking to Dhairya about this today and am not quite sure I understand.

Xiubo, Venky, are we contending the issue is that the mds journals something like:
"adding new preallocated inode(s including) x to session"
"using inode x"

and then later trims the event adding x to the preallocated list, without trimming the use of x? And this causes trouble?

If that's the case, it seems like we messed up journaling when adding preallocated inodes to the protocol — we need to add some kind of tracking structure that we keep persistent.

I went through the code again, for example in case:

1, a client mkdir request comes
2, handle_client_mkdir() --> prepare_new_inode(): it will preallocate some ino#*s and save it in *mdr and return one ino# from them.
3, journal_and_reply(): this will submit a EUpdate(mdlog, "mkdir"); log entry to the MDLog.
4, after the above journal log entry is flushed to Rados pool, it will send a safe reply to client
5, for safe reply in respond_to_request() --> reply_client_request() --> apply_allocated_inos(): it will insert the preallocated ino#*s from Step2 to the corresponding *session, and then mark the session as dirty:

[...]

6, till now the journal logs in the Rados pool have include the ino# for the new made dirctory.
7, later when trim/expiring the MDLog segments the MDS will save() the dirty sessions to the pool, which including all the preallocated *ino#*s

What if just before the Step7 the MDS crashed and then failover to the standy-reply MDS and then at the same time starts a new standby-replay daemon ?

Do you mean we can trim the journal before the session is updated on disk? That seems like we just need to make the session map flush block journal trimming.

But I see at https://github.com/ceph/ceph/blob/main/src/mds/journal.cc#L279, we are indeed blocking the journal trim on the sessionmap flush. So the issue is just that we let clients use preallocated inodes before we flush them out to disk. We need to rethink this — either by flushing the sessionmap before clients can access new preallocated inodes, or by changing replay so that it can handle this case.

#17 Updated by Venky Shankar 7 months ago

(I got to this rather late - so excuse me for any discussion that were already resolved).

Dhairya Parmar wrote:

Xiubo Li wrote:

Greg Farnum wrote:

I was talking to Dhairya about this today and am not quite sure I understand.

Xiubo, Venky, are we contending the issue is that the mds journals something like:
"adding new preallocated inode(s including) x to session"
"using inode x"

and then later trims the event adding x to the preallocated list, without trimming the use of x? And this causes trouble?

If that's the case, it seems like we messed up journaling when adding preallocated inodes to the protocol — we need to add some kind of tracking structure that we keep persistent.

I went through the code again, for example in case:

1, a client mkdir request comes
2, handle_client_mkdir() --> prepare_new_inode(): it will preallocate some ino#*s and save it in *mdr and return one ino# from them.
3, journal_and_reply(): this will submit a EUpdate(mdlog, "mkdir"); log entry to the MDLog.
4, after the above journal log entry is flushed to Rados pool, it will send a safe reply to client
5, for safe reply in respond_to_request() --> reply_client_request() --> apply_allocated_inos(): it will insert the preallocated ino#*s from Step2 to the corresponding *session, and then mark the session as dirty:

[...]

6, till now the journal logs in the Rados pool have include the ino# for the new made dirctory.
7, later when trim/expiring the MDLog segments the MDS will save() the dirty sessions to the pool, which including all the preallocated *ino#*s

What if just before the Step7 the MDS crashed and then failover to the standy-reply MDS and then at the same time starts a new standby-replay daemon ?

if this happens, since the dirty sessions are not saved to the pool, it implies they are now lost?

The required info is persisted in the mds journal, so they aren't lost.

If there is only a single active MDS then it is no more recoverable?

If someone is running with a single active mds, they chose to shoot in their foot. This isn't particularly related to this crash.

#18 Updated by Venky Shankar 7 months ago

Greg Farnum wrote:

Xiubo Li wrote:

Greg Farnum wrote:

I was talking to Dhairya about this today and am not quite sure I understand.

Xiubo, Venky, are we contending the issue is that the mds journals something like:
"adding new preallocated inode(s including) x to session"
"using inode x"

and then later trims the event adding x to the preallocated list, without trimming the use of x? And this causes trouble?

If that's the case, it seems like we messed up journaling when adding preallocated inodes to the protocol — we need to add some kind of tracking structure that we keep persistent.

I went through the code again, for example in case:

1, a client mkdir request comes
2, handle_client_mkdir() --> prepare_new_inode(): it will preallocate some ino#*s and save it in *mdr and return one ino# from them.
3, journal_and_reply(): this will submit a EUpdate(mdlog, "mkdir"); log entry to the MDLog.
4, after the above journal log entry is flushed to Rados pool, it will send a safe reply to client
5, for safe reply in respond_to_request() --> reply_client_request() --> apply_allocated_inos(): it will insert the preallocated ino#*s from Step2 to the corresponding *session, and then mark the session as dirty:

[...]

6, till now the journal logs in the Rados pool have include the ino# for the new made dirctory.
7, later when trim/expiring the MDLog segments the MDS will save() the dirty sessions to the pool, which including all the preallocated *ino#*s

What if just before the Step7 the MDS crashed and then failover to the standy-reply MDS and then at the same time starts a new standby-replay daemon ?

Do you mean we can trim the journal before the session is updated on disk? That seems like we just need to make the session map flush block journal trimming.

But I see at https://github.com/ceph/ceph/blob/main/src/mds/journal.cc#L279, we are indeed blocking the journal trim on the sessionmap flush. So the issue is just that we let clients use preallocated inodes before we flush them out to disk. We need to rethink this — either by flushing the sessionmap before clients can access new preallocated inodes, or by changing replay so that it can handle this case.

That's indeed problematic. Would hitting this issue even require an active MDS crash? The standby-replay daemon following the active MDS journal would see the inode# before it loads the prealloc_inos interval set.

#19 Updated by Venky Shankar 7 months ago

Greg Farnum wrote:

Xiubo Li wrote:

Greg Farnum wrote:

I was talking to Dhairya about this today and am not quite sure I understand.

Xiubo, Venky, are we contending the issue is that the mds journals something like:
"adding new preallocated inode(s including) x to session"
"using inode x"

and then later trims the event adding x to the preallocated list, without trimming the use of x? And this causes trouble?

If that's the case, it seems like we messed up journaling when adding preallocated inodes to the protocol — we need to add some kind of tracking structure that we keep persistent.

I went through the code again, for example in case:

1, a client mkdir request comes
2, handle_client_mkdir() --> prepare_new_inode(): it will preallocate some ino#*s and save it in *mdr and return one ino# from them.
3, journal_and_reply(): this will submit a EUpdate(mdlog, "mkdir"); log entry to the MDLog.
4, after the above journal log entry is flushed to Rados pool, it will send a safe reply to client
5, for safe reply in respond_to_request() --> reply_client_request() --> apply_allocated_inos(): it will insert the preallocated ino#*s from Step2 to the corresponding *session, and then mark the session as dirty:

[...]

6, till now the journal logs in the Rados pool have include the ino# for the new made dirctory.
7, later when trim/expiring the MDLog segments the MDS will save() the dirty sessions to the pool, which including all the preallocated *ino#*s

What if just before the Step7 the MDS crashed and then failover to the standy-reply MDS and then at the same time starts a new standby-replay daemon ?

Do you mean we can trim the journal before the session is updated on disk? That seems like we just need to make the session map flush block journal trimming.

But I see at https://github.com/ceph/ceph/blob/main/src/mds/journal.cc#L279, we are indeed blocking the journal trim on the sessionmap flush. So the issue is just that we let clients use preallocated inodes before we flush them out to disk. We need to rethink this — either by flushing the sessionmap before clients can access new preallocated inodes, or by changing replay so that it can handle this case.

I can't think of how this can be fixed-up in replay apart from adjusting the prealloc_inos set for the session (EUpdate had client_name to lookup in sessionmap) on-the-fly and later when a ESession event is read from the journal, fix up prealloc_inos set again?

#20 Updated by Dhairya Parmar 7 months ago

Greg Farnum wrote:

Xiubo Li wrote:

Greg Farnum wrote:

I was talking to Dhairya about this today and am not quite sure I understand.

Xiubo, Venky, are we contending the issue is that the mds journals something like:
"adding new preallocated inode(s including) x to session"
"using inode x"

and then later trims the event adding x to the preallocated list, without trimming the use of x? And this causes trouble?

If that's the case, it seems like we messed up journaling when adding preallocated inodes to the protocol — we need to add some kind of tracking structure that we keep persistent.

I went through the code again, for example in case:

1, a client mkdir request comes
2, handle_client_mkdir() --> prepare_new_inode(): it will preallocate some ino#*s and save it in *mdr and return one ino# from them.
3, journal_and_reply(): this will submit a EUpdate(mdlog, "mkdir"); log entry to the MDLog.
4, after the above journal log entry is flushed to Rados pool, it will send a safe reply to client
5, for safe reply in respond_to_request() --> reply_client_request() --> apply_allocated_inos(): it will insert the preallocated ino#*s from Step2 to the corresponding *session, and then mark the session as dirty:

[...]

6, till now the journal logs in the Rados pool have include the ino# for the new made dirctory.
7, later when trim/expiring the MDLog segments the MDS will save() the dirty sessions to the pool, which including all the preallocated *ino#*s

What if just before the Step7 the MDS crashed and then failover to the standy-reply MDS and then at the same time starts a new standby-replay daemon ?

Do you mean we can trim the journal before the session is updated on disk? That seems like we just need to make the session map flush block journal trimming.

But I see at https://github.com/ceph/ceph/blob/main/src/mds/journal.cc#L279, we are indeed blocking the journal trim on the sessionmap flush. So the issue is just that we let clients use preallocated inodes before we flush them out to disk. We need to rethink this — either by flushing the sessionmap before clients can access new preallocated inodes, or by changing replay so that it can handle this case.

i read this comment in void SessionMap::replay_open_sessions()

  // Server::finish_force_open_sessions() marks sessions dirty one by one.
  // Marking a session dirty may flush all existing dirty sessions. So it's
  // possible that some sessions are already saved in sessionmap.

so if marking a session dirty can flush all the existing dirty sessions then basically there shouldn't been any issue right? or there could be a possibility that while it was flushing the mds crashed?

#21 Updated by Venky Shankar 7 months ago

Dhairya Parmar wrote:

Greg Farnum wrote:

Xiubo Li wrote:

Greg Farnum wrote:

I was talking to Dhairya about this today and am not quite sure I understand.

Xiubo, Venky, are we contending the issue is that the mds journals something like:
"adding new preallocated inode(s including) x to session"
"using inode x"

and then later trims the event adding x to the preallocated list, without trimming the use of x? And this causes trouble?

If that's the case, it seems like we messed up journaling when adding preallocated inodes to the protocol — we need to add some kind of tracking structure that we keep persistent.

I went through the code again, for example in case:

1, a client mkdir request comes
2, handle_client_mkdir() --> prepare_new_inode(): it will preallocate some ino#*s and save it in *mdr and return one ino# from them.
3, journal_and_reply(): this will submit a EUpdate(mdlog, "mkdir"); log entry to the MDLog.
4, after the above journal log entry is flushed to Rados pool, it will send a safe reply to client
5, for safe reply in respond_to_request() --> reply_client_request() --> apply_allocated_inos(): it will insert the preallocated ino#*s from Step2 to the corresponding *session, and then mark the session as dirty:

[...]

6, till now the journal logs in the Rados pool have include the ino# for the new made dirctory.
7, later when trim/expiring the MDLog segments the MDS will save() the dirty sessions to the pool, which including all the preallocated *ino#*s

What if just before the Step7 the MDS crashed and then failover to the standy-reply MDS and then at the same time starts a new standby-replay daemon ?

Do you mean we can trim the journal before the session is updated on disk? That seems like we just need to make the session map flush block journal trimming.

But I see at https://github.com/ceph/ceph/blob/main/src/mds/journal.cc#L279, we are indeed blocking the journal trim on the sessionmap flush. So the issue is just that we let clients use preallocated inodes before we flush them out to disk. We need to rethink this — either by flushing the sessionmap before clients can access new preallocated inodes, or by changing replay so that it can handle this case.

i read this comment in void SessionMap::replay_open_sessions()
[...]

so if marking a session dirty can flush all the existing dirty sessions then basically there shouldn't been any issue right? or there could be a possibility that while it was flushing the mds crashed?

The comment says it may flush all existing dirty sessions. If you see SessionMap::mark_dirty():

void SessionMap::_mark_dirty(Session *s, bool may_save)
{
  if (dirty_sessions.count(s->info.inst.name))
    return;

  if (may_save &&
      dirty_sessions.size() >= g_conf()->mds_sessionmap_keys_per_op) {
    // Pre-empt the usual save() call from journal segment trim, in                                                                                                                                                                                
    // order to avoid building up an oversized OMAP update operation                                                                                                                                                                               
    // from too many sessions modified at once                                                                                                                                                                                                     
    save(new C_MDSInternalNoop, version);
  }

  null_sessions.erase(s->info.inst.name);
  dirty_sessions.insert(s->info.inst.name);
}

void SessionMap::mark_dirty(Session *s, bool may_save)
{
  dout(20) << __func__ << " s=" << s << " name=" << s->info.inst.name
    << " v=" << version << dendl;

  _mark_dirty(s, may_save);
  version++;
  s->pop_pv(version);
}

It doesn't always flush dirty sessions when marked.

#22 Updated by Venky Shankar 7 months ago

  • Related to Bug #58489: mds stuck in 'up:replay' and crashed. added

#23 Updated by Venky Shankar 7 months ago

Xiubo/Dhairya, if the MDS can ensure that the sessionmap is persisted before the client starts using the prellocated inode, then the changes in https://github.com/ceph/ceph/pull/49970 aren't needed. Would you agree?

#24 Updated by Dhairya Parmar 7 months ago

Venky Shankar wrote:

Xiubo/Dhairya, if the MDS can ensure that the sessionmap is persisted before the client starts using the prellocated inode, then the changes in https://github.com/ceph/ceph/pull/49970 aren't needed. Would you agree?

i think yes. if the sessionmap is persisted then we might not need it.

#25 Updated by Venky Shankar 6 months ago

  • Category set to Correctness/Safety
  • Status changed from New to In Progress
  • Assignee changed from Dhairya Parmar to Venky Shankar
  • Target version set to v19.0.0
  • Backport set to reef,quincy,pacific
  • Severity changed from 3 - minor to 2 - major
  • Component(FS) MDS added

Dhairya, I'm taking this one.

#26 Updated by Venky Shankar 6 months ago

Just FYI - workaround for this crash is to disable async dirops in the (kernel) client (mount option `nowsync`).

#27 Updated by Venky Shankar 6 months ago

Thinking more about this: the mds could delay sending the prealloc inode range to the client till the session metadata (info.prealloc_inos) is persisted.

Question to Xiubo (copied from an internal channel):
Preallocated inode ranges were handed out by the mds to clients starting octopus, however, the mds has been preallocating inodes way before octopus (I'm seeing the jewel branch and the prealloc stuff is there in the MDS and it goes waaaay back). So, if you see prepare_new_inode() in jewel branch, its called with an inode number that the client could send. Do you know if the client was sending that in request head and did it expect a different ino# to be sent back by the MDS?

#28 Updated by Venky Shankar 6 months ago

Venky Shankar wrote:

Thinking more about this: the mds could delay sending the prealloc inode range to the client till the session metadata (info.prealloc_inos) is persisted.

Question to Xiubo (copied from an internal channel):
Preallocated inode ranges were handed out by the mds to clients starting octopus, however, the mds has been preallocating inodes way before octopus (I'm seeing the jewel branch and the prealloc stuff is there in the MDS and it goes waaaay back). So, if you see prepare_new_inode() in jewel branch, its called with an inode number that the client could send. Do you know if the client was sending that in request head and did it expect a different ino# to be sent back by the MDS?

nvm - the ino field in the request head is used by client when replaying ops.

#29 Updated by Venky Shankar 6 months ago

Venky Shankar wrote:

Thinking more about this: the mds could delay sending the prealloc inode range to the client till the session metadata (info.prealloc_inos) is persisted.

I'm testing a fix today and if things go well, I'll submit a change for review.

#30 Updated by Venky Shankar 6 months ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 53752

#31 Updated by Venky Shankar 6 months ago

  • Related to Bug #63103: mds: disable delegating inode ranges to clients added

#32 Updated by Igor Fedotov about 1 month ago

@here - curious if https://tracker.ceph.com/issues/60241 is actually related to this ticket. The former has got completely different backtrace and the only matching thing is an assertion in interval_set...

Also available in: Atom PDF