Project

General

Profile

Actions

Bug #58489

closed

mds stuck in 'up:replay' and crashed.

Added by Kotresh Hiremath Ravishankar over 1 year ago. Updated 10 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Community (user)
Tags:
backport_processed
Backport:
reef,pacific,quincy
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The issue is reported by upstream community user.

The cluster had two filesystems and the active mds of both the filesystems were stuck in 'up:replay'.
This was the case for around 2 days. Later, one of the active mds (stuck in up:replay) state crashed
with below stack trace.

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
In function 'void EMetaBlob::replay(MDSRank*, LogSegment*,
MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions)

  ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7fccd759943f]
  2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
  3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
[0x55fb2b98e89c]
  4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
  5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
  6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
  7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
  8: clone()

The upstream communication can be found at https://www.spinics.net/lists/ceph-users/msg75472.html


Files

mds01.ceph04.logaa.bz2 (879 KB) mds01.ceph04.logaa.bz2 Thomas Widhalm, 01/19/2023 01:15 PM
mds01.ceph04.logab.bz2 (756 KB) mds01.ceph04.logab.bz2 Thomas Widhalm, 01/19/2023 01:15 PM
mds01.ceph06.log.bz2 (681 KB) mds01.ceph06.log.bz2 Thomas Widhalm, 01/19/2023 01:15 PM

Related issues 6 (1 open5 closed)

Related to CephFS - Bug #59768: crash: void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*): assert(g_conf()->mds_wipe_sessions)DuplicateNeeraj Pratap Singh

Actions
Related to CephFS - Bug #61009: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start)Fix Under ReviewVenky Shankar

Actions
Related to CephFS - Bug #63103: mds: disable delegating inode ranges to clientsRejectedVenky Shankar

Actions
Copied to CephFS - Backport #59006: quincy: mds stuck in 'up:replay' and crashed.ResolvedXiubo LiActions
Copied to CephFS - Backport #59007: pacific: mds stuck in 'up:replay' and crashed.ResolvedXiubo LiActions
Copied to CephFS - Backport #59404: reef: mds stuck in 'up:replay' and crashed.ResolvedXiubo LiActions
Actions #1

Updated by Xiubo Li over 1 year ago

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only flushed the first ESessions entry and left the EUpdate one in the MDCache ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Actions #2

Updated by Venky Shankar over 1 year ago

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

Updated by Thomas Widhalm over 1 year ago

Hi,

Thanks for the help. Here are the debug logs of my two active MDS. I thought, I'll tar the whole bunch for you to search through. If you want to have just a few sections, please let me know. I didn't mean to dump everything onto you and not contribute myself. :-)

Actions #4

Updated by Venky Shankar over 1 year ago

  • Assignee set to Xiubo Li
  • Target version set to v18.0.0
  • Backport set to pacific,quincy
  • Labels (FS) crash added
Actions #5

Updated by Venky Shankar over 1 year ago

  • Status changed from New to Triaged
Actions #6

Updated by Thomas Widhalm about 1 year ago

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Actions #7

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Normally when restarting a MDS if all the journal logs are flushed and expired, then when the MDS is starting there is no this issue. But I need to read the code to make sure how the following case could happen:

ESession event in SegmentA
EUpdate("openc") event in a following SegmentB

Case A:

When restarting the MDS normally will the MDS try to expire both the SegmentA and SegmentB before stopping ? Or just flush them out to pool and then let the replay MDS to expire them ? If it's later, then in case if the SegmentA was expired but leaving the SegmentB in the pool, so if the replay MDS try to replay the events in SegementB we will hit this issue.

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

Actions #8

Updated by Xiubo Li about 1 year ago

Thomas Widhalm wrote:

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Hi Thomas,

BTW, in the beginning is there any MDS crash or MDS daemons shutdown abnormally ?

Thanks

Actions #9

Updated by Xiubo Li about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Normally when restarting a MDS if all the journal logs are flushed and expired, then when the MDS is starting there is no this issue. But I need to read the code to make sure how the following case could happen:

[...]

Case A:

When restarting the MDS normally will the MDS try to expire both the SegmentA and SegmentB before stopping ? Or just flush them out to pool and then let the replay MDS to expire them ? If it's later, then in case if the SegmentA was expired but leaving the SegmentB in the pool, so if the replay MDS try to replay the events in SegementB we will hit this issue.

Checked the code when the MDS daemons are stopped normally it will wait for all the MDLog to be flushed and trimmed. So it shouldn't be a issue for this case.

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Actions #10

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Is it because it assumes that "case B" (below) will not be hit?

Normally when restarting a MDS if all the journal logs are flushed and expired, then when the MDS is starting there is no this issue. But I need to read the code to make sure how the following case could happen:

[...]

Case A:

When restarting the MDS normally will the MDS try to expire both the SegmentA and SegmentB before stopping ? Or just flush them out to pool and then let the replay MDS to expire them ? If it's later, then in case if the SegmentA was expired but leaving the SegmentB in the pool, so if the replay MDS try to replay the events in SegementB we will hit this issue.

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

Actions #11

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Normally when restarting a MDS if all the journal logs are flushed and expired, then when the MDS is starting there is no this issue. But I need to read the code to make sure how the following case could happen:

[...]

Case A:

When restarting the MDS normally will the MDS try to expire both the SegmentA and SegmentB before stopping ? Or just flush them out to pool and then let the replay MDS to expire them ? If it's later, then in case if the SegmentA was expired but leaving the SegmentB in the pool, so if the replay MDS try to replay the events in SegementB we will hit this issue.

Checked the code when the MDS daemons are stopped normally it will wait for all the MDLog to be flushed and trimmed. So it shouldn't be a issue for this case.

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Actions #12

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Is it because it assumes that "case B" (below) will not be hit?

IMO we didn't consider this case before.

This should be hit several times before and IMO this is also why Zheng added the mds_wipe_sessions option to wipe the sessions.

Actions #13

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Just assume the Segment A was trimmed just before the MDS crashing and it will update the session map version just before crashing. But when the standby MDS is replaying the Segment B this new MDS daemon will set the session map version to 0.

So updating the session map version when expiring the Segment A makes no sense here.

Actions #14

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Just assume the Segment A was trimmed just before the MDS crashing and it will update the session map version just before crashing. But when the standby MDS is replaying the Segment B this new MDS daemon will set the session map version to 0.

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

Actions #15

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Just assume the Segment A was trimmed just before the MDS crashing and it will update the session map version just before crashing. But when the standby MDS is replaying the Segment B this new MDS daemon will set the session map version to 0.

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Actions #16

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Just assume the Segment A was trimmed just before the MDS crashing and it will update the session map version just before crashing. But when the standby MDS is replaying the Segment B this new MDS daemon will set the session map version to 0.

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Actions #17

Updated by Xiubo Li about 1 year ago

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

Cscope tag: journal_allocated_inos
   #   line  filename / context / line
   1   4576  mds/Server.cc <<handle_client_openc>>
             journal_allocated_inos(mdr, &le->metablob);
   2   6800  mds/Server.cc <<handle_client_mknod>>
             journal_allocated_inos(mdr, &le->metablob);
   3   6883  mds/Server.cc <<handle_client_mkdir>>
             journal_allocated_inos(mdr, &le->metablob);
   4   6968  mds/Server.cc <<handle_client_symlink>>
             journal_allocated_inos(mdr, &le->metablob);
Type number and <Enter> (q or empty cancels): 

These are all EUpdate events.

Actions #18

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

[...]

These are all EUpdate events.

Hmmm... Stashing a ESEssion event might be the way forward, but I'm a bit reluctant to jump to that. Cannot the latest session map version be read off during mds boot?

Actions #19

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

[...]

These are all EUpdate events.

Hmmm... Stashing a ESEssion event might be the way forward, but I'm a bit reluctant to jump to that. Cannot the latest session map version be read off during mds boot?

Is there any other place will save the session map version other than the MDLog ?

From my understanding I am afraid it couldn't. So in case when an MDS failover and the new MDS is booting the session map version could be inherited from last MDS by replaying the MDLogs, or just be initialized as 0.

And instead stashing a ESessions event in each Segment, we can improve it by stashing it only when needed.

Let me read the code more carefully to see could we avoid stashing it.

Actions #20

Updated by Thomas Widhalm about 1 year ago

Xiubo Li wrote:

Thomas Widhalm wrote:

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Hi Thomas,

BTW, in the beginning is there any MDS crash or MDS daemons shutdown abnormally ?

Thanks

The whole problem started when I upgraded my hosts from Fedora 36 to Fedora 37. After the upgrade, there seemed to be a major connection issue between the hosts. I'm all but new to debugging operating system misconfigurations but this problem just escaped me. The result was, that all PGs where listed as unavailable. Out of pure desperation I restored 3 of my 7 hosts (the others are hardware, so a restore would be much harder) and the PGs were online again. Now there's a mix of Fedora 36 and Fedora 37. What I didn't see in the first place is that, even when everything looked like it was resyncing, I couldn't access RBD or CephFS. After some more tries and unblocking OSDs RBD is available again. When I start MDS, they still are in status `up:replaying`. And all but one (randomly chosen) survivor crash after a few minutes.

Actions #21

Updated by Xiubo Li about 1 year ago

Thomas Widhalm wrote:

Xiubo Li wrote:

Thomas Widhalm wrote:

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Hi Thomas,

BTW, in the beginning is there any MDS crash or MDS daemons shutdown abnormally ?

Thanks

The whole problem started when I upgraded my hosts from Fedora 36 to Fedora 37. After the upgrade, there seemed to be a major connection issue between the hosts. I'm all but new to debugging operating system misconfigurations but this problem just escaped me. The result was, that all PGs where listed as unavailable. Out of pure desperation I restored 3 of my 7 hosts (the others are hardware, so a restore would be much harder) and the PGs were online again. Now there's a mix of Fedora 36 and Fedora 37. What I didn't see in the first place is that, even when everything looked like it was resyncing, I couldn't access RBD or CephFS. After some more tries and unblocking OSDs RBD is available again. When I start MDS, they still are in status `up:replaying`. And all but one (randomly chosen) survivor crash after a few minutes.

Okay, there might be some unknown stories. But the MDS should be stopped abnormally, or there shouldn't be these MDLogs replayed.

Actions #22

Updated by Thomas Widhalm about 1 year ago

Xiubo Li wrote:

Thomas Widhalm wrote:

Xiubo Li wrote:

Thomas Widhalm wrote:

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Hi Thomas,

BTW, in the beginning is there any MDS crash or MDS daemons shutdown abnormally ?

Thanks

The whole problem started when I upgraded my hosts from Fedora 36 to Fedora 37. After the upgrade, there seemed to be a major connection issue between the hosts. I'm all but new to debugging operating system misconfigurations but this problem just escaped me. The result was, that all PGs where listed as unavailable. Out of pure desperation I restored 3 of my 7 hosts (the others are hardware, so a restore would be much harder) and the PGs were online again. Now there's a mix of Fedora 36 and Fedora 37. What I didn't see in the first place is that, even when everything looked like it was resyncing, I couldn't access RBD or CephFS. After some more tries and unblocking OSDs RBD is available again. When I start MDS, they still are in status `up:replaying`. And all but one (randomly chosen) survivor crash after a few minutes.

Okay, there might be some unknown stories. But the MDS should be stopped abnormally, or there shouldn't be these MDLogs replayed.

That might have happened during the upgrade process. I was cautious and upgraded one after the other. I waited for all services to come up again or at least I gave enough time before I rebooted the systems with `/sbin/reboot`. I can imagine that after the upgrade, MDS were in an abnormal state and while I tried to fix the PGs being unavailable, I missed the MDS having problems.

Actions #23

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

[...]

These are all EUpdate events.

Hmmm... Stashing a ESEssion event might be the way forward, but I'm a bit reluctant to jump to that. Cannot the latest session map version be read off during mds boot?

Is there any other place will save the session map version other than the MDLog ?

From my understanding I am afraid it couldn't. So in case when an MDS failover and the new MDS is booting the session map version could be inherited from last MDS by replaying the MDLogs, or just be initialized as 0.

Unless we can storing it when persisting session map.

Actions #24

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

[...]

These are all EUpdate events.

Hmmm... Stashing a ESEssion event might be the way forward, but I'm a bit reluctant to jump to that. Cannot the latest session map version be read off during mds boot?

Is there any other place will save the session map version other than the MDLog ?

From my understanding I am afraid it couldn't. So in case when an MDS failover and the new MDS is booting the session map version could be inherited from last MDS by replaying the MDLogs, or just be initialized as 0.

Unless we can storing it when persisting session map.

I found the void LogSegment::try_to_expire(MDSRank *mds, MDSGatherBuilder &gather_bld, int op_prio) have already persisted the sessionmap.

Will read the code more carefully.

Actions #25

Updated by Xiubo Li about 1 year ago

The sessionmap will be persisted when expiring any MDLog Segment in LogSegment::try_to_expire()->sessionmap.save(), that means only when the MDLog Segments are expired will the sessionmap persisted.

If there are enough Segments the expiring won't happen. So it's possible that when the an MDS crash the expiring never triggered and then after failover the new MDS may won't get the sessionmap info.

The inotable have the same logic, but it will force replay the inotable version instead of asserting the MDS daemon.

Actions #26

Updated by Xiubo Li about 1 year ago

  • Status changed from Triaged to Fix Under Review
  • Pull request ID set to 49970
Actions #27

Updated by Venky Shankar about 1 year ago

  • Status changed from Fix Under Review to Pending Backport
Actions #28

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59006: quincy: mds stuck in 'up:replay' and crashed. added
Actions #29

Updated by Backport Bot about 1 year ago

  • Copied to Backport #59007: pacific: mds stuck in 'up:replay' and crashed. added
Actions #30

Updated by Backport Bot about 1 year ago

  • Tags set to backport_processed
Actions #32

Updated by Venky Shankar about 1 year ago

Laura Flores wrote:

@Venky relevant thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GA77DLSQXCXZVJ4BYQ6KDW4DLU5IFCPG/

Thanks for bringing this to our notice, Laura.

Actions #33

Updated by Xiubo Li about 1 year ago

  • Backport changed from pacific,quincy to reef,pacific,quincy
Actions #34

Updated by Xiubo Li about 1 year ago

  • Copied to Backport #59404: reef: mds stuck in 'up:replay' and crashed. added
Actions #35

Updated by Xiubo Li 10 months ago

  • Status changed from Pending Backport to Resolved
Actions #36

Updated by Venky Shankar 8 months ago

  • Related to Bug #59768: crash: void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*): assert(g_conf()->mds_wipe_sessions) added
Actions #37

Updated by Venky Shankar 8 months ago

  • Related to Bug #61009: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start) added
Actions #38

Updated by Venky Shankar 7 months ago

  • Related to Bug #63103: mds: disable delegating inode ranges to clients added
Actions

Also available in: Atom PDF