Bug #58489: mds stuck in 'up:replay' and crashed. - CephFS - Ceph

Actions

Copy link

Bug #58489

closed

mds stuck in 'up:replay' and crashed.

Added by Kotresh Hiremath Ravishankar over 1 year ago. Updated 10 months ago.

Status:

Resolved

Priority:

Normal

Assignee:

Xiubo Li

Category:

Correctness/Safety

Target version:

Ceph - v18.0.0

% Done:

Source:

Community (user)

Tags:

backport_processed

Backport:

reef,pacific,quincy

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v17.2.5

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

49970

Crash signature (v1):

Crash signature (v2):

Description

The issue is reported by upstream community user.

The cluster had two filesystems and the active mds of both the filesystems were stuck in 'up:replay'.
This was the case for around 2 days. Later, one of the active mds (stuck in up:replay) state crashed
with below stack trace.

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
In function 'void EMetaBlob::replay(MDSRank*, LogSegment*,
MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions)

  ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7fccd759943f]
  2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
  3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
[0x55fb2b98e89c]
  4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
  5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
  6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
  7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
  8: clone()

The upstream communication can be found at https://www.spinics.net/lists/ceph-users/msg75472.html

Files

Download all files

mds01.ceph04.logaa.bz2 (879 KB) mds01.ceph04.logaa.bz2		Thomas Widhalm, 01/19/2023 01:15 PM
mds01.ceph04.logab.bz2 (756 KB) mds01.ceph04.logab.bz2		Thomas Widhalm, 01/19/2023 01:15 PM
mds01.ceph06.log.bz2 (681 KB) mds01.ceph06.log.bz2		Thomas Widhalm, 01/19/2023 01:15 PM

Related issues 6 (1 open — 5 closed)

Actions

Copy link

Updated by Xiubo Li over 1 year ago

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only flushed the first ESessions entry and left the EUpdate one in the MDCache ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

Actions

Copy link Download all files

Updated by Thomas Widhalm over 1 year ago

File mds01.ceph04.logaa.bz2 mds01.ceph04.logaa.bz2 added
File mds01.ceph04.logab.bz2 mds01.ceph04.logab.bz2 added
File mds01.ceph06.log.bz2 mds01.ceph06.log.bz2 added

Hi,

Thanks for the help. Here are the debug logs of my two active MDS. I thought, I'll tar the whole bunch for you to search through. If you want to have just a few sections, please let me know. I didn't mean to dump everything onto you and not contribute myself. :-)

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Assignee set to Xiubo Li
Target version set to v18.0.0
Backport set to pacific,quincy
Labels (FS) crash added

Actions

Copy link

Updated by Venky Shankar over 1 year ago

Status changed from New to Triaged

Actions

Copy link

Updated by Thomas Widhalm about 1 year ago

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Actions

Copy link

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Normally when restarting a MDS if all the journal logs are flushed and expired, then when the MDS is starting there is no this issue. But I need to read the code to make sure how the following case could happen:

ESession event in SegmentA
EUpdate("openc") event in a following SegmentB

Case A:

When restarting the MDS normally will the MDS try to expire both the SegmentA and SegmentB before stopping ? Or just flush them out to pool and then let the replay MDS to expire them ? If it's later, then in case if the SegmentA was expired but leaving the SegmentB in the pool, so if the replay MDS try to replay the events in SegementB we will hit this issue.

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

Actions

Copy link

Updated by Xiubo Li about 1 year ago

Thomas Widhalm wrote:

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Hi Thomas,

BTW, in the beginning is there any MDS crash or MDS daemons shutdown abnormally ?

Thanks

Actions

Copy link

Updated by Xiubo Li about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Normally when restarting a MDS if all the journal logs are flushed and expired, then when the MDS is starting there is no this issue. But I need to read the code to make sure how the following case could happen:

[...]

Case A:

When restarting the MDS normally will the MDS try to expire both the SegmentA and SegmentB before stopping ? Or just flush them out to pool and then let the replay MDS to expire them ? If it's later, then in case if the SegmentA was expired but leaving the SegmentB in the pool, so if the replay MDS try to replay the events in SegementB we will hit this issue.

Checked the code when the MDS daemons are stopped normally it will wait for all the MDLog to be flushed and trimmed. So it shouldn't be a issue for this case.

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Actions

Copy link

#10

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Is it because it assumes that "case B" (below) will not be hit?

Normally when restarting a MDS if all the journal logs are flushed and expired, then when the MDS is starting there is no this issue. But I need to read the code to make sure how the following case could happen:

[...]

Case A:

When restarting the MDS normally will the MDS try to expire both the SegmentA and SegmentB before stopping ? Or just flush them out to pool and then let the replay MDS to expire them ? If it's later, then in case if the SegmentA was expired but leaving the SegmentB in the pool, so if the replay MDS try to replay the events in SegementB we will hit this issue.

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

Actions

Copy link

#11

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

There is one case could trigger IMO:

For example, there are two MDLog entries:

ESessions entry --> {..., version, ...}
EUpdate("openc") ---> {..., sessionmapv, ... } // sessionmapv = version + 1

If the mds restarts and both the above entries exist then it should be okay, because the mds' sessionmap could be updated correctly.

What if the mds only trimmed the first ESessions entry and left the EUpdate one in the journal log ? And then when replaying the journal logs since the mds' sessionmap couldn't be updated, so when replaying the EUpdate entry we will hit this.

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Normally when restarting a MDS if all the journal logs are flushed and expired, then when the MDS is starting there is no this issue. But I need to read the code to make sure how the following case could happen:

[...]

Case A:

When restarting the MDS normally will the MDS try to expire both the SegmentA and SegmentB before stopping ? Or just flush them out to pool and then let the replay MDS to expire them ? If it's later, then in case if the SegmentA was expired but leaving the SegmentB in the pool, so if the replay MDS try to replay the events in SegementB we will hit this issue.

Checked the code when the MDS daemons are stopped normally it will wait for all the MDLog to be flushed and trimmed. So it shouldn't be a issue for this case.

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Actions

Copy link

#12

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Wouldn't the ESessions log event update the sessionmap version during segment expiry?

From my reading of the code it won't and no need.

Is it because it assumes that "case B" (below) will not be hit?

IMO we didn't consider this case before.

This should be hit several times before and IMO this is also why Zheng added the mds_wipe_sessions option to wipe the sessions.

Actions

Copy link

#13

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Just assume the Segment A was trimmed just before the MDS crashing and it will update the session map version just before crashing. But when the standby MDS is replaying the Segment B this new MDS daemon will set the session map version to 0.

So updating the session map version when expiring the Segment A makes no sense here.

Actions

Copy link

#14

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Just assume the Segment A was trimmed just before the MDS crashing and it will update the session map version just before crashing. But when the standby MDS is replaying the Segment B this new MDS daemon will set the session map version to 0.

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

Actions

Copy link

#15

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Just assume the Segment A was trimmed just before the MDS crashing and it will update the session map version just before crashing. But when the standby MDS is replaying the Segment B this new MDS daemon will set the session map version to 0.

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Actions

Copy link

#16

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Case B:

If the MDS crashes and leaving the SegmentB in the pool.

This case really could cause the issue, but I am not sure whether is there other cases that also could cause it.

To fix it I think we need to add one ESessions event in each Segment just like the ESubtreemap.

Wouldn't updating the session map version when expiring segment A (ESession) suffice?

Just assume the Segment A was trimmed just before the MDS crashing and it will update the session map version just before crashing. But when the standby MDS is replaying the Segment B this new MDS daemon will set the session map version to 0.

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Actions

Copy link

#17

Updated by Xiubo Li about 1 year ago

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

Cscope tag: journal_allocated_inos
   #   line  filename / context / line
   1   4576  mds/Server.cc <<handle_client_openc>>
             journal_allocated_inos(mdr, &le->metablob);
   2   6800  mds/Server.cc <<handle_client_mknod>>
             journal_allocated_inos(mdr, &le->metablob);
   3   6883  mds/Server.cc <<handle_client_mkdir>>
             journal_allocated_inos(mdr, &le->metablob);
   4   6968  mds/Server.cc <<handle_client_symlink>>
             journal_allocated_inos(mdr, &le->metablob);
Type number and <Enter> (q or empty cancels):

These are all EUpdate events.

Actions

Copy link

#18

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

[...]

These are all EUpdate events.

Hmmm... Stashing a ESEssion event might be the way forward, but I'm a bit reluctant to jump to that. Cannot the latest session map version be read off during mds boot?

Actions

Copy link

#19

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

[...]

These are all EUpdate events.

Hmmm... Stashing a ESEssion event might be the way forward, but I'm a bit reluctant to jump to that. Cannot the latest session map version be read off during mds boot?

Is there any other place will save the session map version other than the MDLog ?

From my understanding I am afraid it couldn't. So in case when an MDS failover and the new MDS is booting the session map version could be inherited from last MDS by replaying the MDLogs, or just be initialized as 0.

And instead stashing a ESessions event in each Segment, we can improve it by stashing it only when needed.

Let me read the code more carefully to see could we avoid stashing it.

Actions

Copy link

#20

Updated by Thomas Widhalm about 1 year ago

Xiubo Li wrote:

Thomas Widhalm wrote:

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Hi Thomas,

BTW, in the beginning is there any MDS crash or MDS daemons shutdown abnormally ?

Thanks

The whole problem started when I upgraded my hosts from Fedora 36 to Fedora 37. After the upgrade, there seemed to be a major connection issue between the hosts. I'm all but new to debugging operating system misconfigurations but this problem just escaped me. The result was, that all PGs where listed as unavailable. Out of pure desperation I restored 3 of my 7 hosts (the others are hardware, so a restore would be much harder) and the PGs were online again. Now there's a mix of Fedora 36 and Fedora 37. What I didn't see in the first place is that, even when everything looked like it was resyncing, I couldn't access RBD or CephFS. After some more tries and unblocking OSDs RBD is available again. When I start MDS, they still are in status `up:replaying`. And all but one (randomly chosen) survivor crash after a few minutes.

Actions

Copy link

#21

Updated by Xiubo Li about 1 year ago

Thomas Widhalm wrote:

Xiubo Li wrote:

Thomas Widhalm wrote:

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Hi Thomas,

BTW, in the beginning is there any MDS crash or MDS daemons shutdown abnormally ?

Thanks

The whole problem started when I upgraded my hosts from Fedora 36 to Fedora 37. After the upgrade, there seemed to be a major connection issue between the hosts. I'm all but new to debugging operating system misconfigurations but this problem just escaped me. The result was, that all PGs where listed as unavailable. Out of pure desperation I restored 3 of my 7 hosts (the others are hardware, so a restore would be much harder) and the PGs were online again. Now there's a mix of Fedora 36 and Fedora 37. What I didn't see in the first place is that, even when everything looked like it was resyncing, I couldn't access RBD or CephFS. After some more tries and unblocking OSDs RBD is available again. When I start MDS, they still are in status `up:replaying`. And all but one (randomly chosen) survivor crash after a few minutes.

Okay, there might be some unknown stories. But the MDS should be stopped abnormally, or there shouldn't be these MDLogs replayed.

Actions

Copy link

#22

Updated by Thomas Widhalm about 1 year ago

Xiubo Li wrote:

Thomas Widhalm wrote:

Xiubo Li wrote:

Thomas Widhalm wrote:

In the meantime I rebooted my hosts for regular maintenance (rolling reboot with only one node down). Since then I can access RBD data, at least read directories. But now, whenever I start an MDS it crashes soon after. I tried to flush the journal again, but that didn't help. If you need any additional data, please let me know.

Hi Thomas,

BTW, in the beginning is there any MDS crash or MDS daemons shutdown abnormally ?

Thanks

The whole problem started when I upgraded my hosts from Fedora 36 to Fedora 37. After the upgrade, there seemed to be a major connection issue between the hosts. I'm all but new to debugging operating system misconfigurations but this problem just escaped me. The result was, that all PGs where listed as unavailable. Out of pure desperation I restored 3 of my 7 hosts (the others are hardware, so a restore would be much harder) and the PGs were online again. Now there's a mix of Fedora 36 and Fedora 37. What I didn't see in the first place is that, even when everything looked like it was resyncing, I couldn't access RBD or CephFS. After some more tries and unblocking OSDs RBD is available again. When I start MDS, they still are in status `up:replaying`. And all but one (randomly chosen) survivor crash after a few minutes.

Okay, there might be some unknown stories. But the MDS should be stopped abnormally, or there shouldn't be these MDLogs replayed.

That might have happened during the upgrade process. I was cautious and upgraded one after the other. I waited for all services to come up again or at least I gave enough time before I rebooted the systems with `/sbin/reboot`. I can imagine that after the upgrade, MDS were in an abnormal state and while I tried to fix the PGs being unavailable, I missed the MDS having problems.

Actions

Copy link

#23

Updated by Venky Shankar about 1 year ago

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

[...]

These are all EUpdate events.

Hmmm... Stashing a ESEssion event might be the way forward, but I'm a bit reluctant to jump to that. Cannot the latest session map version be read off during mds boot?

Is there any other place will save the session map version other than the MDLog ?

From my understanding I am afraid it couldn't. So in case when an MDS failover and the new MDS is booting the session map version could be inherited from last MDS by replaying the MDLogs, or just be initialized as 0.

Unless we can storing it when persisting session map.

Actions

Copy link

#24

Updated by Xiubo Li about 1 year ago

Venky Shankar wrote:

Xiubo Li wrote:

Venky Shankar wrote:

Xiubo Li wrote:

[...]

Why would the session map version get reset to 0 after a mds failover? Maybe I'm missing something somewhere...

When an MDS failover the standby MDS needs to get this info from the MDLogs.

If I didn't misreading it. It will be initialized as 0 in the beginning when an MDS is booting. And only if the MDS replay an ESessions event will the version be updated. If there is no ESessions event then the version won't be touched.

Hmmm... so the session map is not loaded at this point :/

I wonder what other log events possibly can have such issues. Do we know?

Only the following 4 cases will the log events store the sessionmapv in journal_allocated_inos():

[...]

These are all EUpdate events.

Hmmm... Stashing a ESEssion event might be the way forward, but I'm a bit reluctant to jump to that. Cannot the latest session map version be read off during mds boot?

Is there any other place will save the session map version other than the MDLog ?

From my understanding I am afraid it couldn't. So in case when an MDS failover and the new MDS is booting the session map version could be inherited from last MDS by replaying the MDLogs, or just be initialized as 0.

Unless we can storing it when persisting session map.

I found the void LogSegment::try_to_expire(MDSRank *mds, MDSGatherBuilder &gather_bld, int op_prio) have already persisted the sessionmap.

Will read the code more carefully.

Actions

Copy link

#25

Updated by Xiubo Li about 1 year ago

The sessionmap will be persisted when expiring any MDLog Segment in LogSegment::try_to_expire()->sessionmap.save(), that means only when the MDLog Segments are expired will the sessionmap persisted.

If there are enough Segments the expiring won't happen. So it's possible that when the an MDS crash the expiring never triggered and then after failover the new MDS may won't get the sessionmap info.

The inotable have the same logic, but it will force replay the inotable version instead of asserting the MDS daemon.

Actions

Copy link

#26

Updated by Xiubo Li about 1 year ago

Status changed from Triaged to Fix Under Review
Pull request ID set to 49970

Actions

Copy link

#27

Updated by Venky Shankar about 1 year ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#28

Updated by Backport Bot about 1 year ago

Copied to Backport #59006: quincy: mds stuck in 'up:replay' and crashed. added

Actions

Copy link

#29

Updated by Backport Bot about 1 year ago

Copied to Backport #59007: pacific: mds stuck in 'up:replay' and crashed. added

Actions

Copy link

#30

Updated by Backport Bot about 1 year ago

Tags set to backport_processed

Actions

Copy link

#31

Updated by Laura Flores about 1 year ago

@Venky relevant thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GA77DLSQXCXZVJ4BYQ6KDW4DLU5IFCPG/

Actions

Copy link

#32

Updated by Venky Shankar about 1 year ago

Laura Flores wrote:

@Venky relevant thread: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/GA77DLSQXCXZVJ4BYQ6KDW4DLU5IFCPG/

Thanks for bringing this to our notice, Laura.

Actions

Copy link

#33

Updated by Xiubo Li about 1 year ago

Backport changed from pacific,quincy to reef,pacific,quincy

Actions

Copy link

#34

Updated by Xiubo Li about 1 year ago

Copied to Backport #59404: reef: mds stuck in 'up:replay' and crashed. added

Actions

Copy link

#35

Updated by Xiubo Li 10 months ago

Status changed from Pending Backport to Resolved

Actions

Copy link

#36

Updated by Venky Shankar 8 months ago

Related to Bug #59768: crash: void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*): assert(g_conf()->mds_wipe_sessions) added

Actions

Copy link

#37

Updated by Venky Shankar 8 months ago

Related to Bug #61009: crash: void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]: assert(p->first <= start) added

Actions

Copy link

#38

Updated by Venky Shankar 7 months ago

Related to Bug #63103: mds: disable delegating inode ranges to clients added

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #58489

mds stuck in 'up:replay' and crashed.

Updated by Xiubo Li over 1 year ago

Updated by Venky Shankar over 1 year ago

Updated by Thomas Widhalm over 1 year ago

Updated by Venky Shankar over 1 year ago

Updated by Venky Shankar over 1 year ago

Updated by Thomas Widhalm about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Thomas Widhalm about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Thomas Widhalm about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Backport Bot about 1 year ago

Updated by Backport Bot about 1 year ago

Updated by Backport Bot about 1 year ago

Updated by Laura Flores about 1 year ago

Updated by Venky Shankar about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Xiubo Li about 1 year ago

Updated by Xiubo Li 10 months ago

Updated by Venky Shankar 8 months ago

Updated by Venky Shankar 8 months ago

Updated by Venky Shankar 7 months ago