Bug #19118: MDS heartbeat timeout during rejoin, when working with large amount of caps/inodes - CephFS - Ceph

Actions

Copy link

Bug #19118

closed

MDS heartbeat timeout during rejoin, when working with large amount of caps/inodes

Added by Xiaoxi Chen about 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

jewel, kraken

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We set an alarm every OPTION seconds, if mds_rank doesnt finish its task within this time, the beacon will be skipped and resulting the mds be killed by mon.

But during mds failover, it is usually take long time for rejoin and active phase, in our case, it take ~3 mins for active phase in one of our cluster (with ~ 4.5M caps, 600K inodes). And one other cluster take quite a long time for "rejoin".

It is a critical bug which will cause MDS flipping and failover again and again across the cluster, make the fs not accessable.

A simple fix might be call reset_timeout(g_conf->mds_map_processing_timeout, 0 ) if we are working on mdsmap ?

We are on 10.2.5.
-----------------------------------------------

2017-03-01 07:47:02.987237 7f5fc8d85700 1 mds.0.0 replay_done (as standby)
2017-03-01 07:47:03.133133 7f5fcd58e700 1 mds.0.1226 handle_mds_map i am now mds.0.1226
2017-03-01 07:47:03.133136 7f5fcd58e700 1 mds.0.1226 handle_mds_map state change up:standby-replay --> up:replay
2017-03-01 07:47:03.987426 7f5fcad89700 1 mds.0.1226 standby_replay_restart (as standby)
2017-03-01 07:47:03.989874 7f5fc8d85700 1 mds.0.1226 replay_done (as standby)
2017-03-01 07:47:03.989881 7f5fc8d85700 1 mds.0.1226 standby_replay_restart (final takeover pass)
2017-03-01 07:47:03.991720 7f5fc8d85700 1 mds.0.1226 replay_done
2017-03-01 07:47:03.991721 7f5fc8d85700 1 mds.0.1226 making mds journal writeable
2017-03-01 07:47:05.008735 7f5fcd58e700 1 mds.0.1226 handle_mds_map i am now mds.0.1226
2017-03-01 07:47:05.008750 7f5fcd58e700 1 mds.0.1226 handle_mds_map state change up:replay --> up:reconnect
2017-03-01 07:47:05.008774 7f5fcd58e700 1 mds.0.1226 reconnect_start
2017-03-01 07:47:05.008778 7f5fcd58e700 1 mds.0.1226 reopen_log
2017-03-01 07:47:05.008788 7f5fcd58e700 1 mds.0.server reconnect_clients -- 2 sessions
2017-03-01 07:47:05.008875 7f5fcd58e700 0 log_channel(cluster) log [DBG] : reconnect by client.29619146 10.153.10.84:0/3595018518 after 0.000032
2017-03-01 07:47:06.630794 7f5fcd58e700 0 log_channel(cluster) log [DBG] : reconnect by client.31154366 10.161.158.160:0/1163665399 after 1.621938
2017-03-01 07:47:07.285004 7f5fcd58e700 1 mds.0.1226 reconnect_done
2017-03-01 07:47:07.779351 7f5fcd58e700 1 mds.0.1226 handle_mds_map i am now mds.0.1226
2017-03-01 07:47:07.779356 7f5fcd58e700 1 mds.0.1226 handle_mds_map state change up:reconnect --> up:rejoin
2017-03-01 07:47:07.779386 7f5fcd58e700 1 mds.0.1226 rejoin_start
2017-03-01 07:47:23.285393 7f5fca588700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:23.285400 7f5fca588700 1 mds.beacon.ceph-mds02-952239 _send skipping beacon, heartbeat map not healthy
2017-03-01 07:47:23.655822 7f5fcf592700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:27.285605 7f5fca588700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:27.285636 7f5fca588700 1 mds.beacon.ceph-mds02-952239 _send skipping beacon, heartbeat map not healthy
2017-03-01 07:47:28.655983 7f5fcf592700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:31.285621 7f5fca588700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:31.285635 7f5fca588700 1 mds.beacon.ceph-mds02-952239 _send skipping beacon, heartbeat map not healthy
2017-03-01 07:47:33.656158 7f5fcf592700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:35.285704 7f5fca588700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:35.285777 7f5fca588700 1 mds.beacon.ceph-mds02-952239 _send skipping beacon, heartbeat map not healthy
2017-03-01 07:47:37.612919 7f5fcd58e700 1 mds.0.1226 rejoin_joint_start
2017-03-01 07:47:37.622619 7f5fcad89700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2017-03-01 07:47:38.870725 7f5fcd58e700 1 mds.ceph-mds02-952239 handle_mds_map i (10.156.82.190:6801/1749902841) dne in the mdsmap, respawning myself
2017-03-01 07:47:38.870728 7f5fcd58e700 1 mds.ceph-mds02-952239 respawn
2017-03-01 07:47:38.870729 7f5fcd58e700 1 mds.ceph-mds02-952239 e: '/usr/bin/ceph-mds'
2017-03-01 07:47:38.870731 7f5fcd58e700 1 mds.ceph-mds02-952239 0: '/usr/bin/ceph-mds'

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Updated by Xiaoxi Chen about 7 years ago

Priority changed from Normal to Urgent

Actions

Copy link

Updated by John Spray about 7 years ago

Subject changed from NDS flipping during failover, when working with large amount of caps/inodes to MDS heartbeat , when working with large amount of caps/inodes

May be related to http://tracker.ceph.com/issues/18730, although perhaps not since that one shouldn't be causing the heartbeat mechanism to trip.

The timeout the MDS is hitting here is the mds_beacon_grace setting. This is used on both the mon and the MDS. In this case, you would need to increase it on the MDS side to give the MDS a chance to make it through rejoin.

Before doing that, please could you attach a debugger to the MDS while it is in the stuck state (i.e. while it is giving "heartbeat map not healthy") and grab a backtrace of all threads. It's possible this is happening during an O(caps) loop somewhere.

Actions

Copy link

Updated by John Spray about 7 years ago

Subject changed from MDS heartbeat , when working with large amount of caps/inodes to MDS heartbeat timeout during rejoin, when working with large amount of caps/inodes

Actions

Copy link

Updated by Xiaoxi Chen about 7 years ago

I have that already. I did set the beacon_grace to 600s to walk around the bug and bring the cluster back.

In first case(stuck on active phase), mds is busying on MDCache::export_remaining_imported_caps(), this function spends 3 mins and it looks like an O(caps*clients).

The second case(stuck on rejoin), it seems working on the backtrace(1) and replay the journal(2).

[1]
2017-03-01 07:54:04.231242 7f4e2188a700 10 MDSIOContextBase::complete: 32C_IO_MDC_OpenInoBacktraceFetched
2017-03-01 07:54:04.231244 7f4e2188a700 10 mds.0.cache _open_ino_backtrace_fetched ino 10000171d4d errno 0
2017-03-01 07:54:04.231248 7f4e2188a700 10 mds.0.cache old object in pool 2, retrying pool 11
2017-03-01 07:54:04.231389 7f4e2188a700 10 MDSIOContextBase::complete: 32C_IO_MDC_OpenInoBacktraceFetched
2017-03-01 07:54:04.231400 7f4e2188a700 10 mds.0.cache _open_ino_backtrace_fetched ino 10000171d50 errno 0
2017-03-01 07:54:04.231412 7f4e2188a700 10 mds.0.cache old object in pool 2, retrying pool 11

[2]
2017-03-01 07:56:53.179938 7f4e2188a700 12 mds.0.cache.dir(100000fe965) add_primary_dentry [dentry #1/sharefs_prod/8c285b3b59a843b6aab623314288ee36/ssl/ssl1/ca/ca/signed/slc5b01c-3712.stratus.slc.ebay.com.pem [2,head] auth (dversion lock) pv=0 v=259855 inode=0x7f4ebcb10e70 0x7f4eb60379d0]
2017-03-01 07:56:53.180206 7f4e2188a700 12 mds.0.cache.dir(100000fe965) _fetched got [dentry #1/sharefs_prod/8c285b3b59a843b6aab623314288ee36/ssl/ssl1/ca/ca/signed/slc5b01c-3712.stratus.slc.ebay.com.pem [2,head] auth (dversion lock) pv=0 v=259855 inode=0x7f4ebcb10e70 0x7f4eb60379d0] [inode 1000010bdfc [2,head] /sharefs_prod/8c285b3b59a843b6aab623314288ee36/ssl/ssl1/ca/ca/signed/slc5b01c-3712.stratus.slc.ebay.com.pem auth v221548 s=1992 n(v0 b1992 1=1+0) (iversion lock) 0x7f4ebcb10e70]
2017-03-01 07:56:53.180444 7f4e2188a700 20 mds.0.cache.dir(100000fe965) _fetched pos 54009 marker 'I' dname 'slc5b01c-3701.stratus.slc.ebay.com.pem [2,head]
2017-03-01 07:56:53.180447 7f4e2188a700 20 mds.0.cache.dir(100000fe965) lookup (head, 'slc5b01c-3701.stratus.slc.ebay.com.pem')
2017-03-01 07:56:53.180466 7f4e2188a700 20 mds.0.cache.dir(100000fe965) miss -> (slc5b02c-7hqm.stratus.slc.ebay.com.pem,head)
2017-03-01 07:56:53.180479 7f4e2188a700 12 mds.0.cache.dir(100000fe965) add_primary_dentry [dentry #1/sharefs_prod/8c285b3b59a843b6aab623314288ee36/ssl/ssl1/ca/ca/signed/slc5b01c-3701.stratus.slc.ebay.com.pem [2,head] auth (dversion lock) pv=0 v=259855 inode=0x7f4ebcb11440 0x7f4eb6037b60]

Actions

Copy link

Updated by Xiaoxi Chen about 7 years ago

I have that already. I did set the beacon_grace to 600s to walk around the bug and bring the cluster back.

Seems related with http://tracker.ceph.com/issues/18730, truly.

In first case(stuck on active phase), mds is busying on MDCache::export_remaining_imported_caps(), this function spends 3 mins and it looks like an O(caps*clients). [1]

The second case(stuck on rejoin), it seems working on the backtrace² and replay the journal³.

[1]
2017-02-28 20:32:47.286796 7f658894b700 1 mds.0.263 reconnect_done
2017-02-28 20:32:48.260760 7f658894b700 1 mds.0.263 handle_mds_map i am now mds.0.263
2017-02-28 20:32:48.260882 7f658894b700 1 mds.0.263 handle_mds_map state change up:reconnect --> up:rejoin
2017-02-28 20:32:48.260957 7f658894b700 1 mds.0.263 rejoin_start
2017-02-28 20:33:13.823818 7f658894b700 1 mds.0.263 rejoin_joint_start
2017-02-28 20:33:34.744161 7f6584943700 1 mds.0.263 rejoin_done
2017-02-28 20:33:35.970117 7f658894b700 1 mds.0.263 handle_mds_map i am now mds.0.263
2017-02-28 20:33:35.970127 7f658894b700 1 mds.0.263 handle_mds_map state change up:rejoin --> up:active
2017-02-28 20:33:35.970178 7f658894b700 1 mds.0.263 recovery_done -- successful recovery!
2017-02-28 20:33:35.971970 7f658894b700 1 mds.0.263 active_start
2017-02-28 20:36:27.436389 7f658894b700 0 log_channel(cluster) log [WRN] : failed to reconnect caps for missing inodes:
2017-02-28 20:36:27.436411 7f658894b700 0 log_channel(cluster) log [WRN] : ino 100000665cf
2017-02-28 20:36:27.436413 7f658894b700 0 log_channel(cluster) log [WRN] : ino 100000665d0
2017-02-28 20:36:27.436417 7f658894b700 0 log_channel(cluster) log [WRN] : ino 100000665d1

[2]
2017-03-01 07:54:04.231242 7f4e2188a700 10 MDSIOContextBase::complete: 32C_IO_MDC_OpenInoBacktraceFetched
2017-03-01 07:54:04.231244 7f4e2188a700 10 mds.0.cache _open_ino_backtrace_fetched ino 10000171d4d errno 0
2017-03-01 07:54:04.231248 7f4e2188a700 10 mds.0.cache old object in pool 2, retrying pool 11
2017-03-01 07:54:04.231389 7f4e2188a700 10 MDSIOContextBase::complete: 32C_IO_MDC_OpenInoBacktraceFetched
2017-03-01 07:54:04.231400 7f4e2188a700 10 mds.0.cache _open_ino_backtrace_fetched ino 10000171d50 errno 0
2017-03-01 07:54:04.231412 7f4e2188a700 10 mds.0.cache old object in pool 2, retrying pool 11

[3]
2017-03-01 07:56:53.179938 7f4e2188a700 12 mds.0.cache.dir(100000fe965) add_primary_dentry [dentry #1/sharefs_prod/8c285b3b59a843b6aab623314288ee36/ssl/ssl1/ca/ca/signed/slc5b01c-3712.stratus.slc.ebay.com.pem [2,head] auth (dversion lock) pv=0 v=259855 inode=0x7f4ebcb10e70 0x7f4eb60379d0]
2017-03-01 07:56:53.180206 7f4e2188a700 12 mds.0.cache.dir(100000fe965) _fetched got [dentry #1/sharefs_prod/8c285b3b59a843b6aab623314288ee36/ssl/ssl1/ca/ca/signed/slc5b01c-3712.stratus.slc.ebay.com.pem [2,head] auth (dversion lock) pv=0 v=259855 inode=0x7f4ebcb10e70 0x7f4eb60379d0] [inode 1000010bdfc [2,head] /sharefs_prod/8c285b3b59a843b6aab623314288ee36/ssl/ssl1/ca/ca/signed/slc5b01c-3712.stratus.slc.ebay.com.pem auth v221548 s=1992 n(v0 b1992 1=1+0) (iversion lock) 0x7f4ebcb10e70]
2017-03-01 07:56:53.180444 7f4e2188a700 20 mds.0.cache.dir(100000fe965) _fetched pos 54009 marker 'I' dname 'slc5b01c-3701.stratus.slc.ebay.com.pem [2,head]
2017-03-01 07:56:53.180447 7f4e2188a700 20 mds.0.cache.dir(100000fe965) lookup (head, 'slc5b01c-3701.stratus.slc.ebay.com.pem')
2017-03-01 07:56:53.180466 7f4e2188a700 20 mds.0.cache.dir(100000fe965) miss -> (slc5b02c-7hqm.stratus.slc.ebay.com.pem,head)
2017-03-01 07:56:53.180479 7f4e2188a700 12 mds.0.cache.dir(100000fe965) add_primary_dentry [dentry #1/sharefs_prod/8c285b3b59a843b6aab623314288ee36/ssl/ssl1/ca/ca/signed/slc5b01c-3701.stratus.slc.ebay.com.pem [2,head] auth (dversion lock) pv=0 v=259855 inode=0x7f4ebcb11440 0x7f4eb6037b60]

Actions

Copy link

Updated by John Spray about 7 years ago

Status changed from New to Fix Under Review

https://github.com/ceph/ceph/pull/13807

I've added a periodic heartbeat reset in the long loop in export_remaining_imported_caps.

The 18730 issue should not cause the mds_lock to be held for too long (it's dropped in between IOs) but it's possible that such intense IO is simply causing the tick() to never get the lock, as I don't think we do any fairness logic on locking. I've added a heartbeat_reset to the main dispatch function so that we reset it even if we are so busy dispatching messages that we hold up tick().

The replay case could in principle cause a heartbeat timeout in a similar way, but it seems kind of unlikely. MDLog::_replay_thread does unlock mds_lock in between replaying events, so tick() should be getting an opportunity to run.

If you could try out that patch I would be very interested in the results. (warning: I just wrote it and haven't even checked it compiles)

Actions

Copy link

Updated by Zheng Yan about 7 years ago

John Spray wrote:

https://github.com/ceph/ceph/pull/13807

I've added a periodic heartbeat reset in the long loop in export_remaining_imported_caps.

The 18730 issue should not cause the mds_lock to be held for too long (it's dropped in between IOs) but it's possible that such intense IO is simply causing the tick() to never get the lock, as I don't think we do any fairness logic on locking. I've added a heartbeat_reset to the main dispatch function so that we reset it even if we are so busy dispatching messages that we hold up tick().

I don't understand how intense IO can cause timer thread to starve. There are only few threads (probably one) that execute IO's finish contexts. No matter how frequently they acquire the mds_lock, it's unlike the timer thread gets starved.

The replay case could in principle cause a heartbeat timeout in a similar way, but it seems kind of unlikely. MDLog::_replay_thread does unlock mds_lock in between replaying events, so tick() should be getting an opportunity to run.

If you could try out that patch I would be very interested in the results. (warning: I just wrote it and haven't even checked it compiles)

Actions

Copy link