Bug #19118: MDS heartbeat timeout during rejoin, when working with large amount of caps/inodes - CephFS - Ceph

Actions

Copy link

Bug #19118

closed

MDS heartbeat timeout during rejoin, when working with large amount of caps/inodes

Added by Xiaoxi Chen about 7 years ago. Updated over 6 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

Target version:

% Done:

Source:

Community (dev)

Tags:

Backport:

jewel, kraken

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(FS):

Labels (FS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We set an alarm every OPTION seconds, if mds_rank doesnt finish its task within this time, the beacon will be skipped and resulting the mds be killed by mon.

But during mds failover, it is usually take long time for rejoin and active phase, in our case, it take ~3 mins for active phase in one of our cluster (with ~ 4.5M caps, 600K inodes). And one other cluster take quite a long time for "rejoin".

It is a critical bug which will cause MDS flipping and failover again and again across the cluster, make the fs not accessable.

A simple fix might be call reset_timeout(g_conf->mds_map_processing_timeout, 0 ) if we are working on mdsmap ?

We are on 10.2.5.
-----------------------------------------------

2017-03-01 07:47:02.987237 7f5fc8d85700 1 mds.0.0 replay_done (as standby)
2017-03-01 07:47:03.133133 7f5fcd58e700 1 mds.0.1226 handle_mds_map i am now mds.0.1226
2017-03-01 07:47:03.133136 7f5fcd58e700 1 mds.0.1226 handle_mds_map state change up:standby-replay --> up:replay
2017-03-01 07:47:03.987426 7f5fcad89700 1 mds.0.1226 standby_replay_restart (as standby)
2017-03-01 07:47:03.989874 7f5fc8d85700 1 mds.0.1226 replay_done (as standby)
2017-03-01 07:47:03.989881 7f5fc8d85700 1 mds.0.1226 standby_replay_restart (final takeover pass)
2017-03-01 07:47:03.991720 7f5fc8d85700 1 mds.0.1226 replay_done
2017-03-01 07:47:03.991721 7f5fc8d85700 1 mds.0.1226 making mds journal writeable
2017-03-01 07:47:05.008735 7f5fcd58e700 1 mds.0.1226 handle_mds_map i am now mds.0.1226
2017-03-01 07:47:05.008750 7f5fcd58e700 1 mds.0.1226 handle_mds_map state change up:replay --> up:reconnect
2017-03-01 07:47:05.008774 7f5fcd58e700 1 mds.0.1226 reconnect_start
2017-03-01 07:47:05.008778 7f5fcd58e700 1 mds.0.1226 reopen_log
2017-03-01 07:47:05.008788 7f5fcd58e700 1 mds.0.server reconnect_clients -- 2 sessions
2017-03-01 07:47:05.008875 7f5fcd58e700 0 log_channel(cluster) log [DBG] : reconnect by client.29619146 10.153.10.84:0/3595018518 after 0.000032
2017-03-01 07:47:06.630794 7f5fcd58e700 0 log_channel(cluster) log [DBG] : reconnect by client.31154366 10.161.158.160:0/1163665399 after 1.621938
2017-03-01 07:47:07.285004 7f5fcd58e700 1 mds.0.1226 reconnect_done
2017-03-01 07:47:07.779351 7f5fcd58e700 1 mds.0.1226 handle_mds_map i am now mds.0.1226
2017-03-01 07:47:07.779356 7f5fcd58e700 1 mds.0.1226 handle_mds_map state change up:reconnect --> up:rejoin
2017-03-01 07:47:07.779386 7f5fcd58e700 1 mds.0.1226 rejoin_start
2017-03-01 07:47:23.285393 7f5fca588700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:23.285400 7f5fca588700 1 mds.beacon.ceph-mds02-952239 _send skipping beacon, heartbeat map not healthy
2017-03-01 07:47:23.655822 7f5fcf592700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:27.285605 7f5fca588700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:27.285636 7f5fca588700 1 mds.beacon.ceph-mds02-952239 _send skipping beacon, heartbeat map not healthy
2017-03-01 07:47:28.655983 7f5fcf592700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:31.285621 7f5fca588700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:31.285635 7f5fca588700 1 mds.beacon.ceph-mds02-952239 _send skipping beacon, heartbeat map not healthy
2017-03-01 07:47:33.656158 7f5fcf592700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:35.285704 7f5fca588700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2017-03-01 07:47:35.285777 7f5fca588700 1 mds.beacon.ceph-mds02-952239 _send skipping beacon, heartbeat map not healthy
2017-03-01 07:47:37.612919 7f5fcd58e700 1 mds.0.1226 rejoin_joint_start
2017-03-01 07:47:37.622619 7f5fcad89700 1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2017-03-01 07:47:38.870725 7f5fcd58e700 1 mds.ceph-mds02-952239 handle_mds_map i (10.156.82.190:6801/1749902841) dne in the mdsmap, respawning myself
2017-03-01 07:47:38.870728 7f5fcd58e700 1 mds.ceph-mds02-952239 respawn
2017-03-01 07:47:38.870729 7f5fcd58e700 1 mds.ceph-mds02-952239 e: '/usr/bin/ceph-mds'
2017-03-01 07:47:38.870731 7f5fcd58e700 1 mds.ceph-mds02-952239 0: '/usr/bin/ceph-mds'

Related issues 2 (0 open — 2 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #19118

MDS heartbeat timeout during rejoin, when working with large amount of caps/inodes

Updated by Xiaoxi Chen about 7 years ago

Updated by John Spray about 7 years ago

Updated by John Spray about 7 years ago

Updated by Xiaoxi Chen about 7 years ago

Updated by Xiaoxi Chen about 7 years ago

Updated by John Spray about 7 years ago

Updated by Zheng Yan about 7 years ago

Updated by John Spray about 7 years ago

Updated by Nathan Cutler about 7 years ago

Updated by Nathan Cutler about 7 years ago

Updated by Nathan Cutler over 6 years ago