Project

General

Profile

Actions

Bug #36094

closed

mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS))

Added by Frank Yu over 5 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus,mimic
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi I have met mds error "FAILED assert(omap_num_objs <= MAX_OBJECTS)" in my product env.
Ceph version mimic 13.2.1
client CentOS7.5

mds log:

2018-09-20 08:53:17.610 7f8571755700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.1/rpm/el7/BUILD/ceph-13.2.1/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7f8571755700 time 2018-09-20 08:53:17.609555
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.1/rpm/el7/BUILD/ceph-13.2.1/src/mds/OpenFileTable.cc: 467: FAILED assert(omap_num_objs <= MAX_OBJECTS)

 ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f857d828e1f]
 2: (()+0x284fe7) [0x7f857d828fe7]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b13) [0x55e77b081533]
 4: (MDLog::trim(int)+0x5a6) [0x55e77b06ede6]
 5: (MDSRankDispatcher::tick()+0x23b) [0x55e77ae0cfcb]
 6: (FunctionContext::finish(int)+0x2c) [0x55e77adfd70c]
 7: (Context::complete(int)+0x9) [0x55e77adfb189]
 8: (SafeTimer::timer_thread()+0x18b) [0x7f857d82573b]
 9: (SafeTimerThread::entry()+0xd) [0x7f857d826cfd]
 10: (()+0x7e25) [0x7f857b453e25]
 11: (clone()+0x6d) [0x7f857a533bad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-10000> 2018-09-20 08:53:17.238 7f85741e3700  1 -- 10.10.216.20:6800/4206406544 <== client.2772295 10.10.216.2:0/258380382 1423254100 ==== client_request(client.2772295:14342717695 rmxattr #0x100008666eb security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2 ==== 178+0+0 (3162234629 0 0) 0x55f1c9fc5b00 con 0x55e7931db800
 -9999> 2018-09-20 08:53:17.238 7f85741e3700  4 mds.0.server handle_client_request client_request(client.2772295:14342717695 rmxattr #0x100008666eb security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2
 -9998> 2018-09-20 08:53:17.238 7f856cf4c700  5 mds.0.log _submit_thread 2610936708022~994 : EOpen [metablob 0x10000866586, 1 dirs], 1 open files
 -9997> 2018-09-20 08:53:17.238 7f85741e3700  1 -- 10.10.216.20:6800/4206406544 --> 10.10.216.2:0/258380382 -- client_reply(???:14342717695 = -61 (61) No data available safe) v1 -- 0x55ef49ea4b00 con 0
 -9996> 2018-09-20 08:53:17.238 7f85761e7700  5 -- 10.10.216.20:6800/4206406544 >> 10.10.216.2:0/258380382 conn(0x55e7931db800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=14770 cs=1 l=0). rx client.2772295 seq 1423254101 0x55f1967b4900 client_request(client.2772295:14342717696 rmxattr #0x100008666ec security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2
 -9995> 2018-09-20 08:53:17.238 7f85741e3700  1 -- 10.10.216.20:6800/4206406544 <== client.2772295 10.10.216.2:0/258380382 1423254101 ==== client_request(client.2772295:14342717696 rmxattr #0x100008666ec security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2 ==== 178+0+0 (2702756268 0 0) 0x55f1967b4900 con 0x55e7931db800
 -9994> 2018-09-20 08:53:17.238 7f85741e3700  4 mds.0.server handle_client_request client_request(client.2772295:14342717696 rmxattr #0x100008666ec security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2
 -9993> 2018-09-20 08:53:17.238 7f856cf4c700  5 mds.0.log _submit_thread 2610936709036~993 : EOpen [metablob 0x10000866586, 1 dirs], 1 open files
 -9992> 2018-09-20 08:53:17.238 7f85741e3700  1 -- 10.10.216.20:6800/4206406544 --> 10.10.216.2:0/258380382 -- client_reply(???:14342717696 = -61 (61) No data available safe) v1 -- 0x55ea6c5a6840 con 0

Related issues 3 (0 open3 closed)

Has duplicate CephFS - Bug #43800: FAILED ceph_assert(omap_num_objs <= MAX_OBJECTS) - primary and standby MDS failureDuplicate

Actions
Copied to CephFS - Backport #43347: mimic: mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS))ResolvedPatrick DonnellyActions
Copied to CephFS - Backport #43348: nautilus: mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS))ResolvedPatrick DonnellyActions
Actions #1

Updated by Frank Yu over 5 years ago

while the active one crash, the other one failed also

2018-09-20 08:54:00.031 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806345 10.10.216.35:0/2329806341 after 1.614616
2018-09-20 08:54:00.075 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806327 10.10.216.31:0/1373675291 after 1.659129
2018-09-20 08:54:00.173 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806330 10.10.216.42:0/1428472672 after 1.757113
2018-09-20 08:54:00.787 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2872594 10.10.216.46:0/240766887 after 2.371948
2018-09-20 08:54:00.821 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806357 10.10.216.52:0/1803840812 after 2.405711
2018-09-20 08:54:00.825 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2740441 10.10.216.28:0/3807938763 after 2.409857
2018-09-20 08:54:01.212 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806396 10.10.216.61:0/3846051036 after 2.796684
2018-09-20 08:54:02.879 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806369 10.10.216.26:0/1429523594 after 4.463692
2018-09-20 08:54:03.601 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40609 10.10.216.25:0/4272525572 after 5.186328
2018-09-20 08:54:09.458 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40600 10.10.216.34:0/1288037016 after 11.043182
2018-09-20 08:54:11.990 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806399 10.10.216.75:0/2363935332 after 13.575376
2018-09-20 08:54:18.310 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2772295 10.10.216.2:0/258380382 after 19.895174
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.40588 10.10.216.29:0/3966208469
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu007.hogpu.cc:cephfs-common (40588), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.40597 10.10.216.36:0/3922554286
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu011.hogpu.cc: (40597), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740435 10.10.216.3:0/1744283770
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu002.hogpu.cc:cephfs-common (2740435), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740444 10.10.216.30:0/344096429
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-002.hogpu.cc: (2740444), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740447 10.10.216.39:0/795303220
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-003.hogpu.cc:cephfs-common (2740447), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.242 7faaa67e2700  1 mds.0.server reconnect gave up on client.2743528 10.10.216.9:0/1592138423
2018-09-20 08:54:44.242 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu008.hogpu.cc: (2743528), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.243 7faaa67e2700  1 mds.0.server reconnect gave up on client.2743540 10.10.216.11:0/3857891308
2018-09-20 08:54:44.243 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu010.hogpu.cc: (2743540), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806315 10.10.216.23:0/209935866
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-004.hogpu.cc: (2806315), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806339 10.10.216.43:0/289815903
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-012.hogpu.cc:cephfs-common (2806339), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806372 10.10.216.53:0/317404650
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-022.hogpu.cc:cephfs-common (2806372), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806375 10.10.216.57:0/349737048
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-026.hogpu.cc:cephfs-common (2806375), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806378 10.10.216.54:0/864089971
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-023.hogpu.cc:cephfs-common (2806378), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806387 10.10.216.68:0/3434718465
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-031.hogpu.cc:cephfs-common (2806387), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806390 10.10.216.60:0/2888677659
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-029.hogpu.cc: (2806390), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806393 10.10.216.58:0/2243157754
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-027.hogpu.cc: (2806393), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806405 10.10.216.77:0/3435142704
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-033.hogpu.cc:cephfs-common (2806405), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2838481 10.10.224.14:0/1974838532
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu001.hogpu.cc: (2838481), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2872597 10.10.216.51:0/1414240502
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-020.hogpu.cc:cephfs-common (2872597), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875816 10.10.224.16:0/4256317726
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu002.hogpu.cc: (2875816), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875852 10.10.224.21:0/766361199
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu003.hogpu.cc: (2875852), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875855 10.10.216.22:0/3341113140
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu004.hogpu.cc: (2875855), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875858 10.10.216.24:0/2069109781
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu005.hogpu.cc: (2875858), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875861 10.10.216.48:0/3466763147
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-017.hogpu.cc: (2875861), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875939 10.10.216.41:0/3785796933
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu013.hogpu.cc:cephfs-common (2875939), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect will complete once clients are evicted
2018-09-20 08:54:44.319 7faaa2fdb700  1 mds.0.2145 reconnect_done
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:reconnect --> up:rejoin
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 rejoin_start
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 rejoin_joint_start
[root@kscpu018 ceph]# egrep "2018-09-20 08:5" ceph-mds.kscpu018.hogpu.cc.log
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:boot --> up:replay
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145 replay_start
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145  recovery set is
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145  waiting for osdmap 9834 (which blacklists prior instance)
2018-09-20 08:53:56.571 7faaa27da700  0 mds.0.cache creating system inode with ino:0x100
2018-09-20 08:53:56.571 7faaa27da700  0 mds.0.cache creating system inode with ino:0x1
2018-09-20 08:53:58.276 7faaa17d8700  1 mds.0.2145 replay_done
2018-09-20 08:53:58.276 7faaa17d8700  1 mds.0.2145 making mds journal writeable
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:replay --> up:reconnect
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.2145 reconnect_start
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.2145 reopen_log
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.server reconnect_clients -- 53 sessions
2018-09-20 08:53:58.417 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2743525 10.10.216.8:0/3116427638 after 0.000689
2018-09-20 08:53:58.417 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806318 10.10.216.27:0/615703051 after 0.001044
2018-09-20 08:53:58.417 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2743534 10.10.216.10:0/3573197224 after 0.001102
2018-09-20 08:53:58.418 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2740438 10.10.216.4:0/1318746411 after 0.002246
2018-09-20 08:53:58.420 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806342 10.10.216.45:0/2107016146 after 0.004562
2018-09-20 08:53:58.431 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2872600 10.10.216.38:0/480544929 after 0.015399
2018-09-20 08:53:58.436 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806354 10.10.216.47:0/3846352961 after 0.020009
2018-09-20 08:53:58.500 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806324 10.10.216.50:0/1246636586 after 0.083826
2018-09-20 08:53:58.514 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806351 10.10.216.44:0/2088277603 after 0.098046
2018-09-20 08:53:58.515 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806384 10.10.216.59:0/435428671 after 0.098886
2018-09-20 08:53:58.776 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2872618 10.10.216.55:0/3083184790 after 0.359853
2018-09-20 08:53:59.190 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806363 10.10.216.37:0/4002017154 after 0.774424
2018-09-20 08:53:59.469 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40579 10.10.216.33:0/3554613509 after 1.052783
2018-09-20 08:53:59.475 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806408 10.10.216.40:0/3697701789 after 1.059244
2018-09-20 08:53:59.476 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40603 10.10.216.32:0/3007002808 after 1.059630
2018-09-20 08:53:59.555 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806381 10.10.216.56:0/3497284560 after 1.139413
2018-09-20 08:53:59.953 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806348 10.10.216.49:0/4132059874 after 1.536633
2018-09-20 08:54:00.031 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806345 10.10.216.35:0/2329806341 after 1.614616
2018-09-20 08:54:00.075 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806327 10.10.216.31:0/1373675291 after 1.659129
2018-09-20 08:54:00.173 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806330 10.10.216.42:0/1428472672 after 1.757113
2018-09-20 08:54:00.787 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2872594 10.10.216.46:0/240766887 after 2.371948
2018-09-20 08:54:00.821 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806357 10.10.216.52:0/1803840812 after 2.405711
2018-09-20 08:54:00.825 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2740441 10.10.216.28:0/3807938763 after 2.409857
2018-09-20 08:54:01.212 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806396 10.10.216.61:0/3846051036 after 2.796684
2018-09-20 08:54:02.879 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806369 10.10.216.26:0/1429523594 after 4.463692
2018-09-20 08:54:03.601 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40609 10.10.216.25:0/4272525572 after 5.186328
2018-09-20 08:54:09.458 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40600 10.10.216.34:0/1288037016 after 11.043182
2018-09-20 08:54:11.990 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806399 10.10.216.75:0/2363935332 after 13.575376
2018-09-20 08:54:18.310 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2772295 10.10.216.2:0/258380382 after 19.895174
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.40588 10.10.216.29:0/3966208469
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu007.hogpu.cc:cephfs-common (40588), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.40597 10.10.216.36:0/3922554286
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu011.hogpu.cc: (40597), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740435 10.10.216.3:0/1744283770
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu002.hogpu.cc:cephfs-common (2740435), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740444 10.10.216.30:0/344096429
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-002.hogpu.cc: (2740444), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740447 10.10.216.39:0/795303220
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-003.hogpu.cc:cephfs-common (2740447), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.242 7faaa67e2700  1 mds.0.server reconnect gave up on client.2743528 10.10.216.9:0/1592138423
2018-09-20 08:54:44.242 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu008.hogpu.cc: (2743528), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.243 7faaa67e2700  1 mds.0.server reconnect gave up on client.2743540 10.10.216.11:0/3857891308
2018-09-20 08:54:44.243 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu010.hogpu.cc: (2743540), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806315 10.10.216.23:0/209935866
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-004.hogpu.cc: (2806315), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806339 10.10.216.43:0/289815903
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-012.hogpu.cc:cephfs-common (2806339), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806372 10.10.216.53:0/317404650
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-022.hogpu.cc:cephfs-common (2806372), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806375 10.10.216.57:0/349737048
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-026.hogpu.cc:cephfs-common (2806375), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806378 10.10.216.54:0/864089971
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-023.hogpu.cc:cephfs-common (2806378), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806387 10.10.216.68:0/3434718465
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-031.hogpu.cc:cephfs-common (2806387), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806390 10.10.216.60:0/2888677659
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-029.hogpu.cc: (2806390), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806393 10.10.216.58:0/2243157754
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-027.hogpu.cc: (2806393), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806405 10.10.216.77:0/3435142704
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-033.hogpu.cc:cephfs-common (2806405), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2838481 10.10.224.14:0/1974838532
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu001.hogpu.cc: (2838481), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2872597 10.10.216.51:0/1414240502
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-020.hogpu.cc:cephfs-common (2872597), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875816 10.10.224.16:0/4256317726
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu002.hogpu.cc: (2875816), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875852 10.10.224.21:0/766361199
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu003.hogpu.cc: (2875852), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875855 10.10.216.22:0/3341113140
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu004.hogpu.cc: (2875855), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875858 10.10.216.24:0/2069109781
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu005.hogpu.cc: (2875858), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875861 10.10.216.48:0/3466763147
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-017.hogpu.cc: (2875861), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875939 10.10.216.41:0/3785796933
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu013.hogpu.cc:cephfs-common (2875939), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect will complete once clients are evicted
2018-09-20 08:54:44.319 7faaa2fdb700  1 mds.0.2145 reconnect_done
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:reconnect --> up:rejoin
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 rejoin_start
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 rejoin_joint_start
2018-09-20 08:56:12.321 7faaa5fe1700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:12.321 7faaa5fe1700  1 mds.beacon.kscpu018.hogpu.cc _send skipping beacon, heartbeat map not healthy
2018-09-20 08:56:16.009 7faaaaa73700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:16.321 7faaa5fe1700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:16.321 7faaa5fe1700  1 mds.beacon.kscpu018.hogpu.cc _send skipping beacon, heartbeat map not healthy
2018-09-20 08:56:20.321 7faaa5fe1700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:20.321 7faaa5fe1700  1 mds.beacon.kscpu018.hogpu.cc _send skipping beacon, heartbeat map not healthy
2018-09-20 08:56:21.010 7faaaaa73700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:21.506 7faaa9270700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2018-09-20 08:56:21.529 7faaa9270700  0 mds.beacon.kscpu018.hogpu.cc handle_mds_beacon no longer laggy
2018-09-20 08:56:25.503 7faaa9270700  0 mds.beacon.kscpu018.hogpu.cc handle_mds_beacon no longer laggy
2018-09-20 08:57:00.862 7faaa2fdb700  1 mds.0.2145 rejoin_done
2018-09-20 08:57:03.788 7faaa2fdb700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:57:03.788 7faaa2fdb700  1 mds.beacon.kscpu018.hogpu.cc _send skipping beacon, heartbeat map not healthy
2018-09-20 08:57:03.788 7faaa67e2700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2018-09-20 08:57:03.790 7faaa9270700  0 mds.beacon.kscpu018.hogpu.cc handle_mds_beacon no longer laggy
2018-09-20 08:57:03.790 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.790 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2743528 10.10.216.9:0/1592138423 after 185.374346 (allowed interval 45)
2018-09-20 08:57:03.790 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.790 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2743540 10.10.216.11:0/3857891308 after 185.374432 (allowed interval 45)
2018-09-20 08:57:03.791 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.791 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806387 10.10.216.68:0/3434718465 after 185.375176 (allowed interval 45)
2018-09-20 08:57:03.791 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.791 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806405 10.10.216.77:0/3435142704 after 185.375294 (allowed interval 45)
2018-09-20 08:57:03.791 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.791 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2872597 10.10.216.51:0/1414240502 after 185.375488 (allowed interval 45)
2018-09-20 08:57:03.791 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.791 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875939 10.10.216.41:0/3785796933 after 185.375547 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806393 10.10.216.58:0/2243157754 after 185.375599 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2740444 10.10.216.30:0/344096429 after 185.375639 (allowed interval 45)
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.9:0/1592138423 conn(0x562e0b84d000 :6800 s=STATE_OPEN pgs=6055 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.40588 10.10.216.29:0/3966208469 after 185.375695 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875855 10.10.216.22:0/3341113140 after 185.375763 (allowed interval 45)
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.11:0/3857891308 conn(0x5630b943a000 :6800 s=STATE_OPEN pgs=1398 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806315 10.10.216.23:0/209935866 after 185.375828 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806375 10.10.216.57:0/349737048 after 185.375930 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.68:0/3434718465 conn(0x5630b943b200 :6800 s=STATE_OPEN pgs=106255 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875861 10.10.216.48:0/3466763147 after 185.375996 (allowed interval 45)
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.77:0/3435142704 conn(0x5630b943be00 :6800 s=STATE_OPEN pgs=59961 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2740447 10.10.216.39:0/795303220 after 185.376040 (allowed interval 45)
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.41:0/3785796933 conn(0x5630b943d600 :6800 s=STATE_OPEN pgs=24864 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x5630b9429800 :6800 s=STATE_OPEN pgs=203999 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x5630b9428600 :6800 s=STATE_OPEN pgs=94178 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x5630b943c400 :6800 s=STATE_OPEN pgs=155769 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.51:0/1414240502 conn(0x5630b943d000 :6800 s=STATE_OPEN pgs=159091 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.40597 10.10.216.36:0/3922554286 after 185.376227 (allowed interval 45)
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x5630b9428c00 :6800 s=STATE_OPEN pgs=178239 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2740435 10.10.216.3:0/1744283770 after 185.376264 (allowed interval 45)
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.22:0/3341113140 conn(0x5630b943ac00 :6800 s=STATE_OPEN pgs=16088 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.48:0/3466763147 conn(0x5630b9429e00 :6800 s=STATE_OPEN pgs=4586 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.39:0/795303220 conn(0x5630b942aa00 :6800 s=STATE_OPEN pgs=146785 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x5630b941e000 :6800 s=STATE_OPEN pgs=78508 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875858 10.10.216.24:0/2069109781 after 185.376487 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806390 10.10.216.60:0/2888677659 after 185.376529 (allowed interval 45)
2018-09-20 08:57:03.793 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.793 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806339 10.10.216.43:0/289815903 after 185.376622 (allowed interval 45)
2018-09-20 08:57:03.793 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.793 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875816 10.10.224.16:0/4256317726 after 185.376699 (allowed interval 45)
2018-09-20 08:57:03.793 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.793 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806372 10.10.216.53:0/317404650 after 185.376844 (allowed interval 45)
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x562e0b84d600 :6800 s=STATE_OPEN pgs=93341 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x5630b943a600 :6800 s=STATE_OPEN pgs=29413 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.60:0/2888677659 conn(0x5630b9429200 :6800 s=STATE_OPEN pgs=45379 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x5630b943ca00 :6800 s=STATE_OPEN pgs=40409 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.53:0/317404650 conn(0x5630b942a400 :6800 s=STATE_OPEN pgs=206946 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.43:0/289815903 conn(0x5630b9428000 :6800 s=STATE_OPEN pgs=107366 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.16:0/4256317726 conn(0x5630b942b000 :6800 s=STATE_OPEN pgs=47361 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.794 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.794 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2838481 10.10.224.14:0/1974838532 after 185.378390 (allowed interval 45)
2018-09-20 08:57:03.795 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.795 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875852 10.10.224.21:0/766361199 after 185.379375 (allowed interval 45)
2018-09-20 08:57:03.796 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.14:0/1974838532 conn(0x5630b942b600 :6800 s=STATE_OPEN pgs=128816 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.797 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.21:0/766361199 conn(0x5630b943b800 :6800 s=STATE_OPEN pgs=4311 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:06.639 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:06.639 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806378 10.10.216.54:0/864089971 after 188.222679 (allowed interval 45)
2018-09-20 08:57:06.639 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.54:0/864089971 conn(0x5630b941e600 :6800 s=STATE_OPEN pgs=186392 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:07.842 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.22:0/3341113140 conn(0x5630b941ec00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.842 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x5630b941f800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.842 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.77:0/3435142704 conn(0x5630b9420a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.842 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x5630b941fe00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.53:0/317404650 conn(0x5630b941f200 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x5630b9421000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.60:0/2888677659 conn(0x5630b93cc600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.41:0/3785796933 conn(0x5630b93cc000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.68:0/3434718465 conn(0x5630b93cd800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.39:0/795303220 conn(0x5630b93cd200 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x5630b9420400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x5630b93ce400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x5630b93cde00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.844 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x5630b93ccc00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.845 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.21:0/766361199 conn(0x562e04770400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.846 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:57:07.846 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:rejoin --> up:active
2018-09-20 08:57:07.846 7faaa9270700  1 mds.0.2145 recovery_done -- successful recovery!
2018-09-20 08:57:07.846 7faaa9270700  1 mds.0.2145 active_start
2018-09-20 08:57:14.040 7faaa9270700  1 mds.0.2145 cluster recovered.
2018-09-20 08:57:14.040 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.16:0/4256317726 conn(0x5630b9429200 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:14.040 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x562e0476ec00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:30.984 7faaa67e2700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2018-09-20 08:57:30.986 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x5630b943b800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:30.986 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x562e04770400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:30.999 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x5630b941f800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.007 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x5630b93cf000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.008 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x5630b9420a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.008 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.016 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x5630b941ec00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.017 7faaa9270700  0 mds.beacon.kscpu018.hogpu.cc handle_mds_beacon no longer laggy
2018-09-20 08:57:31.019 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.019 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2806315 10.10.216.23:0/209935866 after 212.603348 (allowed interval 45)
2018-09-20 08:57:31.019 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.019 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2806375 10.10.216.57:0/349737048 after 212.603525 (allowed interval 45)
2018-09-20 08:57:31.020 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.020 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2875858 10.10.216.24:0/2069109781 after 212.603752 (allowed interval 45)
2018-09-20 08:57:31.020 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.020 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2740444 10.10.216.30:0/344096429 after 212.603870 (allowed interval 45)
2018-09-20 08:57:31.020 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x562f2e411200 :6800 s=STATE_OPEN pgs=155775 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.020 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.020 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.40597 10.10.216.36:0/3922554286 after 212.603974 (allowed interval 45)
2018-09-20 08:57:31.020 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x562f2e412400 :6800 s=STATE_OPEN pgs=178245 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.021 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x562f2e412a00 :6800 s=STATE_OPEN pgs=204005 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.021 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x562f2e411800 :6800 s=STATE_OPEN pgs=78514 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.021 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x562f2e410600 :6800 s=STATE_OPEN pgs=40415 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.022 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.022 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.40588 10.10.216.29:0/3966208469 after 212.605900 (allowed interval 45)
2018-09-20 08:57:31.022 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.022 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x5630b943b800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.022 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x5630b941f800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.023 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x562f2e410000 :6800 s=STATE_OPEN pgs=93347 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.023 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x5630b9420a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.024 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.024 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2740435 10.10.216.3:0/1744283770 after 212.607835 (allowed interval 45)
2018-09-20 08:57:31.024 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x562e04770400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.024 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.024 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2806393 10.10.216.58:0/2243157754 after 212.608103 (allowed interval 45)
2018-09-20 08:57:31.024 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.025 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x562f2e411e00 :6800 s=STATE_OPEN pgs=29419 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.025 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x562f2e410c00 :6800 s=STATE_OPEN pgs=94184 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.025 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.026 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x562f2e410000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
Actions #2

Updated by Patrick Donnelly over 5 years ago

  • Project changed from Ceph to CephFS
  • Subject changed from [mimic 13.2.1] mds crash(FAILED assert(omap_num_objs <= MAX_OBJECTS)) to mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS))
  • Description updated (diff)
  • Category deleted (MonClient)
  • Assignee set to Zheng Yan
  • Priority changed from Normal to High
  • Target version changed from v13.2.2 to v14.0.0
  • Start date deleted (09/20/2018)
  • Backport set to mimic
  • Affected Versions v13.2.1 added
  • Affected Versions deleted (v0.21.2)
  • ceph-qa-suite deleted (fs)
  • Component(FS) MDS added
  • Labels (FS) crash added
Actions #3

Updated by Patrick Donnelly about 5 years ago

  • Target version changed from v14.0.0 to v15.0.0
Actions #4

Updated by Stefan Kooman over 4 years ago

Today our active MDS crashed with an assert:

2019-10-19 08:14:50.645 7f7906cb7700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable:
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs <= MAX_OBJECTS)

ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f7911b2897e]
2: (()+0x2fab07) [0x7f7911b28b07]
3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) [0x7703f7]
4: (MDLog::trim(int)+0x5a6) [0x75dcd6]
5: (MDSRankDispatcher::tick()+0x24b) [0x4f013b]
6: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
7: (Context::complete(int)+0x9) [0x4d31d9]
8: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b]
9: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d]
10: (()+0x76ba) [0x7f79113a76ba]
11: (clone()+0x6d) [0x7f7910bd041d]

2019-10-19 08:14:50.649 7f7906cb7700 -1 ** Caught signal (Aborted) *
in thread 7f7906cb7700 thread_name:safe_timer

ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
1: (()+0x11390) [0x7f79113b1390]
2: (gsignal()+0x38) [0x7f7910afe428]
3: (abort()+0x16a) [0x7f7910b0002a]
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f7911b28a86]
5: (()+0x2fab07) [0x7f7911b28b07]
6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) [0x7703f7]
7: (MDLog::trim(int)+0x5a6) [0x75dcd6]
8: (MDSRankDispatcher::tick()+0x24b) [0x4f013b]
9: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
10: (Context::complete(int)+0x9) [0x4d31d9]
11: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b]
12: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d]
13: (()+0x76ba) [0x7f79113a76ba]
14: (clone()+0x6d) [0x7f7910bd041d]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.
Actions #5

Updated by Stefan Kooman over 4 years ago

A search for this assert gave this thread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036702.html

Not sure this condition is more likely to happen in Mimic 13.2.6?

Question I have:

- Does this in any way have anything to do with the amount of files the MDS has cached (CAPS)?
- Is there a workaround to avoid this assert, and or a way to monitor it if is likely to be hit?

Actions #6

Updated by Stefan Kooman over 4 years ago

And yet another crash ~ 5 hours later. We have adjusted the mds_cache_memory_limit from 150G -> 32G after the last crash, but that does not seem to be the issue ...

2019-10-19 13:23:28.835 7fb0b0b86700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7fb0b0b86700 time 2019-10-19 13:23:28.840635
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs <= MAX_OBJECTS)

Actions #7

Updated by Stefan Kooman over 4 years ago

Our MDSs keep failing over until we enable debug output (debug_mds=10/10) ... MDS becomes active and stays active ... when debug is disabled a few minutes later it keeps on working ... before hitting the assert again.

Actions #8

Updated by Stefan Kooman over 4 years ago

Yan, Zheng suggested the following:

delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank
of the crashed mds)

^^ Do this after all MDS servers are down.

This worked and immediately put MDS in active state ... let's see if it stays this way.

Actions #9

Updated by Stefan Kooman over 4 years ago

This patch by Yan, Zheng to get some extra debug statements:

diff --git a/src/mds/OpenFileTable.cc b/src/mds/OpenFileTable.cc
index c0f72d581d..2ca737470d 100644
--- a/src/mds/OpenFileTable.cc
++ b/src/mds/OpenFileTable.cc
@ -470,7 +470,11 @ void OpenFileTable::commit(MDSInternalContextBase *c,
uint64_t log_seq, int op_p
}
if (omap_idx < 0) {
++omap_num_objs;
- assert(omap_num_objs <= MAX_OBJECTS);
if (omap_num_objs > MAX_OBJECTS) {
+ dout(1) << "omap_num_objs " << omap_num_objs << dendl;
+ dout(1) << "anchor_map size " << anchor_map.size() << dendl;
+ assert(omap_num_objs <= MAX_OBJECTS);
+ }
omap_num_items.resize(omap_num_objs);
omap_updates.resize(omap_num_objs);
omap_updates.back().clear = true;

Actions #10

Updated by Zheng Yan over 4 years ago

  • Status changed from New to Fix Under Review
  • Backport changed from mimic to nautilus,mimic
  • Pull request ID set to 32020
Actions #11

Updated by Stefan Kooman over 4 years ago

For the sake of completeness, here the crash logging with the extra debug output:

> -10001> 2019-12-04 20:28:34.929 7fd43ce9b700  5 -- [2001:7b8:80:3:0:2c:3:2]:6800/3833566625 >> [2001:7b8:80:1:0:1:2
:10]:6803/727090 conn(0x55e93ca96300 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=141866652 cs=1 l=1). rx os
d.90 seq 32255 0x55e9416cb0c0 osd_op_reply(4104640 10001afe266.00000000 [stat,omap-set-header,omap-set-vals] v63840'1
0049940 uv10049940 ondisk = 0) v8
> -10001> 2019-12-04 20:28:34.929 7fd43ce9b700  1 -- [2001:7b8:80:3:0:2c:3:2]:6800/3833566625 <== osd.90 [2001:7b8:80
:1:0:1:2:10]:6803/727090 32255 ==== osd_op_reply(4104640 10001afe266.00000000 [stat,omap-set-header,omap-set-vals] v6
3840'10049940 uv10049940 ondisk = 0) v8 ==== 248+0+0 (969216453 0 0) 0x55e9416cb0c0 con 0x55e93ca96300
> -10001> 2019-12-04 20:28:34.937 7fd436ca7700  0 mds.0.openfiles omap_num_objs 1025
> -10001> 2019-12-04 20:28:34.937 7fd436ca7700  0 mds.0.openfiles anchor_map size 19678
> -10001> 2019-12-04 20:28:34.937 7fd436ca7700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void Open
FileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7fd436ca7700 time 2019-12-04 20:28:34.939048
> /build/ceph-13.2.6/src/mds/OpenFileTable.cc: 476: FAILED assert(omap_num_objs <= MAX_OBJECTS)
Actions #12

Updated by Patrick Donnelly over 4 years ago

  • Status changed from Fix Under Review to Pending Backport
Actions #13

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #43347: mimic: mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS)) added
Actions #14

Updated by Nathan Cutler over 4 years ago

  • Copied to Backport #43348: nautilus: mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS)) added
Actions #15

Updated by Patrick Donnelly about 4 years ago

  • Has duplicate Bug #43800: FAILED ceph_assert(omap_num_objs <= MAX_OBJECTS) - primary and standby MDS failure added
Actions #16

Updated by Nathan Cutler about 4 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Also available in: Atom PDF