Bug #36094: mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS)) - CephFS - Ceph

Actions

Copy link

Bug #36094

closed

mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS))

Added by Frank Yu over 5 years ago. Updated about 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

Zheng Yan

Category:

Target version:

Ceph - v15.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

nautilus,mimic

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v13.2.1

ceph-qa-suite:

Component(FS):

MDS

Labels (FS):

crash

Pull request ID:

32020

Crash signature (v1):

Crash signature (v2):

Description

Hi I have met mds error "FAILED assert(omap_num_objs <= MAX_OBJECTS)" in my product env.
Ceph version mimic 13.2.1
client CentOS7.5

mds log:

2018-09-20 08:53:17.610 7f8571755700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.1/rpm/el7/BUILD/ceph-13.2.1/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7f8571755700 time 2018-09-20 08:53:17.609555
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.1/rpm/el7/BUILD/ceph-13.2.1/src/mds/OpenFileTable.cc: 467: FAILED assert(omap_num_objs <= MAX_OBJECTS)

 ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7f857d828e1f]
 2: (()+0x284fe7) [0x7f857d828fe7]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b13) [0x55e77b081533]
 4: (MDLog::trim(int)+0x5a6) [0x55e77b06ede6]
 5: (MDSRankDispatcher::tick()+0x23b) [0x55e77ae0cfcb]
 6: (FunctionContext::finish(int)+0x2c) [0x55e77adfd70c]
 7: (Context::complete(int)+0x9) [0x55e77adfb189]
 8: (SafeTimer::timer_thread()+0x18b) [0x7f857d82573b]
 9: (SafeTimerThread::entry()+0xd) [0x7f857d826cfd]
 10: (()+0x7e25) [0x7f857b453e25]
 11: (clone()+0x6d) [0x7f857a533bad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-10000> 2018-09-20 08:53:17.238 7f85741e3700  1 -- 10.10.216.20:6800/4206406544 <== client.2772295 10.10.216.2:0/258380382 1423254100 ==== client_request(client.2772295:14342717695 rmxattr #0x100008666eb security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2 ==== 178+0+0 (3162234629 0 0) 0x55f1c9fc5b00 con 0x55e7931db800
 -9999> 2018-09-20 08:53:17.238 7f85741e3700  4 mds.0.server handle_client_request client_request(client.2772295:14342717695 rmxattr #0x100008666eb security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2
 -9998> 2018-09-20 08:53:17.238 7f856cf4c700  5 mds.0.log _submit_thread 2610936708022~994 : EOpen [metablob 0x10000866586, 1 dirs], 1 open files
 -9997> 2018-09-20 08:53:17.238 7f85741e3700  1 -- 10.10.216.20:6800/4206406544 --> 10.10.216.2:0/258380382 -- client_reply(???:14342717695 = -61 (61) No data available safe) v1 -- 0x55ef49ea4b00 con 0
 -9996> 2018-09-20 08:53:17.238 7f85761e7700  5 -- 10.10.216.20:6800/4206406544 >> 10.10.216.2:0/258380382 conn(0x55e7931db800 :6800 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=14770 cs=1 l=0). rx client.2772295 seq 1423254101 0x55f1967b4900 client_request(client.2772295:14342717696 rmxattr #0x100008666ec security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2
 -9995> 2018-09-20 08:53:17.238 7f85741e3700  1 -- 10.10.216.20:6800/4206406544 <== client.2772295 10.10.216.2:0/258380382 1423254101 ==== client_request(client.2772295:14342717696 rmxattr #0x100008666ec security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2 ==== 178+0+0 (2702756268 0 0) 0x55f1967b4900 con 0x55e7931db800
 -9994> 2018-09-20 08:53:17.238 7f85741e3700  4 mds.0.server handle_client_request client_request(client.2772295:14342717696 rmxattr #0x100008666ec security.ima 2018-09-20 08:53:17.238752 caller_uid=0, caller_gid=0{}) v2
 -9993> 2018-09-20 08:53:17.238 7f856cf4c700  5 mds.0.log _submit_thread 2610936709036~993 : EOpen [metablob 0x10000866586, 1 dirs], 1 open files
 -9992> 2018-09-20 08:53:17.238 7f85741e3700  1 -- 10.10.216.20:6800/4206406544 --> 10.10.216.2:0/258380382 -- client_reply(???:14342717696 = -61 (61) No data available safe) v1 -- 0x55ea6c5a6840 con 0

Related issues 3 (0 open — 3 closed)

Actions

Copy link

Updated by Frank Yu over 5 years ago

while the active one crash, the other one failed also

2018-09-20 08:54:00.031 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806345 10.10.216.35:0/2329806341 after 1.614616
2018-09-20 08:54:00.075 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806327 10.10.216.31:0/1373675291 after 1.659129
2018-09-20 08:54:00.173 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806330 10.10.216.42:0/1428472672 after 1.757113
2018-09-20 08:54:00.787 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2872594 10.10.216.46:0/240766887 after 2.371948
2018-09-20 08:54:00.821 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806357 10.10.216.52:0/1803840812 after 2.405711
2018-09-20 08:54:00.825 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2740441 10.10.216.28:0/3807938763 after 2.409857
2018-09-20 08:54:01.212 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806396 10.10.216.61:0/3846051036 after 2.796684
2018-09-20 08:54:02.879 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806369 10.10.216.26:0/1429523594 after 4.463692
2018-09-20 08:54:03.601 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40609 10.10.216.25:0/4272525572 after 5.186328
2018-09-20 08:54:09.458 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40600 10.10.216.34:0/1288037016 after 11.043182
2018-09-20 08:54:11.990 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806399 10.10.216.75:0/2363935332 after 13.575376
2018-09-20 08:54:18.310 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2772295 10.10.216.2:0/258380382 after 19.895174
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.40588 10.10.216.29:0/3966208469
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu007.hogpu.cc:cephfs-common (40588), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.40597 10.10.216.36:0/3922554286
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu011.hogpu.cc: (40597), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740435 10.10.216.3:0/1744283770
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu002.hogpu.cc:cephfs-common (2740435), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740444 10.10.216.30:0/344096429
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-002.hogpu.cc: (2740444), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740447 10.10.216.39:0/795303220
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-003.hogpu.cc:cephfs-common (2740447), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.242 7faaa67e2700  1 mds.0.server reconnect gave up on client.2743528 10.10.216.9:0/1592138423
2018-09-20 08:54:44.242 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu008.hogpu.cc: (2743528), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.243 7faaa67e2700  1 mds.0.server reconnect gave up on client.2743540 10.10.216.11:0/3857891308
2018-09-20 08:54:44.243 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu010.hogpu.cc: (2743540), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806315 10.10.216.23:0/209935866
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-004.hogpu.cc: (2806315), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806339 10.10.216.43:0/289815903
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-012.hogpu.cc:cephfs-common (2806339), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806372 10.10.216.53:0/317404650
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-022.hogpu.cc:cephfs-common (2806372), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806375 10.10.216.57:0/349737048
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-026.hogpu.cc:cephfs-common (2806375), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806378 10.10.216.54:0/864089971
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-023.hogpu.cc:cephfs-common (2806378), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806387 10.10.216.68:0/3434718465
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-031.hogpu.cc:cephfs-common (2806387), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806390 10.10.216.60:0/2888677659
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-029.hogpu.cc: (2806390), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806393 10.10.216.58:0/2243157754
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-027.hogpu.cc: (2806393), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806405 10.10.216.77:0/3435142704
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-033.hogpu.cc:cephfs-common (2806405), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2838481 10.10.224.14:0/1974838532
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu001.hogpu.cc: (2838481), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2872597 10.10.216.51:0/1414240502
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-020.hogpu.cc:cephfs-common (2872597), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875816 10.10.224.16:0/4256317726
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu002.hogpu.cc: (2875816), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875852 10.10.224.21:0/766361199
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu003.hogpu.cc: (2875852), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875855 10.10.216.22:0/3341113140
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu004.hogpu.cc: (2875855), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875858 10.10.216.24:0/2069109781
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu005.hogpu.cc: (2875858), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875861 10.10.216.48:0/3466763147
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-017.hogpu.cc: (2875861), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875939 10.10.216.41:0/3785796933
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu013.hogpu.cc:cephfs-common (2875939), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect will complete once clients are evicted
2018-09-20 08:54:44.319 7faaa2fdb700  1 mds.0.2145 reconnect_done
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:reconnect --> up:rejoin
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 rejoin_start
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 rejoin_joint_start
[root@kscpu018 ceph]# egrep "2018-09-20 08:5" ceph-mds.kscpu018.hogpu.cc.log
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:boot --> up:replay
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145 replay_start
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145  recovery set is
2018-09-20 08:53:56.366 7faaa9270700  1 mds.0.2145  waiting for osdmap 9834 (which blacklists prior instance)
2018-09-20 08:53:56.571 7faaa27da700  0 mds.0.cache creating system inode with ino:0x100
2018-09-20 08:53:56.571 7faaa27da700  0 mds.0.cache creating system inode with ino:0x1
2018-09-20 08:53:58.276 7faaa17d8700  1 mds.0.2145 replay_done
2018-09-20 08:53:58.276 7faaa17d8700  1 mds.0.2145 making mds journal writeable
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:replay --> up:reconnect
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.2145 reconnect_start
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.2145 reopen_log
2018-09-20 08:53:58.416 7faaa9270700  1 mds.0.server reconnect_clients -- 53 sessions
2018-09-20 08:53:58.417 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2743525 10.10.216.8:0/3116427638 after 0.000689
2018-09-20 08:53:58.417 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806318 10.10.216.27:0/615703051 after 0.001044
2018-09-20 08:53:58.417 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2743534 10.10.216.10:0/3573197224 after 0.001102
2018-09-20 08:53:58.418 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2740438 10.10.216.4:0/1318746411 after 0.002246
2018-09-20 08:53:58.420 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806342 10.10.216.45:0/2107016146 after 0.004562
2018-09-20 08:53:58.431 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2872600 10.10.216.38:0/480544929 after 0.015399
2018-09-20 08:53:58.436 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806354 10.10.216.47:0/3846352961 after 0.020009
2018-09-20 08:53:58.500 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806324 10.10.216.50:0/1246636586 after 0.083826
2018-09-20 08:53:58.514 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806351 10.10.216.44:0/2088277603 after 0.098046
2018-09-20 08:53:58.515 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806384 10.10.216.59:0/435428671 after 0.098886
2018-09-20 08:53:58.776 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2872618 10.10.216.55:0/3083184790 after 0.359853
2018-09-20 08:53:59.190 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806363 10.10.216.37:0/4002017154 after 0.774424
2018-09-20 08:53:59.469 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40579 10.10.216.33:0/3554613509 after 1.052783
2018-09-20 08:53:59.475 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806408 10.10.216.40:0/3697701789 after 1.059244
2018-09-20 08:53:59.476 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40603 10.10.216.32:0/3007002808 after 1.059630
2018-09-20 08:53:59.555 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806381 10.10.216.56:0/3497284560 after 1.139413
2018-09-20 08:53:59.953 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806348 10.10.216.49:0/4132059874 after 1.536633
2018-09-20 08:54:00.031 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806345 10.10.216.35:0/2329806341 after 1.614616
2018-09-20 08:54:00.075 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806327 10.10.216.31:0/1373675291 after 1.659129
2018-09-20 08:54:00.173 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806330 10.10.216.42:0/1428472672 after 1.757113
2018-09-20 08:54:00.787 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2872594 10.10.216.46:0/240766887 after 2.371948
2018-09-20 08:54:00.821 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806357 10.10.216.52:0/1803840812 after 2.405711
2018-09-20 08:54:00.825 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2740441 10.10.216.28:0/3807938763 after 2.409857
2018-09-20 08:54:01.212 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806396 10.10.216.61:0/3846051036 after 2.796684
2018-09-20 08:54:02.879 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806369 10.10.216.26:0/1429523594 after 4.463692
2018-09-20 08:54:03.601 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40609 10.10.216.25:0/4272525572 after 5.186328
2018-09-20 08:54:09.458 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.40600 10.10.216.34:0/1288037016 after 11.043182
2018-09-20 08:54:11.990 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2806399 10.10.216.75:0/2363935332 after 13.575376
2018-09-20 08:54:18.310 7faaa9270700  0 log_channel(cluster) log [DBG] : reconnect by client.2772295 10.10.216.2:0/258380382 after 19.895174
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.40588 10.10.216.29:0/3966208469
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu007.hogpu.cc:cephfs-common (40588), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.40597 10.10.216.36:0/3922554286
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu011.hogpu.cc: (40597), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740435 10.10.216.3:0/1744283770
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu002.hogpu.cc:cephfs-common (2740435), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740444 10.10.216.30:0/344096429
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-002.hogpu.cc: (2740444), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.241 7faaa67e2700  1 mds.0.server reconnect gave up on client.2740447 10.10.216.39:0/795303220
2018-09-20 08:54:44.241 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-003.hogpu.cc:cephfs-common (2740447), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.242 7faaa67e2700  1 mds.0.server reconnect gave up on client.2743528 10.10.216.9:0/1592138423
2018-09-20 08:54:44.242 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu008.hogpu.cc: (2743528), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.243 7faaa67e2700  1 mds.0.server reconnect gave up on client.2743540 10.10.216.11:0/3857891308
2018-09-20 08:54:44.243 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client kscpu010.hogpu.cc: (2743540), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806315 10.10.216.23:0/209935866
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-004.hogpu.cc: (2806315), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806339 10.10.216.43:0/289815903
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-012.hogpu.cc:cephfs-common (2806339), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806372 10.10.216.53:0/317404650
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-022.hogpu.cc:cephfs-common (2806372), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806375 10.10.216.57:0/349737048
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-026.hogpu.cc:cephfs-common (2806375), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806378 10.10.216.54:0/864089971
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-023.hogpu.cc:cephfs-common (2806378), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806387 10.10.216.68:0/3434718465
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-031.hogpu.cc:cephfs-common (2806387), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806390 10.10.216.60:0/2888677659
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-029.hogpu.cc: (2806390), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806393 10.10.216.58:0/2243157754
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-027.hogpu.cc: (2806393), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2806405 10.10.216.77:0/3435142704
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-033.hogpu.cc:cephfs-common (2806405), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2838481 10.10.224.14:0/1974838532
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu001.hogpu.cc: (2838481), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2872597 10.10.216.51:0/1414240502
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-020.hogpu.cc:cephfs-common (2872597), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875816 10.10.224.16:0/4256317726
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu002.hogpu.cc: (2875816), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875852 10.10.224.21:0/766361199
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu003.hogpu.cc: (2875852), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.244 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875855 10.10.216.22:0/3341113140
2018-09-20 08:54:44.244 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu004.hogpu.cc: (2875855), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875858 10.10.216.24:0/2069109781
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu005.hogpu.cc: (2875858), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875861 10.10.216.48:0/3466763147
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu-v-017.hogpu.cc: (2875861), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect gave up on client.2875939 10.10.216.41:0/3785796933
2018-09-20 08:54:44.245 7faaa67e2700  0 log_channel(cluster) log [WRN] : evicting unresponsive client ksgpu013.hogpu.cc:cephfs-common (2875939), after waiting 45 seconds during MDS startup
2018-09-20 08:54:44.245 7faaa67e2700  1 mds.0.server reconnect will complete once clients are evicted
2018-09-20 08:54:44.319 7faaa2fdb700  1 mds.0.2145 reconnect_done
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:reconnect --> up:rejoin
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 rejoin_start
2018-09-20 08:54:45.398 7faaa9270700  1 mds.0.2145 rejoin_joint_start
2018-09-20 08:56:12.321 7faaa5fe1700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:12.321 7faaa5fe1700  1 mds.beacon.kscpu018.hogpu.cc _send skipping beacon, heartbeat map not healthy
2018-09-20 08:56:16.009 7faaaaa73700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:16.321 7faaa5fe1700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:16.321 7faaa5fe1700  1 mds.beacon.kscpu018.hogpu.cc _send skipping beacon, heartbeat map not healthy
2018-09-20 08:56:20.321 7faaa5fe1700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:20.321 7faaa5fe1700  1 mds.beacon.kscpu018.hogpu.cc _send skipping beacon, heartbeat map not healthy
2018-09-20 08:56:21.010 7faaaaa73700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:56:21.506 7faaa9270700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2018-09-20 08:56:21.529 7faaa9270700  0 mds.beacon.kscpu018.hogpu.cc handle_mds_beacon no longer laggy
2018-09-20 08:56:25.503 7faaa9270700  0 mds.beacon.kscpu018.hogpu.cc handle_mds_beacon no longer laggy
2018-09-20 08:57:00.862 7faaa2fdb700  1 mds.0.2145 rejoin_done
2018-09-20 08:57:03.788 7faaa2fdb700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2018-09-20 08:57:03.788 7faaa2fdb700  1 mds.beacon.kscpu018.hogpu.cc _send skipping beacon, heartbeat map not healthy
2018-09-20 08:57:03.788 7faaa67e2700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2018-09-20 08:57:03.790 7faaa9270700  0 mds.beacon.kscpu018.hogpu.cc handle_mds_beacon no longer laggy
2018-09-20 08:57:03.790 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.790 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2743528 10.10.216.9:0/1592138423 after 185.374346 (allowed interval 45)
2018-09-20 08:57:03.790 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.790 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2743540 10.10.216.11:0/3857891308 after 185.374432 (allowed interval 45)
2018-09-20 08:57:03.791 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.791 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806387 10.10.216.68:0/3434718465 after 185.375176 (allowed interval 45)
2018-09-20 08:57:03.791 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.791 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806405 10.10.216.77:0/3435142704 after 185.375294 (allowed interval 45)
2018-09-20 08:57:03.791 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.791 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2872597 10.10.216.51:0/1414240502 after 185.375488 (allowed interval 45)
2018-09-20 08:57:03.791 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.791 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875939 10.10.216.41:0/3785796933 after 185.375547 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806393 10.10.216.58:0/2243157754 after 185.375599 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2740444 10.10.216.30:0/344096429 after 185.375639 (allowed interval 45)
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.9:0/1592138423 conn(0x562e0b84d000 :6800 s=STATE_OPEN pgs=6055 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.40588 10.10.216.29:0/3966208469 after 185.375695 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875855 10.10.216.22:0/3341113140 after 185.375763 (allowed interval 45)
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.11:0/3857891308 conn(0x5630b943a000 :6800 s=STATE_OPEN pgs=1398 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806315 10.10.216.23:0/209935866 after 185.375828 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806375 10.10.216.57:0/349737048 after 185.375930 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.68:0/3434718465 conn(0x5630b943b200 :6800 s=STATE_OPEN pgs=106255 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875861 10.10.216.48:0/3466763147 after 185.375996 (allowed interval 45)
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.77:0/3435142704 conn(0x5630b943be00 :6800 s=STATE_OPEN pgs=59961 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2740447 10.10.216.39:0/795303220 after 185.376040 (allowed interval 45)
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.41:0/3785796933 conn(0x5630b943d600 :6800 s=STATE_OPEN pgs=24864 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x5630b9429800 :6800 s=STATE_OPEN pgs=203999 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x5630b9428600 :6800 s=STATE_OPEN pgs=94178 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x5630b943c400 :6800 s=STATE_OPEN pgs=155769 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.51:0/1414240502 conn(0x5630b943d000 :6800 s=STATE_OPEN pgs=159091 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.40597 10.10.216.36:0/3922554286 after 185.376227 (allowed interval 45)
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x5630b9428c00 :6800 s=STATE_OPEN pgs=178239 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2740435 10.10.216.3:0/1744283770 after 185.376264 (allowed interval 45)
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.22:0/3341113140 conn(0x5630b943ac00 :6800 s=STATE_OPEN pgs=16088 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.48:0/3466763147 conn(0x5630b9429e00 :6800 s=STATE_OPEN pgs=4586 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.39:0/795303220 conn(0x5630b942aa00 :6800 s=STATE_OPEN pgs=146785 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x5630b941e000 :6800 s=STATE_OPEN pgs=78508 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875858 10.10.216.24:0/2069109781 after 185.376487 (allowed interval 45)
2018-09-20 08:57:03.792 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.792 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806390 10.10.216.60:0/2888677659 after 185.376529 (allowed interval 45)
2018-09-20 08:57:03.793 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.793 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806339 10.10.216.43:0/289815903 after 185.376622 (allowed interval 45)
2018-09-20 08:57:03.793 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.793 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875816 10.10.224.16:0/4256317726 after 185.376699 (allowed interval 45)
2018-09-20 08:57:03.793 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.793 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806372 10.10.216.53:0/317404650 after 185.376844 (allowed interval 45)
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x562e0b84d600 :6800 s=STATE_OPEN pgs=93341 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x5630b943a600 :6800 s=STATE_OPEN pgs=29413 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.60:0/2888677659 conn(0x5630b9429200 :6800 s=STATE_OPEN pgs=45379 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x5630b943ca00 :6800 s=STATE_OPEN pgs=40409 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.53:0/317404650 conn(0x5630b942a400 :6800 s=STATE_OPEN pgs=206946 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.43:0/289815903 conn(0x5630b9428000 :6800 s=STATE_OPEN pgs=107366 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.793 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.16:0/4256317726 conn(0x5630b942b000 :6800 s=STATE_OPEN pgs=47361 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.794 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.794 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2838481 10.10.224.14:0/1974838532 after 185.378390 (allowed interval 45)
2018-09-20 08:57:03.795 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:03.795 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2875852 10.10.224.21:0/766361199 after 185.379375 (allowed interval 45)
2018-09-20 08:57:03.796 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.14:0/1974838532 conn(0x5630b942b600 :6800 s=STATE_OPEN pgs=128816 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:03.797 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.21:0/766361199 conn(0x5630b943b800 :6800 s=STATE_OPEN pgs=4311 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:06.639 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:06.639 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:rejoin) from client.2806378 10.10.216.54:0/864089971 after 188.222679 (allowed interval 45)
2018-09-20 08:57:06.639 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.54:0/864089971 conn(0x5630b941e600 :6800 s=STATE_OPEN pgs=186392 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:07.842 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.22:0/3341113140 conn(0x5630b941ec00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.842 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x5630b941f800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.842 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.77:0/3435142704 conn(0x5630b9420a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.842 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x5630b941fe00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.53:0/317404650 conn(0x5630b941f200 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x5630b9421000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.60:0/2888677659 conn(0x5630b93cc600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.41:0/3785796933 conn(0x5630b93cc000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.68:0/3434718465 conn(0x5630b93cd800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.39:0/795303220 conn(0x5630b93cd200 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x5630b9420400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x5630b93ce400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.843 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x5630b93cde00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.844 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x5630b93ccc00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.845 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.21:0/766361199 conn(0x562e04770400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:07.846 7faaa9270700  1 mds.0.2145 handle_mds_map i am now mds.0.2145
2018-09-20 08:57:07.846 7faaa9270700  1 mds.0.2145 handle_mds_map state change up:rejoin --> up:active
2018-09-20 08:57:07.846 7faaa9270700  1 mds.0.2145 recovery_done -- successful recovery!
2018-09-20 08:57:07.846 7faaa9270700  1 mds.0.2145 active_start
2018-09-20 08:57:14.040 7faaa9270700  1 mds.0.2145 cluster recovered.
2018-09-20 08:57:14.040 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.224.16:0/4256317726 conn(0x5630b9429200 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:14.040 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x562e0476ec00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:30.984 7faaa67e2700  1 heartbeat_map reset_timeout 'MDSRank' had timed out after 15
2018-09-20 08:57:30.986 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x5630b943b800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:30.986 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x562e04770400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:30.999 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x5630b941f800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.007 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x5630b93cf000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.008 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x5630b9420a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.008 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.016 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x5630b941ec00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept we reset (peer sent cseq 1), sending RESETSESSION
2018-09-20 08:57:31.017 7faaa9270700  0 mds.beacon.kscpu018.hogpu.cc handle_mds_beacon no longer laggy
2018-09-20 08:57:31.019 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.019 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2806315 10.10.216.23:0/209935866 after 212.603348 (allowed interval 45)
2018-09-20 08:57:31.019 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.019 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2806375 10.10.216.57:0/349737048 after 212.603525 (allowed interval 45)
2018-09-20 08:57:31.020 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.020 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2875858 10.10.216.24:0/2069109781 after 212.603752 (allowed interval 45)
2018-09-20 08:57:31.020 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.020 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2740444 10.10.216.30:0/344096429 after 212.603870 (allowed interval 45)
2018-09-20 08:57:31.020 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x562f2e411200 :6800 s=STATE_OPEN pgs=155775 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.020 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.020 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.40597 10.10.216.36:0/3922554286 after 212.603974 (allowed interval 45)
2018-09-20 08:57:31.020 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x562f2e412400 :6800 s=STATE_OPEN pgs=178245 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.021 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x562f2e412a00 :6800 s=STATE_OPEN pgs=204005 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.021 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x562f2e411800 :6800 s=STATE_OPEN pgs=78514 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.021 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x562f2e410600 :6800 s=STATE_OPEN pgs=40415 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.022 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.022 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.40588 10.10.216.29:0/3966208469 after 212.605900 (allowed interval 45)
2018-09-20 08:57:31.022 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.36:0/3922554286 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.022 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.57:0/349737048 conn(0x5630b943b800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.022 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.24:0/2069109781 conn(0x5630b941f800 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.023 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x562f2e410000 :6800 s=STATE_OPEN pgs=93347 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.023 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.30:0/344096429 conn(0x5630b9420a00 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.024 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.024 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2740435 10.10.216.3:0/1744283770 after 212.607835 (allowed interval 45)
2018-09-20 08:57:31.024 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.23:0/209935866 conn(0x562e04770400 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.024 7faaa9270700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
2018-09-20 08:57:31.024 7faaa9270700  0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:active) from client.2806393 10.10.216.58:0/2243157754 after 212.608103 (allowed interval 45)
2018-09-20 08:57:31.024 7faaab274700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.29:0/3966208469 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.025 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x562f2e411e00 :6800 s=STATE_OPEN pgs=29419 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.025 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x562f2e410c00 :6800 s=STATE_OPEN pgs=94184 cs=1 l=0).fault server, going to standby
2018-09-20 08:57:31.025 7faaaba75700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.58:0/2243157754 conn(0x5630b9421600 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing
2018-09-20 08:57:31.026 7faaac276700  0 -- 10.10.216.19:6800/4078227709 >> 10.10.216.3:0/1744283770 conn(0x562f2e410000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept peer reset, then tried to connect to us, replacing

Actions

Copy link

Updated by Patrick Donnelly over 5 years ago

Project changed from Ceph to CephFS
Subject changed from [mimic 13.2.1] mds crash(FAILED assert(omap_num_objs <= MAX_OBJECTS)) to mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS))
Description updated (diff)
Category deleted (~~MonClient~~)
Assignee set to Zheng Yan
Priority changed from Normal to High
Target version changed from v13.2.2 to v14.0.0
Start date deleted (~~09/20/2018~~)
Backport set to mimic
Affected Versions v13.2.1 added
Affected Versions deleted (~~v0.21.2~~)
ceph-qa-suite deleted (fs)
Component(FS) MDS added
Labels (FS) crash added

Actions

Copy link

Updated by Patrick Donnelly about 5 years ago

Target version changed from v14.0.0 to v15.0.0

Actions

Copy link

Updated by Stefan Kooman over 4 years ago

Today our active MDS crashed with an assert:

2019-10-19 08:14:50.645 7f7906cb7700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable:
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs <= MAX_OBJECTS)

ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f7911b2897e]
 2: (()+0x2fab07) [0x7f7911b28b07]
 3: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) [0x7703f7]
 4: (MDLog::trim(int)+0x5a6) [0x75dcd6]
 5: (MDSRankDispatcher::tick()+0x24b) [0x4f013b]
 6: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
 7: (Context::complete(int)+0x9) [0x4d31d9]
 8: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b]
 9: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d]
 10: (()+0x76ba) [0x7f79113a76ba]
 11: (clone()+0x6d) [0x7f7910bd041d]

2019-10-19 08:14:50.649 7f7906cb7700 -1 ** Caught signal (Aborted) *
in thread 7f7906cb7700 thread_name:safe_timer

ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0x11390) [0x7f79113b1390]
 2: (gsignal()+0x38) [0x7f7910afe428]
 3: (abort()+0x16a) [0x7f7910b0002a]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x256) [0x7f7911b28a86]
 5: (()+0x2fab07) [0x7f7911b28b07]
 6: (OpenFileTable::commit(MDSInternalContextBase*, unsigned long, int)+0x1b27) [0x7703f7]
 7: (MDLog::trim(int)+0x5a6) [0x75dcd6]
 8: (MDSRankDispatcher::tick()+0x24b) [0x4f013b]
 9: (FunctionContext::finish(int)+0x2c) [0x4d52dc]
 10: (Context::complete(int)+0x9) [0x4d31d9]
 11: (SafeTimer::timer_thread()+0x18b) [0x7f7911b2520b]
 12: (SafeTimerThread::entry()+0xd) [0x7f7911b2686d]
 13: (()+0x76ba) [0x7f79113a76ba]
 14: (clone()+0x6d) [0x7f7910bd041d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Actions

Copy link

Updated by Stefan Kooman over 4 years ago

A search for this assert gave this thread: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036702.html

Not sure this condition is more likely to happen in Mimic 13.2.6?

Question I have:

- Does this in any way have anything to do with the amount of files the MDS has cached (CAPS)?
- Is there a workaround to avoid this assert, and or a way to monitor it if is likely to be hit?

Actions

Copy link

Updated by Stefan Kooman over 4 years ago

And yet another crash ~ 5 hours later. We have adjusted the mds_cache_memory_limit from 150G -> 32G after the last crash, but that does not seem to be the issue ...

2019-10-19 13:23:28.835 7fb0b0b86700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7fb0b0b86700 time 2019-10-19 13:23:28.840635
/build/ceph-13.2.6/src/mds/OpenFileTable.cc: 473: FAILED assert(omap_num_objs <= MAX_OBJECTS)

Actions

Copy link

Updated by Stefan Kooman over 4 years ago

Our MDSs keep failing over until we enable debug output (debug_mds=10/10) ... MDS becomes active and stays active ... when debug is disabled a few minutes later it keeps on working ... before hitting the assert again.

Actions

Copy link

Updated by Stefan Kooman over 4 years ago

Yan, Zheng suggested the following:

delete 'mdsX_openfiles.0' object from cephfs metadata pool. (X is rank
of the crashed mds)

^^ Do this after all MDS servers are down.

This worked and immediately put MDS in active state ... let's see if it stays this way.

Actions

Copy link

Updated by Stefan Kooman over 4 years ago

This patch by Yan, Zheng to get some extra debug statements:

diff --git a/src/mds/OpenFileTable.cc b/src/mds/OpenFileTable.cc
index c0f72d581d..2ca737470d 100644
--- a/src/mds/OpenFileTable.cc
++ b/src/mds/OpenFileTable.cc
@ -470,7 +470,11 @ void OpenFileTable::commit(MDSInternalContextBase *c,
uint64_t log_seq, int op_p
}
if (omap_idx < 0) {
++omap_num_objs;
- assert(omap_num_objs <= MAX_OBJECTS);
if (omap_num_objs > MAX_OBJECTS) {
+ dout(1) << "omap_num_objs " << omap_num_objs << dendl;
+ dout(1) << "anchor_map size " << anchor_map.size() << dendl;
+ assert(omap_num_objs <= MAX_OBJECTS);
+ }
omap_num_items.resize(omap_num_objs);
omap_updates.resize(omap_num_objs);
omap_updates.back().clear = true;

Actions

Copy link

#10

Updated by Zheng Yan over 4 years ago

Status changed from New to Fix Under Review
Backport changed from mimic to nautilus,mimic
Pull request ID set to 32020

Actions

Copy link

#11

Updated by Stefan Kooman over 4 years ago

For the sake of completeness, here the crash logging with the extra debug output:

> -10001> 2019-12-04 20:28:34.929 7fd43ce9b700  5 -- [2001:7b8:80:3:0:2c:3:2]:6800/3833566625 >> [2001:7b8:80:1:0:1:2
:10]:6803/727090 conn(0x55e93ca96300 :-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=141866652 cs=1 l=1). rx os
d.90 seq 32255 0x55e9416cb0c0 osd_op_reply(4104640 10001afe266.00000000 [stat,omap-set-header,omap-set-vals] v63840'1
0049940 uv10049940 ondisk = 0) v8
> -10001> 2019-12-04 20:28:34.929 7fd43ce9b700  1 -- [2001:7b8:80:3:0:2c:3:2]:6800/3833566625 <== osd.90 [2001:7b8:80
:1:0:1:2:10]:6803/727090 32255 ==== osd_op_reply(4104640 10001afe266.00000000 [stat,omap-set-header,omap-set-vals] v6
3840'10049940 uv10049940 ondisk = 0) v8 ==== 248+0+0 (969216453 0 0) 0x55e9416cb0c0 con 0x55e93ca96300
> -10001> 2019-12-04 20:28:34.937 7fd436ca7700  0 mds.0.openfiles omap_num_objs 1025
> -10001> 2019-12-04 20:28:34.937 7fd436ca7700  0 mds.0.openfiles anchor_map size 19678
> -10001> 2019-12-04 20:28:34.937 7fd436ca7700 -1 /build/ceph-13.2.6/src/mds/OpenFileTable.cc: In function 'void Open
FileTable::commit(MDSInternalContextBase*, uint64_t, int)' thread 7fd436ca7700 time 2019-12-04 20:28:34.939048
> /build/ceph-13.2.6/src/mds/OpenFileTable.cc: 476: FAILED assert(omap_num_objs <= MAX_OBJECTS)

Actions

Copy link

#12

Updated by Patrick Donnelly over 4 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#13

Updated by Nathan Cutler over 4 years ago

Copied to Backport #43347: mimic: mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS)) added

Actions

Copy link

#14

Updated by Nathan Cutler over 4 years ago

Copied to Backport #43348: nautilus: mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS)) added

Actions

Copy link

#15

Updated by Patrick Donnelly about 4 years ago

Has duplicate Bug #43800: FAILED ceph_assert(omap_num_objs <= MAX_OBJECTS) - primary and standby MDS failure added

Actions

Copy link

#16

Updated by Nathan Cutler about 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » CephFS

Custom queries

Bug #36094

mds: crash(FAILED assert(omap_num_objs <= MAX_OBJECTS))

Updated by Frank Yu over 5 years ago

Updated by Patrick Donnelly over 5 years ago

Updated by Patrick Donnelly about 5 years ago

Updated by Stefan Kooman over 4 years ago

Updated by Stefan Kooman over 4 years ago

Updated by Stefan Kooman over 4 years ago

Updated by Stefan Kooman over 4 years ago

Updated by Stefan Kooman over 4 years ago

Updated by Stefan Kooman over 4 years ago

Updated by Zheng Yan over 4 years ago

Updated by Stefan Kooman over 4 years ago

Updated by Patrick Donnelly over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Nathan Cutler over 4 years ago

Updated by Patrick Donnelly about 4 years ago

Updated by Nathan Cutler about 4 years ago