Project

General

Profile

Bug #55858

Pacific 16.2.7 MDS constantly crashing

Added by Mike Lowe 6 months ago. Updated 4 months ago.

Status:
Need More Info
Priority:
Normal
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
quincy, pacific
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
crash
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Jun  3 23:21:23 r07s05 bash[1415068]: debug     -1> 2022-06-03T23:21:23.148+0000 7f6b0f1e1700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7 /src/mds/Locker.cc: In function 'bool Locker::check_inode_max_size(CInode*, bool, uint64_t, uint64_t, utime_t)' thread 7f6b0f1e1700 time 2
022-06-03T23:21:23.147265+0000 Jun  3 23:21:23 r07s05 bash[1415068]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/mds/Locker.cc: 2787: FAILED ceph_assert(in->is_auth())
Jun  3 23:21:23 r07s05 bash[1415068]:  ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
Jun  3 23:21:23 r07s05 bash[1415068]:  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x7f6b17c0ab52]
Jun  3 23:21:23 r07s05 bash[1415068]:  2: /usr/lib64/ceph/libceph-common.so.2(+0x276d6c) [0x7f6b17c0ad6c]
Jun  3 23:21:23 r07s05 bash[1415068]:  3: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned long, utime_t)+0x1aab) [0x55bd76f9298b]
Jun  3 23:21:23 r07s05 bash[1415068]:  4: (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x111c) [0x55bd76e2130c]
Jun  3 23:21:23 r07s05 bash[1415068]:  5: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x4ab) [0x55bd76e220cb]
Jun  3 23:21:23 r07s05 bash[1415068]:  6: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf3c) [0x55bd76e50e8c]
Jun  3 23:21:23 r07s05 bash[1415068]:  7: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x33) [0x55bd76f07193]
Jun  3 23:21:23 r07s05 bash[1415068]:  8: (MDSContext::complete(int)+0x56) [0x55bd770c3c06]
Jun  3 23:21:23 r07s05 bash[1415068]:  9: (MDSCacheObject::finish_waiting(unsigned long, int)+0xce) [0x55bd770e5cae]
Jun  3 23:21:23 r07s05 bash[1415068]:  10: (Locker::eval_gather(SimpleLock*, bool, bool*, std::vector<MDSContext*, std::allocator<MDSContext*> >*)+0x13d6) [0x55bd76f97d66]
Jun  3 23:21:23 r07s05 bash[1415068]:  11: (Locker::handle_file_lock(ScatterLock*, boost::intrusive_ptr<MLock const> const&)+0xed1) [0x55bd76fa5dd1]
Jun  3 23:21:23 r07s05 bash[1415068]:  12: (Locker::handle_lock(boost::intrusive_ptr<MLock const> const&)+0x1b3) [0x55bd76fa6943]
Jun  3 23:21:23 r07s05 bash[1415068]:  13: (Locker::dispatch(boost::intrusive_ptr<Message const> const&)+0xb4) [0x55bd76faab74]
Jun  3 23:21:23 r07s05 bash[1415068]:  14: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbcc) [0x55bd76dc0a2c]
Jun  3 23:21:23 r07s05 bash[1415068]:  15: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55bd76dc33cb]
Jun  3 23:21:23 r07s05 bash[1415068]:  16: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55bd76dc39c5]
Jun  3 23:21:23 r07s05 bash[1415068]:  17: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55bd76db35d8]
Jun  3 23:21:23 r07s05 bash[1415068]:  18: (DispatchQueue::entry()+0x126a) [0x7f6b17e4eaba]
Jun  3 23:21:23 r07s05 bash[1415068]:  19: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f6b17f005d1]
Jun  3 23:21:23 r07s05 bash[1415068]:  20: /lib64/libpthread.so.0(+0x81cf) [0x7f6b16bee1cf]
Jun  3 23:21:23 r07s05 bash[1415068]:  21: clone()
Jun  3 23:21:23 r07s05 bash[1415068]: debug      0> 2022-06-03T23:21:23.148+0000 7f6b0f1e1700 -1 *** Caught signal (Aborted) **
Jun  3 23:21:23 r07s05 bash[1415068]:  in thread 7f6b0f1e1700 thread_name:ms_dispatch
Jun  3 23:21:23 r07s05 bash[1415068]:  ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)
Jun  3 23:21:23 r07s05 bash[1415068]:  1: /lib64/libpthread.so.0(+0x12ce0) [0x7f6b16bf8ce0]
Jun  3 23:21:23 r07s05 bash[1415068]:  2: gsignal()
Jun  3 23:21:23 r07s05 bash[1415068]:  3: abort()
Jun  3 23:21:23 r07s05 bash[1415068]:  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x7f6b17c0aba3]
Jun  3 23:21:23 r07s05 bash[1415068]:  5: /usr/lib64/ceph/libceph-common.so.2(+0x276d6c) [0x7f6b17c0ad6c]
Jun  3 23:21:23 r07s05 bash[1415068]:  6: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned long, utime_t)+0x1aab) [0x55bd76f9298b]
Jun  3 23:21:23 r07s05 bash[1415068]:  7: (Server::handle_client_open(boost::intrusive_ptr<MDRequestImpl>&)+0x111c) [0x55bd76e2130c]
Jun  3 23:21:23 r07s05 bash[1415068]:  8: (Server::handle_client_openc(boost::intrusive_ptr<MDRequestImpl>&)+0x4ab) [0x55bd76e220cb]
Jun  3 23:21:23 r07s05 bash[1415068]:  9: (Server::dispatch_client_request(boost::intrusive_ptr<MDRequestImpl>&)+0xf3c) [0x55bd76e50e8c]
Jun  3 23:21:23 r07s05 bash[1415068]:  10: (MDCache::dispatch_request(boost::intrusive_ptr<MDRequestImpl>&)+0x33) [0x55bd76f07193]
Jun  3 23:21:23 r07s05 bash[1415068]:  11: (MDSContext::complete(int)+0x56) [0x55bd770c3c06]
Jun  3 23:21:23 r07s05 bash[1415068]:  12: (MDSCacheObject::finish_waiting(unsigned long, int)+0xce) [0x55bd770e5cae]
Jun  3 23:21:23 r07s05 bash[1415068]:  13: (Locker::eval_gather(SimpleLock*, bool, bool*, std::vector<MDSContext*, std::allocator<MDSContext*> >*)+0x13d6) [0x55bd76f97d66]
Jun  3 23:21:23 r07s05 bash[1415068]:  14: (Locker::handle_file_lock(ScatterLock*, boost::intrusive_ptr<MLock const> const&)+0xed1) [0x55bd76fa5dd1]
Jun  3 23:21:23 r07s05 bash[1415068]:  15: (Locker::handle_lock(boost::intrusive_ptr<MLock const> const&)+0x1b3) [0x55bd76fa6943]
Jun  3 23:21:23 r07s05 bash[1415068]:  16: (Locker::dispatch(boost::intrusive_ptr<Message const> const&)+0xb4) [0x55bd76faab74]
Jun  3 23:21:23 r07s05 bash[1415068]:  17: (MDSRank::handle_message(boost::intrusive_ptr<Message const> const&)+0xbcc) [0x55bd76dc0a2c]
Jun  3 23:21:23 r07s05 bash[1415068]:  18: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x7bb) [0x55bd76dc33cb]
Jun  3 23:21:23 r07s05 bash[1415068]:  19: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x55) [0x55bd76dc39c5]
Jun  3 23:21:23 r07s05 bash[1415068]:  20: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x108) [0x55bd76db35d8]
Jun  3 23:21:23 r07s05 bash[1415068]:  21: (DispatchQueue::entry()+0x126a) [0x7f6b17e4eaba]
Jun  3 23:21:23 r07s05 bash[1415068]:  22: (DispatchQueue::DispatchThread::entry()+0x11) [0x7f6b17f005d1]
Jun  3 23:21:23 r07s05 bash[1415068]:  23: /lib64/libpthread.so.0(+0x81cf) [0x7f6b16bee1cf]
Jun  3 23:21:23 r07s05 bash[1415068]:  24: clone()

History

#1 Updated by Venky Shankar 6 months ago

  • Status changed from New to Triaged
  • Assignee set to Kotresh Hiremath Ravishankar
  • Target version set to v18.0.0
  • Backport set to quincy, pacific
  • Labels (FS) crash added

#2 Updated by Mike Lowe 6 months ago

I've identified the problematic clients as kernel client 5.18.0. Once the auth was removed for these clients the mds's were able to stay running long enough to recover.

#3 Updated by Patrick Donnelly 6 months ago

  • Description updated (diff)

#4 Updated by Patrick Donnelly 5 months ago

  • Description updated (diff)

#5 Updated by Kotresh Hiremath Ravishankar 5 months ago

Hi Mike,

We would need more information on this to proceed further.

1. Output of 'ceph fs dump' ?
2. Was multi-mds configured when this crash is seen ?
3. mds logs when this crash is seen. If the issue is reproducible, could you please enable mds debug logs and share it ?
4. What is the workload on the cluster ?

Thanks,
Kotresh HR

#6 Updated by Kotresh Hiremath Ravishankar 5 months ago

  • Status changed from Triaged to Need More Info

#7 Updated by Mike Lowe 4 months ago

I've noticed a commonality when this is being triggered, Singularity is being used https://en.wikipedia.org/wiki/Singularity_(software)

1.
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout
v2,10=snaprealm v2}
legacy client fscid: 3

Filesystem 'cephfs' (3)
fs_name cephfs
epoch 385251
flags 12
created 2021-09-21T19:34:47.717174+0000
modified 2022-08-02T13:58:29.118779+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 421512
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor tabl
e,9=file layout v2,10=snaprealm v2}
max_mds 8
in 0,1,2,3,4,5,6,7
up {0=3115345405,1=2537486907,2=2537466589,3=2295815428,4=2537462109,5=2537490587,6=3081320000,7=2524704559}
failed
damaged
stopped
data_pools [14,15]
metadata_pool 13
inline_data disabled
balancer
standby_count_wanted 1
[mds.fs_name.r07s07.pggvts{0:3115345405} state up:active seq 18700 addr [v2:xxx.xxx.xxx.12:6928/2179125734,v1:xxx.xxx.xxx.12:6929/2179125734] compat {c=[1],r=[1],i=[7ff]}]
[mds.fs_name.r07s08.rgbvub{1:2537486907} state up:active seq 18877 addr [v2:xxx.xxx.xxx.13:6928/3964695445,v1:xxx.xxx.xxx.13:6929/3964695445] compat {c=[1],r=[1],i=[7ff]}]
[mds.fs_name.r07s06.bwendz{2:2537466589} state up:active seq 1854 addr [v2:xxx.xxx.xxx.11:6928/3205070210,v1:xxx.xxx.xxx.11:6929/3205070210] compat {c=[1],r=[1],i=[7ff]}]
[mds.fs_name.r07s03.xkwnse{3:2295815428} state up:active seq 1267828 addr [v2:xxx.xxx.xxx.8:6800/2513498868,v1:xxx.xxx.xxx.8:6801/2513498868] compat {c=[1],r=[1],i=[7ff]}]
[mds.fs_name.r07s04.jebrjh{4:2537462109} state up:active seq 12596 addr [v2:xxx.xxx.xxx.9:6800/909935435,v1:xxx.xxx.xxx.9:6801/909935435] compat {c=[1],r=[1],i=[7ff]}]
[mds.fs_name.r07s09.vxhfas{5:2537490587} state up:active seq 4937 addr [v2:xxx.xxx.xxx.14:6928/904839211,v1:xxx.xxx.xxx.14:6929/904839211] compat {c=[1],r=[1],i=[7ff]}]
[mds.fs_name.r07s05.rkzfgs{6:3081320000} state up:active seq 81771 addr [v2:xxx.xxx.xxx.10:6800/3613797273,v1:xxx.xxx.xxx.10:6801/3613797273] compat {c=[1],r=[1],i=[7ff]}]
[mds.fs_name.r07s01.cbombv{7:2524704559} state up:active seq 67493 addr [v2:xxx.xxx.xxx.6:6800/1780779886,v1:xxx.xxx.xxx.6:6801/1780779886] compat {c=[1],r=[1],i=[7ff]}]

Standby daemons:

[mds.fs_name.r07s02.zrtfpl{-1:3116010666} state up:standby seq 1 addr [v2:xxx.xxx.xxx.7:6800/3247043630,v1:xxx.xxx.xxx.7:6801/3247043630] compat {c=[1],r=[1],i=[7ff]}]
dumped fsmap epoch 385251

2. Yes
3. It is not easily reproducible
4. Various scientific applications

Also available in: Atom PDF