Bug #63830
openMDS fails to start
0%
Description
I have 2 filesystems, production and backup.
The backup fs is offline, because none of the mds's will go active.
Below here, I've added version, mds service spec, pool id and names, mds metadata for backup, one of the many crash reports and the service log output that's generated when i reset-failed + start one of the mds services.
I've also been made aware of https://access.redhat.com/solutions/6994879, but I'm not sure it's the same issue.
$ ceph version ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
$ ceph orch ls --service_type mds --export service_type: mds service_id: Production service_name: mds.Production placement: count: 2 label: mds --- service_type: mds service_id: backup service_name: mds.backup placement: count: 2 label: mds_backup
$ ceph osd pool ls detail | grep cephfs | awk '{print $1" "$2" "$3}' pool 24 'cephfs.backup.meta' pool 25 'cephfs.backup.data' pool 26 'cephfs.production.data' pool 27 'cephfs.production.metadata'
$ ceph fs ls name: backup, metadata pool: cephfs.backup.meta, data pools: [cephfs.backup.data ] name: production, metadata pool: cephfs.production.metadata, data pools: [cephfs.production.data ]
$ ceph mds metadata | jq .[1] { "name": "backup.ceph03.gcoisu", "addr": "[v2:10.1.0.34:6800/3795710591,v1:10.1.0.34:6801/3795710591]", "arch": "x86_64", "ceph_release": "quincy", "ceph_version": "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)", "ceph_version_short": "17.2.7", "container_hostname": "ceph03", "container_image": "quay.io/ceph/ceph@sha256:1fcdbead4709a7182047f8ff9726e0f17b0b209aaa6656c5c8b2339b818e70bb", "cpu": "Intel(R) Celeron(R) J4115 CPU @ 1.80GHz", "distro": "centos", "distro_description": "CentOS Stream 8", "distro_version": "8", "hostname": "ceph03", "kernel_description": "#1 SMP PREEMPT_DYNAMIC Thu Sep 21 18:07:33 UTC 2023", "kernel_version": "5.14.0-368.el9.x86_64", "mem_swap_kb": "3055612", "mem_total_kb": "32410468", "os": "Linux" }
$ ceph crash info 2023-12-14T12:08:09.595806Z_430af44c-1138-47fd-94c2-69cd6f82001e { "backtrace": [ "/lib64/libpthread.so.0(+0x12cf0) [0x7f4acf88acf0]", "gsignal()", "abort()", "/lib64/libstdc++.so.6(+0x9009b) [0x7f4acec8409b]", "/lib64/libstdc++.so.6(+0x9654c) [0x7f4acec8a54c]", "/lib64/libstdc++.so.6(+0x965a7) [0x7f4acec8a5a7]", "/lib64/libstdc++.so.6(+0x96808) [0x7f4acec8a808]", "(ceph::buffer::v15_2_0::list::iterator_impl<true>::copy(unsigned int, char*)+0xa5) [0x7f4ad0c620e5]", "(compact_set_base<long, std::set<long, std::less<long>, mempool::pool_allocator<(mempool::pool_index_t)26, long> > >::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x15f) [0x55a2d20088df]", "(inode_t<mempool::mds_co::pool_allocator>::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x55b) [0x55a2d200903b]", "(old_inode_t<mempool::mds_co::pool_allocator>::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x123) [0x55a2d2009623]", "(EMetaBlob::fullbit::decode(ceph::buffer::v15_2_0::list::iterator_impl<true>&)+0x688) [0x55a2d20eb3f8]", "/usr/bin/ceph-mds(+0x592f2d) [0x55a2d20edf2d]", "(EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x7bf) [0x55a2d20f5bff]", "(EUpdate::replay(MDSRank*)+0x61) [0x55a2d20fdbd1]", "(MDLog::_replay_thread()+0x7bb) [0x55a2d208454b]", "(MDLog::ReplayThread::entry()+0x11) [0x55a2d1d37041]", "/lib64/libpthread.so.0(+0x81ca) [0x7f4acf8801ca]", "clone()" ], "ceph_version": "17.2.7", "crash_id": "2023-12-14T12:08:09.595806Z_430af44c-1138-47fd-94c2-69cd6f82001e", "entity_name": "mds.backup.ceph03.gcoisu", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-mds", "stack_sig": "99cdac589b9de540dc8f5016618788241f1ac1c08b8c8bf453437e6cd9792d18", "timestamp": "2023-12-14T12:08:09.595806Z", "utsname_hostname": "ceph03", "utsname_machine": "x86_64", "utsname_release": "5.14.0-368.el9.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP PREEMPT_DYNAMIC Thu Sep 21 18:07:33 UTC 2023" }
Files
Updated by Venky Shankar 5 months ago
- Category set to Correctness/Safety
- Assignee set to Milind Changire
- Target version set to v19.0.0
Milind, I recall you working on a similar issue. This looks to be the s-r daemon.
Updated by Venky Shankar 4 months ago
Heðin Ejdesgaard Møller wrote:
I have 2 filesystems, production and backup.
The backup fs is offline, because none of the mds's will go active.
The crash posted seems to be from the standby-replay daemon. Why is the active MDS offline? Is that crashing too?
Updated by Venky Shankar 4 months ago
- Priority changed from Normal to High
Venky Shankar wrote:
Heðin Ejdesgaard Møller wrote:
I have 2 filesystems, production and backup.
The backup fs is offline, because none of the mds's will go active.The crash posted seems to be from the standby-replay daemon. Why is the active MDS offline? Is that crashing too?
My bad. This it the to-be active MDS daemon going through the boot sequence and crashed in up:replay. I've seen this before. Unfortunately, nothing came out of debugging then. Will have a look.
Updated by Heðin Ejdesgaard Møller 2 months ago
I have made a coredump of the mds service, but it's size is ~10MB, so I'm unable to upload it here.
Is there another place, where I can upload the file ? If not, then please throw me an email, and I'l share it with you directly.
Updated by Milind Changire about 2 months ago
Heðin Ejdesgaard Møller wrote:
I have made a coredump of the mds service, but it's size is ~10MB, so I'm unable to upload it here.
Is there another place, where I can upload the file ? If not, then please throw me an email, and I'l share it with you directly.
Please upload the coredump on Google Drive or some public file sharing service and paste the download link here.
Updated by Heðin Ejdesgaard Møller about 2 months ago
Milind Changire wrote:
Heðin Ejdesgaard Møller wrote:
I have made a coredump of the mds service, but it's size is ~10MB, so I'm unable to upload it here.
Is there another place, where I can upload the file ? If not, then please throw me an email, and I'l share it with you directly.
Please upload the coredump on Google Drive or some public file sharing service and paste the download link here.
Hey, I have uploaded it here:
https://drive.google.com/drive/folders/1dxXSynMsOCRzFurGugfcrj0iwrQK8y74?usp=drive_link
Updated by Venky Shankar about 2 months ago
Heðin Ejdesgaard Møller wrote:
Milind Changire wrote:
Heðin Ejdesgaard Møller wrote:
I have made a coredump of the mds service, but it's size is ~10MB, so I'm unable to upload it here.
Is there another place, where I can upload the file ? If not, then please throw me an email, and I'l share it with you directly.
Please upload the coredump on Google Drive or some public file sharing service and paste the download link here.
Hey, I have uploaded it here:
https://drive.google.com/drive/folders/1dxXSynMsOCRzFurGugfcrj0iwrQK8y74?usp=drive_link
Thanks for sharing the coredump.
Milind, PTAL.
Updated by Milind Changire 24 days ago
I rebuilt code tagged at v17.2.7 on my Fedora 35 VM and launched gdb with the locally built ceph-mds and the core dump. gdb reported that the stack was probably corrupt:
#0 0x00007f361b385b8f in ?? () [Current thread is 1 (LWP 31)] (gdb) bt #0 0x00007f361b385b8f in ?? () #1 0x0000000000381000 in MDCache::rejoin_send_acks (this=<error reading variable: Cannot access memory at address 0xfffffffffffffe36>) at /home/mchangir/work/mchangir-ceph.git/src/mds/MDCache.cc:6089 Backtrace stopped: previous frame inner to this frame (corrupt stack?)