Project

General

Profile

Actions

Bug #20122

open

Ceph MDS crash with assert failure

Added by James Eckersall almost 7 years ago. Updated over 6 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

The cluster is running Kraken on CentOS 7.3 and has 3 MDS servers, 01 was up:active and is the one that crashed as per the below stacktrace, 02 was in up:standby:replay and 03 was in up:standby.
After the below crash, 01 came back up into up:standby, 02 changed to up:replay, but didn't log anything for two and a half hours and was stuck in up:replay for that whole time. At this point, two and a half hours since initial 01 crash, one of our engineers killed the MDS daemon process on 02 and 03 changed from up:standby to up:standby-replay and then to up:active, so service was restored. 01 changed into up:standby-replay state.

2017-05-30 22:12:00.933446 7f27cf42c700 1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/mds/CDir.cc: In function 'void CDir::try_remove_dentries_for_stray()' thread 7f27cf42c700 time 2017-05-30 22:
12:00.906195
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/ceph-11.2.0/src/mds/CDir.cc: 698: FAILED assert(dn
>get_linkage()->is_null())

ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f27db4df9c5]
2: (CDir::try_remove_dentries_for_stray()+0x1c0) [0x7f27db3424c0]
3: (StrayManager::__eval_stray(CDentry*, bool)+0x8a9) [0x7f27db2c60e9]
4: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7f27db2c665e]
5: (MutationImpl::drop_pins()+0xc1) [0x7f27db20e8c1]
6: (MDCache::request_cleanup(std::shared_ptr<MDRequestImpl>&)+0x171) [0x7f27db236151]
7: (MDCache::request_finish(std::shared_ptr<MDRequestImpl>&)+0x160) [0x7f27db236590]
8: (Server::reply_client_request(std::shared_ptr<MDRequestImpl>&, MClientReply*)+0x223) [0x7f27db1b43b3]
9: (Server::respond_to_request(std::shared_ptr<MDRequestImpl>&, int)+0x411) [0x7f27db1b4fc1]
10: (Server::_unlink_local_finish(std::shared_ptr<MDRequestImpl>&, CDentry*, CDentry*, unsigned long)+0x312) [0x7f27db1befa2]
11: (MDSIOContextBase::complete(int)+0xa4) [0x7f27db3c3164]
12: (MDSLogContextBase::complete(int)+0x3c) [0x7f27db3c360c]
13: (Finisher::finisher_thread_entry()+0x1f6) [0x7f27db4deba6]
14: (()+0x7dc5) [0x7f27d91ecdc5]
15: (clone()+0x6d) [0x7f27d82d873d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
<10000 recent entries >
--- end dump of recent events ---
2017-05-30 22:12:00.962721 7f27cf42c700 -1 ** Caught signal (Aborted) *
in thread 7f27cf42c700 thread_name:fn_anonymous

Please let me know if there is any further information you require.

Actions #1

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to CephFS
Actions #2

Updated by Patrick Donnelly almost 7 years ago

Are you able to reliably reproduce this? Do you have any MDS logs during the failure?

Actions #3

Updated by James Poole over 6 years ago

Thank you for your help. We've had several occurrences of this same issue since. This isn't something easily replicated as the circumstances that cause it aren't 100% clear and it's supporting a production system. The mds appears to crash completely and logging stops. We are running the 3.10 kernel at the moment and appreciate there are fixes to the cephfs client in the mainline 4 kernel but wondered if you could advise how likely the newer kernel will fix the problem we are seeing? We have a stack trace from a client:

[Mon Jul 10 06:53:34 2017] ceph: mds0 caps stale
[Mon Jul 10 06:53:36 2017] ceph: mds0 caps renewed
[Mon Jul 10 06:56:34 2017] ceph: mds0 caps stale
[Mon Jul 10 06:56:42 2017] ceph: mds0 caps renewed
[Mon Jul 10 06:59:14 2017] ceph: mds0 caps stale
[Mon Jul 10 06:59:34 2017] ceph: mds0 caps stale
[Mon Jul 10 07:00:13 2017] ceph: mds0 caps renewed
[Mon Jul 10 07:01:14 2017] ceph: mds0 caps stale
[Mon Jul 10 07:01:33 2017] ceph: mds0 caps renewed
[Mon Jul 10 07:02:14 2017] ceph: mds0 caps stale
[Mon Jul 10 07:02:33 2017] ceph: mds0 caps went stale, renewing
[Mon Jul 10 07:02:33 2017] ceph: mds0 caps stale
[Mon Jul 10 07:02:33 2017] ------------[ cut here ]------------
[Mon Jul 10 07:02:33 2017] WARNING: at fs/ceph/inode.c:567 ceph_fill_file_size+0x1f1/0x220 [ceph]()
[Mon Jul 10 07:02:33 2017] Modules linked in: binfmt_misc nf_conntrack_netlink nfnetlink isofs veth rbd cfg80211 rfkill ceph libceph dns_resolver xt_statistic xt_nat xt_recent ipt_REJECT nf_reject_ipv4 vport_vxlan vxlan ip6_udp_tunnel udp_tunnel xt_mark openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_defrag_ipv6 ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio fuse btrfs zlib_deflate raid6_pq xor vfat msdos fat ext4 mbcache jbd2 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables xt_limit nf_log_ipv4 nf_log_common xt_LOG xt_comment xt_multiport vmw_vsock_vmci_transport vsock intel_powerclamp coretemp iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper
[Mon Jul 10 07:02:33 2017] ablk_helper cryptd ppdev vmw_balloon pcspkr sg vmw_vmci shpchp i2c_piix4 parport_pc parport xfs libcrc32c sr_mod cdrom ata_generic pata_acpi sd_mod crc_t10dif crct10dif_generic vmwgfx drm_kms_helper syscopyarea sysfillrect ahci sysimgblt fb_sys_fops crct10dif_pclmul ttm crct10dif_common mptsas libahci crc32c_intel ata_piix drm serio_raw scsi_transport_sas mptscsih vmxnet3 libata mptbase i2c_core fjes dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nf_defrag_ipv4]
[Mon Jul 10 07:02:33 2017] CPU: 0 PID: 55515 Comm: kworker/0:1 Not tainted 3.10.0-514.16.1.el7.x86_64 #1
[Mon Jul 10 07:02:33 2017] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/17/2015
[Mon Jul 10 07:02:33 2017] Workqueue: ceph-msgr ceph_con_workfn [libceph]
[Mon Jul 10 07:02:33 2017] 0000000000000000 0000000001dc52cf ffff880247bbb9f8 ffffffff81686ac3
[Mon Jul 10 07:02:33 2017] ffff880247bbba30 ffffffff81085cb0 ffff880208627280 000000000000030e
[Mon Jul 10 07:02:33 2017] 0000000000000000 0000000000000000 0000000000000200 ffff880247bbba40
[Mon Jul 10 07:02:33 2017] Call Trace:
[Mon Jul 10 07:02:33 2017] [<ffffffff81686ac3>] dump_stack+0x19/0x1b
[Mon Jul 10 07:02:33 2017] [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0
[Mon Jul 10 07:02:33 2017] [<ffffffff81085dfa>] warn_slowpath_null+0x1a/0x20
[Mon Jul 10 07:02:33 2017] [<ffffffffa07a6b51>] ceph_fill_file_size+0x1f1/0x220 [ceph]
[Mon Jul 10 07:02:33 2017] [<ffffffffa07a72c3>] fill_inode.isra.13+0x253/0xdc0 [ceph]
[Mon Jul 10 07:02:33 2017] [<ffffffffa07a7f75>] ceph_fill_trace+0x145/0xa00 [ceph]
[Mon Jul 10 07:02:33 2017] [<ffffffffa07c86a8>] handle_reply+0x3e8/0xc80 [ceph]
[Mon Jul 10 07:02:33 2017] [<ffffffffa076519d>] ? ceph_x_encrypt+0x4d/0x80 [libceph]
[Mon Jul 10 07:02:33 2017] [<ffffffffa07cad39>] dispatch+0xd9/0xaf0 [ceph]
[Mon Jul 10 07:02:33 2017] [<ffffffff81555f5a>] ? kernel_recvmsg+0x3a/0x50
[Mon Jul 10 07:02:33 2017] [<ffffffffa074cecf>] try_read+0x4df/0x1260 [libceph]
[Mon Jul 10 07:02:33 2017] [<ffffffff810cf22e>] ? dequeue_task_fair+0x41e/0x660
[Mon Jul 10 07:02:33 2017] [<ffffffffa074dd09>] ceph_con_workfn+0xb9/0x650 [libceph]
[Mon Jul 10 07:02:33 2017] [<ffffffff810a845b>] process_one_work+0x17b/0x470
[Mon Jul 10 07:02:33 2017] [<ffffffff810a9296>] worker_thread+0x126/0x410
[Mon Jul 10 07:02:33 2017] [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460
[Mon Jul 10 07:02:33 2017] [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[Mon Jul 10 07:02:33 2017] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[Mon Jul 10 07:02:33 2017] [<ffffffff816970d8>] ret_from_fork+0x58/0x90
[Mon Jul 10 07:02:33 2017] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[Mon Jul 10 07:02:33 2017] ---[ end trace dcfd5316bf96d6e2 ]---

Are there any further debugging steps we can take if/when this reoccurs?

Actions #4

Updated by Patrick Donnelly over 6 years ago

  • Status changed from New to Need More Info
  • Source set to Community (user)

A debug log from the MDS is necesary to diagnose this I think. See: http://docs.ceph.com/docs/giant/rados/troubleshooting/log-and-debug/

"debug mds = 10" would be sufficient I think.

Actions

Also available in: Atom PDF