Actions
Bug #5301
closedmon: leveldb crash in tcmalloc
Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
Hello
I have replaced my crushmap:
# begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 pool # buckets host localhost { id -2 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 item osd.1 weight 1.000 item osd.2 weight 1.000 } rack localrack { id -3 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item localhost weight 3.000 } pool default { id -1 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item localrack weight 3.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type osd step emit } # end crush map
with:
# begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 root # buckets host n11c1 { id -4 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 } host n14c1 { id -5 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.1 weight 1.000 } host n18c1 { id -6 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.2 weight 1.000 } root default { id -1 # do not change unnecessarily # weight 3.000 alg straw hash 0 # rjenkins1 item n11c1 weight 1.000 item n14c1 weight 1.000 item n18c1 weight 1.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step choose firstn 0 type host step emit } # end crush map
Cluster status after applying new crushmap:
# ceph -s health HEALTH_WARN 394 pgs stale; 394 pgs stuck stale monmap e7: 3 mons at {cc2=10.1.128.1:6789/0,n11c1=10.1.128.11:6789/0,n14c1=10.1.128.14:6789/0}, election epoch 110452, quorum 0,1,2 cc2,n11c1,n14c1 osdmap e1862: 3 osds: 3 up, 3 in pgmap v8016076: 1104 pgs: 710 active+clean, 394 stale+active+clean; 24020 MB data, 50278 MB used, 616 GB / 670 GB avail mdsmap e7839: 1/1/1 up {0=n11c1=up:active}
Kernel rbd clients started to crash with error:
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.688107] libceph: osd2 10.1.128.18:6801 socket closed (con state OPEN) Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.688727] ------------[ cut here ]------------ Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.689348] kernel BUG at /build/buildd-linux_3.8.13-1-amd64-YudJGj/linux-3.8.13/net/ceph/osd_client.c:601! Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.690389] invalid opcode: 0000 [#1] SMP Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.691060] Modules linked in: xt_owner xt_comment iptable_mangle ip_tables x_tables xfs cbc rbd libceph loop nfsd auth_rpcgss nfs_acl nfs lockd dns_resolver fscache sunrpc fuse coretemp crc32c_intel ghash_clmulni_intel snd_pcm aesni_intel aes_x86_64 snd_page_alloc ablk_helper snd_timer cryptd snd xts soundcore lrw gf128mul evdev pcspkr joydev ext4 crc16 jbd2 mbcache btrfs zlib_deflate crc32c libcrc32c xen_blkfront xen_netfront Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] CPU 0 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Pid: 37, comm: kworker/0:1 Not tainted 3.8-2-amd64 #1 Debian 3.8.13-1 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RIP: e030:[<ffffffffa029619d>] [<ffffffffa029619d>] osd_reset+0x105/0x195 [libceph] Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RSP: e02b:ffff880003521ce8 EFLAGS: 00010206 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RAX: ffff88007c9c0850 RBX: ffff88000375b748 RCX: 0000000000000000 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RDX: ffff88007c9c0820 RSI: ffff88007fc19e05 RDI: ffff88007c6f64a8 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RBP: ffff88007c9c0800 R08: ffff88000375b800 R09: 00000000fffffff8 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] R10: ffff88007b9f5e00 R11: ffff88007b9f5e00 R12: ffff88007cb9a420 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] R13: ffff88000375b758 R14: ffff88000375b7a0 R15: ffff88007c6f6468 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] FS: 00007f5631331700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] CR2: ffffffffff600400 CR3: 000000007b5f9000 CR4: 0000000000002660 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Process kworker/0:1 (pid: 37, threadinfo ffff880003520000, task ffff88007aadf180) Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Stack: Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] 0000000000000000 ffff88007c6f64b8 ffff88007c6f6030 ffff88007c6f6060 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] 0000000000000002 ffff88007c6f6428 ffffffffa02a25d0 ffff88007c6f6030 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] 0000000000000000 ffffffffa02925b7 ffffffff810040cf ffffffff81004273 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Call Trace: Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffffa02925b7>] ? con_work+0x1b62/0x1c48 [libceph] Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff810040cf>] ? arch_local_irq_restore+0x7/0x8 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff81004273>] ? xen_mc_flush+0x11e/0x161 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff81003159>] ? xen_end_context_switch+0xe/0x1c Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff8100d02f>] ? load_TLS+0x7/0xa Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff8105fd0a>] ? mmdrop+0xd/0x1c Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff81061581>] ? finish_task_switch+0x83/0xb3 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff810532b1>] ? process_one_work+0x18d/0x2d3 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff810536bc>] ? worker_thread+0x118/0x1b2 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff810535a4>] ? rescuer_thread+0x187/0x187 Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] [<ffffffff81057358>] ? kthread+0x81/0x89
# ceph -v ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532)
Actions