Project

General

Profile

Actions

Bug #5301

closed

mon: leveldb crash in tcmalloc

Added by Maciej Galkiewicz almost 11 years ago. Updated almost 11 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hello

I have replaced my crushmap:

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host localhost {
        id -2           # do not change unnecessarily
        # weight 2.000
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 1.000
        item osd.1 weight 1.000
        item osd.2 weight 1.000
}
rack localrack {
        id -3           # do not change unnecessarily
        # weight 2.000
        alg straw
        hash 0  # rjenkins1
        item localhost weight 3.000
}
pool default {
        id -1           # do not change unnecessarily
        # weight 3.000
        alg straw
        hash 0  # rjenkins1
        item localrack weight 3.000
}

# rules
rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type osd
        step emit
}
rule metadata {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type osd
        step emit
}
rule rbd {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type osd
        step emit
}

# end crush map

with:

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host n11c1 {
        id -4           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 1.000
}
host n14c1 {
        id -5           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.1 weight 1.000
}
host n18c1 {
        id -6           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 1.000
}
root default {
        id -1           # do not change unnecessarily
        # weight 3.000
        alg straw
        hash 0  # rjenkins1
        item n11c1 weight 1.000
        item n14c1 weight 1.000
        item n18c1 weight 1.000
}

# rules
rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type host
        step emit
}
rule metadata {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type host
        step emit
}
rule rbd {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type host
        step emit
}

# end crush map

Cluster status after applying new crushmap:

# ceph -s
   health HEALTH_WARN 394 pgs stale; 394 pgs stuck stale
   monmap e7: 3 mons at {cc2=10.1.128.1:6789/0,n11c1=10.1.128.11:6789/0,n14c1=10.1.128.14:6789/0}, election epoch 110452, quorum 0,1,2 cc2,n11c1,n14c1
   osdmap e1862: 3 osds: 3 up, 3 in
    pgmap v8016076: 1104 pgs: 710 active+clean, 394 stale+active+clean; 24020 MB data, 50278 MB used, 616 GB / 670 GB avail
   mdsmap e7839: 1/1/1 up {0=n11c1=up:active}

Kernel rbd clients started to crash with error:

Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.688107] libceph: osd2 10.1.128.18:6801 socket closed (con state OPEN)
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.688727] ------------[ cut here ]------------
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.689348] kernel BUG at /build/buildd-linux_3.8.13-1-amd64-YudJGj/linux-3.8.13/net/ceph/osd_client.c:601!
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.690389] invalid opcode: 0000 [#1] SMP 
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.691060] Modules linked in: xt_owner xt_comment iptable_mangle ip_tables x_tables xfs cbc rbd libceph loop nfsd auth_rpcgss nfs_acl nfs lockd dns_resolver fscache sunrpc fuse coretemp crc32c_intel ghash_clmulni_intel snd_pcm aesni_intel aes_x86_64 snd_page_alloc ablk_helper snd_timer cryptd snd xts soundcore lrw gf128mul evdev pcspkr joydev ext4 crc16 jbd2 mbcache btrfs zlib_deflate crc32c libcrc32c xen_blkfront xen_netfront
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] CPU 0 
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Pid: 37, comm: kworker/0:1 Not tainted 3.8-2-amd64 #1 Debian 3.8.13-1  
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RIP: e030:[<ffffffffa029619d>]  [<ffffffffa029619d>] osd_reset+0x105/0x195 [libceph]
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RSP: e02b:ffff880003521ce8  EFLAGS: 00010206
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RAX: ffff88007c9c0850 RBX: ffff88000375b748 RCX: 0000000000000000
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RDX: ffff88007c9c0820 RSI: ffff88007fc19e05 RDI: ffff88007c6f64a8
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RBP: ffff88007c9c0800 R08: ffff88000375b800 R09: 00000000fffffff8
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] R10: ffff88007b9f5e00 R11: ffff88007b9f5e00 R12: ffff88007cb9a420
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] R13: ffff88000375b758 R14: ffff88000375b7a0 R15: ffff88007c6f6468
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] FS:  00007f5631331700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] CR2: ffffffffff600400 CR3: 000000007b5f9000 CR4: 0000000000002660
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Process kworker/0:1 (pid: 37, threadinfo ffff880003520000, task ffff88007aadf180)
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Stack:
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  0000000000000000 ffff88007c6f64b8 ffff88007c6f6030 ffff88007c6f6060
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  0000000000000002 ffff88007c6f6428 ffffffffa02a25d0 ffff88007c6f6030
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  0000000000000000 ffffffffa02925b7 ffffffff810040cf ffffffff81004273
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Call Trace:
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffffa02925b7>] ? con_work+0x1b62/0x1c48 [libceph]
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff810040cf>] ? arch_local_irq_restore+0x7/0x8
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff81004273>] ? xen_mc_flush+0x11e/0x161
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff81003159>] ? xen_end_context_switch+0xe/0x1c
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff8100d02f>] ? load_TLS+0x7/0xa
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff8105fd0a>] ? mmdrop+0xd/0x1c
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff81061581>] ? finish_task_switch+0x83/0xb3
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff810532b1>] ? process_one_work+0x18d/0x2d3
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff810536bc>] ? worker_thread+0x118/0x1b2
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff810535a4>] ? rescuer_thread+0x187/0x187
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff81057358>] ? kthread+0x81/0x89

# ceph -v
ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532)

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #5239: osd: Segmentation fault in ceph-osd / tcmallocCan't reproduce06/03/2013

Actions
Actions

Also available in: Atom PDF