Bug #5301: mon: leveldb crash in tcmalloc - Ceph - Ceph

Actions

Copy link

Bug #5301

closed

mon: leveldb crash in tcmalloc

Added by Maciej Galkiewicz almost 11 years ago. Updated almost 11 years ago.

Status:

Can't reproduce

Priority:

High

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Hello

I have replaced my crushmap:

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host localhost {
        id -2           # do not change unnecessarily
        # weight 2.000
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 1.000
        item osd.1 weight 1.000
        item osd.2 weight 1.000
}
rack localrack {
        id -3           # do not change unnecessarily
        # weight 2.000
        alg straw
        hash 0  # rjenkins1
        item localhost weight 3.000
}
pool default {
        id -1           # do not change unnecessarily
        # weight 3.000
        alg straw
        hash 0  # rjenkins1
        item localrack weight 3.000
}

# rules
rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type osd
        step emit
}
rule metadata {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type osd
        step emit
}
rule rbd {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type osd
        step emit
}

# end crush map

with:

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host n11c1 {
        id -4           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 1.000
}
host n14c1 {
        id -5           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.1 weight 1.000
}
host n18c1 {
        id -6           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 1.000
}
root default {
        id -1           # do not change unnecessarily
        # weight 3.000
        alg straw
        hash 0  # rjenkins1
        item n11c1 weight 1.000
        item n14c1 weight 1.000
        item n18c1 weight 1.000
}

# rules
rule data {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type host
        step emit
}
rule metadata {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type host
        step emit
}
rule rbd {
        ruleset 2
        type replicated
        min_size 1
        max_size 10
        step take default
        step choose firstn 0 type host
        step emit
}

# end crush map

Cluster status after applying new crushmap:

# ceph -s
   health HEALTH_WARN 394 pgs stale; 394 pgs stuck stale
   monmap e7: 3 mons at {cc2=10.1.128.1:6789/0,n11c1=10.1.128.11:6789/0,n14c1=10.1.128.14:6789/0}, election epoch 110452, quorum 0,1,2 cc2,n11c1,n14c1
   osdmap e1862: 3 osds: 3 up, 3 in
    pgmap v8016076: 1104 pgs: 710 active+clean, 394 stale+active+clean; 24020 MB data, 50278 MB used, 616 GB / 670 GB avail
   mdsmap e7839: 1/1/1 up {0=n11c1=up:active}

Kernel rbd clients started to crash with error:

Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.688107] libceph: osd2 10.1.128.18:6801 socket closed (con state OPEN)
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.688727] ------------[ cut here ]------------
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.689348] kernel BUG at /build/buildd-linux_3.8.13-1-amd64-YudJGj/linux-3.8.13/net/ceph/osd_client.c:601!
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.690389] invalid opcode: 0000 [#1] SMP 
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.691060] Modules linked in: xt_owner xt_comment iptable_mangle ip_tables x_tables xfs cbc rbd libceph loop nfsd auth_rpcgss nfs_acl nfs lockd dns_resolver fscache sunrpc fuse coretemp crc32c_intel ghash_clmulni_intel snd_pcm aesni_intel aes_x86_64 snd_page_alloc ablk_helper snd_timer cryptd snd xts soundcore lrw gf128mul evdev pcspkr joydev ext4 crc16 jbd2 mbcache btrfs zlib_deflate crc32c libcrc32c xen_blkfront xen_netfront
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] CPU 0 
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Pid: 37, comm: kworker/0:1 Not tainted 3.8-2-amd64 #1 Debian 3.8.13-1  
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RIP: e030:[<ffffffffa029619d>]  [<ffffffffa029619d>] osd_reset+0x105/0x195 [libceph]
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RSP: e02b:ffff880003521ce8  EFLAGS: 00010206
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RAX: ffff88007c9c0850 RBX: ffff88000375b748 RCX: 0000000000000000
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RDX: ffff88007c9c0820 RSI: ffff88007fc19e05 RDI: ffff88007c6f64a8
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] RBP: ffff88007c9c0800 R08: ffff88000375b800 R09: 00000000fffffff8
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] R10: ffff88007b9f5e00 R11: ffff88007b9f5e00 R12: ffff88007cb9a420
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] R13: ffff88000375b758 R14: ffff88000375b7a0 R15: ffff88007c6f6468
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] FS:  00007f5631331700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] CR2: ffffffffff600400 CR3: 000000007b5f9000 CR4: 0000000000002660
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Process kworker/0:1 (pid: 37, threadinfo ffff880003520000, task ffff88007aadf180)
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Stack:
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  0000000000000000 ffff88007c6f64b8 ffff88007c6f6030 ffff88007c6f6060
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  0000000000000002 ffff88007c6f6428 ffffffffa02a25d0 ffff88007c6f6030
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  0000000000000000 ffffffffa02925b7 ffffffff810040cf ffffffff81004273
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047] Call Trace:
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffffa02925b7>] ? con_work+0x1b62/0x1c48 [libceph]
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff810040cf>] ? arch_local_irq_restore+0x7/0x8
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff81004273>] ? xen_mc_flush+0x11e/0x161
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff81003159>] ? xen_end_context_switch+0xe/0x1c
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff8100d02f>] ? load_TLS+0x7/0xa
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff8105fd0a>] ? mmdrop+0xd/0x1c
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff81061581>] ? finish_task_switch+0x83/0xb3
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff810532b1>] ? process_one_work+0x18d/0x2d3
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff810536bc>] ? worker_thread+0x118/0x1b2
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff810535a4>] ? rescuer_thread+0x187/0x187
Jun 11 14:25:08 i-10-1-73-197 kernel: [850269.692047]  [<ffffffff81057358>] ? kthread+0x81/0x89

# ceph -v
ceph version 0.61.3 (92b1e398576d55df8e5888dd1a9545ed3fd99532)

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #5301

mon: leveldb crash in tcmalloc

Updated by Sage Weil almost 11 years ago

Updated by Maciej Galkiewicz almost 11 years ago

Updated by Ian Colle almost 11 years ago

Updated by Joao Eduardo Luis almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Maciej Galkiewicz almost 11 years ago

Updated by Sage Weil almost 11 years ago

Updated by Sage Weil almost 11 years ago