Project

General

Profile

Actions

Bug #16875

closed

apama001 load spikes and OSD 65 dies

Added by David Galloway over 7 years ago. Updated over 7 years ago.

Status:
Closed
Priority:
Normal
Category:
Infrastructure Hardware
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Output from dmesg when issue occurs

[Sat Jul 30 19:43:13 2016] divide error: 0000 [#1] SMP 
[Sat Jul 30 19:43:13 2016] Modules linked in: binfmt_misc xfs libcrc32c intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw ipmi_ssif gf128mul glue_helper hpilo ablk_helper cryptd serio_raw sb_edac ipmi_si edac_core kvm_intel 8250_fintek shpchp acpi_power_meter kvm ipmi_msghandler ioatdma lpc_ich irqbypass mac_hid ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr nfsd iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi auth_rpcgss nfs_acl lockd 8021q grace garp mrp sunrpc stp llc autofs4 btrfs xor raid6_pq mlx4_en vxlan ip6_udp_tunnel udp_tunnel psmouse igb dca ptp pata_acpi pps_core i2c_algo_bit hpsa mlx4_core scsi_transport_sas wmi fjes
[Sat Jul 30 19:43:14 2016] CPU: 0 PID: 51447 Comm: ms_pipe_read Not tainted 4.4.0-28-generic #47-Ubuntu
[Sat Jul 30 19:43:14 2016] Hardware name: HP ProLiant SL4540 Gen8 /, BIOS P74 11/01/2014
[Sat Jul 30 19:43:14 2016] task: ffff880113e61b80 ti: ffff880105074000 task.ti: ffff880105074000
[Sat Jul 30 19:43:14 2016] RIP: 0010:[<ffffffff810b5adc>]  [<ffffffff810b5adc>] task_numa_find_cpu+0x23c/0x710
[Sat Jul 30 19:43:14 2016] RSP: 0000:ffff880105077bd8  EFLAGS: 00010206
[Sat Jul 30 19:43:14 2016] RAX: 0000000000000000 RBX: ffff880105077c78 RCX: 0000000000000000
[Sat Jul 30 19:43:14 2016] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff88061fea1800
[Sat Jul 30 19:43:14 2016] RBP: ffff880105077c40 R08: 000000011483e57e R09: 0000000000000036
[Sat Jul 30 19:43:14 2016] R10: ffffffffffffff20 R11: 0000000000000013 R12: ffff8801af1f44c0
[Sat Jul 30 19:43:14 2016] R13: 0000000000000007 R14: 000000000000007e R15: ffffffffffffff2e
[Sat Jul 30 19:43:14 2016] FS:  00007fdf05c3f700(0000) GS:ffff880627600000(0000) knlGS:0000000000000000
[Sat Jul 30 19:43:14 2016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sat Jul 30 19:43:14 2016] CR2: 00005557ba0a2750 CR3: 0000000c1b3a2000 CR4: 00000000000406f0
[Sat Jul 30 19:43:14 2016] Stack:
[Sat Jul 30 19:43:14 2016]  ffff88061fea2c00 ffff880627616d70 0000000000000047 ffff880113e61b80
[Sat Jul 30 19:43:14 2016]  000000000000007f fffffffffffffdf7 0000000000016d00 000000000000007f
[Sat Jul 30 19:43:14 2016]  ffff880113e61b80 ffff880105077c78 00000000000002f8 00000000000001b4
[Sat Jul 30 19:43:14 2016] Call Trace:
[Sat Jul 30 19:43:14 2016]  [<ffffffff810b63ee>] task_numa_migrate+0x43e/0x9b0
[Sat Jul 30 19:43:14 2016]  [<ffffffff810b69d9>] numa_migrate_preferred+0x79/0x80
[Sat Jul 30 19:43:14 2016]  [<ffffffff810baff4>] task_numa_fault+0x7f4/0xd40
[Sat Jul 30 19:43:14 2016]  [<ffffffff810ba665>] ? should_numa_migrate_memory+0x55/0x130
[Sat Jul 30 19:43:14 2016]  [<ffffffff811bff10>] handle_mm_fault+0xbc0/0x1820
[Sat Jul 30 19:43:14 2016]  [<ffffffff816ffcf4>] ? SYSC_recvfrom+0x144/0x160
[Sat Jul 30 19:43:14 2016]  [<ffffffff81822f06>] ? __schedule+0x3b6/0xa30
[Sat Jul 30 19:43:14 2016]  [<ffffffff81822f06>] ? __schedule+0x3b6/0xa30
[Sat Jul 30 19:43:14 2016]  [<ffffffff8106b537>] __do_page_fault+0x197/0x400
[Sat Jul 30 19:43:14 2016]  [<ffffffff8106b7c2>] do_page_fault+0x22/0x30
[Sat Jul 30 19:43:14 2016]  [<ffffffff81829838>] page_fault+0x28/0x30
[Sat Jul 30 19:43:14 2016] Code: 55 b0 4c 89 f7 e8 25 c8 ff ff 48 8b 55 b0 49 8b 4e 78 48 8b 82 d8 01 00 00 48 83 c1 01 31 d2 49 0f af 86 b0 00 00 00 4c 8b 73 78 <48> f7 f1 48 8b 4b 20 49 89 c0 48 29 c1 48 8b 45 d0 4c 03 43 48 
[Sat Jul 30 19:43:15 2016] RIP  [<ffffffff810b5adc>] task_numa_find_cpu+0x23c/0x710
[Sat Jul 30 19:43:15 2016]  RSP <ffff880105077bd8>
[Sat Jul 30 19:43:15 2016] ---[ end trace 3a019a2c84ac8efc ]---

Actions #1

Updated by David Galloway over 7 years ago

  • Status changed from New to In Progress

Assuming I'm actually polling the disk correctly, OSD-65's disk is fine. It's a mystery as to why this particular OSD keeps failing. I updated the host's packages and rebooted it yesterday.

Actions #2

Updated by David Galloway over 7 years ago

  • Status changed from In Progress to Closed

This host is being reinstalled with RHEL7 and repurposed as a docker host.

Actions

Also available in: Atom PDF