Project

General

Profile

Actions

Bug #970

closed

Kernel crash (cause?: lots of small files)

Added by DongJin Lee about 13 years ago. Updated almost 13 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
libceph
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

ceph: 0.25.2
client: 2.6.38.1-2
default setups, 2 nodes 3 osd each

while (copying) writing a lot of files (e.g., ~600millions of sized 2kb = 12GB)
the ceph client crashes with an error message below (copied for about 10 minutes)

Apr 4 11:37:42 ss4 kernel: [ 5625.150841] libceph: mon0 192.168.1.4:6789 session established
(...........file copying...............)
Apr 4 11:46:03 ss4 kernel: [ 6125.643426] libceph: msg_new can't allocate 512 bytes
Apr 4 11:46:03 ss4 kernel: [ 6125.643430] libceph: msg_new can't create type 0 front 512
Apr 4 11:46:03 ss4 kernel: [ 6125.643432] libceph: msgpool osd_op_reply alloc failed
Apr 4 11:46:03 ss4 kernel: [ 6125.643435] libceph: msg_new can't allocate 4096 bytes
Apr 4 11:46:03 ss4 kernel: [ 6125.643436] libceph: msg_new can't create type 0 front 4096
Apr 4 11:46:03 ss4 kernel: [ 6125.643437] libceph: msgpool osd_op alloc failed
Apr 4 11:46:03 ss4 kernel: [ 6125.644350] libceph: msg_new can't allocate 4096 bytes
Apr 4 11:46:03 ss4 kernel: [ 6125.644376] general protection fault: 0000 [#1] SMP

Unsure whether it is due to the pgs being full?
attached file has the log.

Thanks a lot


Files

kdump (12.6 KB) kdump DongJin Lee, 04/03/2011 07:43 PM
Actions #1

Updated by Greg Farnum about 13 years ago

Well the first line there is "can't allocate 512 bytes"...looks like you ran out of memory. Was there memory pressure on this machine for some reason?

Actions #2

Updated by Sage Weil about 13 years ago

  • Project changed from Ceph to Linux kernel client
  • Category changed from 24 to libceph
  • Target version deleted (v0.25.3)
Actions #3

Updated by Sage Weil about 13 years ago

  • Target version set to v2.6.39

probably a memory leak?

Actions #4

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_story_points set to 3
  • Translation missing: en.field_position set to 1
  • Translation missing: en.field_position changed from 1 to 562
Actions #5

Updated by DongJin Lee almost 13 years ago

[ 9326.513556] libceph: client4117 fsid 4f209795-4899-8134-6829-f952b295d00e
[ 9326.513682] libceph: mon0 192.168.1.4:6789 session established
[ 9905.382664] libceph: msg_new can't allocate 4096 bytes
[ 9905.382690] general protection fault: 0000 [#1] SMP 
[ 9905.382786] last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
[ 9905.382903] CPU 1 
[ 9905.382939] Modules linked in: ceph fuse w83793 w83627hf hwmon_vid coretemp edd bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf loop dm_mod sr_mod cdrom sg iTCO_wdt shpchp pci_hotplug igb ioatdma mptctl iTCO_vendor_support ghes i5400_edac edac_core dca i2c_i801 i5k_amb pcspkr container hed button ext4 jbd2 crc16 uhci_hcd radeon ttm drm_kms_helper mptsas mptscsih ehci_hcd drm i2c_algo_bit mptbase scsi_transport_sas usbcore fan thermal processor thermal_sys
[ 9905.384009] 
[ 9905.384009] Pid: 12563, comm: flush-ceph-14 Not tainted 2.6.38.1-2-default #1 Supermicro X7DW3/X7DWN+
[ 9905.384009] RIP: 0010:[<ffffffff812557c2>]  [<ffffffff812557c2>] kref_put+0x22/0x70
[ 9905.384009] RSP: 0018:ffff88011d981770  EFLAGS: 00010202
[ 9905.384009] RAX: ffff8800cb597a38 RBX: 8000000000000000 RCX: 0000000000001c3f
[ 9905.384009] RDX: ffffffff81a15501 RSI: ffffffff8149ba10 RDI: 8000000000000000
[ 9905.384009] RBP: ffffffff8149ba10 R08: 000000000001322d R09: 000000000000000a
[ 9905.384009] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000011200
[ 9905.384009] R13: 0000000000001000 R14: 0000000000000000 R15: ffff8800cb597a48
[ 9905.384009] FS:  0000000000000000(0000) GS:ffff8800cfd00000(0000) knlGS:0000000000000000
[ 9905.384009] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 9905.384009] CR2: 00007fdd88d59000 CR3: 000000012540e000 CR4: 00000000000006e0
[ 9905.384009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9905.384009] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 9905.384009] Process flush-ceph-14 (pid: 12563, threadinfo ffff88011d980000, task ffff8801206e6280)
[ 9905.384009] Stack:
[ 9905.384009]  ffffffff00000010 ffff8800cb597a48 ffff8800cb5979c0 ffffffff8149b7e0
[ 9905.384009]  ffff88012ffedc00 ffff8800cb597a48 ffffffff8149b7a0 ffffffff812557d3
[ 9905.384009]  0000000000000000 ffff8800cb5979c0 0000000000001000 ffffffff81496114
[ 9905.384009] Call Trace:
[ 9905.384009]  [<ffffffff8149b7e0>] ceph_msg_last_put+0x40/0x100
[ 9905.384009]  [<ffffffff812557d3>] kref_put+0x33/0x70
[ 9905.384009]  [<ffffffff81496114>] ceph_msg_new+0x124/0x250
[ 9905.384009]  [<ffffffff8149b8bd>] alloc_fn+0x1d/0x50
[ 9905.384009]  [<ffffffff810f5169>] mempool_alloc+0x59/0x130
[ 9905.384009]  [<ffffffff814a1579>] ceph_osdc_alloc_request+0x1c9/0x2e0
[ 9905.384009]  [<ffffffff814a1779>] ceph_osdc_new_request+0xe9/0x210
[ 9905.384009]  [<ffffffffa047ba72>] ceph_writepages_start+0x742/0x11d0 [ceph]
[ 9905.384009]  [<ffffffff81172950>] writeback_single_inode+0x90/0x230
[ 9905.384009]  [<ffffffff81172d15>] generic_writeback_sb_inodes+0xd5/0x160
[ 9905.384009]  [<ffffffff81173050>] writeback_inodes_wb+0x1b0/0x1c0
[ 9905.384009]  [<ffffffff811732d5>] wb_writeback+0x275/0x350
[ 9905.384009]  [<ffffffff81173e81>] wb_do_writeback+0xa1/0x220
[ 9905.384009]  [<ffffffff8117409a>] bdi_writeback_thread+0x9a/0x270
[ 9905.384009]  [<ffffffff81078516>] kthread+0x96/0xa0
[ 9905.384009]  [<ffffffff81003c44>] kernel_thread_helper+0x4/0x10
[ 9905.384009] Code: 2e 0f 1f 84 00 00 00 00 00 48 83 ec 18 48 85 f6 48 89 5c 24 08 48 89 6c 24 10 48 89 fb 48 89 f5 74 41 48 81 fe d0 ba 13 81 74 25 <f0> ff 0b 0f 94 c2 31 c0 84 d2 74 0a 48 89 df ff d5 b8 01 00 00 
[ 9905.384009] RIP  [<ffffffff812557c2>] kref_put+0x22/0x70
[ 9905.384009]  RSP <ffff88011d981770>
[ 9905.413309] ---[ end trace b020ccf21254f339 ]---

A bit different error, any ideas? this time total files are 10GB, about 200 files. the fault happened while reading those files, I think it could be due to the heat/ cpu scaling_factor?

thanks

Actions #6

Updated by Sage Weil almost 13 years ago

  • Translation missing: en.field_position deleted (562)
  • Translation missing: en.field_position set to 7
Actions #7

Updated by Fyodor Ustinov almost 13 years ago

The same.

[ 1877.996453] libceph: msg_new can't allocate 512 bytes
[ 1877.996607] libceph: msg_new can't create type 0 front 512
[ 1877.996654] libceph: msgpool osd_op_reply alloc failed
[ 1877.996710] libceph: msg_new can't allocate 512 bytes
[ 1877.996766] general protection fault: 0000 [#1] SMP
[ 1877.996811] last sysfs file: /sys/devices/virtual/bdi/ceph-1/uevent
[ 1877.996863] CPU 3
[ 1877.996880] Modules linked in: ceph libceph libcrc32c ppdev vmw_balloon psmouse serio_raw i2c_piix4 parport_pc shpchp lp parport mptspi mptscsih e1000 vmw_pvscsi floppy mptbase vmxnet3
[ 1877.998339]
[ 1877.998354] Pid: 892, comm: flush-ceph-1 Not tainted 2.6.38-8-server #42-Ubuntu VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
[ 1877.998476] RIP: 0010:[<ffffffff812dd325>] [<ffffffff812dd325>] kref_put+0x25/0x70
[ 1877.998555] RSP: 0018:ffff8800b14ed6e0 EFLAGS: 00010206
[ 1877.998600] RAX: ffff8800b169d4f8 RBX: 0001000001e4ce01 RCX: 00000000ffffffff
[ 1877.998674] RDX: 0000000000000001 RSI: ffffffffa00ebaa0 RDI: 0001000001e4ce01
[ 1877.998734] RBP: ffff8800b14ed6f0 R08: 0000000000000000 R09: ffffffff816423e0
[ 1877.998793] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffffa00ebaa0
[ 1877.999006] R13: 0000000000011200 R14: 0000000000000200 R15: 0000000000000000
[ 1877.999132] FS: 0000000000000000(0000) GS:ffff8800bb580000(0000) knlGS:0000000000000000
[ 1877.999338] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1877.999437] CR2: 0000000003fd7000 CR3: 00000000b2f31000 CR4: 00000000000006e0
[ 1877.999579] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1877.999715] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1877.999827] Process flush-ceph-1 (pid: 892, threadinfo ffff8800b14ec000, task ffff8800b0eddb80)
[ 1878.000001] Stack:
[ 1878.000069] ffff8800b169d508 ffff8800b169d480 ffff8800b14ed710 ffffffffa00eb88a
[ 1878.000238] ffff8800b169d508 ffffffffa00eb850 ffff8800b14ed730 ffffffff812dd337
[ 1878.000409] ffff8800b169d480 0000000000000200 ffff8800b14ed780 ffffffffa00ea9bf
[ 1878.000578] Call Trace:
[ 1878.000658] [<ffffffffa00eb88a>] ceph_msg_last_put+0x3a/0xd0 [libceph]
[ 1878.000768] [<ffffffffa00eb850>] ? ceph_msg_last_put+0x0/0xd0 [libceph]
[ 1878.000877] [<ffffffff812dd337>] kref_put+0x37/0x70
[ 1878.000971] [<ffffffffa00ea9bf>] ceph_msg_new+0x20f/0x230 [libceph]
[ 1878.001087] [<ffffffffa00eb945>] alloc_fn+0x25/0x50 [libceph]
[ 1878.001189] [<ffffffff8110de03>] mempool_alloc+0x53/0x130
[ 1878.001286] [<ffffffff8110de62>] ? mempool_alloc+0xb2/0x130
[ 1878.001387] [<ffffffff812e41c9>] ? vsnprintf+0x479/0x620
[ 1878.001486] [<ffffffffa00eba35>] ceph_msgpool_get+0x25/0x60 [libceph]
[ 1878.001601] [<ffffffffa00f0036>] ceph_osdc_alloc_request+0x266/0x310 [libceph]
[ 1878.001768] [<ffffffffa00f01a7>] ceph_osdc_new_request+0xc7/0x1d0 [libceph]
[ 1878.001881] [<ffffffffa00ee422>] ? __map_osds+0xd2/0x3a0 [libceph]
[ 1878.001987] [<ffffffff8110b9d0>] ? find_get_pages_tag+0x40/0x120
[ 1878.002090] [<ffffffff815d779e>] ? _raw_spin_lock+0xe/0x20
[ 1878.002196] [<ffffffffa011cc41>] ceph_writepages_start+0x681/0x970 [ceph]
[ 1878.002310] [<ffffffff8105e463>] ? balance_tasks+0x103/0x1b0
[ 1878.002412] [<ffffffff815d779e>] ? _raw_spin_lock+0xe/0x20
[ 1878.002513] [<ffffffff811167a1>] do_writepages+0x21/0x40
[ 1878.002612] [<ffffffff8118b7af>] writeback_single_inode+0x9f/0x240
[ 1878.002719] [<ffffffff8118bb8b>] writeback_sb_inodes+0xcb/0x160
[ 1878.002821] [<ffffffff8118bddb>] writeback_inodes_wb+0x10b/0x1c0
[ 1878.002923] [<ffffffff8118c20e>] wb_writeback+0x37e/0x490
[ 1878.003019] [<ffffffff815d794f>] ? _raw_spin_lock_irqsave+0x2f/0x40
[ 1878.003124] [<ffffffff81074ceb>] ? lock_timer_base.clone.20+0x3b/0x70
[ 1878.003229] [<ffffffff8118c541>] wb_do_writeback+0x221/0x230
[ 1878.003335] [<ffffffff8118c5d2>] bdi_writeback_thread+0x82/0x260
[ 1878.003437] [<ffffffff8118c550>] ? bdi_writeback_thread+0x0/0x260
[ 1878.003541] [<ffffffff810871f6>] kthread+0x96/0xa0
[ 1878.003634] [<ffffffff8100cde4>] kernel_thread_helper+0x4/0x10
[ 1878.003735] [<ffffffff81087160>] ? kthread+0x0/0xa0
[ 1878.003826] [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10
[ 1878.003927] Code: e8 eb a7 0f 1f 00 55 48 89 e5 48 83 ec 10 48 85 f6 48 89 1c 24 4c 89 64 24 08 48 89 fb 49 89 f4 74 3e 48 81 fe a0 59 15 81 74 22 <f0> ff 0b 0f 94 c2 31 c0 84 d2 74 0b 48 89 df 41 ff d4 b8 01 00
[ 1878.004372] RIP [<ffffffff812dd325>] kref_put+0x25/0x70
[ 1878.004488] RSP <ffff8800b14ed6e0>
[ 1878.004832] ---[ end trace 9d4e026fc60681d0 ]---

Actions #8

Updated by Anonymous almost 13 years ago

  • Status changed from New to Resolved

Fixed in ceph-client.git master

commit 56f63aeb6360fb3ba9584bd5b094d55283a9e332
Author: Henry C Chang <henry.cy.chang@gmail.com>
Date:   2011-05-03 02:29:56 +0000

    libceph: fix ceph_msg_new error path

    If memory allocation failed, calling ceph_msg_put() will cause GPF
    since some of ceph_msg variables are not initialized first.

    Fix Bug #970.

    Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
    Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>
Actions #9

Updated by Anonymous almost 13 years ago

Make that ca20892db7567c40e8ed0668f46cf0d085d7db6d in for-linus instead.

Actions

Also available in: Atom PDF