Project

General

Profile

Bug #970

Kernel crash (cause?: lots of small files)

Added by DongJin Lee over 8 years ago. Updated over 8 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
libceph
Target version:
Start date:
04/03/2011
Due date:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:

Description

ceph: 0.25.2
client: 2.6.38.1-2
default setups, 2 nodes 3 osd each

while (copying) writing a lot of files (e.g., ~600millions of sized 2kb = 12GB)
the ceph client crashes with an error message below (copied for about 10 minutes)

Apr 4 11:37:42 ss4 kernel: [ 5625.150841] libceph: mon0 192.168.1.4:6789 session established
(...........file copying...............)
Apr 4 11:46:03 ss4 kernel: [ 6125.643426] libceph: msg_new can't allocate 512 bytes
Apr 4 11:46:03 ss4 kernel: [ 6125.643430] libceph: msg_new can't create type 0 front 512
Apr 4 11:46:03 ss4 kernel: [ 6125.643432] libceph: msgpool osd_op_reply alloc failed
Apr 4 11:46:03 ss4 kernel: [ 6125.643435] libceph: msg_new can't allocate 4096 bytes
Apr 4 11:46:03 ss4 kernel: [ 6125.643436] libceph: msg_new can't create type 0 front 4096
Apr 4 11:46:03 ss4 kernel: [ 6125.643437] libceph: msgpool osd_op alloc failed
Apr 4 11:46:03 ss4 kernel: [ 6125.644350] libceph: msg_new can't allocate 4096 bytes
Apr 4 11:46:03 ss4 kernel: [ 6125.644376] general protection fault: 0000 [#1] SMP

Unsure whether it is due to the pgs being full?
attached file has the log.

Thanks a lot

kdump (12.6 KB) DongJin Lee, 04/03/2011 07:43 PM

History

#1 Updated by Greg Farnum over 8 years ago

Well the first line there is "can't allocate 512 bytes"...looks like you ran out of memory. Was there memory pressure on this machine for some reason?

#2 Updated by Sage Weil over 8 years ago

  • Project changed from Ceph to Linux kernel client
  • Category changed from 24 to libceph
  • Target version deleted (v0.25.3)

#3 Updated by Sage Weil over 8 years ago

  • Target version set to v2.6.39

probably a memory leak?

#4 Updated by Sage Weil over 8 years ago

  • translation missing: en.field_story_points set to 3
  • translation missing: en.field_position set to 1
  • translation missing: en.field_position changed from 1 to 562

#5 Updated by DongJin Lee over 8 years ago

[ 9326.513556] libceph: client4117 fsid 4f209795-4899-8134-6829-f952b295d00e
[ 9326.513682] libceph: mon0 192.168.1.4:6789 session established
[ 9905.382664] libceph: msg_new can't allocate 4096 bytes
[ 9905.382690] general protection fault: 0000 [#1] SMP 
[ 9905.382786] last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
[ 9905.382903] CPU 1 
[ 9905.382939] Modules linked in: ceph fuse w83793 w83627hf hwmon_vid coretemp edd bonding cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf loop dm_mod sr_mod cdrom sg iTCO_wdt shpchp pci_hotplug igb ioatdma mptctl iTCO_vendor_support ghes i5400_edac edac_core dca i2c_i801 i5k_amb pcspkr container hed button ext4 jbd2 crc16 uhci_hcd radeon ttm drm_kms_helper mptsas mptscsih ehci_hcd drm i2c_algo_bit mptbase scsi_transport_sas usbcore fan thermal processor thermal_sys
[ 9905.384009] 
[ 9905.384009] Pid: 12563, comm: flush-ceph-14 Not tainted 2.6.38.1-2-default #1 Supermicro X7DW3/X7DWN+
[ 9905.384009] RIP: 0010:[<ffffffff812557c2>]  [<ffffffff812557c2>] kref_put+0x22/0x70
[ 9905.384009] RSP: 0018:ffff88011d981770  EFLAGS: 00010202
[ 9905.384009] RAX: ffff8800cb597a38 RBX: 8000000000000000 RCX: 0000000000001c3f
[ 9905.384009] RDX: ffffffff81a15501 RSI: ffffffff8149ba10 RDI: 8000000000000000
[ 9905.384009] RBP: ffffffff8149ba10 R08: 000000000001322d R09: 000000000000000a
[ 9905.384009] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000011200
[ 9905.384009] R13: 0000000000001000 R14: 0000000000000000 R15: ffff8800cb597a48
[ 9905.384009] FS:  0000000000000000(0000) GS:ffff8800cfd00000(0000) knlGS:0000000000000000
[ 9905.384009] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 9905.384009] CR2: 00007fdd88d59000 CR3: 000000012540e000 CR4: 00000000000006e0
[ 9905.384009] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9905.384009] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 9905.384009] Process flush-ceph-14 (pid: 12563, threadinfo ffff88011d980000, task ffff8801206e6280)
[ 9905.384009] Stack:
[ 9905.384009]  ffffffff00000010 ffff8800cb597a48 ffff8800cb5979c0 ffffffff8149b7e0
[ 9905.384009]  ffff88012ffedc00 ffff8800cb597a48 ffffffff8149b7a0 ffffffff812557d3
[ 9905.384009]  0000000000000000 ffff8800cb5979c0 0000000000001000 ffffffff81496114
[ 9905.384009] Call Trace:
[ 9905.384009]  [<ffffffff8149b7e0>] ceph_msg_last_put+0x40/0x100
[ 9905.384009]  [<ffffffff812557d3>] kref_put+0x33/0x70
[ 9905.384009]  [<ffffffff81496114>] ceph_msg_new+0x124/0x250
[ 9905.384009]  [<ffffffff8149b8bd>] alloc_fn+0x1d/0x50
[ 9905.384009]  [<ffffffff810f5169>] mempool_alloc+0x59/0x130
[ 9905.384009]  [<ffffffff814a1579>] ceph_osdc_alloc_request+0x1c9/0x2e0
[ 9905.384009]  [<ffffffff814a1779>] ceph_osdc_new_request+0xe9/0x210
[ 9905.384009]  [<ffffffffa047ba72>] ceph_writepages_start+0x742/0x11d0 [ceph]
[ 9905.384009]  [<ffffffff81172950>] writeback_single_inode+0x90/0x230
[ 9905.384009]  [<ffffffff81172d15>] generic_writeback_sb_inodes+0xd5/0x160
[ 9905.384009]  [<ffffffff81173050>] writeback_inodes_wb+0x1b0/0x1c0
[ 9905.384009]  [<ffffffff811732d5>] wb_writeback+0x275/0x350
[ 9905.384009]  [<ffffffff81173e81>] wb_do_writeback+0xa1/0x220
[ 9905.384009]  [<ffffffff8117409a>] bdi_writeback_thread+0x9a/0x270
[ 9905.384009]  [<ffffffff81078516>] kthread+0x96/0xa0
[ 9905.384009]  [<ffffffff81003c44>] kernel_thread_helper+0x4/0x10
[ 9905.384009] Code: 2e 0f 1f 84 00 00 00 00 00 48 83 ec 18 48 85 f6 48 89 5c 24 08 48 89 6c 24 10 48 89 fb 48 89 f5 74 41 48 81 fe d0 ba 13 81 74 25 <f0> ff 0b 0f 94 c2 31 c0 84 d2 74 0a 48 89 df ff d5 b8 01 00 00 
[ 9905.384009] RIP  [<ffffffff812557c2>] kref_put+0x22/0x70
[ 9905.384009]  RSP <ffff88011d981770>
[ 9905.413309] ---[ end trace b020ccf21254f339 ]---

A bit different error, any ideas? this time total files are 10GB, about 200 files. the fault happened while reading those files, I think it could be due to the heat/ cpu scaling_factor?

thanks

#6 Updated by Sage Weil over 8 years ago

  • translation missing: en.field_position deleted (562)
  • translation missing: en.field_position set to 7

#7 Updated by Fyodor Ustinov over 8 years ago

The same.

[ 1877.996453] libceph: msg_new can't allocate 512 bytes
[ 1877.996607] libceph: msg_new can't create type 0 front 512
[ 1877.996654] libceph: msgpool osd_op_reply alloc failed
[ 1877.996710] libceph: msg_new can't allocate 512 bytes
[ 1877.996766] general protection fault: 0000 [#1] SMP
[ 1877.996811] last sysfs file: /sys/devices/virtual/bdi/ceph-1/uevent
[ 1877.996863] CPU 3
[ 1877.996880] Modules linked in: ceph libceph libcrc32c ppdev vmw_balloon psmouse serio_raw i2c_piix4 parport_pc shpchp lp parport mptspi mptscsih e1000 vmw_pvscsi floppy mptbase vmxnet3
[ 1877.998339]
[ 1877.998354] Pid: 892, comm: flush-ceph-1 Not tainted 2.6.38-8-server #42-Ubuntu VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform
[ 1877.998476] RIP: 0010:[<ffffffff812dd325>] [<ffffffff812dd325>] kref_put+0x25/0x70
[ 1877.998555] RSP: 0018:ffff8800b14ed6e0 EFLAGS: 00010206
[ 1877.998600] RAX: ffff8800b169d4f8 RBX: 0001000001e4ce01 RCX: 00000000ffffffff
[ 1877.998674] RDX: 0000000000000001 RSI: ffffffffa00ebaa0 RDI: 0001000001e4ce01
[ 1877.998734] RBP: ffff8800b14ed6f0 R08: 0000000000000000 R09: ffffffff816423e0
[ 1877.998793] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffffa00ebaa0
[ 1877.999006] R13: 0000000000011200 R14: 0000000000000200 R15: 0000000000000000
[ 1877.999132] FS: 0000000000000000(0000) GS:ffff8800bb580000(0000) knlGS:0000000000000000
[ 1877.999338] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1877.999437] CR2: 0000000003fd7000 CR3: 00000000b2f31000 CR4: 00000000000006e0
[ 1877.999579] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1877.999715] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1877.999827] Process flush-ceph-1 (pid: 892, threadinfo ffff8800b14ec000, task ffff8800b0eddb80)
[ 1878.000001] Stack:
[ 1878.000069] ffff8800b169d508 ffff8800b169d480 ffff8800b14ed710 ffffffffa00eb88a
[ 1878.000238] ffff8800b169d508 ffffffffa00eb850 ffff8800b14ed730 ffffffff812dd337
[ 1878.000409] ffff8800b169d480 0000000000000200 ffff8800b14ed780 ffffffffa00ea9bf
[ 1878.000578] Call Trace:
[ 1878.000658] [<ffffffffa00eb88a>] ceph_msg_last_put+0x3a/0xd0 [libceph]
[ 1878.000768] [<ffffffffa00eb850>] ? ceph_msg_last_put+0x0/0xd0 [libceph]
[ 1878.000877] [<ffffffff812dd337>] kref_put+0x37/0x70
[ 1878.000971] [<ffffffffa00ea9bf>] ceph_msg_new+0x20f/0x230 [libceph]
[ 1878.001087] [<ffffffffa00eb945>] alloc_fn+0x25/0x50 [libceph]
[ 1878.001189] [<ffffffff8110de03>] mempool_alloc+0x53/0x130
[ 1878.001286] [<ffffffff8110de62>] ? mempool_alloc+0xb2/0x130
[ 1878.001387] [<ffffffff812e41c9>] ? vsnprintf+0x479/0x620
[ 1878.001486] [<ffffffffa00eba35>] ceph_msgpool_get+0x25/0x60 [libceph]
[ 1878.001601] [<ffffffffa00f0036>] ceph_osdc_alloc_request+0x266/0x310 [libceph]
[ 1878.001768] [<ffffffffa00f01a7>] ceph_osdc_new_request+0xc7/0x1d0 [libceph]
[ 1878.001881] [<ffffffffa00ee422>] ? __map_osds+0xd2/0x3a0 [libceph]
[ 1878.001987] [<ffffffff8110b9d0>] ? find_get_pages_tag+0x40/0x120
[ 1878.002090] [<ffffffff815d779e>] ? _raw_spin_lock+0xe/0x20
[ 1878.002196] [<ffffffffa011cc41>] ceph_writepages_start+0x681/0x970 [ceph]
[ 1878.002310] [<ffffffff8105e463>] ? balance_tasks+0x103/0x1b0
[ 1878.002412] [<ffffffff815d779e>] ? _raw_spin_lock+0xe/0x20
[ 1878.002513] [<ffffffff811167a1>] do_writepages+0x21/0x40
[ 1878.002612] [<ffffffff8118b7af>] writeback_single_inode+0x9f/0x240
[ 1878.002719] [<ffffffff8118bb8b>] writeback_sb_inodes+0xcb/0x160
[ 1878.002821] [<ffffffff8118bddb>] writeback_inodes_wb+0x10b/0x1c0
[ 1878.002923] [<ffffffff8118c20e>] wb_writeback+0x37e/0x490
[ 1878.003019] [<ffffffff815d794f>] ? _raw_spin_lock_irqsave+0x2f/0x40
[ 1878.003124] [<ffffffff81074ceb>] ? lock_timer_base.clone.20+0x3b/0x70
[ 1878.003229] [<ffffffff8118c541>] wb_do_writeback+0x221/0x230
[ 1878.003335] [<ffffffff8118c5d2>] bdi_writeback_thread+0x82/0x260
[ 1878.003437] [<ffffffff8118c550>] ? bdi_writeback_thread+0x0/0x260
[ 1878.003541] [<ffffffff810871f6>] kthread+0x96/0xa0
[ 1878.003634] [<ffffffff8100cde4>] kernel_thread_helper+0x4/0x10
[ 1878.003735] [<ffffffff81087160>] ? kthread+0x0/0xa0
[ 1878.003826] [<ffffffff8100cde0>] ? kernel_thread_helper+0x0/0x10
[ 1878.003927] Code: e8 eb a7 0f 1f 00 55 48 89 e5 48 83 ec 10 48 85 f6 48 89 1c 24 4c 89 64 24 08 48 89 fb 49 89 f4 74 3e 48 81 fe a0 59 15 81 74 22 <f0> ff 0b 0f 94 c2 31 c0 84 d2 74 0b 48 89 df 41 ff d4 b8 01 00
[ 1878.004372] RIP [<ffffffff812dd325>] kref_put+0x25/0x70
[ 1878.004488] RSP <ffff8800b14ed6e0>
[ 1878.004832] ---[ end trace 9d4e026fc60681d0 ]---

#8 Updated by Anonymous over 8 years ago

  • Status changed from New to Resolved

Fixed in ceph-client.git master

commit 56f63aeb6360fb3ba9584bd5b094d55283a9e332
Author: Henry C Chang <henry.cy.chang@gmail.com>
Date:   2011-05-03 02:29:56 +0000

    libceph: fix ceph_msg_new error path

    If memory allocation failed, calling ceph_msg_put() will cause GPF
    since some of ceph_msg variables are not initialized first.

    Fix Bug #970.

    Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com>
    Signed-off-by: Tommi Virtanen <tommi.virtanen@dreamhost.com>

#9 Updated by Anonymous over 8 years ago

Make that ca20892db7567c40e8ed0668f46cf0d085d7db6d in for-linus instead.

Also available in: Atom PDF