Project

General

Profile

Bug #4698

osd suicide timed out after 150

Added by Sam Lang almost 11 years ago. Updated almost 11 years ago.

Status:
Won't Fix
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

With the following yaml, after running for a while osds start to commit suicide. This is with the latest next branch. Attached is a log from one of the osds.

roles:
- [mon.a, osd.0, osd.1, osd.2]
- [mon.b, osd.3, osd.4, osd.5]
- [mon.c, mds.a, client.0]

tasks:
- install:
branch: next
- ceph:
conf:
client:
ms_inject_delay_type: osd
ms_inject_delay_max: 1
ms_inject_delay_probability: 0.2
- kclient:
- workunit:
clients:
all:
- kernel_untar_build.sh

-5> 2013-04-10 08:43:34.626902 7f722c50b700  1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f7200ff9700' had timed out after 15
-4> 2013-04-10 08:43:34.626920 7f722c50b700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f72007f8700' had timed out after 15
-3> 2013-04-10 08:43:35.179233 7f71fe7f4700 5 osd.5 7 heartbeat: osd_stat(63965 MB used, 365 GB avail, 450 GB total, peers []/[])
-2> 2013-04-10 08:43:39.627074 7f722c50b700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f7200ff9700' had timed out after 15
-1> 2013-04-10 08:43:39.627098 7f722c50b700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f7200ff9700' had suicide timed out after 150
0> 2013-04-10 08:43:39.628723 7f722c50b700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f722c50b700 time 2013-04-10 08:43:39.627107
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

ceph-osd.5.log View (2.63 MB) Sam Lang, 04/10/2013 08:53 AM

ceph-osd.3.log.bz2 (2.49 MB) Sam Lang, 04/11/2013 03:24 PM

ceph-osd.4.log.bz2 (2.69 MB) Sam Lang, 04/11/2013 03:24 PM

ceph-osd.5.log.bz2 (2.41 MB) Sam Lang, 04/11/2013 03:24 PM

History

#1 Updated by Sam Lang almost 11 years ago

  • Category set to OSD

#2 Updated by Ian Colle almost 11 years ago

  • Assignee set to Samuel Just
  • Priority changed from Normal to Urgent

sam - please assess

#3 Updated by Samuel Just almost 11 years ago

Was there anything in dmseg to indicate a hung fs?

#4 Updated by Sam Lang almost 11 years ago

There's this:

[1358446.830181] #0: (&type->s_umount_key#22){+++++.}, at: [<ffffffff811b4709>] sys_syncfs+0x49/0xa0
[1358446.839328] INFO: task ceph-osd:11663 blocked for more than 120 seconds.
[1358446.846215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1358446.854261] ceph-osd D 0000000000000000 0 11663 11570 0x00000004
[1358446.861553] ffff880032a75c48 0000000000000046 0000000000000000 ffff88020be0deb0
[1358446.869198] ffff88020be0deb0 ffff880032a75fd8 ffff880032a75fd8 ffff880032a75fd8
[1358446.876851] ffff880030e2deb0 ffff88020be0deb0 0000000000000000 ffff880032a75e20
[1358446.884537] Call Trace:
[1358446.887169] [<ffffffff8165cb69>] schedule+0x29/0x70
[1358446.892314] [<ffffffff8165a56d>] schedule_timeout+0x1bd/0x270
[1358446.898338] [<ffffffff810b75e6>] ? mark_held_locks+0x86/0x140
[1358446.904360] [<ffffffff8165de30>] ? _raw_spin_unlock_irq+0x30/0x40
[1358446.910736] [<ffffffff810b77a5>] ? trace_hardirqs_on_caller+0x105/0x190
[1358446.917641] [<ffffffff8165c9d6>] wait_for_common+0xd6/0x160
[1358446.923491] [<ffffffff8108cd20>] ? try_to_wake_up+0x2f0/0x2f0
[1358446.929504] [<ffffffff8165cb3d>] wait_for_completion+0x1d/0x20
[1358446.935624] [<ffffffff811ad05b>] sync_inodes_sb+0x13b/0x270
[1358446.941486] [<ffffffff811b44cb>] __sync_filesystem+0x1b/0x60
[1358446.947439] [<ffffffff811b458b>] sync_filesystem+0x4b/0x70
[1358446.953206] [<ffffffff811b4711>] sys_syncfs+0x51/0xa0
[1358446.958526] [<ffffffff81666e59>] system_call_fastpath+0x16/0x1b
[1358446.964721] 1 lock held by ceph-osd/11663:
[1358446.968998] #0: (&type->s_umount_key#22){+++++.}, at: [<ffffffff811b4709>] sys_syncfs+0x49/0xa0

#5 Updated by Greg Farnum almost 11 years ago

So apparently this is pretty consistently suiciding now and it wasn't before. :/ I was trying to figure out if the message delivery delays could impact it but I don't see any way it could, so no idea what we could have changed to cause this — even the message-count throttler changes, while related, don't have any mechanism I can imagine for making it worse.

#6 Updated by Sam Lang almost 11 years ago

logs from osds hitting suicide timeout with:

debug ms = 1
debug filestore = 20
debug filejournal = 20

osd.3 and osd.4 failed with suicide timeout, osd.5 (running on the same node as osd.3 and osd.4) did not fail.

#7 Updated by Ian Colle almost 11 years ago

Any update on this? Is it still happening?

#8 Updated by Samuel Just almost 11 years ago

This appears to be a filesystem problem with ext4.

#9 Updated by Sage Weil almost 11 years ago

  • Priority changed from Urgent to High

#10 Updated by Tamilarasi muthamizhan almost 11 years ago

logs: ubuntu@teuthology:/a/teuthology-2013-05-13_01:30:03-upgrade-master-testing-basic/12483

#10 0x000000000084274f in ceph::__ceph_assert_fail (assertion=0x999e1f "0 == \"hit suicide timeout\"", file=<optimized out>, line=79, 
    func=0x999fc0 "bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)") at common/assert.cc:77
#11 0x00000000007ac1f7 in ceph::HeartbeatMap::_check (this=<optimized out>, h=<optimized out>, who=0x999eb7 "is_healthy", now=<optimized out>) at common/HeartbeatMap.cc:79
#12 0x00000000007acba7 in ceph::HeartbeatMap::is_healthy (this=0x26e00e0) at common/HeartbeatMap.cc:130
#13 0x00000000007ad153 in ceph::HeartbeatMap::check_touch_file (this=0x26e00e0) at common/HeartbeatMap.cc:141
#14 0x000000000084e9c5 in CephContextServiceThread::entry (this=0x26e6240) at common/ceph_context.cc:68
#15 0x00007f07eda51e9a in start_thread (arg=0x7f07eb0d6700) at pthread_create.c:308
#16 0x00007f07ebbe4cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#17 0x0000000000000000 in ?? ()

#11 Updated by Tamilarasi muthamizhan almost 11 years ago

ubuntu@teuthology:/a/teuthology-2013-05-13_01:30:03-upgrade-master-testing-basic/12483$ cat config.yaml 
kernel: &id001
  kdb: true
  sha1: b5b09be30cf99f9c699e825629f02e3bce555d44
machine_type: plana
nuke-on-error: true
overrides:
  ceph:
    conf:
      mon:
        debug mon: 20
        debug ms: 20
        debug paxos: 20
    log-whitelist:
    - slow request
    sha1: db29f49f9eab5fe5bf27d57d623489186ced50ac
  s3tests:
    branch: master
  workunit:
    sha1: db29f49f9eab5fe5bf27d57d623489186ced50ac
roles:
- - mon.a
  - mds.a
  - osd.0
  - osd.1
- - mon.b
  - mon.c
  - osd.2
  - osd.3
- - client.0
targets:
  ubuntu@plana71.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCuvUzb6iEIx1l/b1TF9O/aQ6XFUW/Fn4mri84Q7QfyMBc1NLDkFBBPBfvXv44bYooG1E5VBXTawmVgOtANyGArqXVNYmHxbIoAkKibn2eIUDb2UzrQHsSWK/aqeEPe7AdcBC2FM07hw1DxEQvept1Ou1kLFMumMY11W2HQ7xnMI8v3VJ+MVgxfp2GdFeNfmDCjFLF19s1mKYjbqlEhNJQ8kg1oHOHsTPn5PLeYntgtRfJzOgOOUyG3Vnw8XtOPBOg9BG248jzdqZQxsHC9ZR1TM8xbq88Uz2JU5XKng6pf6mtL82opCZBYsqBIKkbyCLRKtErkQ/LB3o81kme0P+Dz
  ubuntu@plana79.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCx0nVMVy140vXGRPqjqx63mfytPlqmoN7YoJ3Si0ti1XtvJTftB9EdQGwqj/tsY95DeUNBtAQs5TBsiLr1E/JHlKt7EXwyWsJNB2ntvkPJOMxoounypjkVgfv91EWmERQGFsalDmIYjSuSCG28g5Vaz8il9D7fH/ykKZ38EQChhPXIpB2bieJOr2Xm6llde1q2rUEltV17EmiQvu9eUuxb9y9h057k6GSqpsTViPADlT7CG7W60bqWs8d7TvV4rvPhUy6oyUp1ar8116NMSFUiaTgVTidDiQ3xyZeguwJAbzh86MQdHVhSi89W/vjvoEP1opjZP3RArB4BoNwzz/Dh
  ubuntu@plana82.front.sepia.ceph.com: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDoUwiVhSsjFPU46JeiJdYRXuw5Y26MGiIj90va3uLJGz6UPar/mpUkJgcx55uQhpp/8hdCMj5CRBTbLgnKURRX4iGVi9WltSXSGATWQC8ZXcSIOrr919a7pzvnWw9KviHFmKz+kqWby0NhrdHsk6x/bk9S7+fTP23+84aXsSGMS7uavPtjUuidOpUNMyw/QrazNKSE7CDLk21lBPs4dnFV+Ei17odL/opthGCNycuzL73geV9yQ5MpPtzB/wYL72Nbkx069ONmqFVItumN+v6km6XAvzb3QEhO62hqRW9avrW99vVfAhKgdeaEMeLYyR6ZrRCExAZ8Xn4rte0tCJHX
tasks:
- internal.lock_machines:
  - 3
  - plana
- internal.save_config: null
- internal.check_lock: null
- internal.connect: null
- internal.check_conflict: null
- internal.check_ceph_data: null
- internal.vm_setup: null
- kernel: *id001
- internal.base: null
- internal.archive: null
- internal.coredump: null
- internal.syslog: null
- internal.timer: null
- chef: null
- clock.check: null
- install:
    branch: bobtail
- ceph:
    log-whitelist:
    - wrongly marked
- ceph-fuse: null
- workunit:
    clients:
      all:
      - suites/dbench.sh
- install.upgrade:
    all:
      branch: next
- ceph.restart:
  - mon.a
  - mon.b
  - mon.c
  - osd.0
  - osd.1
  - osd.2
  - osd.3
  - mds.a
- workunit:
    clients:
      all:
      - kernel_untar_build.sh

#12 Updated by Tamilarasi muthamizhan almost 11 years ago

ubuntu@teuthology:/a/teuthology-2013-05-14_01:30:05-upgrade-master-testing-basic/13144

#13 Updated by Samuel Just almost 11 years ago

  • Assignee deleted (Samuel Just)

#14 Updated by Tamilarasi muthamizhan almost 11 years ago

  • Subject changed from osd: heartbeat_map had suicide timed out after 150 to osd suicide timed out after 150
  • Priority changed from High to Urgent
  • Source changed from Development to Q/A

log: ubuntu@teuthology:/a/teuthology-2013-06-07_01:30:04-upgrade-master-testing-basic/32963

     0> 2013-06-07 11:07:36.883328 7fe64a7b1700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fe64a7b1700 time 2013-06-07 
11:07:36.881723
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.63-235-g0b036ec (0b036ecddbfd82e651666326d6f16b3c000ade18)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x7addb7]
 2: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x7ae767]
 3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x7aed13]
 4: (CephContextServiceThread::entry()+0x55) [0x851155]
 5: (()+0x7e9a) [0x7fe64d12ce9a]
 6: (clone()+0x6d) [0x7fe64b2bfccd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#15 Updated by Sage Weil almost 11 years ago

  • Status changed from New to Won't Fix

this was an ext4 bug:

2013-06-07T12:07:44.062525-07:00 plana74 kernel: [22812.003900] EXT4-fs warning (device sda1): ext4_da_update_reserve_space:361: ino 2883621, allocated 1 with only 0 reserved metadata blocks (releasing 83 blocks with reserved 83 data blocks)
2013-06-07T12:07:44.062547-07:00 plana74 kernel: [22812.020873] ------------[ cut here ]------------
2013-06-07T12:07:44.078705-07:00 plana74 kernel: [22812.025515] WARNING: at /srv/autobuild-ceph/gitbuilder.git/build/fs/ext4/inode.c:362 ext4_da_update_reserve_space+0x22b/0x2a0()
2013-06-07T12:07:44.078712-07:00 plana74 kernel: [22812.037050] Hardware name: PowerEdge R410
2013-06-07T12:07:44.138453-07:00 plana74 kernel: [22812.041076] Modules linked in: kvm_intel kvm ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs reiserfs coretemp nfsd nfs_acl auth_rpcgss exportfs nfs fscache lockd ghash_clmulni_intel aesni_intel ablk_helper cryptd sunrpc lrw aes_x86_64 xts gf128mul psmouse i7core_edac edac_core microcode lpc_ich mfd_core dcdbas hed serio_raw joydev lp parport hid_generic usbhid hid btrfs raid6_pq ixgbe mptsas mptscsih dca ptp mptbase pps_core scsi_transport_sas mdio bnx2 xor zlib_deflate crc32c_intel libcrc32c [last unloaded: kvm]
2013-06-07T12:07:44.138461-07:00 plana74 kernel: [22812.088618] Pid: 904, comm: flush-8:0 Tainted: G        W    3.9.0-ceph-00303-g19bb6a8 #1
2013-06-07T12:07:44.138463-07:00 plana74 kernel: [22812.096843] Call Trace:
2013-06-07T12:07:44.152818-07:00 plana74 kernel: [22812.099313]  [<ffffffff8105678f>] warn_slowpath_common+0x7f/0xc0
2013-06-07T12:07:44.152825-07:00 plana74 kernel: [22812.105340]  [<ffffffff810567ea>] warn_slowpath_null+0x1a/0x20
2013-06-07T12:07:44.152828-07:00 plana74 kernel: [22812.111210]  [<ffffffff8123514b>] ext4_da_update_reserve_space+0x22b/0x2a0
2013-06-07T12:07:44.159729-07:00 plana74 kernel: [22812.118147]  [<ffffffff81266160>] ext4_ext_map_blocks+0xc10/0x1040
2013-06-07T12:07:44.172356-07:00 plana74 kernel: [22812.124361]  [<ffffffff8127842c>] ? ext4_es_lookup_extent+0x3c/0x1b0
2013-06-07T12:07:44.172363-07:00 plana74 kernel: [22812.130751]  [<ffffffff812354a5>] ext4_map_blocks+0x2e5/0x490
2013-06-07T12:07:44.184592-07:00 plana74 kernel: [22812.136518]  [<ffffffff8123999d>] mpage_da_map_and_submit+0x14d/0x410
2013-06-07T12:07:44.184599-07:00 plana74 kernel: [22812.142990]  [<ffffffff8123a4dc>] ext4_da_writepages+0x35c/0x620
2013-06-07T12:07:44.190624-07:00 plana74 kernel: [22812.149029]  [<ffffffff811b228b>] ? writeback_sb_inodes+0x12b/0x3e0
2013-06-07T12:07:44.202349-07:00 plana74 kernel: [22812.155331]  [<ffffffff8113e0b3>] do_writepages+0x23/0x40
2013-06-07T12:07:44.202356-07:00 plana74 kernel: [22812.160763]  [<ffffffff811afa9e>] __writeback_single_inode+0x4e/0x2a0
2013-06-07T12:07:44.208820-07:00 plana74 kernel: [22812.167238]  [<ffffffff811b23e0>] writeback_sb_inodes+0x280/0x3e0
2013-06-07T12:07:44.220369-07:00 plana74 kernel: [22812.173357]  [<ffffffff811b2709>] wb_writeback+0xf9/0x310
2013-06-07T12:07:44.220376-07:00 plana74 kernel: [22812.178793]  [<ffffffff8105eff7>] ? local_bh_enable_ip+0x97/0x100
2013-06-07T12:07:44.232164-07:00 plana74 kernel: [22812.184907]  [<ffffffff811b425b>] wb_do_writeback+0xbb/0x1e0
2013-06-07T12:07:44.232171-07:00 plana74 kernel: [22812.190597]  [<ffffffff811b4413>] bdi_writeback_thread+0x93/0x270
2013-06-07T12:07:44.238293-07:00 plana74 kernel: [22812.196757]  [<ffffffff811b4380>] ? wb_do_writeback+0x1e0/0x1e0
2013-06-07T12:07:44.249170-07:00 plana74 kernel: [22812.202702]  [<ffffffff8107c46a>] kthread+0xea/0xf0
2013-06-07T12:07:44.249177-07:00 plana74 kernel: [22812.207616]  [<ffffffff8107c380>] ? flush_kthread_work+0x1a0/0x1a0
2013-06-07T12:07:44.260801-07:00 plana74 kernel: [22812.213817]  [<ffffffff8167b4ac>] ret_from_fork+0x7c/0xb0
2013-06-07T12:07:44.260809-07:00 plana74 kernel: [22812.219247]  [<ffffffff8107c380>] ? flush_kthread_work+0x1a0/0x1a0
2013-06-07T12:07:44.271608-07:00 plana74 kernel: [22812.225475] ---[ end trace f0ea84d46d841afe ]---
2013-06-07T12:12:23.941842-07:00 plana74 kernel: Kernel logging (proc) stopped.

Also available in: Atom PDF