Project

General

Profile

Actions

Bug #4488

closed

ceph-osd crash on server under heavy load

Added by mipt patriot about 11 years ago. Updated about 11 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

It crashes after about 15-30 minutes of working. Reproduced many times.

ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)

Here is log:

2013-03-18 17:17:59.834820 7f5c239f8700 -1 *** Caught signal (Aborted) **
 in thread 7f5c239f8700

 ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
 1: /usr/bin/ceph-osd() [0x790e59]
 2: (()+0xf500) [0x7f5c45d54500]
 3: (gsignal()+0x35) [0x7f5c44a208a5]
 4: (abort()+0x175) [0x7f5c44a22085]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f5c452d9a5d]
 6: (()+0xbcbe6) [0x7f5c452d7be6]
 7: (()+0xbcc13) [0x7f5c452d7c13]
 8: (()+0xbcd0e) [0x7f5c452d7d0e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x83d699]
 10: (SyncEntryTimeout::finish(int)+0xbf) [0x7357bf]
 11: (SafeTimer::timer_thread()+0x453) [0x854f23]
 12: (SafeTimerThread::entry()+0xd) [0x8570ed]
 13: (()+0x7851) [0x7f5c45d4c851]
 14: (clone()+0x6d) [0x7f5c44ad611d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
    -3> 2013-03-18 17:17:59.322403 7f5c40a8a700  5 osd.2 241 tick
    -2> 2013-03-18 17:17:59.825744 7f5c1fdf2700  1 -- [2a01:4f8:190:73ae::2]:6803/54197 <== osd.3 [2a01:4f8:100:9363::2]:0/22824 734 ==== osd_ping(ping e241 stamp 2013-03-18 17:17:59.822076) v2 ==== 47+0+0 (2616145970 0 0) 0x7f5bbc000ea0 con 0x7f5bc4000b10
    -1> 2013-03-18 17:17:59.825796 7f5c1fdf2700  1 -- [2a01:4f8:190:73ae::2]:6803/54197 --> [2a01:4f8:100:9363::2]:0/22824 -- osd_ping(ping_reply e241 stamp 2013-03-18 17:17:59.822076) v2 -- ?+0 0x7f5b980008c0 con 0x7f5bc4000b10
     0> 2013-03-18 17:17:59.834820 7f5c239f8700 -1 *** Caught signal (Aborted) **
 in thread 7f5c239f8700

 ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
 1: /usr/bin/ceph-osd() [0x790e59]
 2: (()+0xf500) [0x7f5c45d54500]
 3: (gsignal()+0x35) [0x7f5c44a208a5]
 4: (abort()+0x175) [0x7f5c44a22085]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f5c452d9a5d]
 6: (()+0xbcbe6) [0x7f5c452d7be6]
 7: (()+0xbcc13) [0x7f5c452d7c13]
 8: (()+0xbcd0e) [0x7f5c452d7d0e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7c9) [0x83d699]
 10: (SyncEntryTimeout::finish(int)+0xbf) [0x7357bf]
 11: (SafeTimer::timer_thread()+0x453) [0x854f23]
 12: (SafeTimerThread::entry()+0xd) [0x8570ed]
 13: (()+0x7851) [0x7f5c45d4c851]
 14: (clone()+0x6d) [0x7f5c44ad611d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   0/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent    100000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.2.log
--- end dump of recent events ---

Files

ceph-osd.dump.bz2 (5.9 MB) ceph-osd.dump.bz2 objdump -rdS ceph-osd mipt patriot, 03/18/2013 09:39 AM
Actions #1

Updated by Sage Weil about 11 years ago

This means that sync(2) or syncfs(2) failed to return after 10 minutes. This is a pathologically loaded system (highly unlikely) or a broken disk or file system or kernel. Check dmesg output?

Actions #2

Updated by mipt patriot about 11 years ago

some errors like this in dmesg:

[ 6236.985604] INFO: task ceph-osd:3220 blocked for more than 120 seconds.
[ 6236.985670] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 6236.985760] ceph-osd D ffff8804197da440 0 3220 1 0 0x00000080
[ 6236.985852] ffff8803f4439cd8 0000000000000086 ffff8803f4439c48 ffffffff8104e2fc
[ 6236.985953] ffff8803ce344640 0000000000000286 ffff8803f4439cb8 ffffffff8105a1ee
[ 6236.986046] ffff880419425000 ffff88002c31ea28 ffff8804197daa08 000000000001e9c0
[ 6236.986138] Call Trace:
[ 6236.986186] [<ffffffff8104e2fc>] ? check_preempt_curr+0x7c/0x90
[ 6236.986240] [<ffffffff8105a1ee>] ? try_to_wake_up+0xae/0x4c0
[ 6236.986290] [<ffffffff814f0f75>] schedule_timeout+0x215/0x2e0
[ 6236.986339] [<ffffffff810640b4>] ? check_preempt_wakeup+0x1a4/0x260
[ 6236.986388] [<ffffffff814f0bf3>] wait_for_common+0x123/0x180
[ 6236.986438] [<ffffffff8105a600>] ? default_wake_function+0x0/0x20
[ 6236.986487] [<ffffffff814f0d0d>] wait_for_completion+0x1d/0x20
[ 6236.986537] [<ffffffff811c382a>] writeback_inodes_sb_nr_ub+0x7a/0xa0
[ 6236.986586] [<ffffffff811c38b6>] writeback_inodes_sb_ub+0x46/0x50
[ 6236.986636] [<ffffffff811c9ee1>] __sync_filesystem+0x51/0xa0
[ 6236.986685] [<ffffffff811c9f7d>] sync_filesystem_ub+0x4d/0x80
[ 6236.986734] [<ffffffff811ca0c6>] sys_syncfs+0x116/0x170
[ 6236.986783] [<ffffffff8100b102>] system_call_fastpath+0x16/0x1b

Actions #3

Updated by mipt patriot about 11 years ago

md raid1 was resyncing at the same time, so i think it's quite likely, that sync/syncfs failed to finish its work

Actions #4

Updated by Sage Weil about 11 years ago

  • Status changed from New to Rejected

that might explain it. 10 minutes is still a really long time!

in any case, this is a problem with the underlying storage.

Actions

Also available in: Atom PDF