Project

General

Profile

Actions

Bug #13428

closed

multiple OSDs crashed on same node

Added by Kenneth Waegeman over 8 years ago. Updated over 8 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Stracktraces in different OSDs:
The OSDS are on filestore, most of them Erasure coded, but also an OSD of the cache replicated pool:

Full logs attached


    -5> 2015-10-09 08:50:24.476720 7f5249539700  5 -- op tracker -- seq: 68376217, time: 2015-10-09 08:50:24.476720, event: commit_queued_for_journal
_write, op: osd_repop(client.205172.0:507328 1.1f1 1/3e80a1f1/10003ec73cb.00000022/head v 5330'782762)
    -4> 2015-10-09 08:50:24.477057 7f525d9aa700  5 -- op tracker -- seq: 68376217, time: 2015-10-09 08:50:24.477057, event: write_thread_in_journal_b
uffer, op: osd_repop(client.205172.0:507328 1.1f1 1/3e80a1f1/10003ec73cb.00000022/head v 5330'782762)
    -3> 2015-10-09 08:50:24.486811 7f525d1a9700  5 -- op tracker -- seq: 68376217, time: 2015-10-09 08:50:24.486811, event: journaled_completion_queu
ed, op: osd_repop(client.205172.0:507328 1.1f1 1/3e80a1f1/10003ec73cb.00000022/head v 5330'782762)
    -2> 2015-10-09 08:50:24.486851 7f525a9a4700  5 -- op tracker -- seq: 68376217, time: 2015-10-09 08:50:24.486851, event: commit_sent, op: osd_repo
p(client.205172.0:507328 1.1f1 1/3e80a1f1/10003ec73cb.00000022/head v 5330'782762)
    -1> 2015-10-09 08:50:24.486902 7f525a9a4700  1 -- 10.143.16.19:6806/1035881 --> 10.143.16.13:6803/28439 -- osd_repop_reply(client.205172.0:507328
 1.1f1 ondisk, result = 0) v1 -- ?+0 0xd945100 con 0x11759b80
     0> 2015-10-09 08:50:24.489418 7f522bba7700 -1 *** Caught signal (Aborted) **
 in thread 7f522bba7700

 ceph version 9.0.3 (7295612d29f953f46e6e88812ef372b89a43b9da)
 1: /usr/bin/ceph-osd() [0xb476d2]
 2: (()+0xf130) [0x7f526c476130]
 3: (gsignal()+0x37) [0x7f526ac545d7]
 4: (abort()+0x148) [0x7f526ac55cc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f526b5589b5]
 6: (()+0x5e926) [0x7f526b556926]
 7: (()+0x5e953) [0x7f526b556953]
 8: (()+0x5eb73) [0x7f526b556b73]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xc4bc1a]
 10: (Thread::create(unsigned long)+0x8a) [0xc2f06a]
 11: (Pipe::accept()+0x37db) [0xd36edb]
 12: (Pipe::reader()+0x193d) [0xd3a95d]
 13: (Pipe::Reader::entry()+0xd) [0xd3d52d]
 14: (()+0x7df5) [0x7f526c46edf5]
 15: (clone()+0x6d) [0x7f526ad151ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

  -4> 2015-10-09 09:00:50.862266 7fa38872b700  5 -- op tracker -- seq: 51428597, time: 2015-10-09 09:00:50.862266, event: started, op: osd_sub_op(u
nknown.0.0:0 2.60es7 MIN [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
    -3> 2015-10-09 09:00:50.862351 7fa38872b700  1 -- 10.143.16.19:6814/36794 --> 10.143.16.18:6820/36052 -- osd_sub_op_reply(unknown.0.0:0 2.60es0 M
IN [scrub-reserve] ack, result = 0) v2 -- ?+1 0x56420b00 con 0x3ea10f20
    -2> 2015-10-09 09:00:50.862432 7fa38872b700  5 -- op tracker -- seq: 51428597, time: 2015-10-09 09:00:50.862432, event: done, op: osd_sub_op(unkn
own.0.0:0 2.60es7 MIN [scrub-reserve] v 0'0 snapset=0=[]:[] snapc=0=[])
    -1> 2015-10-09 09:00:50.862508 7fa38872b700  5 -- op tracker -- seq: 51428589, time: 2015-10-09 09:00:50.862508, event: reached_pg, op: MOSDPGPus
h(2.82s3 5458 [PushOp(2/82680082/100030d21bb.00000023/head, version: 4550'40080, data_included: [0~419744], data_size: 419744, omap_header_size: 0, o
map_entries_size: 0, attrset_size: 3, recovery_info: ObjectRecoveryInfo(2/82680082/100030d21bb.00000023/head@4550'40080, copy_subset: [], clone_subse
t: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4197440, data_complete:true, omap_recovered_to:, omap_complete:true), before
_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true)),PushOp(2/63680082/1000214
a741.0000032f/head, version: 3287'22904, data_included: [0~419744], data_size: 419744, omap_header_size: 0, omap_entries_size: 0, attrset_size: 3, re
covery_info: ObjectRecoveryInfo(2/63680082/1000214a741.0000032f/head@3287'22904, copy_subset: [], clone_subset: {}), after_progress: ObjectRecoveryPr
ogress(!first, data_recovered_to:4197440, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first,
 data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true)),PushOp(2/55680082/1000295f89d.00000000/head, version: 3805'26247,
 data_included: [0~832], data_size: 832, omap_header_size: 0, omap_entries_size: 0, attrset_size: 5, recovery_info: ObjectRecoveryInfo(2/55680082/100
0295f89d.00000000/head@3805'26247, copy_subset: [], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:8320, data_co
mplete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_r
ecovered_to:, omap_complete:true)),PushOp(2/e5680082/10001f0405a.000029f8/head, version: 2454'15544, data_included: [0~419744], data_size: 419744, om
ap_header_size: 0, omap_entries_size: 0, attrset_size: 3, recovery_info: ObjectRecoveryInfo(2/e5680082/10001f0405a.000029f8/head@2454'15544, copy_sub
set: [], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_recovered_to:4197440, data_complete:true, omap_recovered_to:, omap_co
mplete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true)),PushO
p(2/89680082/100032cfb43.00000000/head, version: 4550'44639, data_included: [0~11648], data_size: 11648, omap_header_size: 0, omap_entries_size: 0, a
ttrset_size: 5, recovery_info: ObjectRecoveryInfo(2/89680082/100032cfb43.00000000/head@4550'44639, copy_subset: [], clone_subset: {}), after_progress
: ObjectRecoveryProgress(!first, data_recovered_to:116480, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecove
ryProgress(first, data_recovered_to:0, data_complete:false, omap_recovered_to:, omap_complete:true))])
     0> 2015-10-09 09:00:50.863997 7fa376394700 -1 *** Caught signal (Aborted) **
 in thread 7fa376394700

 ceph version 9.0.3 (7295612d29f953f46e6e88812ef372b89a43b9da)
 1: /usr/bin/ceph-osd() [0xb476d2]
 2: (()+0xf130) [0x7fa3ab463130]
 3: (gsignal()+0x37) [0x7fa3a9c415d7]
 4: (abort()+0x148) [0x7fa3a9c42cc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fa3aa5459b5]
 6: (()+0x5e926) [0x7fa3aa543926]
 7: (()+0x5e953) [0x7fa3aa543953]
 8: (()+0x5eb73) [0x7fa3aa543b73]
 9: (ceph::buffer::create_aligned(unsigned int, unsigned int)+0x1fa) [0xc5433a]
 10: (Pipe::read_message(Message**, AuthSessionHandler*)+0x22bd) [0xd26d7d]
 11: (Pipe::reader()+0xa91) [0xd39ab1]
 12: (Pipe::Reader::entry()+0xd) [0xd3d52d]
 13: (()+0x7df5) [0x7fa3ab45bdf5]
 14: (clone()+0x6d) [0x7fa3a9d021ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Files

cephlog1.tar.gz (263 KB) cephlog1.tar.gz Kenneth Waegeman, 10/09/2015 10:54 AM
Actions #1

Updated by Kenneth Waegeman over 8 years ago

I can't attach all logs (only 1.2MB gzipped):
413 Request Entity Too Large

So only log of 1 osd attached

Actions #2

Updated by Loïc Dachary over 8 years ago

  • Status changed from New to Rejected

The only way ceph::buffer::create_aligned can fail is if memory allocation fails. I think the failures you saw was because there was not enough memory on the host. If you disagree, please let me know :-)

Actions

Also available in: Atom PDF