Project

General

Profile

Actions

Bug #10670

closed

osd segfault

Added by Andrey Matyashov about 9 years ago. Updated about 9 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi!
I have cluster with 5 nodes. After update glibc (CVE-2015-0235), 12 hours later 2 nodes is dies. After reset, at the start of ceph, i have segfault. If start this osd manually with debug, getting messages:

root@virt-master:~# ceph-osd -f -d --debug_ms=10 -c /etc/ceph/ceph.conf --name=osd.2
2015-01-28 14:35:46.995375 7f92c1adc840  0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-osd, pid 16275
starting osd.2 at :/0 osd_data /var/lib/ceph/osd/ceph-2 /var/lib/ceph/osd/ceph-2/journal
2015-01-28 14:35:46.996114 7f92c1adc840 10 -- :/0 rank.bind 10.100.23.2:0/0
2015-01-28 14:35:46.996125 7f92c1adc840 10 accepter.accepter.bind
2015-01-28 14:35:46.996139 7f92c1adc840 10 accepter.accepter.bind bound on random port 10.100.23.2:6801/0
2015-01-28 14:35:46.996143 7f92c1adc840 10 accepter.accepter.bind bound to 10.100.23.2:6801/0
2015-01-28 14:35:46.996178 7f92c1adc840  1 -- 10.100.23.2:0/0 learned my addr 10.100.23.2:0/0
2015-01-28 14:35:46.996183 7f92c1adc840  1 accepter.accepter.bind my_inst.addr is 10.100.23.2:6801/16275 need_addr=0
2015-01-28 14:35:46.996187 7f92c1adc840 10 -- :/0 rank.bind :/0
2015-01-28 14:35:46.996188 7f92c1adc840 10 accepter.accepter.bind
2015-01-28 14:35:46.996193 7f92c1adc840 10 accepter.accepter.bind bound on random port 0.0.0.0:6802/0
2015-01-28 14:35:46.996195 7f92c1adc840 10 accepter.accepter.bind bound to 0.0.0.0:6802/0
2015-01-28 14:35:46.996198 7f92c1adc840  1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6802/16275 need_addr=1
2015-01-28 14:35:46.996200 7f92c1adc840 10 -- :/0 rank.bind :/0
2015-01-28 14:35:46.996201 7f92c1adc840 10 accepter.accepter.bind
2015-01-28 14:35:46.996205 7f92c1adc840 10 accepter.accepter.bind bound on random port 0.0.0.0:6803/0
2015-01-28 14:35:46.996207 7f92c1adc840 10 accepter.accepter.bind bound to 0.0.0.0:6803/0
2015-01-28 14:35:46.996210 7f92c1adc840  1 accepter.accepter.bind my_inst.addr is 0.0.0.0:6803/16275 need_addr=1
2015-01-28 14:35:46.996212 7f92c1adc840 10 -- :/0 rank.bind 10.100.23.2:0/0
2015-01-28 14:35:46.996213 7f92c1adc840 10 accepter.accepter.bind
2015-01-28 14:35:46.996218 7f92c1adc840 10 accepter.accepter.bind bound on random port 10.100.23.2:6804/0
2015-01-28 14:35:46.996220 7f92c1adc840 10 accepter.accepter.bind bound to 10.100.23.2:6804/0
2015-01-28 14:35:46.996229 7f92c1adc840  1 -- 10.100.23.2:0/0 learned my addr 10.100.23.2:0/0
2015-01-28 14:35:46.996232 7f92c1adc840  1 accepter.accepter.bind my_inst.addr is 10.100.23.2:6804/16275 need_addr=0
2015-01-28 14:35:46.996233 7f92c1adc840 10 -- :/0 rank.bind 10.100.23.2:0/0
2015-01-28 14:35:46.996235 7f92c1adc840 10 accepter.accepter.bind
2015-01-28 14:35:46.996240 7f92c1adc840 10 accepter.accepter.bind bound on random port 10.100.23.2:6805/0
2015-01-28 14:35:46.996242 7f92c1adc840 10 accepter.accepter.bind bound to 10.100.23.2:6805/0
2015-01-28 14:35:46.996250 7f92c1adc840  1 -- 10.100.23.2:0/0 learned my addr 10.100.23.2:0/0
2015-01-28 14:35:46.996253 7f92c1adc840  1 accepter.accepter.bind my_inst.addr is 10.100.23.2:6805/16275 need_addr=0
2015-01-28 14:35:46.998332 7f92c1adc840  0 filestore(/var/lib/ceph/osd/ceph-2) backend xfs (magic 0x58465342)
2015-01-28 14:35:46.998342 7f92c1adc840  1 filestore(/var/lib/ceph/osd/ceph-2)  disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs
2015-01-28 14:35:47.107992 7f92c1adc840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: FIEMAP ioctl is supported and appears to work
2015-01-28 14:35:47.108003 7f92c1adc840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-01-28 14:35:47.116258 7f92c1adc840  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_features: syscall(SYS_syncfs, fd) fully supported
2015-01-28 14:35:47.126209 7f92c1adc840  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-2) detect_feature: extsize is disabled by conf
2015-01-28 14:35:47.296360 7f92c1adc840  0 filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2015-01-28 14:35:48.105737 7f92c1adc840  1 journal _open /var/lib/ceph/osd/ceph-2/journal fd 19: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1
*** Caught signal (Segmentation fault) **
 in thread 7f92c1adc840
Segmentation Fault

How can this be fixed?

Thanks!


Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #11488: 2 OSD segfaults after some commitDuplicate04/28/2015

Actions
Actions #1

Updated by Andrey Matyashov about 9 years ago

After a segmentation fault, server hangs (sometimes with a kernel panic).

Actions #2

Updated by Samuel Just about 9 years ago

  • Status changed from New to Rejected

It sounds a lot like there is something off with that glibc. You probably want to take it up with the appropriate maintainer for your distro? If your server then hangs, it's pretty unlikely to be ceph related, I think.

Actions #3

Updated by Samuel Just about 9 years ago

Feel free to mark it new again if there is more information.

Actions #4

Updated by Andrey Matyashov about 9 years ago

I found new details for this bug.

I repairing my cluster so:
1. disable autostart ceph on boot on nodes with die OSDs
2. manually start MONs and MDSs
3. manually start OSDs one by one, thus found failure OSDs
4. after reboot manually start MONs, MDSs and only good OSDs
5. delete failure OSDs and recreate inactive PGs

I have one snapshot of my virtual machine. After rebuild cluster I run reverting this snapshot. In progress reverting 2 OSD (others) die again. May be this snapshot include "toxic data"?

Actions

Also available in: Atom PDF