Project

General

Profile

Actions

Bug #202

closed

OSD crash during reads from cluster

Added by Wido den Hollander almost 14 years ago. Updated over 13 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Today i noticed one crashing OSD during read operations (rsync) from my cluster.

I don't know if it matters, but the crashes started after a added an extra OSD to my cluster, but it hasn't been added to the CRUSH map yet.

The full log can be found at: http://zooi.widodh.nl/ceph/ceph05.10199.gz

The strace from gdb:

root@ceph05:~# gdb /usr/lib/debug/usr/bin/cosd /core.ceph05.10199 
GNU gdb (GDB) 7.1-ubuntu
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying" 
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/lib/debug/usr/bin/cosd...done.
[New Thread 10250]
[New Thread 10251]
[New Thread 10253]
[New Thread 10252]
[New Thread 10254]
[New Thread 10262]
[New Thread 10255]
[New Thread 10264]
[New Thread 10256]
[New Thread 10265]
[New Thread 10258]
[New Thread 10260]
[New Thread 10267]
[New Thread 10266]
[New Thread 10269]
[New Thread 10268]
[New Thread 10271]
[New Thread 10270]
[New Thread 10275]
[New Thread 10272]
[New Thread 10277]
[New Thread 10279]
[New Thread 10278]
[New Thread 10281]
[New Thread 10283]
[New Thread 10284]
[New Thread 10286]
[New Thread 10285]
[New Thread 10287]
[New Thread 10291]
[New Thread 10199]
[New Thread 10292]
[New Thread 10239]
[New Thread 10201]
[New Thread 10244]
[New Thread 10202]
[New Thread 10246]
[New Thread 10234]
[New Thread 10247]
[New Thread 10235]
[New Thread 10248]
[New Thread 10236]
[New Thread 10238]
[New Thread 10241]
[New Thread 10243]
[New Thread 10245]
[New Thread 10249]
[New Thread 10240]
[New Thread 10237]

warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.
Core was generated by `/usr/bin/cosd -i 5 -c /etc/ceph/ceph.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007f60ddddca75 in ?? ()
(gdb) bt
#0  0x00007f60ddddca75 in ?? ()
#1  0x00007f60ddde05c0 in ?? ()
#2  0x0000000000000000 in ?? ()
(gdb) 

After the crash i restarted the OSD, but a few minutes later it crashed again, this time with a bit more information. The log of the second crash can be found at: http://zooi.widodh.nl/ceph/ceph05.10671.gz

The strace of the second crash was the same as the first one.

Actions #1

Updated by Wido den Hollander almost 14 years ago

I tried doing a fresh mkfs of only osd5, this is what i did:
  • Removed all data in /srv/ceph/osd5
  • Ran cosd with --mkfs
  • Started osd5 again

After a few minutes of synchronization the OSD crashed again, with this being the last log lines:

root@ceph05:/var/log/ceph# tail -n 10 ceph05.11120
10.06.15_20:54:59.657371 7f07f10bf710 osd5 281 pg[0.191( v 26'112 lc 26'71 (26'110,26'112]+backlog n=109 ec=2 les=248 242/247/29) [0,5] r=1 lcod 26'63 active m=41 snaptrimq=[3~14]] _committed last_complete 26'64 now ondisk
10.06.15_20:54:59.657387 7f07f10bf710 osd5 281 pg[0.1b6( v 26'123 lc 14'5 (26'121,26'123]+backlog n=114 ec=2 les=243 242/242/242) [5,2] r=0 mlcod 0'0 active m=2] _committed last_complete 14'5 now ondisk
10.06.15_20:54:59.657402 7f07f10bf710 osd5 281 pg[0.1b6( v 26'123 lc 14'5 (26'121,26'123]+backlog n=114 ec=2 les=243 242/242/242) [5,2] r=0 mlcod 0'0 active m=2] _committed last_complete 14'5 now ondisk
10.06.15_20:54:59.657580 7f07f10bf710 osd5 281 pg[0.1b6( v 26'123 lc 14'5 (26'121,26'123]+backlog n=114 ec=2 les=243 242/242/242) [5,2] r=0 mlcod 0'0 active m=2] _committed last_complete 14'5 now ondisk
10.06.15_20:54:59.657598 7f07f10bf710 osd5 281 pg[0.1b6( v 26'123 lc 14'5 (26'121,26'123]+backlog n=114 ec=2 les=243 242/242/242) [5,2] r=0 mlcod 0'0 active m=2] _committed last_complete 14'5 now ondisk
10.06.15_20:54:59.657614 7f07f10bf710 osd5 281 pg[0.1b6( v 26'123 lc 14'5 (26'121,26'123]+backlog n=114 ec=2 les=243 242/242/242) [5,2] r=0 mlcod 0'0 active m=2] _committed last_complete 14'5 now ondisk
10.06.15_20:54:59.657631 7f07f10bf710 osd5 281 pg[0.79( v 26'137 lc 23'31 (26'134,26'137]+backlog n=131 ec=2 les=255 242/252/55) [4,5] r=1 lcod 23'19 active m=106 snaptrimq=[3~14]] _committed last_complete 23'20 now ondisk
10.06.15_20:54:59.657647 7f07f10bf710 osd5 281 pg[1.b9( v 241'1156 (136'1154,241'1156]+backlog n=26 ec=2 les=255 242/252/55) [4,5] r=1 lcod 27'1149 active] _committed last_complete 27'1150 now ondisk
10.06.15_20:54:59.658542 7f07f30c3710 filestore(/srv/ceph/osd5) remove /srv/ceph/osd5/current/0.1b6_head/1000000331a.00000001_head = -2
10.06.15_20:54:59.658579 7f07f30c3710 filestore(/srv/ceph/osd5) write /srv/ceph/osd5/current/0.1b6_head/1000000331a.00000001_head 0~3332324
root@ceph05:/var/log/ceph#

The full log can be found at: http://zooi.widodh.nl/ceph/ceph05.11120.gz

The strace was the same again as the already posted strace, but the symbols seem a bit strange.

Actions #2

Updated by Sage Weil almost 14 years ago

  • Status changed from New to Closed

I'm going to close this.. not much to be done without a stack trace or more specific information. If it comes up again, we can reopen this with more info.

Actions #3

Updated by Sage Weil almost 14 years ago

  • Status changed from Closed to Can't reproduce
Actions

Also available in: Atom PDF