Project

General

Profile

Bug #1379

osd segfault during end of recovery

Added by Josh Durgin over 12 years ago. Updated over 12 years ago.

Status:
Resolved
Priority:
High
Assignee:
-
Category:
OSD
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

From huang jun's email titled 'osd down after adding OSDs':

2011-08-08 20:11:34.506899 7f2f85ff6700 osd10 22  0.87 at
2011-08-08 20:05:16.696631 > 2011-08-07 20:11:34.506050 (86400 seconds
ago)
2011-08-08 20:11:34.506903 7f2f85ff6700 osd10 22 sched_scrub done
2011-08-08 20:11:35.191699 7f2f7cee3700 osd10 22 heartbeat_entry woke up
2011-08-08 20:11:35.191724 7f2f7cee3700 osd10 22 heartbeat
2011-08-08 20:11:35.191758 7f2f7cee3700 osd10 22 heartbeat checking stats
2011-08-08 20:11:35.191780 7f2f7cee3700 osd10 22 update_osd_stat
osd_stat(1640 KB used, 931 GB avail, 931 GB total, peers []/[])
2011-08-08 20:11:35.191795 7f2f7cee3700 osd10 22 heartbeat:
osd_stat(1640 KB used, 931 GB avail, 931 GB total, peers []/[])
2011-08-08 20:11:35.191808 7f2f7cee3700 osd10 22 heartbeat map_locked=1
2011-08-08 20:11:35.191820 7f2f7cee3700 osd10 22 heartbeat check
2011-08-08 20:11:35.191828 7f2f7cee3700 osd10 22 heartbeat lonely?
2011-08-08 20:11:35.191835 7f2f7cee3700 osd10 22 heartbeat put map_lock
2011-08-08 20:11:35.191839 7f2f7cee3700 osd10 22 heartbeat done
2011-08-08 20:11:35.191846 7f2f7cee3700 osd10 22 heartbeat_entry
sleeping for 1.1
2011-08-08 20:11:35.506955 7f2f85ff6700 osd10 22 tick
2011-08-08 20:11:35.507011 7f2f85ff6700 osd10 22 scrub_should_schedule
loadavg 0.13 < max 0.5 = no, randomly backing off
2011-08-08 20:11:36.001713 7f2f847f3700 filestore(/data/osd10)
sync_entry woke after 5.000054
2011-08-08 20:11:36.001745 7f2f847f3700 filestore(/data/osd10)
sync_entry committing 2830 sync_epoch 10
2011-08-08 20:11:36.001786 7f2f847f3700 filestore(/data/osd10)
sync_entry doing btrfs SYNC
2011-08-08 20:11:36.077118 7f2f847f3700 filestore(/data/osd10)
sync_entry commit took 0.075372
2011-08-08 20:11:36.077238 7f2f84ff4700 osd10 22 pg[1.309( empty n=0
ec=2 les/c 6/20 21/21/21) [3] r=-1 stray] _activate_committed 8, that
was an old interval
2011-08-08 20:11:36.077278 7f2f84ff4700 osd10 22 pg[1.309( empty n=0
ec=2 les/c 6/20 21/21/21) [3] r=-1 stray] _finish_recovery -- stale
2011-08-08 20:11:36.077291 7f2f847f3700 filestore(/data/osd10)
sync_entry committed to op_seq 2830
2011-08-08 20:11:36.077308 7f2f847f3700 filestore(/data/osd10)
sync_entry waiting for max_interval 5.000000
2011-08-08 20:11:36.077369 7f2f84ff4700 osd10 22 pg[2.429( empty n=0
ec=2 les/c 6/20 21/21/21) [6] r=-1 stray] _activate_committed 8, that
was an old interval
2011-08-08 20:11:36.077412 7f2f84ff4700 osd10 22 pg[2.429( empty n=0
ec=2 les/c 6/20 21/21/21) [6] r=-1 stray] _finish_recovery -- stale
*** Caught signal (Segmentation fault) **
 in thread 0x7f2f84ff4700
 ceph version 0.32 (commit:c08d08baa6a945d989427563e46c992f757ad5eb)
 1: /usr/bin/cosd() [0x581269]
 2: (()+0xef60) [0x7f2f8b854f60]
 3: (PG::_activate_committed(unsigned int)+0x9c) [0x60bc2c]
 4: (Context::complete(int)+0xa) [0x4d9ada]
 5: (C_Contexts::finish(int)+0xdb) [0x4dfcdb]
 6: (Finisher::finisher_thread_entry()+0x188) [0x6a0288]
 7: (()+0x68ba) [0x7f2f8b84c8ba]
 8: (clone()+0x6d) [0x7f2f8a2e602d]

osd.10.log View (12.9 MB) Josh Durgin, 08/09/2011 08:47 AM

Associated revisions

Revision 8ce65447 (diff)
Added by Sage Weil over 12 years ago

osd: fix _activate_committed() crash

Do not dereference acting0 unless we know it is still valid.

Take a reference when scheduling the transaction, and drop it in the
completion, to ensure that the PG isn't removed out from underneath us.

Fixes: #1379
Signed-off-by: Sage Weil <>

History

#1 Updated by Josh Durgin over 12 years ago

#2 Updated by Sage Weil over 12 years ago

  • Target version set to v0.34

#3 Updated by Sage Weil over 12 years ago

  • Priority changed from Normal to High

#4 Updated by Sage Weil over 12 years ago

slang hit this too,

(10:08:17 AM) slang: http://pastebin.com/raw.php?i=fYFGnVPJ

#5 Updated by Sage Weil over 12 years ago

  • Status changed from New to Resolved

Also available in: Atom PDF