Project

General

Profile

Actions

Bug #1758

closed

OSD segfault in SimpleMessenger::send_message

Added by Anonymous over 12 years ago. Updated about 12 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

in the 11/29 nightlies, cfuse_workunit_misc (3335) the osd on sepia5 seg-faulted.
The end of the osd log is:
2011-11-28 23:52:57.849808 7f04f3e5f700 log [INF] : 2.1p0 scrub ok
2011-11-28 23:53:05.560595 7f04fb06f700 journal check_for_full at 103739392 : JOURNAL FULL 103739392 >= 2646015 (max_size 104857600 start 1531904)
2011-11-28 23:53:05.726387 7f04fb06f700 journal check_for_full at 103739392 : JOURNAL FULL 103739392 >= 2646015 (max_size 104857600 start 1531904)
2011-11-28 23:53:49.882656 7f04fb06f700 journal check_for_full at 79925248 : JOURNAL FULL 79925248 >= 544767 (max_size 104857600 start 80470016)
2011-11-28 23:53:50.236698 7f04fb06f700 journal check_for_full at 79925248 : JOURNAL FULL 79925248 >= 544767 (max_size 104857600 start 80470016)
2011-11-28 23:54:09.010185 7f04fb06f700 journal check_for_full at 79917056 : JOURNAL FULL 79917056 >= 8191 (max_size 104857600 start 79925248)
2011-11-28 23:54:09.125164 7f04fb06f700 journal check_for_full at 79917056 : JOURNAL FULL 79917056 >= 8191 (max_size 104857600 start 79925248)
2011-11-28 23:56:12.377305 7f04fb06f700 journal check_for_full at 99950592 : JOURNAL FULL 99950592 >= 1699839 (max_size 104857600 start 101650432)
2011-11-29 00:02:17.485866 7f04f3e5f700 log [INF] : 0.0 scrub ok
2011-11-29 00:02:21.324655 7f04fb06f700 journal check_for_full at 28332032 : JOURNAL FULL 28332032 >= 122879 (max_size 104857600 start 28454912)
2011-11-29 00:02:21.447687 7f04fb06f700 journal check_for_full at 28332032 : JOURNAL FULL 28332032 >= 122879 (max_size 104857600 start 28454912)
2011-11-29 00:02:24.664521 7f04f365e700 log [INF] : 0.1 scrub ok
2011-11-29 00:02:26.710498 7f04f365e700 log [INF] : 0.3 scrub ok
  • Caught signal (Segmentation fault) *
    in thread 7f04f265c700
    ceph version 0.38-250-gc2889fe (c2889fef420611df3dd0de4064c91f6aa9f86625)
    1: /tmp/cephtest/binary/usr/local/bin/ceph-osd() [0x6ad944]
    2: (()+0xfb40) [0x7f0502cf8b40]
    3: (SimpleMessenger::send_message(Message
    , Connection*)+0x54) [0x6377f4]
    4: (PG::replica_scrub(MOSDRepScrub*)+0x82c) [0x72a63c]
    5: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x10a) [0x5e4bba]
    6: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5b91d2]
    7: (ThreadPool::worker()+0x7e3) [0x699593]
    8: (ThreadPool::WorkThread::entry()+0x15) [0x5c1235]
    9: (Thread::_entry_func(void*)+0x12) [0x6245f2]
    10: (()+0x7971) [0x7f0502cf0971]
    11: (clone()+0x6d) [0x7f050137b92d]
Actions #1

Updated by Sage Weil over 12 years ago

  • Target version set to v0.40
Actions #2

Updated by Sage Weil over 12 years ago

  • Priority changed from Normal to High
Actions #3

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position set to 24
Actions #4

Updated by Josh Durgin over 12 years ago

Actions #6

Updated by Greg Farnum over 12 years ago

  • Status changed from New to Resolved
  • Assignee set to Greg Farnum

I checked out a core dump, and the OSD is calling send_message with a null Connection* from PG::replica_scrub::2895. I'm not quite sure how the Connection is getting NULLed out yet, but I suspect it has something to do with this code snippet from ReplicatedPG::sub_op_modify_applied:

  if (last_update_applied == info.last_update && finalizing_scrub) {
    assert(active_rep_scrub);
    osd->rep_scrub_wq.queue(active_rep_scrub);
    active_rep_scrub->put();
    active_rep_scrub = 0;
  }

So it's deleting the message, and probably the Connection * is getting NULLed out by the next person to use the memory. Pushed a fix to master in 03b03553b2e386b4c102e24bf90f88297a0f61e7

Actions #7

Updated by Josh Durgin over 12 years ago

  • Status changed from Resolved to New

This happened again yesterday. Core is in teuthology:~teut/coredump/1323401936.7586.core

Actions #8

Updated by Sage Weil over 12 years ago

Verify that last failure was running a commit that included the fix?

Actions #9

Updated by Greg Farnum over 12 years ago

  • Assignee deleted (Greg Farnum)

For the life of me I cannot seem to get useful symbols out of this, though I'm not sure why. I've been using LD_LIBRARY_PATH and have run it on the gitbuilder...

Anyway, without that I can't do anything. If somebody can be more successful I will happily check it out again, but for now I'm releasing this bug.

(Oh, it did run on a commit including the apparently-not-a-fix.)

Actions #10

Updated by Sage Weil over 12 years ago

  • Status changed from New to Need More Info
Actions #11

Updated by Sage Weil over 12 years ago

  • Priority changed from High to Normal
Actions #12

Updated by Sage Weil over 12 years ago

  • Target version deleted (v0.40)
  • Translation missing: en.field_position deleted (62)
  • Translation missing: en.field_position set to 30
Actions #13

Updated by Sage Weil about 12 years ago

  • Status changed from Need More Info to Can't reproduce

Haven't seen this one in ages, either. Going to assume it's been fixed.

Actions

Also available in: Atom PDF