Project

General

Profile

Actions

Bug #1530

closed

osd crash during build_inc_scrub_map

Added by Josh Durgin over 12 years ago. Updated about 12 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Severity:
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in teuthology:~teuthworker/archive/nightly_coverage_2011-09-09/112/teuthology.log

osd/PG.cc: In function 'void PG::build_inc_scrub_map(ScrubMap&, eversion_t)', in thread '0x7f4634573700'
osd/PG.cc: 2896: FAILED assert(0)
 ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
 1: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x434) [0x6dfc64]
 2: (PG::replica_scrub(MOSDRepScrub*)+0x8ff) [0x6e716f]
 3: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x12a) [0x5c6dba]
 4: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5993d2]
 5: (ThreadPool::worker()+0x846) [0x662276]
 6: (ThreadPool::WorkThread::entry()+0x15) [0x5a5765]
 7: (Thread::_entry_func(void*)+0x12) [0x609492]
 8: (()+0x7971) [0x7f4644201971]
 9: (clone()+0x6d) [0x7f4642a9192d]
 ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
 1: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x434) [0x6dfc64]
 2: (PG::replica_scrub(MOSDRepScrub*)+0x8ff) [0x6e716f]
 3: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x12a) [0x5c6dba]
 4: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5993d2]
 5: (ThreadPool::worker()+0x846) [0x662276]
 6: (ThreadPool::WorkThread::entry()+0x15) [0x5a5765]
 7: (Thread::_entry_func(void*)+0x12) [0x609492]
 8: (()+0x7971) [0x7f4644201971]
 9: (clone()+0x6d) [0x7f4642a9192d]


Files

osd.3.log (8.92 MB) osd.3.log Samuel Just, 11/02/2011 01:33 PM
Actions #1

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position set to 42
Actions #2

Updated by Josh Durgin over 12 years ago

This happened again, but with blogbench on cfuse. Logs in teuthology:~teuthworker/archive/nightly_coverage_2011-09-19/299/

Actions #3

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.36 to v0.37
Actions #4

Updated by Josh Durgin over 12 years ago

A third recurrence, this time with ffsb on rbd. Logs in teuthology:~teuthworker/archive/nightly_coverage_2011-10-05/142/

Actions #5

Updated by Sage Weil over 12 years ago

  • Priority changed from Normal to High
Actions #6

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.37 to v0.38
Actions #7

Updated by Josh Durgin over 12 years ago

  • Subject changed from osd crash during kernel_untar_build workunit on top of rbd to osd crash during build_inc_scrub_map

This happened again during cfuse on ffsb (teuthology:~teuthworker/archive/nightly_coverage_2011-10-20/694).

Actions #8

Updated by Sage Weil over 12 years ago

  • Status changed from New to Need More Info
Actions #9

Updated by Sage Weil over 12 years ago

I'm going to up the scrub frequency in the teuthology conf to help shake out these problems. There was another bug recently--the op reply reordering--that was triggered by scrubbing (and the requeuing of ops after scrub completion).

Actions #10

Updated by Sage Weil over 12 years ago

  • Assignee set to Sage Weil
Actions #11

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.38 to v0.39
Actions #13

Updated by Sage Weil over 12 years ago

  • Assignee deleted (Sage Weil)

can someone work on reproducing this? see metropolis:~sage/src/teuthology/j.1530 and hammer.sh

Actions #14

Updated by Samuel Just over 12 years ago

  • Assignee set to Samuel Just
Actions #15

Updated by Josh Durgin over 12 years ago

Two more tests hit this last night, and two other osds crashed due to an assert in build_inc_scrub_map:

osd/PG.cc: In function 'void PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&)', in thread '7f81f1a72700'
osd/PG.cc: 2573: FAILED assert(!o.negative)
 ceph version 0.37-243-g2fc01b5 (commit:2fc01b5292f8c1777c546583543e1b4ab6451e79)
 1: (PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&)+0x78b) [0x704eeb]
 2: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x2a6) [0x705256]
 3: (PG::build_scrub_map(ScrubMap&)+0x37a) [0x708dea]
 4: (PG::scrub()+0x799) [0x740cd9]
 5: (OSD::ScrubWQ::_process(PG*)+0x14) [0x5af8f4]
 6: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x5a62d2]
 7: (ThreadPool::worker()+0x7e3) [0x682f03]
 8: (ThreadPool::WorkThread::entry()+0x15) [0x5afae5]
 9: (Thread::_entry_func(void*)+0x12) [0x60f252]
 10: (()+0x7971) [0x7f8201e00971]
 11: (clone()+0x6d) [0x7f820069092d]

Actions #16

Updated by Josh Durgin over 12 years ago

Possibly related: the snaps workunit failed yesterday and today with bad stats:

2011-11-01 09:12:33.948653 osd.0 10.3.14.133:6800/1000 31 : [ERR] 0.0 scrub stat mismatch, got 295/295 objects, 7/7 clones, 9712167/6337823 bytes, 9655/6356 kb." in cluster log
1317 FAIL scheduled_teuthology@teuthology collection:basic clusters:fixed-3.yaml tasks:cfuse_workunit_snaps.yaml
    "2011-11-02 00:34:10.145251 osd.0 10.3.14.131:6800/895 17 : [ERR] 0.1 scrub stat mismatch, got 145/145 objects, 22/22 clones, 8796326/8591881 bytes, 8663/8451 kb." in cluster log

Actions #17

Updated by Samuel Just over 12 years ago

c861ee105475b3f20f64f51b8611f9b69207ca8c should take care of the assert(!o.negative) error. Still trying to reproduce the other one.

Actions #18

Updated by Samuel Just over 12 years ago

Alright, in irc, slb seems to have hit a related bug...with logging!

Actions #19

Updated by Samuel Just over 12 years ago

My best guess about this bug goes something like this:
1) osd->generate_backlog does pg->assemble_backlog
2) osd->generate_backlog drops the pg lock to grab the osd_map read lock
3) ...which is held by osd->handle_osd_map
4) at the end of osd->handle_osd_map, we call write_info on the pg since it has progressed to a new peering interval
5) osd->generate_backlog gets the read_lock and the pg lock and promptly bails since the backlog generation has been cancelled
6) osd dies, but not before the write_info transaction is durable

The result of this is that the in-memory backlog generated in assemble_backlog doesn't make it to disk, but the updated info does resulting in an ondisklog inconsistent with the pg info on osd restart.

Actions #20

Updated by Samuel Just over 12 years ago

ae41f3232a39dbf33487ab02cbac292f58debea8

Actions #21

Updated by Samuel Just over 12 years ago

  • Status changed from Need More Info to Resolved
Actions #22

Updated by Anonymous over 12 years ago

Actions #23

Updated by Sage Weil over 12 years ago

  • Status changed from Resolved to 12
Actions #24

Updated by Anonymous over 12 years ago

Still happening in 11/11 nightly
/log/osd.1.log.gz

Actions #26

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position deleted (99)
  • Translation missing: en.field_position set to 6
Actions #27

Updated by Anonymous over 12 years ago

Both of the above described variants occurred in the 11/21 nightlies
(2775:sepia17, 2783:sepia81, 2805:sepia82)

Actions #28

Updated by Josh Durgin over 12 years ago

A new crash during scrub from /var/lib/teut/log/osd.1.log.gz

osd/PG.cc: In function 'void PG::replica_scrub(MOSDRepScrub*)', in thread '7ffa73315700'
osd/PG.cc: 2865: FAILED assert(last_update_applied == info.last_update)
 ceph version 0.38-241-g32a6837 (commit:32a6837839acd7bf7135e356b45a8636140f8eb1)
 1: (PG::replica_scrub(MOSDRepScrub*)+0x275) [0x729405]
 2: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x10a) [0x5e4c9a]
 3: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5b92c2]
 4: (ThreadPool::worker()+0x7e3) [0x698913]
 5: (ThreadPool::WorkThread::entry()+0x15) [0x5c1325]
 6: (Thread::_entry_func(void*)+0x12) [0x624102]
 7: (()+0x7971) [0x7ffa839a9971]
 8: (clone()+0x6d) [0x7ffa8203492d]
Actions #29

Updated by Sage Weil over 12 years ago

  • Target version changed from v0.39 to v0.40
Actions #30

Updated by Josh Durgin over 12 years ago

Another one crashed in PG::replica_scrub yesterday. core is in teuthology:~teut/coredump/1323087877.10117.core

Actions #31

Updated by Josh Durgin over 12 years ago

I'm guessing this is the new incarnation of this issue?
From teuthology:~teut/log/osd.1.log.gz:

osd/OSD.cc: In function 'MOSDMap* OSD::build_incremental_map_msg(epoch_t, epoch_t)', in thread '7f24566a0700'
osd/OSD.cc: 3662: FAILED assert(0)
 ceph version 0.39-96-g6d5e5bd (commit:6d5e5bdb1ae38cfa30e852d53b4e4a399ad3861c)
 1: (OSD::build_incremental_map_msg(unsigned int, unsigned int)+0x4df) [0x5911ef]
 2: (OSD::send_incremental_map(unsigned int, entity_inst_t const&, bool)+0xa8) [0x591378]
 3: (OSD::_share_map_incoming(entity_inst_t const&, unsigned int, OSD::Session*)+0x74f) [0x59728f]
 4: (OSD::handle_op(MOSDOp*)+0x6fa) [0x59b17a]
 5: (OSD::_dispatch(Message*)+0x9df) [0x5b0baf]
 6: (OSD::ms_dispatch(Message*)+0x1b0) [0x5b1ea0]
 7: (SimpleMessenger::dispatch_entry()+0x99a) [0x626b0a]
 8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x4a8a2c]
 9: (Thread::_entry_func(void*)+0x12) [0x620fb2]
 10: (()+0x7971) [0x7f2462329971]
 11: (clone()+0x6d) [0x7f24609b492d]

Actions #32

Updated by Sage Weil over 12 years ago

fixed that last thing with c9d0e556c7ad294819c60ca4e3cd4d0191811f18, but i think it's unrelated to the rest of this bug.

Actions #33

Updated by Josh Durgin over 12 years ago

This was the only failure in the run last night. Core at teuthology:~teut/coredump/1324284203.17720.core, build at teuthology:~teuthworker/archive/ceph_builds/nightly_coverage_2011-12-19-a.tgz

Actions #34

Updated by Sage Weil over 12 years ago

  • Translation missing: en.field_position deleted (43)
  • Translation missing: en.field_position set to 31
Actions #35

Updated by Sage Weil over 12 years ago

  • Priority changed from High to Urgent
Actions #36

Updated by Sage Weil over 12 years ago

  • Priority changed from Urgent to Normal
Actions #37

Updated by Sage Weil over 12 years ago

  • Target version deleted (v0.40)
  • Translation missing: en.field_position deleted (56)
  • Translation missing: en.field_position set to 3
Actions #38

Updated by Sage Weil over 12 years ago

  • Assignee deleted (Samuel Just)
Actions #39

Updated by Sage Weil about 12 years ago

  • Status changed from 12 to Can't reproduce
Actions

Also available in: Atom PDF