Bug #1530: osd crash during build_inc_scrub_map - Ceph - Ceph

Actions

Copy link

Bug #1530

closed

osd crash during build_inc_scrub_map

Added by Josh Durgin over 12 years ago. Updated about 12 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Logs are in teuthology:~teuthworker/archive/nightly_coverage_2011-09-09/112/teuthology.log

osd/PG.cc: In function 'void PG::build_inc_scrub_map(ScrubMap&, eversion_t)', in thread '0x7f4634573700'
osd/PG.cc: 2896: FAILED assert(0)
 ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
 1: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x434) [0x6dfc64]
 2: (PG::replica_scrub(MOSDRepScrub*)+0x8ff) [0x6e716f]
 3: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x12a) [0x5c6dba]
 4: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5993d2]
 5: (ThreadPool::worker()+0x846) [0x662276]
 6: (ThreadPool::WorkThread::entry()+0x15) [0x5a5765]
 7: (Thread::_entry_func(void*)+0x12) [0x609492]
 8: (()+0x7971) [0x7f4644201971]
 9: (clone()+0x6d) [0x7f4642a9192d]
 ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f)
 1: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x434) [0x6dfc64]
 2: (PG::replica_scrub(MOSDRepScrub*)+0x8ff) [0x6e716f]
 3: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x12a) [0x5c6dba]
 4: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5993d2]
 5: (ThreadPool::worker()+0x846) [0x662276]
 6: (ThreadPool::WorkThread::entry()+0x15) [0x5a5765]
 7: (Thread::_entry_func(void*)+0x12) [0x609492]
 8: (()+0x7971) [0x7f4644201971]
 9: (clone()+0x6d) [0x7f4642a9192d]

Files

osd.3.log (8.92 MB) osd.3.log

Samuel Just, 11/02/2011 01:33 PM

Actions

Copy link

Updated by Sage Weil over 12 years ago

Translation missing: en.field_position set to 42

Actions

Copy link

Updated by Josh Durgin over 12 years ago

This happened again, but with blogbench on cfuse. Logs in teuthology:~teuthworker/archive/nightly_coverage_2011-09-19/299/

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version changed from v0.36 to v0.37

Actions

Copy link

Updated by Josh Durgin over 12 years ago

A third recurrence, this time with ffsb on rbd. Logs in teuthology:~teuthworker/archive/nightly_coverage_2011-10-05/142/

Actions

Copy link

Updated by Sage Weil over 12 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Sage Weil over 12 years ago

Target version changed from v0.37 to v0.38

Actions

Copy link

Updated by Josh Durgin over 12 years ago

Subject changed from osd crash during kernel_untar_build workunit on top of rbd to osd crash during build_inc_scrub_map

This happened again during cfuse on ffsb (teuthology:~teuthworker/archive/nightly_coverage_2011-10-20/694).

Actions

Copy link

Updated by Sage Weil over 12 years ago

Status changed from New to Need More Info

Actions

Copy link

Updated by Sage Weil over 12 years ago

I'm going to up the scrub frequency in the teuthology conf to help shake out these problems. There was another bug recently--the op reply reordering--that was triggered by scrubbing (and the requeuing of ops after scrub completion).

Actions

Copy link

#10

Updated by Sage Weil over 12 years ago

Assignee set to Sage Weil

Actions

Copy link

#11

Updated by Sage Weil over 12 years ago

Target version changed from v0.38 to v0.39

Actions

Copy link

#12

Updated by Josh Durgin over 12 years ago

This happened again in teuthology:~teuthworker/archive/nightly_coverage_2011-11-01/1254/remote/ubuntu@sepia68.ceph.dreamhost.com/log/osd.0.log.gz.

Actions

Copy link

#13

Updated by Sage Weil over 12 years ago

Assignee deleted (~~Sage Weil~~)

can someone work on reproducing this? see metropolis:~sage/src/teuthology/j.1530 and hammer.sh

Actions

Copy link

#14

Updated by Samuel Just over 12 years ago

Assignee set to Samuel Just

Actions

Copy link

#15

Updated by Josh Durgin over 12 years ago

Two more tests hit this last night, and two other osds crashed due to an assert in build_inc_scrub_map:

osd/PG.cc: In function 'void PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&)', in thread '7f81f1a72700'
osd/PG.cc: 2573: FAILED assert(!o.negative)
 ceph version 0.37-243-g2fc01b5 (commit:2fc01b5292f8c1777c546583543e1b4ab6451e79)
 1: (PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&)+0x78b) [0x704eeb]
 2: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x2a6) [0x705256]
 3: (PG::build_scrub_map(ScrubMap&)+0x37a) [0x708dea]
 4: (PG::scrub()+0x799) [0x740cd9]
 5: (OSD::ScrubWQ::_process(PG*)+0x14) [0x5af8f4]
 6: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x5a62d2]
 7: (ThreadPool::worker()+0x7e3) [0x682f03]
 8: (ThreadPool::WorkThread::entry()+0x15) [0x5afae5]
 9: (Thread::_entry_func(void*)+0x12) [0x60f252]
 10: (()+0x7971) [0x7f8201e00971]
 11: (clone()+0x6d) [0x7f820069092d]

Actions

Copy link

#16

Updated by Josh Durgin over 12 years ago

Possibly related: the snaps workunit failed yesterday and today with bad stats:

2011-11-01 09:12:33.948653 osd.0 10.3.14.133:6800/1000 31 : [ERR] 0.0 scrub stat mismatch, got 295/295 objects, 7/7 clones, 9712167/6337823 bytes, 9655/6356 kb." in cluster log
1317 FAIL scheduled_teuthology@teuthology collection:basic clusters:fixed-3.yaml tasks:cfuse_workunit_snaps.yaml
    "2011-11-02 00:34:10.145251 osd.0 10.3.14.131:6800/895 17 : [ERR] 0.1 scrub stat mismatch, got 145/145 objects, 22/22 clones, 8796326/8591881 bytes, 8663/8451 kb." in cluster log

Actions

Copy link

#17

Updated by Samuel Just over 12 years ago

c861ee105475b3f20f64f51b8611f9b69207ca8c should take care of the assert(!o.negative) error. Still trying to reproduce the other one.

Actions

Copy link

#18

Updated by Samuel Just over 12 years ago

File osd.3.log osd.3.log added

Alright, in irc, slb seems to have hit a related bug...with logging!

Actions

Copy link

#19

Updated by Samuel Just over 12 years ago

My best guess about this bug goes something like this:
1) osd->generate_backlog does pg->assemble_backlog
2) osd->generate_backlog drops the pg lock to grab the osd_map read lock
3) ...which is held by osd->handle_osd_map
4) at the end of osd->handle_osd_map, we call write_info on the pg since it has progressed to a new peering interval
5) osd->generate_backlog gets the read_lock and the pg lock and promptly bails since the backlog generation has been cancelled
6) osd dies, but not before the write_info transaction is durable

The result of this is that the in-memory backlog generated in assemble_backlog doesn't make it to disk, but the updated info does resulting in an ondisklog inconsistent with the pg info on osd restart.

Actions

Copy link

#20

Updated by Samuel Just over 12 years ago

ae41f3232a39dbf33487ab02cbac292f58debea8

Actions

Copy link

#21

Updated by Samuel Just over 12 years ago

Status changed from Need More Info to Resolved

Actions

Copy link

#22

Updated by Anonymous over 12 years ago

We just found surprisingly similar stack traces in three of last night's failures:
nightly_coverage_2011-11-10/1740/remote/ubuntu@sepia40.ceph.dreamhost.com/log/osd.1.log.gz
nightly_coverage_2011-11-10/1747/remote/ubuntu@sepia54.ceph.dreamhost.com/log/osd.1.log.gz
nightly_coverage_2011-11-10/1766/remote/ubuntu@sepia40.ceph.dreamhost.com/log/osd.1.log.gz

Actions

Copy link

#23

Updated by Sage Weil over 12 years ago

Status changed from Resolved to 12

Actions

Copy link

#24

Updated by Anonymous over 12 years ago

Still happening in 11/11 nightly
1814/remote/ubuntu@sepia55.ceph.dreamhost.com/log/osd.1.log.gz

Actions

Copy link

#25

Updated by Josh Durgin over 12 years ago

Got a couple more of these today: teuthworker/archive/nightly_coverage_2011-11-18-2/2649/remote/ubuntu@sepia56.ceph.dreamhost.com/log/osd.1.log.gz

Actions

Copy link

#26

Updated by Sage Weil over 12 years ago

Translation missing: en.field_position deleted (99)
Translation missing: en.field_position set to 6

Actions

Copy link

#27

Updated by Anonymous over 12 years ago

Both of the above described variants occurred in the 11/21 nightlies
(2775:sepia17, 2783:sepia81, 2805:sepia82)

Actions

Copy link

#28

Updated by Josh Durgin over 12 years ago

A new crash during scrub from /var/lib/teuthworker/archive/nightly_coverage_2011-11-23-b/3051/remote/ubuntu@sepia71.ceph.dreamhost.com/log/osd.1.log.gz

osd/PG.cc: In function 'void PG::replica_scrub(MOSDRepScrub*)', in thread '7ffa73315700'
osd/PG.cc: 2865: FAILED assert(last_update_applied == info.last_update)
 ceph version 0.38-241-g32a6837 (commit:32a6837839acd7bf7135e356b45a8636140f8eb1)
 1: (PG::replica_scrub(MOSDRepScrub*)+0x275) [0x729405]
 2: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x10a) [0x5e4c9a]
 3: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5b92c2]
 4: (ThreadPool::worker()+0x7e3) [0x698913]
 5: (ThreadPool::WorkThread::entry()+0x15) [0x5c1325]
 6: (Thread::_entry_func(void*)+0x12) [0x624102]
 7: (()+0x7971) [0x7ffa839a9971]
 8: (clone()+0x6d) [0x7ffa8203492d]

Actions

Copy link

#29

Updated by Sage Weil over 12 years ago

Target version changed from v0.39 to v0.40

Actions

Copy link

#30

Updated by Josh Durgin over 12 years ago

Another one crashed in PG::replica_scrub yesterday. core is in teuthology:~teuthworker/archive/nightly_coverage_2011-12-05-a/3647/remote/ubuntu@sepia44.ceph.dreamhost.com/coredump/1323087877.10117.core

Actions

Copy link

#31

Updated by Josh Durgin over 12 years ago

I'm guessing this is the new incarnation of this issue?
From teuthology:~teuthworker/archive/nightly_coverage_2011-12-13-a/4202/remote/ubuntu@sepia63.ceph.dreamhost.com/log/osd.1.log.gz:

osd/OSD.cc: In function 'MOSDMap* OSD::build_incremental_map_msg(epoch_t, epoch_t)', in thread '7f24566a0700'
osd/OSD.cc: 3662: FAILED assert(0)
 ceph version 0.39-96-g6d5e5bd (commit:6d5e5bdb1ae38cfa30e852d53b4e4a399ad3861c)
 1: (OSD::build_incremental_map_msg(unsigned int, unsigned int)+0x4df) [0x5911ef]
 2: (OSD::send_incremental_map(unsigned int, entity_inst_t const&, bool)+0xa8) [0x591378]
 3: (OSD::_share_map_incoming(entity_inst_t const&, unsigned int, OSD::Session*)+0x74f) [0x59728f]
 4: (OSD::handle_op(MOSDOp*)+0x6fa) [0x59b17a]
 5: (OSD::_dispatch(Message*)+0x9df) [0x5b0baf]
 6: (OSD::ms_dispatch(Message*)+0x1b0) [0x5b1ea0]
 7: (SimpleMessenger::dispatch_entry()+0x99a) [0x626b0a]
 8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x4a8a2c]
 9: (Thread::_entry_func(void*)+0x12) [0x620fb2]
 10: (()+0x7971) [0x7f2462329971]
 11: (clone()+0x6d) [0x7f24609b492d]

Actions

Copy link

#32

Updated by Sage Weil over 12 years ago

fixed that last thing with c9d0e556c7ad294819c60ca4e3cd4d0191811f18, but i think it's unrelated to the rest of this bug.

Actions

Copy link

#33

Updated by Josh Durgin over 12 years ago

This was the only failure in the run last night. Core at teuthology:~teuthworker/archive/nightly_coverage_2011-12-19-a/4464/remote/ubuntu@sepia64.ceph.dreamhost.com/coredump/1324284203.17720.core, build at teuthology:~teuthworker/archive/ceph_builds/nightly_coverage_2011-12-19-a.tgz

Actions

Copy link

#34