Bug #1530
closedosd crash during build_inc_scrub_map
0%
Description
Logs are in teuthology:~teuthworker/archive/nightly_coverage_2011-09-09/112/teuthology.log
osd/PG.cc: In function 'void PG::build_inc_scrub_map(ScrubMap&, eversion_t)', in thread '0x7f4634573700' osd/PG.cc: 2896: FAILED assert(0) ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f) 1: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x434) [0x6dfc64] 2: (PG::replica_scrub(MOSDRepScrub*)+0x8ff) [0x6e716f] 3: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x12a) [0x5c6dba] 4: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5993d2] 5: (ThreadPool::worker()+0x846) [0x662276] 6: (ThreadPool::WorkThread::entry()+0x15) [0x5a5765] 7: (Thread::_entry_func(void*)+0x12) [0x609492] 8: (()+0x7971) [0x7f4644201971] 9: (clone()+0x6d) [0x7f4642a9192d] ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f) 1: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x434) [0x6dfc64] 2: (PG::replica_scrub(MOSDRepScrub*)+0x8ff) [0x6e716f] 3: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x12a) [0x5c6dba] 4: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5993d2] 5: (ThreadPool::worker()+0x846) [0x662276] 6: (ThreadPool::WorkThread::entry()+0x15) [0x5a5765] 7: (Thread::_entry_func(void*)+0x12) [0x609492] 8: (()+0x7971) [0x7f4644201971] 9: (clone()+0x6d) [0x7f4642a9192d]
Files
Updated by Sage Weil over 12 years ago
- Translation missing: en.field_position set to 42
Updated by Josh Durgin over 12 years ago
This happened again, but with blogbench on cfuse. Logs in teuthology:~teuthworker/archive/nightly_coverage_2011-09-19/299/
Updated by Sage Weil over 12 years ago
- Target version changed from v0.36 to v0.37
Updated by Josh Durgin over 12 years ago
A third recurrence, this time with ffsb on rbd. Logs in teuthology:~teuthworker/archive/nightly_coverage_2011-10-05/142/
Updated by Sage Weil over 12 years ago
- Target version changed from v0.37 to v0.38
Updated by Josh Durgin over 12 years ago
- Subject changed from osd crash during kernel_untar_build workunit on top of rbd to osd crash during build_inc_scrub_map
This happened again during cfuse on ffsb (teuthology:~teuthworker/archive/nightly_coverage_2011-10-20/694).
Updated by Sage Weil over 12 years ago
- Status changed from New to Need More Info
Updated by Sage Weil over 12 years ago
I'm going to up the scrub frequency in the teuthology conf to help shake out these problems. There was another bug recently--the op reply reordering--that was triggered by scrubbing (and the requeuing of ops after scrub completion).
Updated by Sage Weil over 12 years ago
- Target version changed from v0.38 to v0.39
Updated by Josh Durgin over 12 years ago
This happened again in teuthology:~teuthworker/archive/nightly_coverage_2011-11-01/1254/remote/ubuntu@sepia68.ceph.dreamhost.com/log/osd.0.log.gz.
Updated by Sage Weil over 12 years ago
- Assignee deleted (
Sage Weil)
can someone work on reproducing this? see metropolis:~sage/src/teuthology/j.1530 and hammer.sh
Updated by Josh Durgin over 12 years ago
Two more tests hit this last night, and two other osds crashed due to an assert in build_inc_scrub_map:
osd/PG.cc: In function 'void PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&)', in thread '7f81f1a72700' osd/PG.cc: 2573: FAILED assert(!o.negative) ceph version 0.37-243-g2fc01b5 (commit:2fc01b5292f8c1777c546583543e1b4ab6451e79) 1: (PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&)+0x78b) [0x704eeb] 2: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x2a6) [0x705256] 3: (PG::build_scrub_map(ScrubMap&)+0x37a) [0x708dea] 4: (PG::scrub()+0x799) [0x740cd9] 5: (OSD::ScrubWQ::_process(PG*)+0x14) [0x5af8f4] 6: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x5a62d2] 7: (ThreadPool::worker()+0x7e3) [0x682f03] 8: (ThreadPool::WorkThread::entry()+0x15) [0x5afae5] 9: (Thread::_entry_func(void*)+0x12) [0x60f252] 10: (()+0x7971) [0x7f8201e00971] 11: (clone()+0x6d) [0x7f820069092d]
Updated by Josh Durgin over 12 years ago
Possibly related: the snaps workunit failed yesterday and today with bad stats:
2011-11-01 09:12:33.948653 osd.0 10.3.14.133:6800/1000 31 : [ERR] 0.0 scrub stat mismatch, got 295/295 objects, 7/7 clones, 9712167/6337823 bytes, 9655/6356 kb." in cluster log 1317 FAIL scheduled_teuthology@teuthology collection:basic clusters:fixed-3.yaml tasks:cfuse_workunit_snaps.yaml "2011-11-02 00:34:10.145251 osd.0 10.3.14.131:6800/895 17 : [ERR] 0.1 scrub stat mismatch, got 145/145 objects, 22/22 clones, 8796326/8591881 bytes, 8663/8451 kb." in cluster log
Updated by Samuel Just over 12 years ago
c861ee105475b3f20f64f51b8611f9b69207ca8c should take care of the assert(!o.negative) error. Still trying to reproduce the other one.
Updated by Samuel Just over 12 years ago
Alright, in irc, slb seems to have hit a related bug...with logging!
Updated by Samuel Just over 12 years ago
My best guess about this bug goes something like this:
1) osd->generate_backlog does pg->assemble_backlog
2) osd->generate_backlog drops the pg lock to grab the osd_map read lock
3) ...which is held by osd->handle_osd_map
4) at the end of osd->handle_osd_map, we call write_info on the pg since it has progressed to a new peering interval
5) osd->generate_backlog gets the read_lock and the pg lock and promptly bails since the backlog generation has been cancelled
6) osd dies, but not before the write_info transaction is durable
The result of this is that the in-memory backlog generated in assemble_backlog doesn't make it to disk, but the updated info does resulting in an ondisklog inconsistent with the pg info on osd restart.
Updated by Samuel Just over 12 years ago
ae41f3232a39dbf33487ab02cbac292f58debea8
Updated by Samuel Just over 12 years ago
- Status changed from Need More Info to Resolved
Updated by Anonymous over 12 years ago
We just found surprisingly similar stack traces in three of last night's failures:
nightly_coverage_2011-11-10/1740/remote/ubuntu@sepia40.ceph.dreamhost.com/log/osd.1.log.gz
nightly_coverage_2011-11-10/1747/remote/ubuntu@sepia54.ceph.dreamhost.com/log/osd.1.log.gz
nightly_coverage_2011-11-10/1766/remote/ubuntu@sepia40.ceph.dreamhost.com/log/osd.1.log.gz
Updated by Anonymous over 12 years ago
Still happening in 11/11 nightly
1814/remote/ubuntu@sepia55.ceph.dreamhost.com/log/osd.1.log.gz
Updated by Josh Durgin over 12 years ago
Got a couple more of these today: teuthworker/archive/nightly_coverage_2011-11-18-2/2649/remote/ubuntu@sepia56.ceph.dreamhost.com/log/osd.1.log.gz
Updated by Sage Weil over 12 years ago
- Translation missing: en.field_position deleted (
99) - Translation missing: en.field_position set to 6
Updated by Anonymous over 12 years ago
Both of the above described variants occurred in the 11/21 nightlies
(2775:sepia17, 2783:sepia81, 2805:sepia82)
Updated by Josh Durgin over 12 years ago
A new crash during scrub from /var/lib/teuthworker/archive/nightly_coverage_2011-11-23-b/3051/remote/ubuntu@sepia71.ceph.dreamhost.com/log/osd.1.log.gz
osd/PG.cc: In function 'void PG::replica_scrub(MOSDRepScrub*)', in thread '7ffa73315700' osd/PG.cc: 2865: FAILED assert(last_update_applied == info.last_update) ceph version 0.38-241-g32a6837 (commit:32a6837839acd7bf7135e356b45a8636140f8eb1) 1: (PG::replica_scrub(MOSDRepScrub*)+0x275) [0x729405] 2: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x10a) [0x5e4c9a] 3: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5b92c2] 4: (ThreadPool::worker()+0x7e3) [0x698913] 5: (ThreadPool::WorkThread::entry()+0x15) [0x5c1325] 6: (Thread::_entry_func(void*)+0x12) [0x624102] 7: (()+0x7971) [0x7ffa839a9971] 8: (clone()+0x6d) [0x7ffa8203492d]
Updated by Sage Weil over 12 years ago
- Target version changed from v0.39 to v0.40
Updated by Josh Durgin over 12 years ago
Another one crashed in PG::replica_scrub yesterday. core is in teuthology:~teuthworker/archive/nightly_coverage_2011-12-05-a/3647/remote/ubuntu@sepia44.ceph.dreamhost.com/coredump/1323087877.10117.core
Updated by Josh Durgin over 12 years ago
I'm guessing this is the new incarnation of this issue?
From teuthology:~teuthworker/archive/nightly_coverage_2011-12-13-a/4202/remote/ubuntu@sepia63.ceph.dreamhost.com/log/osd.1.log.gz:
osd/OSD.cc: In function 'MOSDMap* OSD::build_incremental_map_msg(epoch_t, epoch_t)', in thread '7f24566a0700' osd/OSD.cc: 3662: FAILED assert(0) ceph version 0.39-96-g6d5e5bd (commit:6d5e5bdb1ae38cfa30e852d53b4e4a399ad3861c) 1: (OSD::build_incremental_map_msg(unsigned int, unsigned int)+0x4df) [0x5911ef] 2: (OSD::send_incremental_map(unsigned int, entity_inst_t const&, bool)+0xa8) [0x591378] 3: (OSD::_share_map_incoming(entity_inst_t const&, unsigned int, OSD::Session*)+0x74f) [0x59728f] 4: (OSD::handle_op(MOSDOp*)+0x6fa) [0x59b17a] 5: (OSD::_dispatch(Message*)+0x9df) [0x5b0baf] 6: (OSD::ms_dispatch(Message*)+0x1b0) [0x5b1ea0] 7: (SimpleMessenger::dispatch_entry()+0x99a) [0x626b0a] 8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x4a8a2c] 9: (Thread::_entry_func(void*)+0x12) [0x620fb2] 10: (()+0x7971) [0x7f2462329971] 11: (clone()+0x6d) [0x7f24609b492d]
Updated by Sage Weil over 12 years ago
fixed that last thing with c9d0e556c7ad294819c60ca4e3cd4d0191811f18, but i think it's unrelated to the rest of this bug.
Updated by Josh Durgin over 12 years ago
This was the only failure in the run last night. Core at teuthology:~teuthworker/archive/nightly_coverage_2011-12-19-a/4464/remote/ubuntu@sepia64.ceph.dreamhost.com/coredump/1324284203.17720.core, build at teuthology:~teuthworker/archive/ceph_builds/nightly_coverage_2011-12-19-a.tgz
Updated by Sage Weil over 12 years ago
- Translation missing: en.field_position deleted (
43) - Translation missing: en.field_position set to 31
Updated by Sage Weil over 12 years ago
- Target version deleted (
v0.40) - Translation missing: en.field_position deleted (
56) - Translation missing: en.field_position set to 3
Updated by Sage Weil about 12 years ago
- Status changed from 12 to Can't reproduce