Bug #1530
osd crash during build_inc_scrub_map
0%
Description
Logs are in teuthology:~teuthworker/archive/nightly_coverage_2011-09-09/112/teuthology.log
osd/PG.cc: In function 'void PG::build_inc_scrub_map(ScrubMap&, eversion_t)', in thread '0x7f4634573700' osd/PG.cc: 2896: FAILED assert(0) ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f) 1: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x434) [0x6dfc64] 2: (PG::replica_scrub(MOSDRepScrub*)+0x8ff) [0x6e716f] 3: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x12a) [0x5c6dba] 4: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5993d2] 5: (ThreadPool::worker()+0x846) [0x662276] 6: (ThreadPool::WorkThread::entry()+0x15) [0x5a5765] 7: (Thread::_entry_func(void*)+0x12) [0x609492] 8: (()+0x7971) [0x7f4644201971] 9: (clone()+0x6d) [0x7f4642a9192d] ceph version 0.34-480-g4281f02 (commit:4281f02193539ee428cf98ad10a2277071ac341f) 1: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x434) [0x6dfc64] 2: (PG::replica_scrub(MOSDRepScrub*)+0x8ff) [0x6e716f] 3: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x12a) [0x5c6dba] 4: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5993d2] 5: (ThreadPool::worker()+0x846) [0x662276] 6: (ThreadPool::WorkThread::entry()+0x15) [0x5a5765] 7: (Thread::_entry_func(void*)+0x12) [0x609492] 8: (()+0x7971) [0x7f4644201971] 9: (clone()+0x6d) [0x7f4642a9192d]
History
#1 Updated by Sage Weil about 12 years ago
- translation missing: en.field_position set to 42
#2 Updated by Josh Durgin about 12 years ago
This happened again, but with blogbench on cfuse. Logs in teuthology:~teuthworker/archive/nightly_coverage_2011-09-19/299/
#3 Updated by Sage Weil about 12 years ago
- Target version changed from v0.36 to v0.37
#4 Updated by Josh Durgin about 12 years ago
A third recurrence, this time with ffsb on rbd. Logs in teuthology:~teuthworker/archive/nightly_coverage_2011-10-05/142/
#5 Updated by Sage Weil about 12 years ago
- Priority changed from Normal to High
#6 Updated by Sage Weil about 12 years ago
- Target version changed from v0.37 to v0.38
#7 Updated by Josh Durgin about 12 years ago
- Subject changed from osd crash during kernel_untar_build workunit on top of rbd to osd crash during build_inc_scrub_map
This happened again during cfuse on ffsb (teuthology:~teuthworker/archive/nightly_coverage_2011-10-20/694).
#8 Updated by Sage Weil about 12 years ago
- Status changed from New to Need More Info
#9 Updated by Sage Weil about 12 years ago
I'm going to up the scrub frequency in the teuthology conf to help shake out these problems. There was another bug recently--the op reply reordering--that was triggered by scrubbing (and the requeuing of ops after scrub completion).
#10 Updated by Sage Weil about 12 years ago
- Assignee set to Sage Weil
#11 Updated by Sage Weil about 12 years ago
- Target version changed from v0.38 to v0.39
#12 Updated by Josh Durgin about 12 years ago
This happened again in teuthology:~teuthworker/archive/nightly_coverage_2011-11-01/1254/remote/ubuntu@sepia68.ceph.dreamhost.com/log/osd.0.log.gz.
#13 Updated by Sage Weil about 12 years ago
- Assignee deleted (
Sage Weil)
can someone work on reproducing this? see metropolis:~sage/src/teuthology/j.1530 and hammer.sh
#14 Updated by Samuel Just about 12 years ago
- Assignee set to Samuel Just
#15 Updated by Josh Durgin about 12 years ago
Two more tests hit this last night, and two other osds crashed due to an assert in build_inc_scrub_map:
osd/PG.cc: In function 'void PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&)', in thread '7f81f1a72700' osd/PG.cc: 2573: FAILED assert(!o.negative) ceph version 0.37-243-g2fc01b5 (commit:2fc01b5292f8c1777c546583543e1b4ab6451e79) 1: (PG::_scan_list(ScrubMap&, std::vector<hobject_t, std::allocator<hobject_t> >&)+0x78b) [0x704eeb] 2: (PG::build_inc_scrub_map(ScrubMap&, eversion_t)+0x2a6) [0x705256] 3: (PG::build_scrub_map(ScrubMap&)+0x37a) [0x708dea] 4: (PG::scrub()+0x799) [0x740cd9] 5: (OSD::ScrubWQ::_process(PG*)+0x14) [0x5af8f4] 6: (ThreadPool::WorkQueue<PG>::_void_process(void*)+0x12) [0x5a62d2] 7: (ThreadPool::worker()+0x7e3) [0x682f03] 8: (ThreadPool::WorkThread::entry()+0x15) [0x5afae5] 9: (Thread::_entry_func(void*)+0x12) [0x60f252] 10: (()+0x7971) [0x7f8201e00971] 11: (clone()+0x6d) [0x7f820069092d]
#16 Updated by Josh Durgin about 12 years ago
Possibly related: the snaps workunit failed yesterday and today with bad stats:
2011-11-01 09:12:33.948653 osd.0 10.3.14.133:6800/1000 31 : [ERR] 0.0 scrub stat mismatch, got 295/295 objects, 7/7 clones, 9712167/6337823 bytes, 9655/6356 kb." in cluster log 1317 FAIL scheduled_teuthology@teuthology collection:basic clusters:fixed-3.yaml tasks:cfuse_workunit_snaps.yaml "2011-11-02 00:34:10.145251 osd.0 10.3.14.131:6800/895 17 : [ERR] 0.1 scrub stat mismatch, got 145/145 objects, 22/22 clones, 8796326/8591881 bytes, 8663/8451 kb." in cluster log
#17 Updated by Samuel Just about 12 years ago
c861ee105475b3f20f64f51b8611f9b69207ca8c should take care of the assert(!o.negative) error. Still trying to reproduce the other one.
#18 Updated by Samuel Just about 12 years ago
Alright, in irc, slb seems to have hit a related bug...with logging!
#19 Updated by Samuel Just about 12 years ago
My best guess about this bug goes something like this:
1) osd->generate_backlog does pg->assemble_backlog
2) osd->generate_backlog drops the pg lock to grab the osd_map read lock
3) ...which is held by osd->handle_osd_map
4) at the end of osd->handle_osd_map, we call write_info on the pg since it has progressed to a new peering interval
5) osd->generate_backlog gets the read_lock and the pg lock and promptly bails since the backlog generation has been cancelled
6) osd dies, but not before the write_info transaction is durable
The result of this is that the in-memory backlog generated in assemble_backlog doesn't make it to disk, but the updated info does resulting in an ondisklog inconsistent with the pg info on osd restart.
#20 Updated by Samuel Just about 12 years ago
ae41f3232a39dbf33487ab02cbac292f58debea8
#21 Updated by Samuel Just about 12 years ago
- Status changed from Need More Info to Resolved
#22 Updated by Anonymous about 12 years ago
We just found surprisingly similar stack traces in three of last night's failures:
nightly_coverage_2011-11-10/1740/remote/ubuntu@sepia40.ceph.dreamhost.com/log/osd.1.log.gz
nightly_coverage_2011-11-10/1747/remote/ubuntu@sepia54.ceph.dreamhost.com/log/osd.1.log.gz
nightly_coverage_2011-11-10/1766/remote/ubuntu@sepia40.ceph.dreamhost.com/log/osd.1.log.gz
#23 Updated by Sage Weil about 12 years ago
- Status changed from Resolved to 12
#24 Updated by Anonymous about 12 years ago
Still happening in 11/11 nightly
1814/remote/ubuntu@sepia55.ceph.dreamhost.com/log/osd.1.log.gz
#25 Updated by Josh Durgin about 12 years ago
Got a couple more of these today: teuthworker/archive/nightly_coverage_2011-11-18-2/2649/remote/ubuntu@sepia56.ceph.dreamhost.com/log/osd.1.log.gz
#26 Updated by Sage Weil about 12 years ago
- translation missing: en.field_position deleted (
99) - translation missing: en.field_position set to 6
#27 Updated by Anonymous about 12 years ago
Both of the above described variants occurred in the 11/21 nightlies
(2775:sepia17, 2783:sepia81, 2805:sepia82)
#28 Updated by Josh Durgin about 12 years ago
A new crash during scrub from /var/lib/teuthworker/archive/nightly_coverage_2011-11-23-b/3051/remote/ubuntu@sepia71.ceph.dreamhost.com/log/osd.1.log.gz
osd/PG.cc: In function 'void PG::replica_scrub(MOSDRepScrub*)', in thread '7ffa73315700' osd/PG.cc: 2865: FAILED assert(last_update_applied == info.last_update) ceph version 0.38-241-g32a6837 (commit:32a6837839acd7bf7135e356b45a8636140f8eb1) 1: (PG::replica_scrub(MOSDRepScrub*)+0x275) [0x729405] 2: (OSD::RepScrubWQ::_process(MOSDRepScrub*)+0x10a) [0x5e4c9a] 3: (ThreadPool::WorkQueue<MOSDRepScrub>::_void_process(void*)+0x12) [0x5b92c2] 4: (ThreadPool::worker()+0x7e3) [0x698913] 5: (ThreadPool::WorkThread::entry()+0x15) [0x5c1325] 6: (Thread::_entry_func(void*)+0x12) [0x624102] 7: (()+0x7971) [0x7ffa839a9971] 8: (clone()+0x6d) [0x7ffa8203492d]
#29 Updated by Sage Weil about 12 years ago
- Target version changed from v0.39 to v0.40
#30 Updated by Josh Durgin almost 12 years ago
Another one crashed in PG::replica_scrub yesterday. core is in teuthology:~teuthworker/archive/nightly_coverage_2011-12-05-a/3647/remote/ubuntu@sepia44.ceph.dreamhost.com/coredump/1323087877.10117.core
#31 Updated by Josh Durgin almost 12 years ago
I'm guessing this is the new incarnation of this issue?
From teuthology:~teuthworker/archive/nightly_coverage_2011-12-13-a/4202/remote/ubuntu@sepia63.ceph.dreamhost.com/log/osd.1.log.gz:
osd/OSD.cc: In function 'MOSDMap* OSD::build_incremental_map_msg(epoch_t, epoch_t)', in thread '7f24566a0700' osd/OSD.cc: 3662: FAILED assert(0) ceph version 0.39-96-g6d5e5bd (commit:6d5e5bdb1ae38cfa30e852d53b4e4a399ad3861c) 1: (OSD::build_incremental_map_msg(unsigned int, unsigned int)+0x4df) [0x5911ef] 2: (OSD::send_incremental_map(unsigned int, entity_inst_t const&, bool)+0xa8) [0x591378] 3: (OSD::_share_map_incoming(entity_inst_t const&, unsigned int, OSD::Session*)+0x74f) [0x59728f] 4: (OSD::handle_op(MOSDOp*)+0x6fa) [0x59b17a] 5: (OSD::_dispatch(Message*)+0x9df) [0x5b0baf] 6: (OSD::ms_dispatch(Message*)+0x1b0) [0x5b1ea0] 7: (SimpleMessenger::dispatch_entry()+0x99a) [0x626b0a] 8: (SimpleMessenger::DispatchThread::entry()+0x2c) [0x4a8a2c] 9: (Thread::_entry_func(void*)+0x12) [0x620fb2] 10: (()+0x7971) [0x7f2462329971] 11: (clone()+0x6d) [0x7f24609b492d]
#32 Updated by Sage Weil almost 12 years ago
fixed that last thing with c9d0e556c7ad294819c60ca4e3cd4d0191811f18, but i think it's unrelated to the rest of this bug.
#33 Updated by Josh Durgin almost 12 years ago
This was the only failure in the run last night. Core at teuthology:~teuthworker/archive/nightly_coverage_2011-12-19-a/4464/remote/ubuntu@sepia64.ceph.dreamhost.com/coredump/1324284203.17720.core, build at teuthology:~teuthworker/archive/ceph_builds/nightly_coverage_2011-12-19-a.tgz
#34 Updated by Sage Weil almost 12 years ago
- translation missing: en.field_position deleted (
43) - translation missing: en.field_position set to 31
#35 Updated by Sage Weil almost 12 years ago
- Priority changed from High to Urgent
#36 Updated by Sage Weil almost 12 years ago
- Priority changed from Urgent to Normal
#37 Updated by Sage Weil almost 12 years ago
- Target version deleted (
v0.40) - translation missing: en.field_position deleted (
56) - translation missing: en.field_position set to 3
#38 Updated by Sage Weil almost 12 years ago
- Assignee deleted (
Samuel Just)
#39 Updated by Sage Weil almost 12 years ago
- Status changed from 12 to Can't reproduce