Project

General

Profile

Bug #10433

OSD osd/ReplicatedPG.cc: 5540: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

Added by Laurent GUERBY over 4 years ago. Updated over 4 years ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
12/26/2014
Due date:
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Hi,

We just got this on one of our OSD after a few days in production of ceph version 0.87-73-g70a5569 (70a5569e34786d4124e37561473f1aa02c80f779):

2014-12-26 08:54:32.320625 7f116dd9b700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7f116dd9b700 time 2014-12-26 08:54:32.310448
osd/ReplicatedPG.cc: 5540: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

ceph version 0.87-73-g70a5569 (70a5569e34786d4124e37561473f1aa02c80f779)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xbc8056]
2: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x2087) [0x851057]
3: (ReplicatedPG::try_flush_mark_clean(boost::shared_ptr&lt;ReplicatedPG::FlushOp&gt;)+0x730) [0x8519c0]
4: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x24c) [0x85242c]
5: (C_Flush::finish(int)+0x99) [0x8c22c9]
6: (Context::complete(int)+0x9) [0x69b8d9]
7: (Finisher::finisher_thread_entry()+0x158) [0xafc4c8]
8: (()+0x80a4) [0x7f11966b90a4]
9: (clone()+0x6d) [0x7f1194c13ccd]

...
10> 2014-12-26 08:54:32.253186 7f115d40a700 5 - op tracker -- seq: 25018348, time: 2014-12-26 08:54:32.253151, event: header_read, op: osd_sub_op_reply(client.11180247.0:11830200 58.cb 96d664cb/rb.0.10061c.2ae8944a.000000000\
7b5/head//58 [] ondisk, result = 0)
9> 2014-12-26 08:54:32.253188 7f115d40a700 5 - op tracker -- seq: 25018348, time: 2014-12-26 08:54:32.253151, event: throttled, op: osd_sub_op_reply(client.11180247.0:11830200 58.cb 96d664cb/rb.0.10061c.2ae8944a.0000000007b\
5/head//58 [] ondisk, result = 0)
8> 2014-12-26 08:54:32.253200 7f115d40a700 5 - op tracker -- seq: 25018348, time: 2014-12-26 08:54:32.253178, event: all_read, op: osd_sub_op_reply(client.11180247.0:11830200 58.cb 96d664cb/rb.0.10061c.2ae8944a.0000000007b5\
/head//58 [] ondisk, result = 0)
7> 2014-12-26 08:54:32.253202 7f115d40a700 5 - op tracker -- seq: 25018348, time: 0.000000, event: dispatched, op: osd_sub_op_reply(client.11180247.0:11830200 58.cb 96d664cb/rb.0.10061c.2ae8944a.0000000007b5/head//58 [] ond\
isk, result = 0)
6> 2014-12-26 08:54:32.253456 7f115d40a700 1 - 192.168.99.251:6804/8026 <== osd.8 192.168.99.247:6811/16157 1319599 ==== osd_sub_op_reply(client.11180247.0:11830201 58.cb 96d664cb/rb.0.10061c.2ae8944a.0000000007b5/head//58 \
[] ondisk, result = 0) v2 ==== 171+0+0 (1299343257 0 0) 0x488d9180 con 0x2645c940
5> 2014-12-26 08:54:32.253474 7f115d40a700 5 - op tracker -- seq: 25018349, time: 2014-12-26 08:54:32.253432, event: header_read, op: osd_sub_op_reply(client.11180247.0:11830201 58.cb 96d664cb/rb.0.10061c.2ae8944a.000000000\
7b5/head//58 [] ondisk, result = 0)
4> 2014-12-26 08:54:32.253477 7f115d40a700 5 - op tracker -- seq: 25018349, time: 2014-12-26 08:54:32.253433, event: throttled, op: osd_sub_op_reply(client.11180247.0:11830201 58.cb 96d664cb/rb.0.10061c.2ae8944a.0000000007b\
5/head//58 [] ondisk, result = 0)
3> 2014-12-26 08:54:32.253491 7f115d40a700 5 - op tracker -- seq: 25018349, time: 2014-12-26 08:54:32.253454, event: all_read, op: osd_sub_op_reply(client.11180247.0:11830201 58.cb 96d664cb/rb.0.10061c.2ae8944a.0000000007b5\
/head//58 [] ondisk, result = 0)
2> 2014-12-26 08:54:32.253493 7f115d40a700 5 - op tracker -- seq: 25018349, time: 0.000000, event: dispatched, op: osd_sub_op_reply(client.11180247.0:11830201 58.cb 96d664cb/rb.0.10061c.2ae8944a.0000000007b5/head//58 [] ond\
isk, result = 0)
1> 2014-12-26 08:54:32.304854 7f118af6d700 5 - op tracker -- seq: 25018288, time: 2014-12-26 08:54:32.304854, event: journaled_completion_queued, op: osd_op(client.11180247.0:11830225 rb.0.10061c.2ae8944a.0000000013f5 [set-\
alloc-hint object_size 4194304 write_size 4194304,write 557056~20480] 58.f0ed751c ack+ondisk+write+known_if_redirected e30192)
0> 2014-12-26 08:54:32.320625 7f116dd9b700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7f116dd9b700 time 2014-12-26 08:54:32.310448
osd/ReplicatedPG.cc: 5540: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

ceph version 0.87-73-g70a5569 (70a5569e34786d4124e37561473f1aa02c80f779)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x76) [0xbc8056]
2: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x2087) [0x851057]
3: (ReplicatedPG::try_flush_mark_clean(boost::shared_ptr&lt;ReplicatedPG::FlushOp&gt;)+0x730) [0x8519c0]
4: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x24c) [0x85242c]
5: (C_Flush::finish(int)+0x99) [0x8c22c9]
6: (Context::complete(int)+0x9) [0x69b8d9]
7: (Finisher::finisher_thread_entry()+0x158) [0xafc4c8]
8: (()+0x80a4) [0x7f11966b90a4]
9: (clone()+0x6d) [0x7f1194c13ccd]

This happened just after the creation of a new openstack cinder volume in an erasure coded pool (4+1), might be related or not.

ceph-report-20141226.txt.gz is attached as the ceph-osd log.

ceph-report-20141226.txt.gz (2.28 MB) Laurent GUERBY, 12/26/2014 12:14 AM

ceph-osd.0.log.gz (239 KB) Laurent GUERBY, 12/26/2014 12:14 AM

ceph-osd.24.log View (976 KB) Irek Fasikhov, 02/26/2015 01:34 PM


Related issues

Duplicates Ceph - Bug #8011: osd/ReplicatedPG.cc: 5244: FAILED assert(soid < scrubber.start || soid >= scrubber.end) Resolved 04/07/2014

History

#1 Updated by Laurent GUERBY over 4 years ago

Just got two more OSD with the same assert failure, I set noscrub and nodeep-scrub just in case.

#2 Updated by Samuel Just over 4 years ago

  • Status changed from New to Duplicate

#3 Updated by Irek Fasikhov over 4 years ago

Samuel.
This does not duplicate #8011. Today, the same error on Cache Pool. I think it depends on the Cache Pool since during scrub called "ReplicatedPG::try_flush_mark_clean"

osd/ReplicatedPG.cc: 5318: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x32f5) [0x889275]
 2: (ReplicatedPG::try_flush_mark_clean(boost::shared_ptr<ReplicatedPG::FlushOp>)+0x7df) [0x88a8cf]
 3: (ReplicatedPG::finish_flush(hobject_t, unsigned long, int)+0x2ea) [0x88b17a]
 4: (C_Flush::finish(int)+0xb7) [0x8e28b7]
 5: (Context::complete(int)+0x9) [0x667209]
 6: (Finisher::finisher_thread_entry()+0x1d8) [0x9ed148]
 7: (()+0x79d1) [0x7ff440c279d1]
 8: (clone()+0x6d) [0x7ff43fba08fd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-10000> 2015-01-29 21:50:19.076172 7ff43d2ee700  1 do_command 'perfcounters_dump' 'result is 12842 bytes

#4 Updated by Irek Fasikhov over 4 years ago

Unfortunately, I still can not repeat this on the test stand....

#5 Updated by Keysource Support over 4 years ago

We facing the same problem on our Proxmox VE cluster, now the problem occur every night on our SSD cache pool and seems getting worse as the workload is growing, our production pool is now unavailable every night, all the IO freeze and the virtual machines are unresponsive for one or two osd down on the cache pool.

2015-02-25 04:15:16.683363 7f5f2a7dc700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)' thread 7f5f2a7dc700 time 2015-02-25 04:15:16.663532
osd/ReplicatedPG.cc: 5318: FAILED assert(soid < scrubber.start || soid >= scrubber.end)

 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
 1: (ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)+0x1320) [0x929d00]
 2: (ReplicatedPG::finish_promote(int, std::tr1::shared_ptr<OpRequest>, ReplicatedPG::CopyResults*, std::tr1::shared_ptr<ObjectContext>)+0x80e) [0x92cade]
 3: (PromoteCallback::finish(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_
type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x75) [0x986cf5]
 4: (GenContext<boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::
tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> >::complete(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::nul
l_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x15) [0x955675]
 5: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x441) [0x93edd1]
 6: (C_Copyfrom::finish(int)+0x86) [0x9878b6]
 7: (Context::complete(int)+0x9) [0x78fd79]
 8: (Finisher::finisher_thread_entry()+0x1c8) [0xad6368]
 9: (()+0x6b50) [0x7f5f55856b50]
 10: (clone()+0x6d) [0x7f5f544647bd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#6 Updated by Irek Fasikhov over 4 years ago

#7 Updated by Laurent GUERBY over 4 years ago

Same crash with

ceph version 0.87.1-2-g2cb8a2d (2cb8a2da6d46bc7c9020b1dc040e9062f034f66f)

I assume the backport to giant is still in TODO list

Also available in: Atom PDF