Project

General

Profile

Bug #11156

FAILED assert(soid < scrubber.start || soid >= scrubber.end)

Added by Samuel Just about 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
firefly
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This is not the same as the previous one. In this case, it's a promote which is racing with scrub. Odd, thought that got handled by the blocked status on the obc.

2015-03-18 03:03:52.568891 7f0bb77ac700 10 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] process_copy_chunk 151ebcaf/bar/head/test-rados-api-plana57-5515-37/94 tid 97 (0) Success
2015-03-18 03:03:52.568934 7f0bb77ac700 20 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] _write_copy_chunk 0x49356c0 0 attrs 5 bytes 0 omap header bytes 0 omap data bytes
2015-03-18 03:03:52.568960 7f0bb77ac700 20 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] process_copy_chunk got digest: rx data 0xbf96af60 omap 0xffffffff, source: data 0xbf96af60 omap 0xffffffff flags 3
2015-03-18 03:03:52.568980 7f0bb77ac700 20 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] process_copy_chunk success; committing
2015-03-18 03:03:52.568991 7f0bb77ac700 20 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] process_copy_chunk complete r = (0) Success
2015-03-18 03:03:52.569006 7f0bb77ac700 10 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] finish_promote 151ebcaf/bar/head/test-rados-api-plana57-5515-37/94 r=0 uv6
2015-03-18 03:03:52.569018 7f0bb77ac700 20 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] simple_repop_create 151ebcaf/bar/head/test-rados-api-plana57-5515-37/94
2015-03-18 03:03:52.569034 7f0bb77ac700 10 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] new_repop rep_tid 1010 (no op)
2015-03-18 03:03:52.569057 7f0bb77ac700 20 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] finish_promote new_snapset 2=[2]:[2]+head
2015-03-18 03:03:52.569070 7f0bb77ac700 20 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] finish_promote took lock on obc, rwstate(write n=1 w=0)
2015-03-18 03:03:52.569082 7f0bb77ac700 20 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] finish_ctx 151ebcaf/bar/head/test-rados-api-plana57-5515-37/94 0x716c200 op promote
2015-03-18 03:03:52.569097 7f0bb77ac700 10 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] set mtime to 2015-03-18 03:03:52.569034
2015-03-18 03:03:52.569121 7f0bb77ac700 10 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] final snapset 2=[2]:[2]+head in 151ebcaf/bar/head/test-rados-api-plana57-5515-37/94
2015-03-18 03:03:52.569144 7f0bb77ac700 20 osd.0 pg_epoch: 817 pg[94.7( v 817'2 (0'0,817'2] local-les=796 n=2 ec=795 les/c 796/796 795/795/795) [0,5] r=0 lpr=795 crt=0'0 lcod 810'1 mlcod 810'1 active+clean+scrubbing] finish_ctx extra_reqids [client.4866.0:114,6,client.4866.0:109,2]

ubuntu@teuthology:/a/samuelj-2015-03-17_21:37:11-rados-wip-sam-testing-distro-basic-multi/808722/remote/ceph-osd.0.log

Related to e26023eeb11808c7ef22c0201ac894ce71995501?


Related issues

Duplicated by Ceph - Bug #10832: FAILED assert(soid < scrubber.start || soid >= scrubber.end) Duplicate 02/11/2015
Blocked by Ceph - Bug #11139: "Exception: ceph-objectstore-tool: import failure" in upgrade:firefly-x-hammer-distro-basic-multi run Resolved 03/17/2015

Associated revisions

Revision 65bb4df5 (diff)
Added by Samuel Just about 9 years ago

ReplicatedPG::promote_object: check scrubber and block if necessary

Otherwise, we might attempt to promote into an in-progress scrub
interval causing 11156. I would have added a return value to
promote_object(), but could not find an existing user which
cared to distinguish the cases, even with a null op passed.
All existing users are in maybe_handle_cache. The ones which
pass a null op are for promoting the object in parallel
with a proxy -- a case where not actually performing the promote
does not really matter.

Fixes: #11156
Signed-off-by: Samuel Just <>

Revision 110c3549 (diff)
Added by Samuel Just about 9 years ago

ReplicatedPG::promote_object: check scrubber and block if necessary

Otherwise, we might attempt to promote into an in-progress scrub
interval causing 11156. I would have added a return value to
promote_object(), but could not find an existing user which
cared to distinguish the cases, even with a null op passed.
All existing users are in maybe_handle_cache. The ones which
pass a null op are for promoting the object in parallel
with a proxy -- a case where not actually performing the promote
does not really matter.

Fixes: #11156
Signed-off-by: Samuel Just <>
(cherry picked from commit 65bb4df599541cd2e0f195b905f24f529e255c00)

History

#1 Updated by Samuel Just about 9 years ago

Scrub got there first. The promote is a result of a read which of course is not blocked behind the scrubber. bother.

#2 Updated by Samuel Just about 9 years ago

  • Status changed from New to 12

#3 Updated by Sage Weil about 9 years ago

  • Status changed from 12 to Pending Backport
  • Source changed from other to Q/A
  • Backport set to firefly

#4 Updated by Loïc Dachary about 9 years ago

http://pulpito.ceph.com/loic-2015-03-25_17:37:52-upgrade:firefly-x:stress-split-erasure-code-hammer---basic-vps/820686/

Core was generated by `ceph-osd -f -i 11'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f5b4e76e20b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
37    ../nptl/sysdeps/unix/sysv/linux/pt-raise.c: No such file or directory.
(gdb) bt
#0  0x00007f5b4e76e20b in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/pt-raise.c:37
#1  0x000000000098302a in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2  handle_fatal_signal (signum=6) at global/signal_handler.cc:105
#3  <signal handler called>
#4  0x00007f5b4ce15cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#5  0x00007f5b4ce190d8 in __GI_abort () at abort.c:89
#6  0x00007f5b4d7206b5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x00007f5b4d71e836 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007f5b4d71e863 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007f5b4d71eaa2 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x0000000000a66802 in ceph::__ceph_assert_fail (assertion=assertion@entry=0xb82a98 "soid < scrubber.start || soid >= scrubber.end", 
    file=file@entry=0xb7e3a0 "osd/ReplicatedPG.cc", line=line@entry=5320, 
    func=func@entry=0xb8a180 <ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)::__PRETTY_FUNCTION__> "void ReplicatedPG::finish_ctx(ReplicatedPG::OpContext*, int, bool)") at common/assert.cc:77
#11 0x00000000007d3ae6 in ReplicatedPG::finish_ctx (this=this@entry=0x32ba400, ctx=ctx@entry=0x46bea00, log_op_type=log_op_type@entry=8, 
    maintain_ssc=maintain_ssc@entry=true) at osd/ReplicatedPG.cc:5320
#12 0x00000000007d7c90 in ReplicatedPG::finish_promote (this=0x32ba400, r=<optimized out>, op=..., results=<optimized out>, obc=...)
    at osd/ReplicatedPG.cc:6025
#13 0x0000000000841e9c in PromoteCallback::finish (this=<optimized out>, results=...) at osd/ReplicatedPG.cc:1687
#14 0x0000000000816ca9 in GenContext<boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type> >::complete (this=0x49b2420, t=...) at ./include/Context.h:45
#15 0x00000000007ec55d in ReplicatedPG::process_copy_chunk (this=0x32ba400, oid=..., tid=tid@entry=926, r=r@entry=0)
    at osd/ReplicatedPG.cc:5711
#16 0x0000000000842b49 in C_Copyfrom::finish (this=0x4b87c60, r=0) at osd/ReplicatedPG.cc:5371
#17 0x0000000000655a19 in Context::complete (this=0x4b87c60, r=<optimized out>) at ./include/Context.h:64
#18 0x00000000009a60d8 in Finisher::finisher_thread_entry (this=0x2ffc340) at common/Finisher.cc:56
#19 0x00007f5b4e766182 in start_thread (arg=0x7f5b35885700) at pthread_create.c:312
#20 0x00007f5b4ced947d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

#5 Updated by Loïc Dachary about 9 years ago

must be synchronized with https://github.com/ceph/ceph-qa-suite/pull/367 which reproduces it (but not all the time, it probably needs a few runs)

#7 Updated by Loïc Dachary about 9 years ago

./virtualenv/bin/teuthology-suite --filter 'upgrade:firefly-x:stress-split-erasure-code/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-workload/ec-cache-create.yaml 3-partial-upgrade/firsthalf.yaml 4-thrash/default.yaml 5-mon/mona.yaml 6-workload/ec-cache-agent.yaml 7-next-mon/monb.yaml 8-next-mon/monc.yaml 9-workload/ec-no-lrc.yaml distros/ubuntu_14.04.yaml}' --suite-branch wip-ec-cache-agent-firefly-backports --priority 101 --suite upgrade/firefly-x/stress-split-erasure-code --machine-type plana,burnupi,mira  --email loic@dachary.org --owner loic@dachary.org --ceph hammer

because the rados tasks runs in the background, the test does not do what is expected.

#8 Updated by Loïc Dachary about 9 years ago

./virtualenv/bin/teuthology-suite --filter 'upgrade:firefly-x:stress-split-erasure-code/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-workload/ec-cache-create.yaml 3-partial-upgrade/firsthalf.yaml 4-thrash/default.yaml 5-mon/mona.yaml 6-workload/ec-cache-agent.yaml 7-next-mon/monb.yaml 8-next-mon/monc.yaml 9-workload/ec-no-lrc.yaml distros/ubuntu_14.04.yaml}' --suite-branch wip-ec-cache-agent-firefly-backports --priority 101 --suite upgrade/firefly-x/stress-split-erasure-code --machine-type plana,burnupi,mira  --email loic@dachary.org --owner loic@dachary.org --ceph hammer

#9 Updated by Loïc Dachary about 9 years ago

Same teuthology job as above but on firefly instead of firefly-backports

./virtualenv/bin/teuthology-suite --num 5 --filter 'upgrade:firefly-x:stress-split-erasure-code/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-workload/ec-cache-create.yaml 3-partial-upgrade/firsthalf.yaml 4-thrash/default.yaml 5-mon/mona.yaml 6-workload/ec-cache-agent.yaml 7-next-mon/monb.yaml 8-next-mon/monc.yaml 9-workload/ec-no-lrc.yaml distros/ubuntu_14.04.yaml}' --suite-branch wip-ec-cache-agent --priority 101 --suite upgrade/firefly-x/stress-split-erasure-code --machine-type plana,burnupi,mira  --email loic@dachary.org --owner loic@dachary.org --ceph hammer

#10 Updated by Loïc Dachary almost 9 years ago

Same teuthology job as above, on firefly and expecting success now that http://tracker.ceph.com/issues/11139 has been fixed

./virtualenv/bin/teuthology-suite --num 5 --filter 'upgrade:firefly-x:stress-split-erasure-code/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-workload/ec-cache-create.yaml 3-partial-upgrade/firsthalf.yaml 4-thrash/default.yaml 5-mon/mona.yaml 6-workload/ec-cache-agent.yaml 7-next-mon/monb.yaml 8-next-mon/monc.yaml 9-workload/ec-no-lrc.yaml distros/ubuntu_14.04.yaml}' --suite-branch wip-ec-cache-agent --priority 101 --suite upgrade/firefly-x/stress-split-erasure-code --machine-type plana,burnupi,mira  --email loic@dachary.org --owner loic@dachary.org --ceph hammer

#11 Updated by Loïc Dachary almost 9 years ago

Back to testing firefly-backports and verify it behaves in a split cluster being upgraded with a cache + erasure-coded pool

./virtualenv/bin/teuthology-suite --num 5 --filter 'upgrade:firefly-x:stress-split-erasure-code/{0-cluster/start.yaml 1-firefly-install/firefly.yaml 2-workload/ec-cache-create.yaml 3-partial-upgrade/firsthalf.yaml 4-thrash/default.yaml 5-mon/mona.yaml 6-workload/ec-cache-agent.yaml 7-next-mon/monb.yaml 8-next-mon/monc.yaml 9-workload/ec-no-lrc.yaml distros/ubuntu_14.04.yaml}' --suite-branch wip-ec-cache-agent-firefly-backports --priority 101 --suite upgrade/firefly-x/stress-split-erasure-code --machine-type plana,burnupi,mira  --email loic@dachary.org --owner loic@dachary.org --ceph hammer

#12 Updated by Loïc Dachary almost 9 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF