Project

General

Profile

Bug #10794

ceph_test_rados: incorrect buffer at pos ...

Added by Sage Weil about 9 years ago. Updated almost 8 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
David Zafman
Category:
-
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

2015-02-07T18:04:02.806 INFO:teuthology.orchestra.run.plana41.stderr:set osd.0 primary-affinity to 0.102194 (866972)
2015-02-07T18:04:04.055 INFO:tasks.rados.rados.0.plana41.stdout:2479:  expect (ObjNum 1482 snap 0 seq_num 1482)
2015-02-07T18:04:05.649 INFO:tasks.rados.rados.0.plana41.stdout:incorrect buffer at pos 2784445
2015-02-07T18:04:05.650 INFO:tasks.rados.rados.0.plana41.stderr:2479: oid 231 contents (ObjNum 1482 snap 0 seq_num 1482) corrupt
2015-02-07T18:04:05.673 INFO:tasks.rados.rados.0.plana41.stderr:./test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7f9ae3fff700 time 2015-02-07 18:04:05.648903
2015-02-07T18:04:05.673 INFO:tasks.rados.rados.0.plana41.stderr:./test/osd/RadosModel.h: 1114: FAILED assert(0)
2015-02-07T18:04:05.686 INFO:tasks.rados.rados.0.plana41.stderr: ceph version 0.92-960-g0b7106e (0b7106e217e878a5c6b1a3c5cd4b68afcca31f74)
2015-02-07T18:04:05.686 INFO:tasks.rados.rados.0.plana41.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0x4db60f]
2015-02-07T18:04:05.686 INFO:tasks.rados.rados.0.plana41.stderr: 2: (ReadOp::_finish(TestOp::CallbackInfo*)+0x161e) [0x4c78ee]
2015-02-07T18:04:05.686 INFO:tasks.rados.rados.0.plana41.stderr: 3: (()+0x8121d) [0x7f9b0031d21d]
2015-02-07T18:04:05.686 INFO:tasks.rados.rados.0.plana41.stderr: 4: (()+0x5ef59) [0x7f9b002faf59]
2015-02-07T18:04:05.687 INFO:tasks.rados.rados.0.plana41.stderr: 5: (()+0x124480) [0x7f9b003c0480]
2015-02-07T18:04:05.687 INFO:tasks.rados.rados.0.plana41.stderr: 6: (()+0x7e9a) [0x7f9affe6de9a]
2015-02-07T18:04:05.687 INFO:tasks.rados.rados.0.plana41.stderr: 7: (clone()+0x6d) [0x7f9afea333fd]
2015-02-07T18:04:05.687 INFO:tasks.rados.rados.0.plana41.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

ubuntu@teuthology:/a/sage-2015-02-07_04:24:21-rados-wip-sage-testing-distro-basic-multi/743469

Related issues

Related to Ceph - Feature #14877: ceph-objectstore-tool improvements Resolved 02/26/2016
Related to Ceph - Bug #13382: osd/PG.cc: 758: FAILED assert(info.history.same_interval_since == same_interval_since) Resolved 10/06/2015
Copied to Ceph - Backport #14979: hammer: ceph_test_rados: incorrect buffer at pos ... Resolved

Associated revisions

Revision 70316541 (diff)
Added by David Zafman almost 9 years ago

ceph-objectstore-tool, osd: Fix import handling

Use same_interval_since set to zero to communicate that an import has happened
and the correct value needs to be computed.

Fixes: #10794

Signed-off-by: David Zafman <>

Revision c3f031a3 (diff)
Added by David Zafman about 8 years ago

ceph-objectstore-tool, osd: Fix import handling

Use same_interval_since set to zero to communicate that an import has happened
and the correct value needs to be computed.

Fixes: #10794

Signed-off-by: David Zafman <>
(cherry picked from commit 70316541bbb115d9a35954bfba373cf1dc084b7e)

Conflicts:
src/osd/PG.cc (trivial)

History

#1 Updated by Sage Weil about 9 years ago

  • Project changed from rbd to Ceph

#2 Updated by Samuel Just about 9 years ago

The replicas deep scrubbed successfully at the end of the run after ceph_test_rados crashed. Odd

#3 Updated by Yuri Weinstein about 9 years ago

Seeing also in run http://pulpito.ceph.com/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/
Job: 743717
Logs: http://qa-proxy.ceph.com/teuthology/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/

/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182726796-2015-02-08 13:58:15.665699 7ffcb0c0f700 -1 *** Caught signal (Aborted) **
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182726870- in thread 7ffcb0c0f700
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182726894-
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182726895: ceph version 0.80.8-22-g26966c4 (26966c4aa9cd79cb53db052553a5fc5653f2591b)
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182726971- 1: ceph-osd() [0x9bd9c1]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182726997- 2: (()+0xf710) [0x7ffcca749710]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182727030- 3: (gsignal()+0x35) [0x7ffcc961b925]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182727068- 4: (abort()+0x175) [0x7ffcc961d105]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182727105- 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7ffcc9ed5a5d]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182727175- 6: (()+0xbcbe6) [0x7ffcc9ed3be6]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182727209- 7: (()+0xbcc13) [0x7ffcc9ed3c13]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182727243- 8: (()+0xbcd0e) [0x7ffcc9ed3d0e]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182727277- 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f2) [0xaa3ce2]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182727369- 10: (PG::RecoveryState::Active::react(PG::MInfoRec const&)+0x309) [0x7a1839]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182727447- 11: (boost::statechart::detail::reaction_result boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl<boost::mpl::list<boost::statechart::custom_reaction<PG::QueryState>, boost::statechart::custom_reaction<PG::ActMap>, boost::statechart::custom_reaction<PG::AdvMap>, boost::statechart::custom_reaction<PG::MInfoRec>, boost::statechart::custom_reaction<PG::MNotifyRec>, boost::statechart::custom_reaction<PG::MLogRec>, boost::statechart::custom_reaction<PG::Backfilled>, boost::statechart::custom_reaction<PG::AllReplicasActivated>, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0> >(boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>&, boost::statechart::event_base const&, void const*)+0x243) [0x7e15e3]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182728628- 12: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x26) [0x7e1786]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182728868- 13: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x4b) [0x7cc53b]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182729106- 14: (PG::handle_peering_event(std::tr1::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x32f) [0x791c5f]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182729215- 15: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x30c) [0x64edfc]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182729335- 16: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x16) [0x6a5126]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182729451- 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x551) [0xae2ab1]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182729519- 18: (ThreadPool::WorkThread::entry()+0x10) [0xae5b90]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182729574- 19: (()+0x79d1) [0x7ffcca7419d1]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182729608- 20: (clone()+0x6d) [0x7ffcc96d1b6d]
/a/teuthology-2015-02-07_17:18:02-upgrade:firefly-x-hammer-distro-basic-vps/743717/remote/vpm010/log/ceph-osd.7.log.gz:1182729645- NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#4 Updated by Samuel Just about 9 years ago

  • Status changed from New to Can't reproduce

#5 Updated by David Zafman almost 9 years ago

  • Status changed from Can't reproduce to 12
  • Regression set to No

#6 Updated by David Zafman almost 9 years ago

  • Assignee set to David Zafman

#7 Updated by David Zafman almost 9 years ago

An overwrite whose smallest offset was 590180 occurred right after a PG was moved using ceph-objectstore-tool. The incorrect buffer data was seen at 590180. I'm going to look into exactly which OSD received the PG and what happened when that OSD came back in.

$ grep "oid 358\|thrasher\|burnupi1865424-358\|incorrect buffer" teuthology.log
2015-05-11T16:56:13.572 INFO:tasks.thrashosds.thrasher:starting do_thrash
2015-05-11T16:56:13.572 INFO:tasks.thrashosds.thrasher:in_osds: [0, 1, 2, 3, 4, 5] out_osds: [] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:56:13.572 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:56:13.572 INFO:tasks.thrashosds.thrasher:Growing pool rbd
2015-05-11T16:56:19.746 INFO:tasks.thrashosds.thrasher:in_osds: [0, 1, 2, 3, 4, 5] out_osds: [] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:56:19.746 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:56:19.746 INFO:tasks.thrashosds.thrasher:Removing osd 0, in_osds are: [0, 1, 2, 3, 4, 5]
2015-05-11T16:56:25.069 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 4, 5] out_osds: [0] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:56:25.069 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:56:25.070 INFO:tasks.thrashosds.thrasher:Growing pool rbd
2015-05-11T16:56:30.322 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 4, 5] out_osds: [0] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:56:30.322 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:56:30.322 INFO:tasks.thrashosds.thrasher:inject_pause on 3
2015-05-11T16:56:30.322 INFO:tasks.thrashosds.thrasher:Testing filestore_inject_stall pause injection for duration 3
2015-05-11T16:56:30.323 INFO:tasks.thrashosds.thrasher:Checking after 0, should_be_down=False
2015-05-11T16:56:35.402 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 4, 5] out_osds: [0] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:56:35.402 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:56:35.402 INFO:tasks.thrashosds.thrasher:Setting osd 1 primary_affinity to 1.000000
2015-05-11T16:56:39.835 INFO:tasks.rados.rados.0.burnupi18.stdout:358: write initial oid 358
2015-05-11T16:56:39.860 INFO:tasks.rados.rados.0.burnupi18.stdout:358: writing burnupi1865424-358 from 748892 to 1374785 tid 1
2015-05-11T16:56:39.881 INFO:tasks.rados.rados.0.burnupi18.stdout:358: writing burnupi1865424-358 from 1859666 to 2328416 tid 2
2015-05-11T16:56:39.881 INFO:tasks.rados.rados.0.burnupi18.stdout:358: writing burnupi1865424-358 from 2348891 to 2348892 tid 3
2015-05-11T16:56:40.821 INFO:tasks.rados.rados.0.burnupi18.stdout:358: finishing write tid 1 to burnupi1865424-358
2015-05-11T16:56:40.822 INFO:tasks.rados.rados.0.burnupi18.stdout:358: finishing write tid 2 to burnupi1865424-358
2015-05-11T16:56:40.822 INFO:tasks.rados.rados.0.burnupi18.stdout:358: finishing write tid 3 to burnupi1865424-358
2015-05-11T16:56:40.822 INFO:tasks.rados.rados.0.burnupi18.stdout:358: finishing write tid 5 to burnupi1865424-358
2015-05-11T16:56:40.822 INFO:tasks.rados.rados.0.burnupi18.stdout:358: finishing write tid 6 to burnupi1865424-358
2015-05-11T16:56:40.822 INFO:tasks.rados.rados.0.burnupi18.stdout:update_object_version oid 358 v 105 (ObjNum 357 snap 0 seq_num 357) dirty exists
2015-05-11T16:56:40.822 INFO:tasks.rados.rados.0.burnupi18.stdout:358: left oid 358 (ObjNum 357 snap 0 seq_num 357)
2015-05-11T16:56:41.385 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 4, 5] out_osds: [0] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:56:48.510 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:56:48.510 INFO:tasks.thrashosds.thrasher:Reweighting osd 4 to 0.57259207371
2015-05-11T16:56:54.657 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 4, 5] out_osds: [0] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:56:54.657 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:56:54.658 INFO:tasks.thrashosds.thrasher:Adding osd 0
2015-05-11T16:56:55.155 INFO:tasks.thrashosds.thrasher:Added osd 0
2015-05-11T16:57:00.156 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 4, 5, 0] out_osds: [] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:57:00.156 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:57:00.156 INFO:tasks.thrashosds.thrasher:Removing osd 4, in_osds are: [1, 2, 3, 4, 5, 0]
2015-05-11T16:57:06.522 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0] out_osds: [4] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:57:06.522 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:57:06.522 INFO:tasks.thrashosds.thrasher:Killing osd 4, live_osds are [1, 0, 3, 2, 5, 4]
2015-05-11T16:57:12.523 INFO:tasks.thrashosds.thrasher:Testing ceph-objectstore-tool on down osd
2015-05-11T16:57:13.780 INFO:teuthology.orchestra.run.burnupi60.stderr:Read 349e6ec6/burnupi1865424-358/head//1
2015-05-11T16:57:14.242 INFO:teuthology.orchestra.run.burnupi60.stdout:remove 349e6ec6/burnupi1865424-358/head//1
2015-05-11T16:57:16.182 INFO:teuthology.orchestra.run.burnupi60.stdout:Write 349e6ec6/burnupi1865424-358/head//1
2015-05-11T16:57:18.278 INFO:tasks.rados.rados.0.burnupi18.stdout:542: write oid 358 current snap is 0
2015-05-11T16:57:18.299 INFO:tasks.rados.rados.0.burnupi18.stdout:542: writing burnupi1865424-358 from 590180 to 1067690 tid 1
2015-05-11T16:57:18.323 INFO:tasks.rados.rados.0.burnupi18.stdout:542: writing burnupi1865424-358 from 1525601 to 2058017 tid 2
2015-05-11T16:57:18.325 INFO:tasks.rados.rados.0.burnupi18.stdout:542: writing burnupi1865424-358 from 2190179 to 2190180 tid 3
2015-05-11T16:57:21.948 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0] out_osds: [4] dead_osds: [4] live_osds: [1, 0, 3, 2, 5]
2015-05-11T16:57:21.948 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:57:21.948 INFO:tasks.thrashosds.thrasher:Reviving osd 4
2015-05-11T16:57:22.544 INFO:tasks.rados.rados.0.burnupi18.stdout:542: finishing write tid 1 to burnupi1865424-358
2015-05-11T16:57:22.545 INFO:tasks.rados.rados.0.burnupi18.stdout:542: finishing write tid 2 to burnupi1865424-358
2015-05-11T16:57:22.545 INFO:tasks.rados.rados.0.burnupi18.stdout:542: finishing write tid 3 to burnupi1865424-358
2015-05-11T16:57:22.545 INFO:tasks.rados.rados.0.burnupi18.stdout:542: finishing write tid 5 to burnupi1865424-358
2015-05-11T16:57:22.546 INFO:tasks.rados.rados.0.burnupi18.stdout:542: finishing write tid 6 to burnupi1865424-358
2015-05-11T16:57:22.546 INFO:tasks.rados.rados.0.burnupi18.stdout:update_object_version oid 358 v 144 (ObjNum 523 snap 0 seq_num 523) dirty exists
2015-05-11T16:57:22.546 INFO:tasks.rados.rados.0.burnupi18.stdout:542: left oid 358 (ObjNum 523 snap 0 seq_num 523)
2015-05-11T16:57:32.103 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0] out_osds: [4] dead_osds: [] live_osds: [1, 0, 3, 2, 5, 4]
2015-05-11T16:57:32.104 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:57:32.104 INFO:tasks.thrashosds.thrasher:Killing osd 3, live_osds are [1, 0, 3, 2, 5, 4]
2015-05-11T16:57:38.105 INFO:tasks.thrashosds.thrasher:Testing ceph-objectstore-tool on down osd
2015-05-11T16:57:47.509 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0] out_osds: [4] dead_osds: [3] live_osds: [1, 0, 2, 5, 4]
2015-05-11T16:57:47.509 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:57:47.510 INFO:tasks.thrashosds.thrasher:fixing pg num pool rbd
2015-05-11T16:57:53.416 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0] out_osds: [4] dead_osds: [3] live_osds: [1, 0, 2, 5, 4]
2015-05-11T16:57:53.417 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:57:53.417 INFO:tasks.thrashosds.thrasher:Adding osd 4
2015-05-11T16:57:54.527 INFO:tasks.thrashosds.thrasher:Added osd 4
2015-05-11T16:57:59.527 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0, 4] out_osds: [] dead_osds: [3] live_osds: [1, 0, 2, 5, 4]
2015-05-11T16:57:59.527 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:57:59.528 INFO:tasks.thrashosds.thrasher:Killing osd 1, live_osds are [1, 0, 2, 5, 4]
2015-05-11T16:58:05.529 INFO:tasks.thrashosds.thrasher:Testing ceph-objectstore-tool on down osd
2015-05-11T16:58:06.408 INFO:tasks.thrashosds.thrasher:Moving pg 0.1b from osd.3 to osd.1
2015-05-11T16:58:06.408 INFO:tasks.thrashosds.thrasher:Transfer export file from to
2015-05-11T16:58:11.593 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0, 4] out_osds: [] dead_osds: [3, 1] live_osds: [0, 2, 5, 4]
2015-05-11T16:58:11.593 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:58:11.593 INFO:tasks.thrashosds.thrasher:Use ceph_objectstore_tool to remove past intervals
2015-05-11T16:58:17.078 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0, 4] out_osds: [] dead_osds: [3, 1] live_osds: [0, 2, 5, 4]
2015-05-11T16:58:17.078 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:58:17.078 INFO:tasks.thrashosds.thrasher:Killing osd 2, live_osds are [0, 2, 5, 4]
2015-05-11T16:58:23.079 INFO:tasks.thrashosds.thrasher:Testing ceph-objectstore-tool on down osd
2015-05-11T16:58:23.667 INFO:tasks.thrashosds.thrasher:Moving pg 1.7 from osd.1 to osd.2
2015-05-11T16:58:29.332 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0, 4] out_osds: [] dead_osds: [3, 1, 2] live_osds: [0, 5, 4]
2015-05-11T16:58:29.333 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:58:29.333 INFO:tasks.thrashosds.thrasher:inject_pause on 4
2015-05-11T16:58:29.333 INFO:tasks.thrashosds.thrasher:Testing filestore_inject_stall pause injection for duration 3
2015-05-11T16:58:29.333 INFO:tasks.thrashosds.thrasher:Checking after 0, should_be_down=False
2015-05-11T16:58:34.410 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0, 4] out_osds: [] dead_osds: [3, 1, 2] live_osds: [0, 5, 4]
2015-05-11T16:58:34.411 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:58:34.411 INFO:tasks.thrashosds.thrasher:Reweighting osd 1 to 0.184489830353
2015-05-11T16:58:39.722 INFO:tasks.thrashosds.thrasher:in_osds: [1, 2, 3, 5, 0, 4] out_osds: [] dead_osds: [3, 1, 2] live_osds: [0, 5, 4]
2015-05-11T16:58:39.722 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:58:39.723 INFO:tasks.thrashosds.thrasher:Removing osd 2, in_osds are: [1, 2, 3, 5, 0, 4]
2015-05-11T16:58:45.030 INFO:tasks.thrashosds.thrasher:in_osds: [1, 3, 5, 0, 4] out_osds: [2] dead_osds: [3, 1, 2] live_osds: [0, 5, 4]
2015-05-11T16:58:45.030 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:58:45.030 INFO:tasks.thrashosds.thrasher:Setting osd 3 primary_affinity to 1.000000
2015-05-11T16:58:50.389 INFO:tasks.thrashosds.thrasher:in_osds: [1, 3, 5, 0, 4] out_osds: [2] dead_osds: [3, 1, 2] live_osds: [0, 5, 4]
2015-05-11T16:58:50.389 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:58:50.389 INFO:tasks.thrashosds.thrasher:Setting osd 1 primary_affinity to 0.000000
2015-05-11T16:58:56.595 INFO:tasks.thrashosds.thrasher:in_osds: [1, 3, 5, 0, 4] out_osds: [2] dead_osds: [3, 1, 2] live_osds: [0, 5, 4]
2015-05-11T16:58:56.596 INFO:tasks.thrashosds.thrasher:choose_action: min_in 3 min_out 0 min_live 2 min_dead 0
2015-05-11T16:58:56.596 INFO:tasks.thrashosds.thrasher:Reviving osd 2
2015-05-11T16:59:06.753 INFO:tasks.thrashosds.thrasher:in_osds: [1, 3, 5, 0, 4] out_osds: [2] dead_osds: [3, 1] live_osds: [0, 5, 4, 2]
2015-05-11T16:59:06.753 INFO:tasks.thrashosds.thrasher:Reviving osd 1
2015-05-11T16:59:11.920 INFO:tasks.thrashosds.thrasher:Reviving osd 3
2015-05-11T16:59:13.409 INFO:tasks.rados.rados.0.burnupi18.stdout:639: read oid 358 snap -1
2015-05-11T16:59:16.346 INFO:tasks.rados.rados.0.burnupi18.stderr:639: oid 358 found incorrect object contents (ObjNum 357 snap 0 seq_num 357), expected (ObjNum 523 snap 0 seq_num 523)
2015-05-11T16:59:16.413 INFO:tasks.rados.rados.0.burnupi18.stdout:incorrect buffer at pos 590180
2015-05-11T16:59:16.413 INFO:tasks.rados.rados.0.burnupi18.stderr:639: oid 358 contents (ObjNum 357 snap 0 seq_num 357) corrupt

#8 Updated by David Zafman almost 9 years ago

pg 1.6 burnupi1865424-358

2015-05-11T16:56:39 [3,4] write burnupi1865424-358  10'102 - 10'105
2015-05-11T16:56:48.510 INFO:tasks.thrashosds.thrasher:Reweighting osd 4 to 0.57259207371              ??????Does this matter
2015-05-11T16:57:00.156 INFO:tasks.thrashosds.thrasher:Removing osd 4, in_osds are: [1, 2, 3, 4, 5, 0]
2015-05-11T16:57:01  down/out 4
2015-05-11 16:57:18.299634 7f4632e48700 10 osd.3 21 dequeue_op 0x50c7a00 prio 63 cost 477510 latency 0.004506 osd_op(client.4123.0:2623 burnupi1865424-358 [write 590180~477510] 1.349e6ec6 ack+ondisk+write+known_if_redirected e21) v5 pg pg[1.6( v 21'140 (0'0,21'140] local-les=21 n=35 ec=6 les/c 21/7 20/20/6) [3,2] r=0 lpr=20 pi=6-19/1 crt=14'131 lcod 21'139 mlcod 7'2 active+recovery_wait+degraded]
2015-05-11T16:57:18 [3,2] write burnupi1865424-358 21'141 - 21'143 partial overwrite at 590180 was the first write 21'141
2015-05-11T16:57:21.948 INFO:tasks.thrashosds.thrasher:Reviving osd 4
2015-05-11T16:57:21 up/in 4
2015-05-11T16:57:32.104 INFO:tasks.thrashosds.thrasher:Killing osd 3, live_osds are [1, 0, 3, 2, 5, 4]
2015-05-11T16:58:45.030 INFO:tasks.thrashosds.thrasher:Setting osd 3 primary_affinity to 1.000000      ?????Does this matter
2015-05-11T16:59:11.920 INFO:tasks.thrashosds.thrasher:Reviving osd 3
[4,3] push/pull burnupi1865424-358
osd.3 tosses its copy and writes version of osd.4 which doesn't include 590180 write

Doesn't this show that the write to 590180 was divergent version 21'141? So either the test shouldn't have completed its write at 590180 or the read should have been held off until we could figure the final state of the object.


Earlier write as indicated above:
2015-05-11 16:57:18.364551 7f4632e48700 10 osd.3 pg_epoch: 21 pg[1.6( v 21'141 (0'0,21'141] local-les=21 n=35 ec=6 les/c 21/7 20/20/6) [3,2] r=0 lpr=20 pi=6-19/1 luod=21'140 lua=21'140 crt=14'131 lcod 21'139 mlcod 7'2 active+recovery_wait+degraded] new_repop rep_tid 551 on osd_op(client.4123.0:2624 burnupi1865424-358 [write 1525601~532416] 1.349e6ec6 ack+ondisk+write+known_if_redirected e21) v5
2015-05-11 16:57:18.373898 7f9c658f9700 15 osd.2 21 enqueue_op 0x5325400 prio 127 cost 478171 latency 0.007255 osd_repop(client.4123.0:2623 1.6 349e6ec6/burnupi1865424-358/head//1 v 21'141) v1

2015-05-11 16:59:11 start osd.3

2015-05-11 16:59:12.112405 7f07b42b7980 20 read_log 21'141 (10'105) modify   349e6ec6/burnupi1865424-358/head//1 by client.4123.0:2623 2015-05-11 16:57:18.295375
2015-05-11 16:59:12.112418 7f07b42b7980 20 read_log 21'142 (21'141) modify   349e6ec6/burnupi1865424-358/head//1 by client.4123.0:2624 2015-05-11 16:57:18.319409

2015-05-11 16:59:12.410568 7f0797eb8700 10 merge_log log((15'136,42'142], crt=27'138) from osd.4 into log((0'0,21'144], crt=14'131)
2015-05-11 16:59:12.410631 7f0797eb8700 10 merge_log divergent 21'142 (21'141) modify   349e6ec6/burnupi1865424-358/head//1 by client.4123.0:2624 2015-05-11 16:57:18.319409
2015-05-11 16:59:12.410634 7f0797eb8700 10 merge_log divergent 21'141 (10'105) modify   349e6ec6/burnupi1865424-358/head//1 by client.4123.0:2623 2015-05-11 16:57:18.295375
2015-05-11 16:59:12.410757 7f0797eb8700 10 _merge_object_divergent_entries: merging hoid 349e6ec6/burnupi1865424-358/head//1 entries: 21'141 (10'105) modify   349e6ec6/burnupi1865424-358/head//1 by client.4123.0:2623 2015-05-11 16:57:18.295375,21'142 (21'141) modify   349e6ec6/burnupi1865424-358/head//1 by client.4123.0:2624 2015-05-11 16:57:18.319409,21'143 (21'142) modify   349e6ec6/burnupi1865424-358/head//1 by client.4123.0:2625 2015-05-11 16:57:18.320686,21'144 (21'143) modify   349e6ec6/burnupi1865424-358/head//1 by client.4123.0:2626 2015-05-11 16:57:18.320728
2015-05-11 16:59:12.410770 7f0797eb8700 10 _merge_object_divergent_entries: hoid 349e6ec6/burnupi1865424-358/head//1 prior_version: 10'105 first_divergent_update: 21'141 last_divergent_update: 21'144

#9 Updated by David Zafman almost 9 years ago

Per Sam Just:

2015-05-11 16:57:54.379075 7fc0eea28700 10 osd.4 pg_epoch: 26 pg[1.6( v 15'136 (0'0,15'136] local-les=7 n=35 ec=6 les/c 7/7 20/21/6) [4] r=0 lpr=26 pi=7-25/2 crt=14'131 lcod 0'0 mlcod 0'0 inactive NOTIFY] start_peering_interval: check_new_interval output: generate_past_intervals interval(21-25 up [3,2](3) acting [3,2](3)): not rw, up_thru 20 up_from 3 last_epoch_clean 7

if up_thru is 20 and the interval is 21-25, the pg cannot have gone active
if up_thru is 20 and the interval is 20-25, the pg can have gone active

osd.4 should have built past_intervals with 20-25 and since the older pg primary could have gone active, osd.4 would NOT have gone active and allowed writes that cause the divergence.

#10 Updated by David Zafman almost 9 years ago

  • Status changed from 12 to 7

#11 Updated by David Zafman almost 9 years ago

  • Status changed from 7 to Fix Under Review

#12 Updated by David Zafman almost 9 years ago

  • Status changed from Fix Under Review to Resolved

#14 Updated by Abhishek Varshney over 8 years ago

Should this be backported to hammer? It may be required before http://tracker.ceph.com/issues/13534 can be backported. Please confirm.

#15 Updated by Loïc Dachary over 8 years ago

it may be required before http://tracker.ceph.com/issues/13534 can be backported.

Why would it be required ?

#16 Updated by Loïc Dachary about 8 years ago

#17 Updated by Loïc Dachary about 8 years ago

  • Status changed from Resolved to Pending Backport
  • Backport set to hammer

#18 Updated by Loïc Dachary about 8 years ago

  • Copied to Backport #14979: hammer: ceph_test_rados: incorrect buffer at pos ... added

#19 Updated by David Zafman about 8 years ago

  • Related to Bug #13382: osd/PG.cc: 758: FAILED assert(info.history.same_interval_since == same_interval_since) added

#20 Updated by David Zafman almost 8 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF