Bug #48745
closedSegmentation fault in PrimaryLogPG::cancel_manifest_ops
0%
Description
2020-12-19T02:08:00.720 INFO:tasks.ceph.osd.0.smithi104.stderr:*** Caught signal (Segmentation fault) ** 2020-12-19T02:08:00.721 INFO:tasks.ceph.osd.0.smithi104.stderr: in thread 7efc3c710700 thread_name:tp_osd_tp 2020-12-19T02:08:00.723 INFO:teuthology.orchestra.run.smithi104.stderr:nodeep-scrub is set 2020-12-19T02:08:00.724 INFO:tasks.ceph.osd.0.smithi104.stderr: ceph version 16.0.0-8176-gc8682306 (c8682306c75836c231f2bd9f364a5f1c5a0c2247) pacific (dev) 2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7efc6255fdc0] 2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 2: (PrimaryLogPG::cancel_manifest_ops(bool, std::vector<unsigned long, std::allocator<unsigned long> >*)+0x71) [0x56316aebfa61] 2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 3: (PrimaryLogPG::on_change(ceph::os::Transaction&)+0x18a) [0x56316aeee74a] 2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 4: (PeeringState::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ceph::os::Transaction&)+0x869) [0x56316b0441c9] 2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 5: (PeeringState::Reset::react(PeeringState::AdvMap const&)+0x293) [0x56316b060353] 2020-12-19T02:08:00.726 INFO:tasks.ceph.osd.0.smithi104.stderr: 6: (boost::statechart::simple_state<PeeringState::Reset, PeeringState::PeeringMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xf5) [0x56316b09d455] 2020-12-19T02:08:00.726 INFO:tasks.ceph.osd.0.smithi104.stderr: 7: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()+0xa7) [0x56316b086ef7] 2020-12-19T02:08:00.726 INFO:tasks.ceph.osd.0.smithi104.stderr: 8: (PeeringState::advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0x269) [0x56316b040ac9] 2020-12-19T02:08:00.726 INFO:tasks.ceph.osd.0.smithi104.stderr: 9: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0x1e6) [0x56316ae759e6] 2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 10: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x303) [0x56316ade4523] 2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xa4) [0x56316ade6674] 2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 12: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x56316b025856] 2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x56316add8848] 2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56316b41ab54] 2020-12-19T02:08:00.728 INFO:tasks.ceph.osd.0.smithi104.stderr: 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x56316b41d7f4] 2020-12-19T02:08:00.728 INFO:tasks.ceph.osd.0.smithi104.stderr: 16: /lib64/libpthread.so.0(+0x82de) [0x7efc625552de] 2020-12-19T02:08:00.728 INFO:tasks.ceph.osd.0.smithi104.stderr: 17: clone()
rados/thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/normal_pg_log 2-recovery-overrides/{more-async-recovery} backoff/normal ceph clusters/{fixed-2 openstack} crc-failures/default d-balancer/on mon_election/connectivity msgr-failures/fastclose msgr/async-v2only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{centos_8} thrashers/careful thrashosds-health workloads/rados_api_tests}
/a/jafaj-2020-12-15_06:47:10-rados-wip-jan-testing-2020-12-11-0930-distro-basic-smithi/5709664
Updated by Neha Ojha over 3 years ago
- Priority changed from Normal to Urgent
/a/nojha-2021-01-07_00:06:49-rados-master-distro-basic-smithi/5761073
Updated by Neha Ojha over 3 years ago
Xie Xingguo/Myoungwon Oh: this seems to be new regression in master, do you know what could have caused it? I don't see any recent changes in cancel_manifest_ops. Unfortunately, we do not have have any logs yet since the jobs went dead.
Updated by Neha Ojha over 3 years ago
- Assignee set to Myoungwon Oh
Myoungwon Oh: I am assigning this to you for more inputs.
Updated by Myoungwon Oh over 3 years ago
Hm... I can't find any clues in /a/nojha-2021-01-07_00\:06\:49-rados-master-distro-basic-smithi/5761073.
Can we reproduce this?
Also, https://github.com/ceph/ceph/pull/38576 has been merged after /a/jafaj-2020-12-15_06:47:10-rados-wip-jan-testing-2020-12-11-0930-distro-basic-smithi/5709664 had run.
So, the log as above can not be the log I can refer to.
Updated by Neha Ojha over 3 years ago
Myoungwon Oh wrote:
Hm... I can't find any clues in /a/nojha-2021-01-07_00\:06\:49-rados-master-distro-basic-smithi/5761073.
yeah, there are no logs because the job died
Can we reproduce this?
It has reproduced at least a couple of times
Also, https://github.com/ceph/ceph/pull/38576 has been merged after /a/jafaj-2020-12-15_06:47:10-rados-wip-jan-testing-2020-12-11-0930-distro-basic-smithi/5709664 had run.
The runs attached here include https://github.com/ceph/ceph/pull/38576.
So, the log as above can not be the log I can refer to.
not sure I understand what you mean
Updated by Neha Ojha over 3 years ago
- Backport set to pacific
rados/thrash/{0-size-min-size-overrides/2-size-2-min-size 1-pg-log-overrides/normal_pg_log 2-recovery-overrides/{more-async-partial-recovery} backoff/peering_and_degraded ceph clusters/{fixed-2 openstack} crc-failures/bad_map_crc_failure d-balancer/crush-compat mon_election/classic msgr-failures/osd-delay msgr/async objectstore/bluestore-hybrid rados supported-random-distro$/{ubuntu_latest} thrashers/default thrashosds-health workloads/rados_api_tests}
2021-01-31T13:58:13.895 INFO:tasks.ceph.osd.6.smithi122.stderr:2021-01-31T13:58:13.886+0000 7f0953530700 -1 osd.6 pg_epoch: 549 pg[10.d( v 548'4 (0'0,548'4] local-lis/les=536/538 n=2 ec=536/536 lis/c=536/536 les/c/f=538/538/0 sis=536) [6,4] r=0 lpr=536 crt=548'4 lcod 547'3 mlcod 547'3 active+clean+snaptrim trimq=[9~1] ps=[3~1,5~1,7~1]] removing snap head 2021-01-31T14:08:06.327 INFO:tasks.ceph.osd.6.smithi122.stderr: in thread 7f0953530700 thread_name:tp_osd_tp 2021-01-31T14:08:06.336 INFO:tasks.ceph.osd.6.smithi122.stderr:2021-01-31T14:08:06.334+0000 7f0953530700 -1 *** Caught signal (Segmentation fault) **
/a/teuthology-2021-01-31_02:31:01-rados-pacific-distro-basic-smithi/5843139 - dead job again
Updated by Myoungwon Oh over 3 years ago
Sorry, It seems that my bad.
Updated by Kefu Chai over 3 years ago
- Status changed from New to Fix Under Review
- Pull request ID set to 39217
Updated by Kefu Chai over 3 years ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot over 3 years ago
- Copied to Backport #49156: pacific: Segmentation fault in PrimaryLogPG::cancel_manifest_ops added
Updated by Loïc Dachary about 3 years ago
- Status changed from Pending Backport to Resolved
While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".