Bug #48745: Segmentation fault in PrimaryLogPG::cancel_manifest_ops - RADOS - Ceph

Actions

Copy link

Bug #48745

closed

Segmentation fault in PrimaryLogPG::cancel_manifest_ops

Added by Neha Ojha over 3 years ago. Updated about 3 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Myoungwon Oh

Category:

Target version:

% Done:

Source:

Tags:

Backport:

pacific

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Component(RADOS):

Pull request ID:

39217

Crash signature (v1):

Crash signature (v2):

Description

2020-12-19T02:08:00.720 INFO:tasks.ceph.osd.0.smithi104.stderr:*** Caught signal (Segmentation fault) **
2020-12-19T02:08:00.721 INFO:tasks.ceph.osd.0.smithi104.stderr: in thread 7efc3c710700 thread_name:tp_osd_tp
2020-12-19T02:08:00.723 INFO:teuthology.orchestra.run.smithi104.stderr:nodeep-scrub is set
2020-12-19T02:08:00.724 INFO:tasks.ceph.osd.0.smithi104.stderr: ceph version 16.0.0-8176-gc8682306 (c8682306c75836c231f2bd9f364a5f1c5a0c2247) pacific (dev)
2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 1: /lib64/libpthread.so.0(+0x12dc0) [0x7efc6255fdc0]
2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 2: (PrimaryLogPG::cancel_manifest_ops(bool, std::vector<unsigned long, std::allocator<unsigned long> >*)+0x71) [0x56316aebfa61]
2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 3: (PrimaryLogPG::on_change(ceph::os::Transaction&)+0x18a) [0x56316aeee74a]
2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 4: (PeeringState::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ceph::os::Transaction&)+0x869) [0x56316b0441c9]
2020-12-19T02:08:00.725 INFO:tasks.ceph.osd.0.smithi104.stderr: 5: (PeeringState::Reset::react(PeeringState::AdvMap const&)+0x293) [0x56316b060353]
2020-12-19T02:08:00.726 INFO:tasks.ceph.osd.0.smithi104.stderr: 6: (boost::statechart::simple_state<PeeringState::Reset, PeeringState::PeeringMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xf5) [0x56316b09d455]
2020-12-19T02:08:00.726 INFO:tasks.ceph.osd.0.smithi104.stderr: 7: (boost::statechart::state_machine<PeeringState::PeeringMachine, PeeringState::Initial, std::allocator<boost::statechart::none>, boost::statechart::null_exception_translator>::process_queued_events()+0xa7) [0x56316b086ef7]
2020-12-19T02:08:00.726 INFO:tasks.ceph.osd.0.smithi104.stderr: 8: (PeeringState::advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0x269) [0x56316b040ac9]
2020-12-19T02:08:00.726 INFO:tasks.ceph.osd.0.smithi104.stderr: 9: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PeeringCtx&)+0x1e6) [0x56316ae759e6]
2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 10: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PeeringCtx&)+0x303) [0x56316ade4523]
2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xa4) [0x56316ade6674]
2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 12: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x56316b025856]
2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x56316add8848]
2020-12-19T02:08:00.727 INFO:tasks.ceph.osd.0.smithi104.stderr: 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x56316b41ab54]
2020-12-19T02:08:00.728 INFO:tasks.ceph.osd.0.smithi104.stderr: 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x56316b41d7f4]
2020-12-19T02:08:00.728 INFO:tasks.ceph.osd.0.smithi104.stderr: 16: /lib64/libpthread.so.0(+0x82de) [0x7efc625552de]
2020-12-19T02:08:00.728 INFO:tasks.ceph.osd.0.smithi104.stderr: 17: clone()

rados/thrash/{0-size-min-size-overrides/3-size-2-min-size 1-pg-log-overrides/normal_pg_log 2-recovery-overrides/{more-async-recovery} backoff/normal ceph clusters/{fixed-2 openstack} crc-failures/default d-balancer/on mon_election/connectivity msgr-failures/fastclose msgr/async-v2only objectstore/bluestore-comp-lz4 rados supported-random-distro$/{centos_8} thrashers/careful thrashosds-health workloads/rados_api_tests}

/a/jafaj-2020-12-15_06:47:10-rados-wip-jan-testing-2020-12-11-0930-distro-basic-smithi/5709664

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Priority changed from Normal to Urgent

/a/nojha-2021-01-07_00:06:49-rados-master-distro-basic-smithi/5761073

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Xie Xingguo/Myoungwon Oh: this seems to be new regression in master, do you know what could have caused it? I don't see any recent changes in cancel_manifest_ops. Unfortunately, we do not have have any logs yet since the jobs went dead.

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Assignee set to Myoungwon Oh

Myoungwon Oh: I am assigning this to you for more inputs.

Actions

Copy link

Updated by Myoungwon Oh over 3 years ago

Hm... I can't find any clues in /a/nojha-2021-01-07_00\:06\:49-rados-master-distro-basic-smithi/5761073.
Can we reproduce this?
Also, https://github.com/ceph/ceph/pull/38576 has been merged after /a/jafaj-2020-12-15_06:47:10-rados-wip-jan-testing-2020-12-11-0930-distro-basic-smithi/5709664 had run.
So, the log as above can not be the log I can refer to.

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Myoungwon Oh wrote:

Hm... I can't find any clues in /a/nojha-2021-01-07_00\:06\:49-rados-master-distro-basic-smithi/5761073.

yeah, there are no logs because the job died

Can we reproduce this?

It has reproduced at least a couple of times

Also, https://github.com/ceph/ceph/pull/38576 has been merged after /a/jafaj-2020-12-15_06:47:10-rados-wip-jan-testing-2020-12-11-0930-distro-basic-smithi/5709664 had run.

The runs attached here include https://github.com/ceph/ceph/pull/38576.

So, the log as above can not be the log I can refer to.

not sure I understand what you mean

Actions

Copy link

Updated by Neha Ojha over 3 years ago

Backport set to pacific

rados/thrash/{0-size-min-size-overrides/2-size-2-min-size 1-pg-log-overrides/normal_pg_log 2-recovery-overrides/{more-async-partial-recovery} backoff/peering_and_degraded ceph clusters/{fixed-2 openstack} crc-failures/bad_map_crc_failure d-balancer/crush-compat mon_election/classic msgr-failures/osd-delay msgr/async objectstore/bluestore-hybrid rados supported-random-distro$/{ubuntu_latest} thrashers/default thrashosds-health workloads/rados_api_tests}

2021-01-31T13:58:13.895 INFO:tasks.ceph.osd.6.smithi122.stderr:2021-01-31T13:58:13.886+0000 7f0953530700 -1 osd.6 pg_epoch: 549 pg[10.d( v 548'4 (0'0,548'4] local-lis/les=536/538 n=2 ec=536/536 lis/c=536/536 les/c/f=538/538/0 sis=536) [6,4] r=0 lpr=536 crt=548'4 lcod 547'3 mlcod 547'3 active+clean+snaptrim trimq=[9~1] ps=[3~1,5~1,7~1]] removing snap head
2021-01-31T14:08:06.327 INFO:tasks.ceph.osd.6.smithi122.stderr: in thread 7f0953530700 thread_name:tp_osd_tp
2021-01-31T14:08:06.336 INFO:tasks.ceph.osd.6.smithi122.stderr:2021-01-31T14:08:06.334+0000 7f0953530700 -1 *** Caught signal (Segmentation fault) **

/a/teuthology-2021-01-31_02:31:01-rados-pacific-distro-basic-smithi/5843139 - dead job again

Actions

Copy link

Updated by Myoungwon Oh over 3 years ago

Sorry, It seems that my bad.

https://github.com/ceph/ceph/pull/39217

Actions

Copy link

Updated by Kefu Chai over 3 years ago

Status changed from New to Fix Under Review
Pull request ID set to 39217

Actions

Copy link

Updated by Kefu Chai over 3 years ago

Status changed from Fix Under Review to Pending Backport

Actions

Copy link

#10

Updated by Backport Bot over 3 years ago

Copied to Backport #49156: pacific: Segmentation fault in PrimaryLogPG::cancel_manifest_ops added

Actions

Copy link

#11

Updated by Loïc Dachary about 3 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #48745

Segmentation fault in PrimaryLogPG::cancel_manifest_ops

Updated by Neha Ojha over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Myoungwon Oh over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Neha Ojha over 3 years ago

Updated by Myoungwon Oh over 3 years ago

Updated by Kefu Chai over 3 years ago

Updated by Kefu Chai over 3 years ago

Updated by Backport Bot over 3 years ago

Updated by Loïc Dachary about 3 years ago