Project

General

Profile

Actions

Bug #44022

closed

mimic: Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash

Added by Neha Ojha about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):

de9b47e4c55bb576c605c46fa90d9fda24f77c381514888ba063e0fd01b350ec

Crash signature (v2):

Description

The crash happens on a mimic OSD. Telemetry crash reports have been reporting similar crashes in 14.2.4(may or may not be related).

2020-02-06T03:33:51.194 INFO:tasks.ceph.osd.11.smithi158.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8-57-g530fb22/rpm/el7/BUILD/ceph-13.2.8-57-g530fb22/src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f60aafd9700 time 2020-02-06 03:33:51.135848
2020-02-06T03:33:51.195 INFO:tasks.ceph.osd.11.smithi158.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8-57-g530fb22/rpm/el7/BUILD/ceph-13.2.8-57-g530fb22/src/osd/PG.cc: 6665: FAILED assert(0 == "we got a bad state machine event")
2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: ceph version 13.2.8-57-g530fb22 (530fb2279cf8639e8ecff9ab4891acb72dabbf09) mimic (stable)
2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f60cdd6ba6b]                                                  2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 2: (()+0x26fbf7) [0x7f60cdd6bbf7]
2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 3: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xa5) [0x55fe690ad6b5]
2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 4: (()+0x489066) [0x55fe690f0066]
2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 5: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x343) [0x55fe6912d1f3]
2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 6: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x8c) [0x55fe6912736c]
2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 7: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x9a) [0x55fe6912b0ca]
2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x55fe69109ceb]
2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x143) [0x55fe690ef9f3]
2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xcf) [0x55fe6901e7df]
2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x55fe692959f0]
2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x592) [0x55fe69033802]
2020-02-06T03:33:51.203 INFO:tasks.ceph.osd.11.smithi158.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) [0x7f60cdd717a3]
2020-02-06T03:33:51.203 INFO:tasks.ceph.osd.11.smithi158.stderr: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f60cdd72390]
2020-02-06T03:33:51.203 INFO:tasks.ceph.osd.11.smithi158.stderr: 15: (()+0x7e65) [0x7f60cace2e65]
2020-02-06T03:33:51.203 INFO:tasks.ceph.osd.11.smithi158.stderr: 16: (clone()+0x6d) [0x7f60c9dd288d]

/a/nojha-2020-02-06_01:27:32-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4736605/

upgrade:mimic-x:stress-split/{0-cluster/{openstack.yaml start.yaml} 1-ceph-install/mimic.yaml 1.1-pg-log-overrides/short_pg_log.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-workload/{radosbench.yaml rbd-cls.yaml rbd-import-export.yaml rbd_api.yaml readwrite.yaml rgw_ragweed_prepare.yaml snaps-few-objects.yaml} 5-finish-upgrade.yaml 6-msgr2.yaml 6-nautilus.yaml 7-final-workload/{rbd-python.yaml rgw-swift-ragweed_check.yaml snaps-many-objects.yaml} debug_upgrade.yaml objectstore/bluestore-bitmap.yaml supported-all-distro/centos_latest.yaml thrashosds-health.yaml}

Actions #1

Updated by Neha Ojha about 4 years ago

-10000> 2020-02-06 03:33:51.134 7f60aafd9700  5 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started/Primary 1.613785 0 0.000000
-10000> 2020-02-06 03:33:51.134 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] clear_primary_state
-10000> 2020-02-06 03:33:51.134 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] release_backoffs [37:5c000000::::0,37:60000000::::head)
-10000> 2020-02-06 03:33:51.134 7f60aafd9700 20 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] agent_stop
-10000> 2020-02-06 03:33:51.134 7f60aafd9700  5 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started 1.613899 0 0.000000
-10000> 2020-02-06 03:33:51.134 7f60aafd9700  5 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] enter Crashed
Actions #2

Updated by Neha Ojha about 4 years ago

For whatever reason we do not have complete osd logs for this, but from nojha-2020-02-06_01:27:32-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4736605/remote/smithi158/crash/posted/2020-02-06_03:33:51.204745Z_e39e2a69-e5a1-483b-926b-55d71896e52c/log the problem seems to be the following:

-10000> 2020-02-06 03:33:50.534 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 2718 epoch_requested: 2718 MNotifyRec 37.3a from 3 notify: (query:2718 sent:2718 37.3a( empty local-lis/les=2578/2581 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704)) features: 0x3ffddff8ffacfffb ([2578,2703] intervals=([2642,2703] acting 1)) +create_info
-10000> 2020-02-06 03:33:50.534 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 peering mbc={}]  got dup osd.3 info 37.3a( empty local-lis/les=2578/2581 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704), identical to ours

We had received MNotifyRec while in Started/Primary/Peering/GetInfo just before this, and then received MLogRec, which isn't expected in GetInfo thus we see

-10000> 2020-02-06 03:33:51.134 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 2717 epoch_requested: 2717 MLogRec from 1 +create_info
-10000> 2020-02-06 03:33:51.134 7f60aafd9700  5 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering/GetInfo 1.613711 8 0.000521
Actions #3

Updated by Neha Ojha about 4 years ago

  • Subject changed from upgrade/mimic-x: FAILED assert(0 == "we got a bad state machine event") to nautilus: upgrade/mimic-x: Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash
  • Priority changed from Normal to High
2020-02-26 21:36:08.696 7fe1a9fb8700 20 osd.8 op_wq(1) _process OpQueueItem(36.19 PGPeeringEvent(epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info) prio 255 cost 10 e4115) queued
2020-02-26 21:36:08.696 7fe1a9fb8700 20 osd.8 op_wq(1) _process 36.19 to_process <OpQueueItem(36.19 PGPeeringEvent(epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info) prio 255 cost 10 e4115)> waiting <> waiting_peering {}
2020-02-26 21:36:08.696 7fe1a9fb8700 20 osd.8 op_wq(1) _process OpQueueItem(36.19 PGPeeringEvent(epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info) prio 255 cost 10 e4115) pg 0x55607d65d000
2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info
2020-02-26 21:36:08.696 7fe1a9fb8700  5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering/GetInfo 30.389177 13 0.000265
2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Leaving Peering
2020-02-26 21:36:08.696 7fe1a9fb8700  5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering 30.389191 0 0.000000
2020-02-26 21:36:08.696 7fe1a9fb8700  5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started/Primary 30.389201 0 0.000000
2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] clear_primary_state
2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] release_backoffs [36:98000000::::0,36:9c000000::::head)
2020-02-26 21:36:08.696 7fe1a9fb8700 20 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] agent_stop
2020-02-26 21:36:08.696 7fe1a9fb8700  5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started 30.389244 0 0.000000
2020-02-26 21:36:08.696 7fe1a9fb8700  5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] enter Crashed

/a/nojha-2020-02-26_18:17:32-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4805548/

Actions #4

Updated by Neha Ojha about 4 years ago

On osd.8(mimic)

This is when we request the log from osd.6

2020-02-26 21:34:51.436 7fe1a5fb0700  5 osd.8 pg_epoch: 4100 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] enter Started/Primary/Peering/GetLog
2020-02-26 21:34:51.436 7fe1a5fb0700 10 osd.8 pg_epoch: 4100 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] state<Started/Primary/Peering/GetLog>:  requesting log from osd.6

a map change makes us exit Peering/GetLog

2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] state<Started/Primary/Peering>: Peering advmap
2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] affected_by_map osd.1 now up
2020-02-26 21:35:38.307 7fe1a9fb8700  1 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
2020-02-26 21:35:38.307 7fe1a9fb8700  5 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] exit Started/Primary/Peering/GetLog 46.870220 57 0.000946
2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] state<Started/Primary/Peering>: Leaving Peering
2020-02-26 21:35:38.307 7fe1a9fb8700  5 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] exit Started/Primary/Peering 112.040719 0 0.000000
2020-02-26 21:35:38.307 7fe1a9fb8700  5 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating mbc={}] exit Started/Primary 112.040728 0 0.000000

Then we get into GetInfo

2020-02-26 21:35:38.307 7fe1a9fb8700  5 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] enter Started/Primary/Peering/GetInfo
2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] build_prior all_probe 1,6,8
2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetInfo>:  querying info from osd.1
2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetInfo>:  querying info from osd.6

and end up crashing when the response to the earlier GetLog arrives.

2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info
2020-02-26 21:36:08.696 7fe1a9fb8700  5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering/GetInfo 30.389177 13 0.000265
2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Leaving Peering
2020-02-26 21:36:08.696 7fe1a9fb8700  5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering 30.389191 0 0.000000

On osd.6(nautilus)

2020-02-26 21:36:08.695 7f724e879700 10 osd.6 pg_epoch: 4115 pg[36.19( empty local-lis/les=3044/3127 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=1 lpr=4067 pi=[3044,4067)/1 crt=0'0 unknown NOTIFY mbc={}] log request from 8
2020-02-26 21:36:08.695 7f724e879700 10 osd.6 pg_epoch: 4115 pg[36.19( empty local-lis/les=3044/3127 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=1 lpr=4067 pi=[3044,4067)/1 crt=0'0 unknown NOTIFY mbc={}]  sending info+missing+log since 0'0
2020-02-26 21:36:08.695 7f724e879700 10 osd.6 pg_epoch: 4115 pg[36.19( empty local-lis/les=3044/3127 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=1 lpr=4067 pi=[3044,4067)/1 crt=0'0 unknown NOTIFY mbc={}]  sending log((0'0,0'0], crt=0'0) missing(0 may_include_deletes = 1)
Actions #5

Updated by Neha Ojha about 4 years ago

The part that I don't understand is when osd.6 responded, the epoch_sent/epoch_requested(4115/4100) seem correct

2020-02-26 21:36:08.695 7f724e879700  1 -- [v2:172.21.15.92:6810/1012620,v1:172.21.15.92:6811/1012620] --> v1:172.21.15.80:6805/341918 -- PGlog(36.19 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e4115/4100)
 v5 -- 0x562106cdd000 con 0x562103651400
2020-02-26 21:36:08.697 7f7270baf700 10 --1- [v2:172.21.15.92:6810/1012620,v1:172.21.15.92:6811/1012620] >> v1:172.21.15.80:6805/341918 conn(0x562103651400 0x56210e32a000 :6811 s=OPENED pgs=9 cs=1 l=0).handle_tag_ack got ack seq 180 >= 179 on 0x562106cdd000 PGlog(36.19 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e4115/4100) v5

on osd.8, we see e4115/4115

020-02-26 21:36:08.696 7fe1c8eb9700  1 -- 172.21.15.80:6805/341918 <== osd.6 172.21.15.92:6811/1012620 179 ==== PGlog(36.19 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e4115/4115) v5 ==== 1052+0+0 (277749
2280 0 0) 0x55607d79e700 con 0x55607b932000 on osd.8

since lpr=4114, this does not qualify as an old_peering_msg

Actions #6

Updated by Neha Ojha about 4 years ago

  • Subject changed from nautilus: upgrade/mimic-x: Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash to Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash
  • Status changed from New to Fix Under Review
  • Assignee set to Neha Ojha
  • Backport set to nautilus
  • Pull request ID set to 33590

In any case, https://github.com/ceph/ceph/pull/33590 will just prevent it from crashing.

Actions #7

Updated by Sage Weil about 4 years ago

Ah, this is why: 168e20ab8b8da3a5aed41b73f9627d10971be67b

commit 168e20ab8b8da3a5aed41b73f9627d10971be67b (refs/remotes/pr/22501, refs/remotes/one/wip-pglog-upgrade, refs/remotes/me/wip-pglog-upgrade)
Author: Sage Weil <sweil@redhat.com>
Date:   Sun Jun 10 21:41:54 2018 -0500

    messages/MOSDPGLog: encode epoch for query_epoch for pre-nautilus OSDs

    This fix is analogous to d5a3ae6008e4399b336f3a72d8dee328f97a0002.  As of
    84a1984b7cb70c74eaf517ffc424228fa954ed37 we set query_epoch to a lower
    value reliably, but we need to encode the latest epoch for pre-nautilus
    OSDs because they do not set last_peering_epoch reliably and havin a lower
    query_epoch will cause them to discard this message.

    Signed-off-by: Sage Weil <sage@redhat.com>

diff --git a/src/messages/MOSDPGLog.h b/src/messages/MOSDPGLog.h
index 7138bbf08f0..2ae27f3a589 100644
--- a/src/messages/MOSDPGLog.h
+++ b/src/messages/MOSDPGLog.h
@@ -96,7 +96,12 @@ public:
     encode(info, payload);
     encode(log, payload);
     encode(missing, payload);
-    encode(query_epoch, payload);
+    if (!HAVE_FEATURE(features, SERVER_NAUTILUS)) {
+      // pre-nautilus OSDs do not set last_peering_reset properly
+      encode(epoch, payload);
+    } else {
+      encode(query_epoch, payload);
+    }
     encode(past_intervals, payload);
     encode(to, payload);
     encode(from, payload);

I think if we want to fix this, we should make this state machine change only in mimic?

Actions #8

Updated by Neha Ojha about 4 years ago

  • Subject changed from Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash to mimic: Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash
  • Status changed from Fix Under Review to In Progress
  • Backport deleted (nautilus)
Actions #9

Updated by Neha Ojha about 4 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID changed from 33590 to 33594
Actions #11

Updated by Nathan Cutler about 4 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF