Bug #44022
mimic: Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash
0%
de9b47e4c55bb576c605c46fa90d9fda24f77c381514888ba063e0fd01b350ec
Description
The crash happens on a mimic OSD. Telemetry crash reports have been reporting similar crashes in 14.2.4(may or may not be related).
2020-02-06T03:33:51.194 INFO:tasks.ceph.osd.11.smithi158.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8-57-g530fb22/rpm/el7/BUILD/ceph-13.2.8-57-g530fb22/src/osd/PG.cc: In function 'PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine>::my_context)' thread 7f60aafd9700 time 2020-02-06 03:33:51.135848 2020-02-06T03:33:51.195 INFO:tasks.ceph.osd.11.smithi158.stderr:/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.8-57-g530fb22/rpm/el7/BUILD/ceph-13.2.8-57-g530fb22/src/osd/PG.cc: 6665: FAILED assert(0 == "we got a bad state machine event") 2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: ceph version 13.2.8-57-g530fb22 (530fb2279cf8639e8ecff9ab4891acb72dabbf09) mimic (stable) 2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14b) [0x7f60cdd6ba6b] 2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 2: (()+0x26fbf7) [0x7f60cdd6bbf7] 2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 3: (PG::RecoveryState::Crashed::Crashed(boost::statechart::state<PG::RecoveryState::Crashed, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::my_context)+0xa5) [0x55fe690ad6b5] 2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 4: (()+0x489066) [0x55fe690f0066] 2020-02-06T03:33:51.201 INFO:tasks.ceph.osd.11.smithi158.stderr: 5: (boost::statechart::simple_state<PG::RecoveryState::Primary, PG::RecoveryState::Started, PG::RecoveryState::Peering, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x343) [0x55fe6912d1f3] 2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 6: (boost::statechart::simple_state<PG::RecoveryState::Peering, PG::RecoveryState::Primary, PG::RecoveryState::GetInfo, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x8c) [0x55fe6912736c] 2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 7: (boost::statechart::simple_state<PG::RecoveryState::GetInfo, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x9a) [0x55fe6912b0ca] 2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x6b) [0x55fe69109ceb] 2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x143) [0x55fe690ef9f3] 2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xcf) [0x55fe6901e7df] 2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 11: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x55fe692959f0] 2020-02-06T03:33:51.202 INFO:tasks.ceph.osd.11.smithi158.stderr: 12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x592) [0x55fe69033802] 2020-02-06T03:33:51.203 INFO:tasks.ceph.osd.11.smithi158.stderr: 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) [0x7f60cdd717a3] 2020-02-06T03:33:51.203 INFO:tasks.ceph.osd.11.smithi158.stderr: 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f60cdd72390] 2020-02-06T03:33:51.203 INFO:tasks.ceph.osd.11.smithi158.stderr: 15: (()+0x7e65) [0x7f60cace2e65] 2020-02-06T03:33:51.203 INFO:tasks.ceph.osd.11.smithi158.stderr: 16: (clone()+0x6d) [0x7f60c9dd288d]
/a/nojha-2020-02-06_01:27:32-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4736605/
upgrade:mimic-x:stress-split/{0-cluster/{openstack.yaml start.yaml} 1-ceph-install/mimic.yaml 1.1-pg-log-overrides/short_pg_log.yaml 2-partial-upgrade/firsthalf.yaml 3-thrash/default.yaml 4-workload/{radosbench.yaml rbd-cls.yaml rbd-import-export.yaml rbd_api.yaml readwrite.yaml rgw_ragweed_prepare.yaml snaps-few-objects.yaml} 5-finish-upgrade.yaml 6-msgr2.yaml 6-nautilus.yaml 7-final-workload/{rbd-python.yaml rgw-swift-ragweed_check.yaml snaps-many-objects.yaml} debug_upgrade.yaml objectstore/bluestore-bitmap.yaml supported-all-distro/centos_latest.yaml thrashosds-health.yaml}
History
#1 Updated by Neha Ojha over 3 years ago
-10000> 2020-02-06 03:33:51.134 7f60aafd9700 5 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started/Primary 1.613785 0 0.000000 -10000> 2020-02-06 03:33:51.134 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] clear_primary_state -10000> 2020-02-06 03:33:51.134 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] release_backoffs [37:5c000000::::0,37:60000000::::head) -10000> 2020-02-06 03:33:51.134 7f60aafd9700 20 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] agent_stop -10000> 2020-02-06 03:33:51.134 7f60aafd9700 5 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started 1.613899 0 0.000000 -10000> 2020-02-06 03:33:51.134 7f60aafd9700 5 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 unknown mbc={}] enter Crashed
#2 Updated by Neha Ojha over 3 years ago
For whatever reason we do not have complete osd logs for this, but from nojha-2020-02-06_01:27:32-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4736605/remote/smithi158/crash/posted/2020-02-06_03:33:51.204745Z_e39e2a69-e5a1-483b-926b-55d71896e52c/log the problem seems to be the following:
-10000> 2020-02-06 03:33:50.534 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 2718 epoch_requested: 2718 MNotifyRec 37.3a from 3 notify: (query:2718 sent:2718 37.3a( empty local-lis/les=2578/2581 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704)) features: 0x3ffddff8ffacfffb ([2578,2703] intervals=([2642,2703] acting 1)) +create_info -10000> 2020-02-06 03:33:50.534 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 peering mbc={}] got dup osd.3 info 37.3a( empty local-lis/les=2578/2581 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704), identical to ours
We had received MNotifyRec while in Started/Primary/Peering/GetInfo just before this, and then received MLogRec, which isn't expected in GetInfo thus we see
-10000> 2020-02-06 03:33:51.134 7f60aafd9700 10 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 2717 epoch_requested: 2717 MLogRec from 1 +create_info -10000> 2020-02-06 03:33:51.134 7f60aafd9700 5 osd.11 pg_epoch: 2718 pg[37.3a( empty local-lis/les=0/0 n=0 ec=2088/2088 lis/c 2642/2578 les/c/f 2643/2584/0 2704/2704/2704) [11,1] r=0 lpr=2717 pi=[2578,2704)/1 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering/GetInfo 1.613711 8 0.000521
#3 Updated by Neha Ojha over 3 years ago
- Subject changed from upgrade/mimic-x: FAILED assert(0 == "we got a bad state machine event") to nautilus: upgrade/mimic-x: Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash
- Priority changed from Normal to High
2020-02-26 21:36:08.696 7fe1a9fb8700 20 osd.8 op_wq(1) _process OpQueueItem(36.19 PGPeeringEvent(epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info) prio 255 cost 10 e4115) queued 2020-02-26 21:36:08.696 7fe1a9fb8700 20 osd.8 op_wq(1) _process 36.19 to_process <OpQueueItem(36.19 PGPeeringEvent(epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info) prio 255 cost 10 e4115)> waiting <> waiting_peering {} 2020-02-26 21:36:08.696 7fe1a9fb8700 20 osd.8 op_wq(1) _process OpQueueItem(36.19 PGPeeringEvent(epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info) prio 255 cost 10 e4115) pg 0x55607d65d000 2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info 2020-02-26 21:36:08.696 7fe1a9fb8700 5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering/GetInfo 30.389177 13 0.000265 2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Leaving Peering 2020-02-26 21:36:08.696 7fe1a9fb8700 5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering 30.389191 0 0.000000 2020-02-26 21:36:08.696 7fe1a9fb8700 5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started/Primary 30.389201 0 0.000000 2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] clear_primary_state 2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] release_backoffs [36:98000000::::0,36:9c000000::::head) 2020-02-26 21:36:08.696 7fe1a9fb8700 20 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] agent_stop 2020-02-26 21:36:08.696 7fe1a9fb8700 5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] exit Started 30.389244 0 0.000000 2020-02-26 21:36:08.696 7fe1a9fb8700 5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 unknown mbc={}] enter Crashed
/a/nojha-2020-02-26_18:17:32-upgrade:mimic-x:stress-split-nautilus-distro-basic-smithi/4805548/
#4 Updated by Neha Ojha over 3 years ago
On osd.8(mimic)
This is when we request the log from osd.6
2020-02-26 21:34:51.436 7fe1a5fb0700 5 osd.8 pg_epoch: 4100 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] enter Started/Primary/Peering/GetLog 2020-02-26 21:34:51.436 7fe1a5fb0700 10 osd.8 pg_epoch: 4100 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] state<Started/Primary/Peering/GetLog>: requesting log from osd.6
a map change makes us exit Peering/GetLog
2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] state<Started/Primary/Peering>: Peering advmap 2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] affected_by_map osd.1 now up 2020-02-26 21:35:38.307 7fe1a9fb8700 1 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset 2020-02-26 21:35:38.307 7fe1a9fb8700 5 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] exit Started/Primary/Peering/GetLog 46.870220 57 0.000946 2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] state<Started/Primary/Peering>: Leaving Peering 2020-02-26 21:35:38.307 7fe1a9fb8700 5 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating+peering mbc={}] exit Started/Primary/Peering 112.040719 0 0.000000 2020-02-26 21:35:38.307 7fe1a9fb8700 5 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4067 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 creating mbc={}] exit Started/Primary 112.040728 0 0.000000
Then we get into GetInfo
2020-02-26 21:35:38.307 7fe1a9fb8700 5 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] enter Started/Primary/Peering/GetInfo 2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] build_prior all_probe 1,6,8 2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetInfo>: querying info from osd.1 2020-02-26 21:35:38.307 7fe1a9fb8700 10 osd.8 pg_epoch: 4114 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetInfo>: querying info from osd.6
and end up crashing when the response to the earlier GetLog arrives.
2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 4115 epoch_requested: 4115 MLogRec from 6 +create_info 2020-02-26 21:36:08.696 7fe1a9fb8700 5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering/GetInfo 30.389177 13 0.000265 2020-02-26 21:36:08.696 7fe1a9fb8700 10 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering>: Leaving Peering 2020-02-26 21:36:08.696 7fe1a9fb8700 5 osd.8 pg_epoch: 4116 pg[36.19( empty local-lis/les=0/0 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=0 lpr=4114 pi=[2336,4067)/3 crt=0'0 mlcod 0'0 peering mbc={}] exit Started/Primary/Peering 30.389191 0 0.000000
On osd.6(nautilus)
2020-02-26 21:36:08.695 7f724e879700 10 osd.6 pg_epoch: 4115 pg[36.19( empty local-lis/les=3044/3127 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=1 lpr=4067 pi=[3044,4067)/1 crt=0'0 unknown NOTIFY mbc={}] log request from 8 2020-02-26 21:36:08.695 7f724e879700 10 osd.6 pg_epoch: 4115 pg[36.19( empty local-lis/les=3044/3127 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=1 lpr=4067 pi=[3044,4067)/1 crt=0'0 unknown NOTIFY mbc={}] sending info+missing+log since 0'0 2020-02-26 21:36:08.695 7f724e879700 10 osd.6 pg_epoch: 4115 pg[36.19( empty local-lis/les=3044/3127 n=0 ec=2336/2336 lis/c 3044/3044 les/c/f 3127/3246/0 4067/4067/4067) [8,6] r=1 lpr=4067 pi=[3044,4067)/1 crt=0'0 unknown NOTIFY mbc={}] sending log((0'0,0'0], crt=0'0) missing(0 may_include_deletes = 1)
#5 Updated by Neha Ojha over 3 years ago
The part that I don't understand is when osd.6 responded, the epoch_sent/epoch_requested(4115/4100) seem correct
2020-02-26 21:36:08.695 7f724e879700 1 -- [v2:172.21.15.92:6810/1012620,v1:172.21.15.92:6811/1012620] --> v1:172.21.15.80:6805/341918 -- PGlog(36.19 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e4115/4100) v5 -- 0x562106cdd000 con 0x562103651400 2020-02-26 21:36:08.697 7f7270baf700 10 --1- [v2:172.21.15.92:6810/1012620,v1:172.21.15.92:6811/1012620] >> v1:172.21.15.80:6805/341918 conn(0x562103651400 0x56210e32a000 :6811 s=OPENED pgs=9 cs=1 l=0).handle_tag_ack got ack seq 180 >= 179 on 0x562106cdd000 PGlog(36.19 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e4115/4100) v5
on osd.8, we see e4115/4115
020-02-26 21:36:08.696 7fe1c8eb9700 1 -- 172.21.15.80:6805/341918 <== osd.6 172.21.15.92:6811/1012620 179 ==== PGlog(36.19 log log((0'0,0'0], crt=0'0) pi ([0,0] intervals=) e4115/4115) v5 ==== 1052+0+0 (277749 2280 0 0) 0x55607d79e700 con 0x55607b932000 on osd.8
since lpr=4114, this does not qualify as an old_peering_msg
#6 Updated by Neha Ojha over 3 years ago
- Subject changed from nautilus: upgrade/mimic-x: Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash to Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash
- Status changed from New to Fix Under Review
- Assignee set to Neha Ojha
- Backport set to nautilus
- Pull request ID set to 33590
In any case, https://github.com/ceph/ceph/pull/33590 will just prevent it from crashing.
#7 Updated by Sage Weil over 3 years ago
Ah, this is why: 168e20ab8b8da3a5aed41b73f9627d10971be67b
commit 168e20ab8b8da3a5aed41b73f9627d10971be67b (refs/remotes/pr/22501, refs/remotes/one/wip-pglog-upgrade, refs/remotes/me/wip-pglog-upgrade) Author: Sage Weil <sweil@redhat.com> Date: Sun Jun 10 21:41:54 2018 -0500 messages/MOSDPGLog: encode epoch for query_epoch for pre-nautilus OSDs This fix is analogous to d5a3ae6008e4399b336f3a72d8dee328f97a0002. As of 84a1984b7cb70c74eaf517ffc424228fa954ed37 we set query_epoch to a lower value reliably, but we need to encode the latest epoch for pre-nautilus OSDs because they do not set last_peering_epoch reliably and havin a lower query_epoch will cause them to discard this message. Signed-off-by: Sage Weil <sage@redhat.com> diff --git a/src/messages/MOSDPGLog.h b/src/messages/MOSDPGLog.h index 7138bbf08f0..2ae27f3a589 100644 --- a/src/messages/MOSDPGLog.h +++ b/src/messages/MOSDPGLog.h @@ -96,7 +96,12 @@ public: encode(info, payload); encode(log, payload); encode(missing, payload); - encode(query_epoch, payload); + if (!HAVE_FEATURE(features, SERVER_NAUTILUS)) { + // pre-nautilus OSDs do not set last_peering_reset properly + encode(epoch, payload); + } else { + encode(query_epoch, payload); + } encode(past_intervals, payload); encode(to, payload); encode(from, payload);
I think if we want to fix this, we should make this state machine change only in mimic?
#8 Updated by Neha Ojha over 3 years ago
- Subject changed from Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash to mimic: Receiving MLogRec in Started/Primary/Peering/GetInfo causes an osd crash
- Status changed from Fix Under Review to In Progress
- Backport deleted (
nautilus)
#9 Updated by Neha Ojha over 3 years ago
- Status changed from In Progress to Fix Under Review
- Pull request ID changed from 33590 to 33594
#10 Updated by Yuri Weinstein over 3 years ago
#11 Updated by Nathan Cutler over 3 years ago
- Status changed from Fix Under Review to Resolved