From teuthology-2015-01-28_17:05:01-upgrade:giant-x-next-distro-basic-multi/728068 logs I see that osd.0 was killed and revived 3 times. The final time it hit
the assert trying to load pg 0.22. The interesting thing is that the ceph-objectstore-tool was never run in the test case up to that point.
I suspect the write_info() was never called upon pg creation and subsequent handle_activate_map() calls didn't bother because it wasn't over 200 epochs
past. Later epochs think that info was persisted in epoch 649. Here are pg messages during pg creation in epoch 649.
I propose the following change. What isn't clear is why haven't we seen this before now.
diff --git a/src/osd/OSD.cc b/src/osd/OSD.cc
index 4d6a01d..179abab 100644
--- a/src/osd/OSD.cc
+++ b/src/osd/OSD.cc
@@ -7038,6 +7038,8 @@ void OSD::handle_pg_create(OpRequestRef op)
pg->info.last_epoch_started = pg->info.history.last_epoch_started;
creating_pgs.erase(pgid);
pg->handle_create(&rctx);
+ pg->dirty_info = true;
+ pg->dirty_big_info = true;
pg->write_if_dirty(*rctx.transaction);
pg->publish_stats_to_osd();
pg->unlock();
2015-01-28 22:49:01.301575 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22(unlocked)] enter Initial
2015-01-28 22:49:01.301588 7f9ce059a700 20 osd.0 pg_epoch: 649 pg[0.22(unlocked)] enter NotTrimming
2015-01-28 22:49:01.301614 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( DNE empty local-les=0 n=0 ec=0 les/c 0/0 0/0/0) [] r=0 lpr=0 crt=0'0 inactive] init role 0 up [0,10] acting [0,10] history ec=1 les/c 642/642 649/649/649 1 past_intervals
2015-01-28 22:49:01.301672 7f9ce059a700 7 osd.0 649 _create_lock_pg pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=0 pi=641-648/1 crt=0'0 mlcod 0'0 inactive]
2015-01-28 22:49:01.301681 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=0 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] handle_create
2015-01-28 22:49:01.301694 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=0 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] exit Initial 0.000119 0 0.000000
2015-01-28 22:49:01.301708 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=0 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] enter Reset
2015-01-28 22:49:01.301718 7f9ce059a700 20 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=0 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] set_last_peering_reset 649
2015-01-28 22:49:01.301728 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] Clearing blocked outgoing recovery messages
2015-01-28 22:49:01.301741 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] Not blocking outgoing recovery messages
2015-01-28 22:49:01.301764 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] update_heartbeat_peers -> 0,10
2015-01-28 22:49:01.301790 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] take_waiters
2015-01-28 22:49:01.301805 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] exit Reset 0.000097 1 0.000068
2015-01-28 22:49:01.301819 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] enter Started
2015-01-28 22:49:01.301829 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] enter Start
2015-01-28 22:49:01.301838 7f9ce059a700 1 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] state<Start>: transitioning to Primary
2015-01-28 22:49:01.301851 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] exit Start 0.000021 0 0.000000
2015-01-28 22:49:01.301864 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] enter Started/Primary
2015-01-28 22:49:01.301875 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] enter Started/Primary/Peering
2015-01-28 22:49:01.301892 7f9ce059a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] enter Started/Primary/Peering/GetInfo
2015-01-28 22:49:01.301913 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] _calc_past_interval_range: already have past intervals back to 642
2015-01-28 22:49:01.301953 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] PriorSet: build_prior interval(641-648 up [10](10) acting [10](10) maybe_went_rw)
2015-01-28 22:49:01.301966 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] PriorSet: build_prior final: probe 0,10 down blocked_by {}
2015-01-28 22:49:01.301992 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] up_thru 627 < same_since 649, must notify monitor
2015-01-28 22:49:01.302005 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>: querying info from osd.10
2015-01-28 22:49:01.302024 7f9ce059a700 15 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] publish_stats_to_osd 649:1
2015-01-28 22:49:01.302156 7f9ce059a700 10 osd.0 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] is new
2015-01-28 22:49:01.529688 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] handle_peering_event: epoch_sent: 649 epoch_requested: 649 MNotifyRec from 10 notify: (query_epoch:649, epoch_sent:649, info:0.22( empty local-les=642 n=0 ec=1 les/c 642/642 649/649/649)) features: 0x3fffffffffff
2015-01-28 22:49:01.529709 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] got osd.10 0.22( empty local-les=642 n=0 ec=1 les/c 642/642 649/649/649)
2015-01-28 22:49:01.529734 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] update_heartbeat_peers 0,10 unchanged
2015-01-28 22:49:01.529747 7f9cd858a700 20 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>: Adding osd: 10 features: 3fffffffffff
2015-01-28 22:49:01.529758 7f9cd858a700 20 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetInfo>: Common features: 3fffffffffff
2015-01-28 22:49:01.529771 7f9cd858a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] exit Started/Primary/Peering/GetInfo 0.227878 1 0.000280
2015-01-28 22:49:01.529789 7f9cd858a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] enter Started/Primary/Peering/GetLog
2015-01-28 22:49:01.529805 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] calc_acting osd.0 0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649)
2015-01-28 22:49:01.529818 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] calc_acting osd.10 0.22( empty local-les=642 n=0 ec=1 les/c 642/642 649/649/649)
2015-01-28 22:49:01.529849 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] calc_acting newest update on osd.0 with 0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649)
2015-01-28 22:49:01.529864 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] actingbackfill is 0,10
2015-01-28 22:49:01.529874 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] choose_acting want [0,10] (== acting) backfill_targets
2015-01-28 22:49:01.529891 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetLog>: leaving GetLog
2015-01-28 22:49:01.529909 7f9cd858a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] exit Started/Primary/Peering/GetLog 0.000120 0 0.000000
2015-01-28 22:49:01.529923 7f9cd858a700 15 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] publish_stats_to_osd 649:2
2015-01-28 22:49:01.538026 7f9cd858a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] enter Started/Primary/Peering/GetMissing
2015-01-28 22:49:01.538049 7f9cd858a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary/Peering/GetMissing>: still need up_thru update before going active
2015-01-28 22:49:01.538069 7f9cd858a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] exit Started/Primary/Peering/GetMissing 0.000044 0 0.000000
2015-01-28 22:49:01.538085 7f9cd858a700 15 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] publish_stats_to_osd 649:3
2015-01-28 22:49:01.538097 7f9cd858a700 5 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] enter Started/Primary/Peering/WaitUpThru
2015-01-28 22:49:01.554565 7f9ce485c700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] flushed
2015-01-28 22:49:01.625987 7f9cd8d8b700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] handle_peering_event: epoch_sent: 649 epoch_requested: 649 MNotifyRec from 10 notify: (query_epoch:649, epoch_sent:649, info:0.22( empty local-les=642 n=0 ec=1 les/c 642/642 649/649/649)) features: 0x3fffffffffff
2015-01-28 22:49:01.626003 7f9cd8d8b700 7 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary>: handle_pg_notify from osd.10
2015-01-28 22:49:01.626015 7f9cd8d8b700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] state<Started/Primary>: pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] got dup osd.10 info 0.22( empty local-les=642 n=0 ec=1 les/c 642/642 649/649/649), identical to ours
2015-01-28 22:49:01.705367 7f9cd8d8b700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] handle_peering_event: epoch_sent: 649 epoch_requested: 649 FlushedEvt
2015-01-28 22:49:01.705392 7f9cd8d8b700 15 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] requeue_ops
2015-01-28 22:49:02.851155 7f9ce059a700 10 osd.0 pg_epoch: 649 pg[0.22( empty local-les=0 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 peering] null
2015-01-28 22:49:02.861866 7f9cd858a700 20 osd.0 pg_epoch: 650 pg[0.22( empty local-les=650 n=0 ec=1 les/c 642/642 649/649/649) [0,10] r=0 lpr=649 pi=641-648/1 crt=0'0 mlcod 0'0 inactive] handle_activate_map: Not dirtying info: last_persisted is 649 while current is 650
2015-01-28 22:49:04.697427 7f9cd8d8b700 20 osd.0 pg_epoch: 651 pg[0.22( empty local-les=650 n=0 ec=1 les/c 650/650 649/649/649) [0,10] r=0 lpr=649 crt=0'0 mlcod 0'0 active+clean] handle_activate_map: Not dirtying info: last_persisted is 650 while current is 651
...
2015-01-28 22:55:33.006105 7f9cd858a700 20 osd.0 pg_epoch: 804 pg[0.22( empty local-les=650 n=0 ec=1 les/c 650/650 649/649/649) [0,10] r=0 lpr=649 crt=0'0 mlcod 0'0 active+clean] handle_activate_map: Not dirtying info: last_persisted is 650 while current is 804
osd.0 taken down