osd crashes during hit_set_trim and hit_set_remove_all if hit set object doesn't exist
i started an upgrade process to go from 0.94.7 to 10.2.5 on a production cluster that is using cache tiering. this cluster has 3 monitors, 28 storage nodes, around 370 osds. the upgrade of the monitors completed without issue. i then upgraded 2 of the storage nodes, and after the restarts, the osds started crashing during hit_set_trim. here is some of the output from the log.
2017-03-02 22:41:32.338290 7f8bfd6d7700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' thread 7f8bfd6d7700 time 2017-03-02 22:41:32.335020
osd/ReplicatedPG.cc: 10514: FAILED assert(obc)
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbddac5]
2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)+0x75f) [0x87e48f]
3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a0d1a]
5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x83be4a]
6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69a5c5]
7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd1cf]
9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
10: (()+0x7dc5) [0x7f8c1c209dc5]
11: (clone()+0x6d) [0x7f8c1aceaced]
it started on just one osd and then spread to others until most of the osds that are part of the cache tier were crashing. that was happening on both the osds that were running jewel and on the ones running hammer. in the process of trying to sort this out, the use_gmt_hitset option was set to true and all of the osds were upgraded to hammer. we still have not been able to determine a cause or a fix.
it looks like when hit_set_trim and hit_set_remove_all are being called, they are calling hit_set_archive_object() to generate a name based on a timestamp and then calling get_object_context() which then returns nothing and triggers an assert.
i raised the debug_osd to 10/10 and then analyzed the logs after the crash. i found the following in the ceph osd log afterwards.
2017-03-03 03:10:31.918470 7f218c842700 10 osd.146 pg_epoch: 266043 pg[19.5d4( v 264786'61233923 (262173'61230715,264786'61233923] local-les=266043 n=393 ec=83762 les/c/f 266043/264767/0 266042/266042/266042) [146,116,179] r=0 lpr=266042
pi=264766-266041/431 crt=262323'61233250 lcod 0'0 mlcod 0'0 active+degraded NIBBLEWISE] get_object_context: no obc for soid 19:2ba00000:.ceph-internal::hit_set_19.5d4_archive_2017-03-03 05%3a55%3a58.459084Z_2017-03-03 05%3a56%3a58.98101
6Z:head and !can_create
2017-03-03 03:10:31.921064 7f2194051700 10 osd.146 266043 do_waiters -- start
2017-03-03 03:10:31.921072 7f2194051700 10 osd.146 266043 do_waiters -- finish
2017-03-03 03:10:31.921076 7f2194051700 7 osd.146 266043 handle_pg_notify from osd.255
2017-03-03 03:10:31.921096 7f2194051700 10 osd.146 266043 do_waiters -- start
2017-03-03 03:10:31.921099 7f2194051700 10 osd.146 266043 do_waiters -- finish
2017-03-03 03:10:31.925858 7f218c041700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_remove_all()' thread 7f218c041700 time 2017-03-03 03:10:31.918201
osd/ReplicatedPG.cc: 11494: FAILED assert(obc)
ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f21acee9425]
2: (ReplicatedPG::hit_set_remove_all()+0x412) [0x7f21ac9cba92]
3: (ReplicatedPG::on_activate()+0x6dd) [0x7f21ac9f73fd]
4: (PG::RecoveryState::Active::react(PG::AllReplicasActivated const&)+0xac) [0x7f21ac916adc]
5: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x179) [0x7f21a
6: (boost::statechart::simple_state<PG::RecoveryState::Activating, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xcd) [0x7f21ac977ccd]
7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x7f21ac95
8: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1f4) [0x7f21ac924d24]
9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x259) [0x7f21ac87de99]
10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7f21ac8c8552]
11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f21aceda46e]
12: (ThreadPool::WorkThread::entry()+0x10) [0x7f21acedb350]
13: (()+0x7dc5) [0x7f21aac04dc5]
14: (clone()+0x6d) [0x7f21a928f73d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
it looks like it was looking for a hit_set with a timestamp of 05:55:58.459084Z but the timestamp on the log lines is for 3:10:31.921096. we have tried various options of the hit_set_count and hit_set_period but the osds are still trying to find objects for timestamps that don't exist. i've looked through the osd pg directories (i.e. /var/lib/ceph/osd/ceph-##/current/<pg.id>_head) and found hit_set objects but with wildly different timestamps.
there are a few bugs on tracker that look like their related but we haven't been able to find a fix from them yet. http://tracker.ceph.com/issues/13185 is the closest one but it hasn't given us a hint on a fix yet. http://tracker.ceph.com/issues/14399 and http://tracker.ceph.com/issues/9732 also seem related but also haven't helped. we have verified that all nodes are using the same timezone and have consistent timestamps. we're currently looking at options to edit the hit_set_setup() code to skip the hit_set_remove() call and the failing code. this will give us empty hit sets and the thought is that we will then flush the cache tier and remove it then add one back. still not sure if its going to work.
does anyone else have any other thoughts about how to handle this or why this is happening? or what else we could do to get the osds back online? this has crashed almost all of the cache tier osds and the cluster is unusable at the moment.
#1 Updated by Gerben Meijer 12 months ago
So my take on this is that for some reason, either the new jewel OSDs or the old hammer ones started creating GMT timestamped hitset archives as rados objects, which the other OSDs did not find and/or could not replicate, which then triggered asserts somewhere.
All boxes were NTP time synced and on MDT timezone.
- Status changed from New to Need Review
- Assignee set to Kefu Chai
the use-gmt flag is unset for pools created before upgrading, so i don't think that's the case.
https://github.com/ceph/ceph/pull/13821 is created to skip the archive hitset object when we are trying to remove it.
#4 Updated by Gerben Meijer 12 months ago
Kefu Chai wrote:
the use-gmt flag is unset for pools created before upgrading, so i don't think that's the case.
The very first OSD that crashed logged this:
Mar 02 20:46:30 ceph2.silo2.uspro1.unifiedlayer.com ceph-osd: 2017-03-02 20:46:30.973974 7f3f9651f700 -1 osd.254 pg_epoch: 262185 pg[19.643( v 262185'80274196 lc 262179'80273645 (262173'80270577,262185'80274196] local-les=262181 n=379 ec=83762 les/c/f 262181/262179/0 262180/262180/262180) [254,248,168] r=0 lpr=262180 pi=262170-262179/2 crt=262179'80273683 mlcod 262179'80273645 active+recovery_wait+degraded NIBBLEWISE m=1] agent_load_hit_sets: could not load hitset 19:c2600000:.ceph-internal::hit_set_19.643_archive_2017-03-03 03%3a44%3a39.027534Z_2017-03-03 03%3a44%3a39.653995Z:head
It crashed at 20:46 local time MDT on March 2nd, and it is looking for a hitset object that is 7 hours in the future. MDT+7 == GMT. At this point, 6 cache-tier OSDs were on Jewel, all others were still Hammer. We had not changed any flags anywhere, and we definitely weren't using use-gmt.
Here's the cache tier pool XML that I dumped at 21:03 with ceph report:
<pool> <pool>19</pool> <pool_name>ssd-tier</pool_name> <flags>9</flags> <flags_names>hashpspool,incomplete_clones</flags_names> <type>1</type> <size>3</size> <min_size>2</min_size> <crush_ruleset>1</crush_ruleset> <object_hash>2</object_hash> <pg_num>2048</pg_num> <pg_placement_num>2048</pg_placement_num> <crash_replay_interval>0</crash_replay_interval> <last_change>262131</last_change> <last_force_op_resend>0</last_force_op_resend> <auid>0</auid> <snap_mode>selfmanaged</snap_mode> <snap_seq>2</snap_seq> <snap_epoch>262046</snap_epoch> <pool_snaps /> <removed_snaps>[1~1]</removed_snaps> <quota_max_bytes>0</quota_max_bytes> <quota_max_objects>0</quota_max_objects> <tiers /> <tier_of>17</tier_of> <read_tier>-1</read_tier> <write_tier>-1</write_tier> <cache_mode>forward</cache_mode> <target_max_bytes>10995116277760</target_max_bytes> <target_max_objects>0</target_max_objects> <cache_target_dirty_ratio_micro>400000</cache_target_dirty_ratio_micro> <cache_target_dirty_high_ratio_micro>600000</cache_target_dirty_high_ratio_micro> <cache_target_full_ratio_micro>400000</cache_target_full_ratio_micro> <cache_min_flush_age>0</cache_min_flush_age> <cache_min_evict_age>0</cache_min_evict_age> <erasure_code_profile>replicated_ruleset_ssd</erasure_code_profile> <hit_set_params> <type>bloom</type> <false_positive_probability>0.05</false_positive_probability> <target_size>0</target_size> <seed>0</seed> </hit_set_params> <hit_set_period>600</hit_set_period> <hit_set_count>3</hit_set_count> <use_gmt_hitset>false</use_gmt_hitset> <min_read_recency_for_promote>1</min_read_recency_for_promote> <min_write_recency_for_promote>1</min_write_recency_for_promote> <hit_set_grade_decay_rate>0</hit_set_grade_decay_rate> <hit_set_search_last_n>1</hit_set_search_last_n> <grade_table> <value>1000000</value> <value>1000000</value> <value>1000000</value> </grade_table> <stripe_width>0</stripe_width> <expected_num_objects>0</expected_num_objects> <fast_read>false</fast_read> <options /> </pool>
Unfortunately the log rotated for that OSD so I don't have an earlier stack trace. Either way: why is an OSD looking for a hitset 7 hours in the future? Doesn't seem right, or does it?
#7 Updated by Nathan Cutler 12 months ago
- Status changed from Need Review to Pending Backport
- Backport set to hammer, jewel
Apparently, this is a special-case bug that is only present in hammer and jewel, not in master. I will stage backport tracker issues as usual (even though the master PR is "missing"), since we need to track the hammer and jewel fixes separately.