Project

General

Profile

Actions

Bug #36686

closed

osd: pg log hard limit can cause crash during upgrade

Added by Josh Durgin over 5 years ago. Updated about 5 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous, mimic
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
upgrade/luminous-x
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During an upgrade from an earlier version, a primary running the new code will send a trim_to value to a replica that triggers an assert in the old code in certain circumstances. Namely, when the pg log is being trimmed beyond info.last_clean, during backfill.

This can be triggered by the luminous-x:stress-split suite with a small pg log to force backfill. In this run upgrading from 12.2.5 (no hard limit) to mimic (hard limit present) with osd_min_pg_log_entries = 1 and osd_max_pg_log_entries = 2 to force more backfilling, we hit this assert:

http://pulpito.ceph.com/joshd-2018-11-02_21:55:29-upgrade:luminous-x:stress-split-mimic-distro-basic-smithi/3216207/

/builddir/build/BUILD/ceph-12.2.5/src/osd/PGLog.cc: 170: FAILED assert(trim_to <= info.last_complete)

ceph version 12.2.5-42.0.TEST.bz1636267.el7cp (559ef7e0c955a21506efea93cfccafcf153e74b7) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f64c7427db0]
2: (PGLog::trim(eversion_t, pg_info_t&)+0x26f) [0x7f64c6fc4b8f]
3: (PG::append_log(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, eversion_t, eversion_t, ObjectStore::Transaction&, bool)+0x36d) [0x7f64c6f4f23d]
4: (PrimaryLogPG::log_operation(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t> const&, eversion_t const&, eversion_t const&, bool, ObjectStore::Transaction&)+0x74) [0x7f64c7066344]
5: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x31a) [0x7f64c718eaea]
6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x327) [0x7f64c719f6e7]
7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x7f64c709fdb0]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x7f64c700b2cc]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x7f64c6e8e9c9]
10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x7f64c7111ea7]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x7f64c6ebdace]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x7f64c742d8c9]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f64c742f860]
14: (()+0x7dc5) [0x7f64c41dddc5]
15: (clone()+0x6d) [0x7f64c32d276d]

We can avoid this by adding an osdmap flag to enable the hard limit, dependent on OSDs reporting a pg log hard limit feature bit, similar to how we handled recovery deletes.

A workaround for users is to upgrade and restart all OSDs to a version with the pg hard limit, or only upgrade when all PGs are active+clean.


Related issues 4 (0 open4 closed)

Related to RADOS - Bug #37803: osd/PGLog.cc: 170: FAILED assert(trim_to <= info.last_complete)Duplicate01/07/2019

Actions
Has duplicate rgw - Bug #36706: Ceph ECBackend: assert fail at PGLog::trimDuplicate11/06/2018

Actions
Copied to RADOS - Backport #37902: mimic: osd: pg log hard limit can cause crash during upgradeResolvedNeha OjhaActions
Copied to RADOS - Backport #37903: luminous: osd: pg log hard limit can cause crash during upgradeResolvedNeha OjhaActions
Actions

Also available in: Atom PDF