Bug #36686: osd: pg log hard limit can cause crash during upgrade - RADOS - Ceph

Actions

Copy link

Bug #36686

closed

osd: pg log hard limit can cause crash during upgrade

Added by Josh Durgin over 5 years ago. Updated about 5 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Neha Ojha

Category:

Correctness/Safety

Target version:

% Done:

Source:

Tags:

Backport:

luminous, mimic

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

upgrade/luminous-x

Component(RADOS):

OSD

Pull request ID:

25816

Crash signature (v1):

Crash signature (v2):

Description

During an upgrade from an earlier version, a primary running the new code will send a trim_to value to a replica that triggers an assert in the old code in certain circumstances. Namely, when the pg log is being trimmed beyond info.last_clean, during backfill.

This can be triggered by the luminous-x:stress-split suite with a small pg log to force backfill. In this run upgrading from 12.2.5 (no hard limit) to mimic (hard limit present) with osd_min_pg_log_entries = 1 and osd_max_pg_log_entries = 2 to force more backfilling, we hit this assert:

http://pulpito.ceph.com/joshd-2018-11-02_21:55:29-upgrade:luminous-x:stress-split-mimic-distro-basic-smithi/3216207/

/builddir/build/BUILD/ceph-12.2.5/src/osd/PGLog.cc: 170: FAILED assert(trim_to <= info.last_complete)

ceph version 12.2.5-42.0.TEST.bz1636267.el7cp (559ef7e0c955a21506efea93cfccafcf153e74b7) luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f64c7427db0]
2: (PGLog::trim(eversion_t, pg_info_t&)+0x26f) [0x7f64c6fc4b8f]
3: (PG::append_log(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, eversion_t, eversion_t, ObjectStore::Transaction&, bool)+0x36d) [0x7f64c6f4f23d]
4: (PrimaryLogPG::log_operation(std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t> const&, eversion_t const&, eversion_t const&, bool, ObjectStore::Transaction&)+0x74) [0x7f64c7066344]
5: (ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, Context*)+0x31a) [0x7f64c718eaea]
6: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x327) [0x7f64c719f6e7]
7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50) [0x7f64c709fdb0]
8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x59c) [0x7f64c700b2cc]
9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3f9) [0x7f64c6e8e9c9]
10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest> const&)+0x57) [0x7f64c7111ea7]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xfce) [0x7f64c6ebdace]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839) [0x7f64c742d8c9]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f64c742f860]
14: (()+0x7dc5) [0x7f64c41dddc5]
15: (clone()+0x6d) [0x7f64c32d276d]

We can avoid this by adding an osdmap flag to enable the hard limit, dependent on OSDs reporting a pg log hard limit feature bit, similar to how we handled recovery deletes.

A workaround for users is to upgrade and restart all OSDs to a version with the pg hard limit, or only upgrade when all PGs are active+clean.

Related issues 4 (0 open — 4 closed)

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #36686

osd: pg log hard limit can cause crash during upgrade

Updated by Neha Ojha over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Yuri Weinstein over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Neha Ojha over 5 years ago

Updated by Yuri Weinstein over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Sage Weil over 5 years ago

Updated by Alexander Morozov over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Alexander Morozov over 5 years ago

Updated by Oliver Freyermuth over 5 years ago

Updated by Neha Ojha over 5 years ago

Updated by Neha Ojha over 5 years ago

Updated by Neha Ojha over 5 years ago

Updated by Neha Ojha over 5 years ago

Updated by Neha Ojha over 5 years ago

Updated by Neha Ojha over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler over 5 years ago

Updated by Nathan Cutler about 5 years ago