Project

General

Profile

Bug #41336

All OSD Faild after Reboot.

Added by Ansgar Jazdzewski about 1 year ago. Updated about 1 year ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature:

Description

We have Faced an issue with a Writeback-Cache + EC-Pool.

We "solved" the issue by removing the pool and delete the data on the OSD using the "ceph-objectstore-tool"

since we now had removed our cephfs we still not know if we could have
solved it without data loss.

History

#1 Updated by Patrick Donnelly about 1 year ago

  • Project changed from Ceph to RADOS
  • Component(RADOS) OSD added

#2 Updated by Neha Ojha about 1 year ago


2019-08-06 12:50:14.208 7f8a32e4f200 -1 /build/ceph-13.2.4/src/osd/ECUtil.h: In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' thread 7f8a32e4f200 time 2019-08-06 12:50:14.208743
/build/ceph-13.2.4/src/osd/ECUtil.h: 34: FAILED assert(stripe_width % stripe_size == 0)

 ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f8a2a1843c2]
 2: (()+0x2e5587) [0x7f8a2a184587]
 3: (ECBackend::ECBackend(PGBackend::Listener*, coll_t const&, boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore*, CephContext*, std::shared_ptr<ceph::ErasureCodeInterface>, unsigned long)+0x4de) [0xa4cbbe]
 4: (PGBackend::build_pg_backend(pg_pool_t const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, PGBackend::Listener*, coll_t, boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ObjectStore*, CephContext*)+0x2f9) [0x9474e9]
 5: (PrimaryLogPG::PrimaryLogPG(OSDService*, std::shared_ptr<OSDMap const>, PGPool const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, spg_t)+0x138) [0x8f96e8]
 6: (OSD::_make_pg(std::shared_ptr<OSDMap const>, spg_t)+0x11d3) [0x753553]
 7: (OSD::load_pgs()+0x4a9) [0x758339]
 8: (OSD::init()+0xcd3) [0x7619c3]
 9: (main()+0x3678) [0x64d6a8]

#3 Updated by Josh Durgin about 1 year ago

This is fixed in later versions - the monitor makes sure stripe_unit is a valid value when the pool is created. With an existing invalid pool, the only way to fix it would be manually editing the ec profile. Deleting/recreating the pool is simpler.

#4 Updated by Josh Durgin about 1 year ago

  • Status changed from New to Resolved

#5 Updated by Oliver Freyermuth about 1 year ago

Hi,

two questions:
- How to find out if a pool is affected?
"ceph osd erasure-code-profile get" does not list stripe_unit for me.
- If affected, how to fix it? For a >0.5 PB pool as we have it, just recreating it is not really a way to go if you want to do it without a downtime, don't have twice the space, and want to keep the data.

If this is a common issue affecting all users upgrading, it should surely be mentioned in the release notes before any more data is lost.

Also available in: Atom PDF