Project

General

Profile

Bug #45202

Repeatedly OSD crashes in PrimaryLogPG::hit_set_trim()

Added by KOT MATPOCKuH almost 4 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
Yes
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After a network troubles I got 1 pg in a state recovery_unfound
I tried to solve this problem using command:

ceph pg 2.f8 mark_unfound_lost revert

And in about one hour after connectivity was restored I got crash for OSD.12:

 ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
 1: (()+0x911e70) [0x564d0067fe70]
 2: (()+0xf5d0) [0x7f1272dad5d0]
 3: (gsignal()+0x37) [0x7f1271dce2c7]
 4: (abort()+0x148) [0x7f1271dcf9b8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x242) [0x7f12762252b2]
 6: (()+0x25a337) [0x7f1276225337]
 7: (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext, std::default_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x930) [0x564d002ab480]
 8: (PrimaryLogPG::hit_set_persist()+0xa0c) [0x564d002afafc]
 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2989) [0x564d002c5f09]
 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xc99) [0x564d002cac09]
 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1b7) [0x564d00124c87]
 12: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x564d0039d8c2]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x592) [0x564d00144ae2]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) [0x7f127622aec3]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f127622bab0]
 16: (()+0x7dd5) [0x7f1272da5dd5]
 17: (clone()+0x6d) [0x7f1271e95f6d]

Crashes for this OSD was repeated many times.
I tried to:
- deep-scrub for all PG's on this OSD;
- ceph-bluestore-tool fsck --deep yes for this OSD;
- upgrade ceph on this node from 13.2.4 to 13.2.9.

After this I tried to flush PG's from cache poll using:

rados -p vms-cache cache-try-flush-evict-all

And got a crash for OSD.13 on another node.
Additionally both OSD's crashes in a seconds after start (from ~5 second to <60 seconds).

I set:

ceph osd tier cache-mode vms-cache forward --yes-i-really-mean-it

And decreased target_max_bytes.
After this change in about one hour OSD stopped to crash, and for last ~30 minutes works properly.
But I think, when I continue to flush PG's from cache pool, OSD can crash again.

I collected a log output of both OSD's and uploaded log using ceph-post-file:
ceph-post-file: 900533d2-8558-11ea-ad44-00144fca4038
ceph-post-file: d45082b8-8558-11ea-ad44-00144fca4038

My problem can duplicate:
Bug #19185
Bug #40388

History

#1 Updated by KOT MATPOCKuH almost 4 years ago

After a hour of work both OSD's continued to crash every several seconds

#2 Updated by KOT MATPOCKuH almost 4 years ago

I'm got a crash for another OSD on 4rd node, and last lines in log are not related to PG 2.f8:

    -2> 2020-04-24 10:30:25.570 7f698f680700  5 -- 192.168.128.1:6807/813546 >>
192.168.128.3:6802/12438 conn(0x555fe4d00000 :6807 s=STATE_OPEN_MESSAGE_READ_FOO
TER_AND_DISPATCH pgs=13277 cs=1 l=0). rx osd.14 seq 614 0x555fe51a0960 pg_info((
query:41109 sent:41109 2.107( v 30124'28156278 (6953'28153233,30124'28156278] lo
cal-lis/les=41108/41109 n=4 ec=66/66 lis/c 41108/41106 les/c/f 41109/41107/201 4
1108/41108/5501))=([41106,41107] intervals=([41106,41107] acting 10,14,17)) epoc
h 41109) v5
    -1> 2020-04-24 10:30:25.570 7f698f680700  1 -- 192.168.128.1:6807/813546 <==
 osd.14 192.168.128.3:6802/12438 614 ==== pg_info((query:41109 sent:41109 2.107(
 v 30124'28156278 (6953'28153233,30124'28156278] local-lis/les=41108/41109 n=4 e
c=66/66 lis/c 41108/41106 les/c/f 41109/41107/201 41108/41108/5501))=([41106,411
07] intervals=([41106,41107] acting 10,14,17)) epoch 41109) v5 ==== 1185+0+0 (22
18552892 0 0) 0x555fe51a0960 con 0x555fe4d00000
     0> 2020-04-24 10:30:25.605 7f6971b54700 -1 *** Caught signal (Aborted) **
 in thread 7f6971b54700 thread_name:tp_osd_tp

 ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
 1: (()+0x911e70) [0x555fc1861e70]
 2: (()+0xf5d0) [0x7f69946e55d0]
 3: (gsignal()+0x37) [0x7f69937062c7]
 4: (abort()+0x148) [0x7f69937079b8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x242)
 [0x7f6997b5d2b2]
 6: (()+0x25a337) [0x7f6997b5d337]
 7: (PrimaryLogPG::hit_set_trim(std::unique_ptr<PrimaryLogPG::OpContext, std::de
fault_delete<PrimaryLogPG::OpContext> >&, unsigned int)+0x930) [0x555fc148d480]
 8: (PrimaryLogPG::hit_set_persist()+0xa0c) [0x555fc1491afc]
 9: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x2989) [0x555fc14a7f
09]
 10: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPH
andle&)+0xc99) [0x555fc14acc09]
 11: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>,
 ThreadPool::TPHandle&)+0x1b7) [0x555fc1306c87]
 12: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHa
ndle&)+0x62) [0x555fc157f8c2]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x592)
 [0x555fc1326ae2]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) [0x7f6997
b62ec3]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f6997b63ab0]
 16: (()+0x7dd5) [0x7f69946dddd5]
 17: (clone()+0x6d) [0x7f69937cdf6d]

#3 Updated by Pawel Stefanski over 2 years ago

I got the same assert for 14.2.22 when the scenario is replayed

1. make rbd pool write-back cache layer
2. restart few osds
3. unfound objects can be spotted
4. mark_unfound_lost revert is applied
5. cluster is healthly for some short period of time
6. osds die, the pg affected had missing objects of cache layer hit_set_list.
7. affected osds can't be restarted, they die with the same assert like above.

#4 Updated by Neha Ojha over 2 years ago

Is this reproducible on octopus or pacific, which are not EOL?

#5 Updated by Dmitry Smirnov over 2 years ago

We face the same issue with simultaneously four OSD down events at 4 different hosts and inability to restart them all together at the same time. Coredump trace is similar to reported in this BUG#
It is possible to start 3 of them, but only in specific combination. When we start the last remaining OSD all other three OSDs are crashing in 20-30 seconds with similar coredumps.

ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
(ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x55efc7f34a12]
(BlueStore::ExtentMap::allocate_spanning_blob_id()+0x102) [0x55efc8493472]
(BlueStore::ExtentMap::reshard(KeyValueDB*, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x14b5) [0x55efc8494945]
(BlueStore::_record_onode(boost::intrusive_ptr<BlueStore::Onode>&, std::shared_ptr<KeyValueDB::TransactionImpl>&)+0xb7) [0x55efc8496227]
(BlueStore::_txc_write_nodes(BlueStore::TransContext*, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x87) [0x55efc8497d57]
(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3bc) [0x55efc84a62ac]
(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55efc820fa24]
(ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&)+0xb10) [0x55efc8328130]
(ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2f6) [0x55efc833c3c6]
(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55efc822631a]
(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5b3) [0x55efc81d4683]
(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55efc8015882]
(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55efc82a5272]
(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55efc8030d1f]
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55efc85ef006]
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55efc85f1b20]
(()+0x7ea5) [0x7f66d9080ea5]
(clone()+0x6d) [0x7f66d7f439fd]
0> 2021-10-03 20:04:34.781 7f66b7445700 -1 ** Caught signal (Aborted) *
in thread 7f66b7445700 thread_name:tp_osd_tp

I guess this problem is not CEPH version related and can occur with multiple CEPH releases. We still not able to make the entire pool available to use.

#6 Updated by Igor Fedotov over 2 years ago

Dmitry Smirnov wrote:

We face the same issue with simultaneously four OSD down events at 4 different hosts and inability to restart them all together at the same time. Coredump trace is similar to reported in this BUG#
It is possible to start 3 of them, but only in specific combination. When we start the last remaining OSD all other three OSDs are crashing in 20-30 seconds with similar coredumps.

ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
(ceph::__ceph_abort(char const*, int, char const*, std::string const&)+0xdd) [0x55efc7f34a12]
(BlueStore::ExtentMap::allocate_spanning_blob_id()+0x102) [0x55efc8493472]
(BlueStore::ExtentMap::reshard(KeyValueDB*, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x14b5) [0x55efc8494945]
(BlueStore::_record_onode(boost::intrusive_ptr<BlueStore::Onode>&, std::shared_ptr<KeyValueDB::TransactionImpl>&)+0xb7) [0x55efc8496227]
(BlueStore::_txc_write_nodes(BlueStore::TransContext*, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x87) [0x55efc8497d57]
(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x3bc) [0x55efc84a62ac]
(non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55efc820fa24]
(ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&)+0xb10) [0x55efc8328130]
(ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2f6) [0x55efc833c3c6]
(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x4a) [0x55efc822631a]
(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5b3) [0x55efc81d4683]
(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x362) [0x55efc8015882]
(PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x55efc82a5272]
(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x90f) [0x55efc8030d1f]
(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b6) [0x55efc85ef006]
(ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55efc85f1b20]
(()+0x7ea5) [0x7f66d9080ea5]
(clone()+0x6d) [0x7f66d7f439fd]
0> 2021-10-03 20:04:34.781 7f66b7445700 -1 ** Caught signal (Aborted) *
in thread 7f66b7445700 thread_name:tp_osd_tp

I guess this problem is not CEPH version related and can occur with multiple CEPH releases. We still not able to make the entire pool available to use.

Hey Dmitry,
I'm not sure why you're saying this is the same coredump as in the original post. Backtrace you provided looks pretty different to me. Is it always like that?
If so could you please set debug-bluestore to 20 and collect/share the relevant OSD log.

#7 Updated by Igor Fedotov over 2 years ago

Igor Fedotov wrote:

Dmitry Smirnov wrote:

Hey Dmitry,
I'm not sure why you're saying this is the same coredump as in the original post. Backtrace you provided looks pretty different to me. Is it always like that?
If so could you please set debug-bluestore to 20 and collect/share the relevant OSD log.

I would rather bet that your issue is a duplicate of https://tracker.ceph.com/issues/48216

#8 Updated by Dmitry Smirnov over 2 years ago

All evidences pointing to a firmware bug persisting in dozens of used Hitachi 12TB (HGST model HUH721212AL5200) HDDs till firmware version NS09. DELL released a bugfix which solves the following condition, but we had no clue about that:

- Fix potential stale data condition where data from a sequential write stream does not invalidate previous overlapping read data in cache.

We managed to get affected EC pool back to a working state, but 2 of 1024 PGs contains corrupted/inconsistent data, those two remained with lesser EC shards then K factor. One PG has 3 out of 4+2 EC scheme, second PG only 2 and those cannot be reconstructed.

Also available in: Atom PDF