Bug #52513
BlueStore.cc: 12391: ceph_abort_msg(\"unexpected error\") on operation 15
0%
Description
We get crash of two simultaneously OSD's served 17.7ff [684,768,760]
RECENT_CRASH 2 daemons have recently crashed osd.760 crashed on host meta114 at 2021-09-03 21:50:28.138745Z osd.768 crashed on host meta115 at 2021-09-03 21:50:28.123223Z
Seems ENOFILE is unexpected, when object lock is acquired
-8> 2021-09-04 00:50:28.077 7f3299342700 -1 bluestore(/var/lib/ceph/osd/ceph-768) _txc_add_transaction error (2) No such file or directory not handled on operation 15 (op 0, counting from 0) -7> 2021-09-04 00:50:28.077 7f3299342700 -1 bluestore(/var/lib/ceph/osd/ceph-768) unexpected error code -6> 2021-09-04 00:50:28.077 7f3299342700 0 _dump_transaction transaction dump: { "ops": [ { "op_num": 0, "op_name": "setattrs", "collection": "17.7ff_head", "oid": "#17:ffffffff:::%2fv2%2fmeta%2fd732de8b-8b15-5b57-a54a-fc23aadce4fe%2f88e9261c-832b-5d13-9517-40015c81e84e%2f27%2f11033%2f11033693%2f556500fe714ab37.webp:head#", "attr_lens": { "_": 376, "_lock.libcephv2.lock": 153, "snapset": 35 } } ] } -5> 2021-09-04 00:50:28.117 7f3299342700 -1 /build/ceph-14.2.22/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f3299342700 time 2021-09-04 00:50:28.083637 /build/ceph-14.2.22/src/os/bluestore/BlueStore.cc: 12391: ceph_abort_msg("unexpected error") ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xdf) [0x55980fe784c4] 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xbde) [0x5598103dbaee] 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2aa) [0x5598103e19fa] 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55981011b514] 5: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xb09) [0x55981021c0a9] 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1a7) [0x55981022a407] 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x55981012ee57] 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x705) [0x5598100dd965] 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1bf) [0x55980fefbd8f] 10: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5598101b5b22] 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x55980ff19835] 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x5598105393ec] 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55981053c5b0] 14: (()+0x76db) [0x7f32bd6e66db] 15: (clone()+0x3f) [0x7f32bc47d71f]
History
#1 Updated by Konstantin Shalygin over 2 years ago
- File tracker_52513.zip added
#2 Updated by Konstantin Shalygin over 2 years ago
PG was actually inconsistent
16872:2021-09-07 13:32:13.449625 osd.684 (osd.684) 102 : cluster [ERR] 17.7ff scrub : stat mismatch, got 4207028/4207029 objects, 0/0 clones, 4207028/4207029 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 1950026408/1950027022 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.
#3 Updated by Neha Ojha over 2 years ago
- Project changed from bluestore to RADOS
#4 Updated by Neha Ojha over 2 years ago
- Status changed from New to Need More Info
Can you capture a coredump or osd logs with debug_osd=20,debug_bluestore=20,debug_ms=1 if this crash is reproducible?
#5 Updated by Konstantin Shalygin over 2 years ago
@Neha, I will try to reproduce this via removing objects from replicas
#6 Updated by Konstantin Shalygin over 2 years ago
I was reproduced this issue:
- put rados object to pool
- get mapping for this object
osdmap e1555 pool 'meta.ru-1a' (3) object 'lock_bug_object' -> pg 3.cd8d62a0 (3.20) -> up ([2,8,11], p2) acting ([2,8,11], p2)
- then remove this object from 8,11
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-8 --pgid 3.20 lock_bug_object remove ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --pgid 3.20 lock_bug_object remove
- restart 8,11
- set rados xattr or rados lock
# set rados lock rados lock get lock_bug_object my_lock --pool meta.ru-1a # set xattr rados setxattr lock_bug_object lock.libcephv2.lock 153 --pool meta.ru-1a
- replicas crashed simultaneously
Debug logs uploaded: https://www.icloud.com/iclouddrive/0GxIuMZh2-9CU4EEmPvRMO3jQ#arch
#7 Updated by Neha Ojha over 2 years ago
- Status changed from Need More Info to New
#8 Updated by Neha Ojha over 2 years ago
Konstantin Shalygin wrote:
I was reproduced this issue:
- put rados object to pool
- get mapping for this object
[...]- then remove this object from 8,11
[...]- restart 8,11
- set rados xattr or rados lock
[...]- replicas crashed simultaneously
Debug logs uploaded: https://www.icloud.com/iclouddrive/0GxIuMZh2-9CU4EEmPvRMO3jQ#arch
Thanks, we'll take a look and see if we can find anything interesting.
#9 Updated by Josh Durgin over 2 years ago
Added logs to teuthology:/post/tracker_52513/
#10 Updated by Konstantin Shalygin over 1 year ago
- Target version set to v18.0.0
- Backport set to pacific quincy
- Severity changed from 3 - minor to 2 - major
#11 Updated by Radoslaw Zarzynski over 1 year ago
- Assignee set to Adam Kupczyk
Would you like to have a look? Neha confirms the logs mentioned in #9 are still available.
#12 Updated by Konstantin Shalygin 12 months ago
- Backport changed from pacific quincy to pacific quincy reef