Bug #52513: BlueStore.cc: 12391: ceph_abort_msg(\"unexpected error\") on operation 15 - bluestore - Ceph

Actions

Copy link

Bug #52513

open

BlueStore.cc: 12391: ceph_abort_msg(\"unexpected error\") on operation 15

Added by Konstantin Shalygin over 2 years ago. Updated 11 days ago.

Status:

New

Priority:

Normal

Assignee:

Adam Kupczyk

Target version:

Ceph - v20.0.0

% Done:

Source:

Community (user)

Tags:

Backport:

quincy reef squid

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v14.2.22

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We get crash of two simultaneously OSD's served 17.7ff [684,768,760]

RECENT_CRASH 2 daemons have recently crashed
    osd.760 crashed on host meta114 at 2021-09-03 21:50:28.138745Z
    osd.768 crashed on host meta115 at 2021-09-03 21:50:28.123223Z

Seems ENOFILE is unexpected, when object lock is acquired

    -8> 2021-09-04 00:50:28.077 7f3299342700 -1 bluestore(/var/lib/ceph/osd/ceph-768) _txc_add_transaction error (2) No such file or directory not handled on operation 15 (op 0, counting from 0)
    -7> 2021-09-04 00:50:28.077 7f3299342700 -1 bluestore(/var/lib/ceph/osd/ceph-768) unexpected error code
    -6> 2021-09-04 00:50:28.077 7f3299342700  0 _dump_transaction transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "setattrs",
            "collection": "17.7ff_head",
            "oid": "#17:ffffffff:::%2fv2%2fmeta%2fd732de8b-8b15-5b57-a54a-fc23aadce4fe%2f88e9261c-832b-5d13-9517-40015c81e84e%2f27%2f11033%2f11033693%2f556500fe714ab37.webp:head#",
            "attr_lens": {
                "_": 376,
                "_lock.libcephv2.lock": 153,
                "snapset": 35
            }
        }
    ]
}

    -5> 2021-09-04 00:50:28.117 7f3299342700 -1 /build/ceph-14.2.22/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f3299342700 time 2021-09-04 00:50:28.083637
/build/ceph-14.2.22/src/os/bluestore/BlueStore.cc: 12391: ceph_abort_msg("unexpected error")

 ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xdf) [0x55980fe784c4]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xbde) [0x5598103dbaee]
 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2aa) [0x5598103e19fa]
 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55981011b514]
 5: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xb09) [0x55981021c0a9]
 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1a7) [0x55981022a407]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x55981012ee57]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x705) [0x5598100dd965]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1bf) [0x55980fefbd8f]
 10: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5598101b5b22]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x55980ff19835]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x5598105393ec]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55981053c5b0]
 14: (()+0x76db) [0x7f32bd6e66db]
 15: (clone()+0x3f) [0x7f32bc47d71f]

Files

tracker_52513.zip (267 KB) tracker_52513.zip

osd logs

Konstantin Shalygin, 09/06/2021 09:21 AM

Actions

Copy link

Updated by Konstantin Shalygin over 2 years ago

File tracker_52513.zip tracker_52513.zip added

Actions

Copy link

Updated by Konstantin Shalygin over 2 years ago

PG was actually inconsistent

16872:2021-09-07 13:32:13.449625 osd.684 (osd.684) 102 : cluster [ERR] 17.7ff scrub : stat mismatch, got 4207028/4207029 objects, 0/0 clones, 4207028/4207029 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 1950026408/1950027022 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Project changed from bluestore to RADOS

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Status changed from New to Need More Info

Can you capture a coredump or osd logs with debug_osd=20,debug_bluestore=20,debug_ms=1 if this crash is reproducible?

Actions

Copy link

Updated by Konstantin Shalygin over 2 years ago

@Neha ., I will try to reproduce this via removing objects from replicas

Actions

Copy link

Updated by Konstantin Shalygin over 2 years ago

I was reproduced this issue:

put rados object to pool

get mapping for this object

osdmap e1555 pool 'meta.ru-1a' (3) object 'lock_bug_object' -> pg 3.cd8d62a0 (3.20) -> up ([2,8,11], p2) acting ([2,8,11], p2)

then remove this object from 8,11

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-8 --pgid 3.20 lock_bug_object remove
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --pgid 3.20 lock_bug_object remove

restart 8,11

set rados xattr or rados lock

# set rados lock
rados lock get lock_bug_object my_lock --pool meta.ru-1a
# set xattr
rados setxattr lock_bug_object lock.libcephv2.lock 153 --pool meta.ru-1a

replicas crashed simultaneously

Debug logs uploaded: https://www.icloud.com/iclouddrive/0GxIuMZh2-9CU4EEmPvRMO3jQ#arch

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Status changed from Need More Info to New

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Konstantin Shalygin wrote:

I was reproduced this issue:

put rados object to pool

get mapping for this object
[...]

then remove this object from 8,11
[...]

restart 8,11

set rados xattr or rados lock
[...]

replicas crashed simultaneously

Debug logs uploaded: https://www.icloud.com/iclouddrive/0GxIuMZh2-9CU4EEmPvRMO3jQ#arch