Project

General

Profile

Bug #52513

BlueStore.cc: 12391: ceph_abort_msg(\"unexpected error\") on operation 15

Added by Konstantin Shalygin about 1 year ago. Updated about 1 month ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
% Done:

0%

Source:
Community (user)
Tags:
Backport:
pacific quincy
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We get crash of two simultaneously OSD's served 17.7ff [684,768,760]

RECENT_CRASH 2 daemons have recently crashed
    osd.760 crashed on host meta114 at 2021-09-03 21:50:28.138745Z
    osd.768 crashed on host meta115 at 2021-09-03 21:50:28.123223Z

Seems ENOFILE is unexpected, when object lock is acquired

    -8> 2021-09-04 00:50:28.077 7f3299342700 -1 bluestore(/var/lib/ceph/osd/ceph-768) _txc_add_transaction error (2) No such file or directory not handled on operation 15 (op 0, counting from 0)
    -7> 2021-09-04 00:50:28.077 7f3299342700 -1 bluestore(/var/lib/ceph/osd/ceph-768) unexpected error code
    -6> 2021-09-04 00:50:28.077 7f3299342700  0 _dump_transaction transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "setattrs",
            "collection": "17.7ff_head",
            "oid": "#17:ffffffff:::%2fv2%2fmeta%2fd732de8b-8b15-5b57-a54a-fc23aadce4fe%2f88e9261c-832b-5d13-9517-40015c81e84e%2f27%2f11033%2f11033693%2f556500fe714ab37.webp:head#",
            "attr_lens": {
                "_": 376,
                "_lock.libcephv2.lock": 153,
                "snapset": 35
            }
        }
    ]
}

    -5> 2021-09-04 00:50:28.117 7f3299342700 -1 /build/ceph-14.2.22/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7f3299342700 time 2021-09-04 00:50:28.083637
/build/ceph-14.2.22/src/os/bluestore/BlueStore.cc: 12391: ceph_abort_msg("unexpected error")

 ceph version 14.2.22 (ca74598065096e6fcbd8433c8779a2be0c889351) nautilus (stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xdf) [0x55980fe784c4]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)+0xbde) [0x5598103dbaee]
 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2aa) [0x5598103e19fa]
 4: (non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x54) [0x55981011b514]
 5: (ReplicatedBackend::do_repop(boost::intrusive_ptr<OpRequest>)+0xb09) [0x55981021c0a9]
 6: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x1a7) [0x55981022a407]
 7: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x97) [0x55981012ee57]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x705) [0x5598100dd965]
 9: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1bf) [0x55980fefbd8f]
 10: (PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x62) [0x5598101b5b22]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xbf5) [0x55980ff19835]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x4ac) [0x5598105393ec]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55981053c5b0]
 14: (()+0x76db) [0x7f32bd6e66db]
 15: (clone()+0x3f) [0x7f32bc47d71f]

tracker_52513.zip - osd logs (267 KB) Konstantin Shalygin, 09/06/2021 09:21 AM

History

#2 Updated by Konstantin Shalygin about 1 year ago

PG was actually inconsistent

16872:2021-09-07 13:32:13.449625 osd.684 (osd.684) 102 : cluster [ERR] 17.7ff scrub : stat mismatch, got 4207028/4207029 objects, 0/0 clones, 4207028/4207029 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 1950026408/1950027022 bytes, 0/0 manifest objects, 0/0 hit_set_archive bytes.

#3 Updated by Neha Ojha 12 months ago

  • Project changed from bluestore to RADOS

#4 Updated by Neha Ojha 12 months ago

  • Status changed from New to Need More Info

Can you capture a coredump or osd logs with debug_osd=20,debug_bluestore=20,debug_ms=1 if this crash is reproducible?

#5 Updated by Konstantin Shalygin 12 months ago

@Neha, I will try to reproduce this via removing objects from replicas

#6 Updated by Konstantin Shalygin 12 months ago

I was reproduced this issue:

  1. put rados object to pool
  2. get mapping for this object
    osdmap e1555 pool 'meta.ru-1a' (3) object 'lock_bug_object' -> pg 3.cd8d62a0 (3.20) -> up ([2,8,11], p2) acting ([2,8,11], p2)
    
  3. then remove this object from 8,11
    ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-8 --pgid 3.20 lock_bug_object remove
    ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --pgid 3.20 lock_bug_object remove
    
  4. restart 8,11
  5. set rados xattr or rados lock
    # set rados lock
    rados lock get lock_bug_object my_lock --pool meta.ru-1a
    # set xattr
    rados setxattr lock_bug_object lock.libcephv2.lock 153 --pool meta.ru-1a
    
  6. replicas crashed simultaneously

Debug logs uploaded: https://www.icloud.com/iclouddrive/0GxIuMZh2-9CU4EEmPvRMO3jQ#arch

#7 Updated by Neha Ojha 11 months ago

  • Status changed from Need More Info to New

#8 Updated by Neha Ojha 11 months ago

Konstantin Shalygin wrote:

I was reproduced this issue:

  1. put rados object to pool
  2. get mapping for this object
    [...]
  3. then remove this object from 8,11
    [...]
  4. restart 8,11
  5. set rados xattr or rados lock
    [...]
  6. replicas crashed simultaneously

Debug logs uploaded: https://www.icloud.com/iclouddrive/0GxIuMZh2-9CU4EEmPvRMO3jQ#arch

Thanks, we'll take a look and see if we can find anything interesting.

#9 Updated by Josh Durgin 11 months ago

Added logs to teuthology:/post/tracker_52513/

#10 Updated by Konstantin Shalygin about 2 months ago

  • Target version set to v18.0.0
  • Backport set to pacific quincy
  • Severity changed from 3 - minor to 2 - major

#11 Updated by Radoslaw Zarzynski about 1 month ago

  • Assignee set to Adam Kupczyk

Would you like to have a look? Neha confirms the logs mentioned in #9 are still available.

Also available in: Atom PDF