Project

General

Profile

Bug #43267

unexpected error in BlueStore::_txc_add_transaction

Added by Jeff Layton about 4 years ago. Updated about 4 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I was testing kcephfs vs. a vstart cluster and the OSD crashed. fsstress was running at the time, so it was being kept busy. Here's the stack trace in the log:

    -2> 2019-12-11T16:13:44.467-0500 7fcbe4ecd700  0 _dump_transaction transaction dump:
{
    "ops": [
        {
            "op_num": 0,
            "op_name": "create",
            "collection": "2.38_head",
            "oid": "#2:1dfec08d:::1000018ce47.00000003:head#" 
        },
        {
            "op_num": 1,
            "op_name": "setattrs",
            "collection": "2.38_head",
            "oid": "#2:1dfec08d:::1000018ce47.00000003:head#",
            "attr_lens": {
                "_": 275,
                "snapset": 35
            }
        },
        {
            "op_num": 2,
            "op_name": "write",
            "collection": "2.38_head",
            "oid": "#2:1dfec08d:::1000018ce47.00000003:head#",
            "length": 2646016,
            "offset": 0,
            "bufferlist length": 2646016
        },
        {
            "op_num": 3,
            "op_name": "omap_setkeys",
            "collection": "2.38_head",
            "oid": "#2:1c000000::::head#",
            "attr_lens": {
                "0000000067.00000000000000279027": 200,
                "_fastinfo": 186
            }
        }
    ]
}

    -1> 2019-12-11T16:13:44.493-0500 7fcbe4ecd700 -1 /home/jlayton/git/ceph/src/os/bluestore/BlueStore.cc: In function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, ObjectStore::Transaction*)' thread 7fcbe4ecd700 time 2019-12-11T16:13:44.468445-0500
/home/jlayton/git/ceph/src/os/bluestore/BlueStore.cc: 12661: ceph_abort_msg("unexpected error")

 ceph version 15.0.0-8262-g5f2781461f33 (5f2781461f33a07dfe0070a2c16d68780da6371a) octopus (dev)
 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xf6) [0x559b1be65cca]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x217e) [0x559b1bc09934]
 3: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2fb) [0x559b1bc06d11]
 4: (PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x8c) [0x559b1b694ccc]
 5: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x610) [0x559b1b9ee834]
 6: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x54e) [0x559b1b644eb6]
 7: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x1cc8) [0x559b1b603bec]
 8: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x4efb) [0x559b1b5f3e9b]
 9: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xf34) [0x559b1b5ee15c]
 10: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x38e) [0x559b1b40f308]
 11: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6e) [0x559b1b87fb48]
 12: (ceph::osd::scheduler::OpSchedulerItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x4b) [0x559b1b439e37]
 13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x3629) [0x559b1b41c379]
 14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x59c) [0x559b1be51a8e]
 15: (ShardedThreadPool::WorkThreadSharded::entry()+0x25) [0x559b1be5346b]
 16: (Thread::entry_wrapper()+0x78) [0x559b1be3e244]
 17: (Thread::_entry_func(void*)+0x18) [0x559b1be3e1c2]
 18: (()+0x94e2) [0x7fcc035a64e2]
 19: (clone()+0x43) [0x7fcc03112693]

     0> 2019-12-11T16:13:44.521-0500 7fcbe4ecd700 -1 *** Caught signal (Aborted) **
 in thread 7fcbe4ecd700 thread_name:tp_osd_tp

 ceph version 15.0.0-8262-g5f2781461f33 (5f2781461f33a07dfe0070a2c16d68780da6371a) octopus (dev)
 1: (()+0x2ddffb4) [0x559b1bdacfb4]
 2: (()+0x14b20) [0x7fcc035b1b20]
 3: (gsignal()+0x145) [0x7fcc0304d625]
 4: (abort()+0x12b) [0x7fcc030368d9]
 5: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x37a) [0x559b1be65f4e]
 6: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x217e) [0x559b1bc09934]
 7: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x2fb) [0x559b1bc06d11]
 8: (PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)+0x8c) [0x559b1b694ccc]
 9: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x610) [0x559b1b9ee834]
 10: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x54e) [0x559b1b644eb6]
 11: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0x1cc8) [0x559b1b603bec]
 12: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x4efb) [0x559b1b5f3e9b]
 13: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0xf34) [0x559b1b5ee15c]
 14: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x38e) [0x559b1b40f308]
 15: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6e) [0x559b1b87fb48]
 16: (ceph::osd::scheduler::OpSchedulerItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x4b) [0x559b1b439e37]
 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x3629) [0x559b1b41c379]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x59c) [0x559b1be51a8e]
 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x25) [0x559b1be5346b]
 20: (Thread::entry_wrapper()+0x78) [0x559b1be3e244]
 21: (Thread::_entry_func(void*)+0x18) [0x559b1be3e1c2]
 22: (()+0x94e2) [0x7fcc035a64e2]
 23: (clone()+0x43) [0x7fcc03112693]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

osd.0.log.xz - OSD log (153 KB) Jeff Layton, 12/11/2019 09:38 PM

History

#1 Updated by Igor Fedotov about 4 years ago

This is simply out-of-space condition, see:
-6> 2019-12-11T16:13:44.466-0500 7fcbe4ecd700 -1 bluestore(/build/ceph/dev/osd0) _do_alloc_write failed to allocate 0x286000 allocated 0x 12a000 min_alloc_size 0x1000 available 0x 0
-5> 2019-12-11T16:13:44.466-0500 7fcbe4ecd700 -1 bluestore(/build/ceph/dev/osd0) _do_write _do_alloc_write failed with (28) No space left on device
-4> 2019-12-11T16:13:44.467-0500 7fcbe4ecd700 -1 bluestore(/build/ceph/dev/osd0) _txc_add_transaction error (28) No space left on device not handled on operation 10 (op 2, counting from 0)
-3> 2019-12-11T16:13:44.467-0500 7fcbe4ecd700 -1 bluestore(/build/ceph/dev/osd0) ENOSPC from bluestore, misconfigured cluster

OSD is configured with 20GB main device, and 2x1 GB for DB/WAL which seems to be quite small for the test case:
2019-12-11T13:10:44.715-0500 7fcc02f44d80 1 bdev(0x559b26e14000 /build/ceph/dev/osd0/block) open path /build/ceph/dev/osd0/block
2019-12-11T13:10:44.716-0500 7fcc02f44d80 1 bdev(0x559b26e14000 /build/ceph/dev/osd0/block) open size 21474836480 (0x500000000, 20 GiB) block_size 4096 (4 KiB) non-rotational discard supported
2019-12-11T13:10:44.716-0500 7fcc02f44d80 1 bluestore(/build/ceph/dev/osd0) _set_cache_sizes cache_size 3221225472 meta 0.4 kv 0.4 data 0.2
2019-12-11T13:10:44.716-0500 7fcc02f44d80 1 bdev create path /build/ceph/dev/osd0/block.db type kernel
2019-12-11T13:10:44.716-0500 7fcc02f44d80 1 bdev(0x559b26e14480 /build/ceph/dev/osd0/block.db) open path /build/ceph/dev/osd0/block.db
2019-12-11T13:10:44.716-0500 7fcc02f44d80 1 bdev(0x559b26e14480 /build/ceph/dev/osd0/block.db) open size 1073741824 (0x40000000, 1 GiB) block_size 4096 (4 KiB) non-rotational discard supported
2019-12-11T13:10:44.716-0500 7fcc02f44d80 1 bluefs add_block_device bdev 1 path /build/ceph/dev/osd0/block.db size 1 GiB
2019-12-11T13:10:44.716-0500 7fcc02f44d80 1 bdev create path /build/ceph/dev/osd0/block type kernel
2019-12-11T13:10:44.716-0500 7fcc02f44d80 1 bdev(0x559b26e14900 /build/ceph/dev/osd0/block) open path /build/ceph/dev/osd0/block
2019-12-11T13:10:44.734-0500 7fcc02f44d80 1 bdev(0x559b26e14900 /build/ceph/dev/osd0/block) open size 21474836480 (0x500000000, 20 GiB) block_size 4096 (4 KiB) non-rotational discard supported
2019-12-11T13:10:44.734-0500 7fcc02f44d80 1 bluefs add_block_device bdev 2 path /build/ceph/dev/osd0/block size 20 GiB
2019-12-11T13:10:44.734-0500 7fcc02f44d80 1 bdev create path /build/ceph/dev/osd0/block.wal type kernel
2019-12-11T13:10:44.734-0500 7fcc02f44d80 1 bdev(0x559b26e14d80 /build/ceph/dev/osd0/block.wal) open path /build/ceph/dev/osd0/block.wal
2019-12-11T13:10:44.735-0500 7fcc02f44d80 1 bdev(0x559b26e14d80 /build/ceph/dev/osd0/block.wal) open size 1048576000 (0x3e800000, 1000 MiB) block_size 4096 (4 KiB) non-rotational discard supported
2019-12-11T13:10:44.735-0500 7fcc02f44d80 1 bluefs add_block_device bdev 0 path /build/ceph/dev/osd0/block.wal size 1000 MiB

#2 Updated by David Zafman about 4 years ago

This is caused by an out of space condition that won't usually happen. Check your BlueStore configuration.

Is your data out of balance? Check:
ceph osd df

The other possibility is that you have raised your full ratios too high or you have a very small cluster and was able to fill it up before the cluster had a chance to react.

Won't backfill above this this amount (default 90%)
osd set-backfillfull-ratio <ratio:float> set usage ratio at which OSDs are marked too full to backfill

Stop accepting client data above this amount (default 95%)
osd set-full-ratio <ratio:float> set usage ratio at which OSDs are marked full

Get warnings before things get critical (default 85%)
osd set-nearfull-ratio <ratio:float>

The config value of "osd_failsafe_full_ratio" indicates the ratio that will cause OSD to deliberately crash if over this ratio. (default 97%).

#3 Updated by Jeff Layton about 4 years ago

Nope, it was full. Well spotted:

ID CLASS WEIGHT  REWEIGHT SIZE   RAW USE DATA   OMAP   META    AVAIL  %USE  VAR  PGS STATUS 
 0   ssd 0.02049  1.00000 21 GiB  21 GiB 19 GiB 28 MiB 996 MiB 17 MiB 99.92 1.00  80     up 
                    TOTAL 21 GiB  21 GiB 19 GiB 28 MiB 996 MiB 17 MiB 99.92                 
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

I'll increase the size for my next run.

That said, is crashing when you have a full OSD condition really the best way to handle things? That seems quite unforgiving. I wonder if it would be possible to have it just stop allowing writes and switch to read only mode?

EDIT: maybe it's best to just kick it out after all

#4 Updated by Jeff Layton about 4 years ago

  • Status changed from New to Rejected

Also available in: Atom PDF