Project

General

Profile

Actions

Bug #53584

open

FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

Added by 玮文 胡 over 2 years ago. Updated over 2 years ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
Backfill/Recovery
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
BlueStore, OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

# ceph crash info 2021-12-12T08:09:48.682272Z_d2564665-8c3a-4a94-b425-05281a6f7956
{
    "assert_condition": "pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to)",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECBackend.cc",
    "assert_func": "void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)",
    "assert_line": 670,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7fe90d074700 time 2021-12-12T08:09:48.636155+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECBackend.cc: 670: FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7fe930596b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x55f45389b59d]",
        "/usr/bin/ceph-osd(+0x56a766) [0x55f45389b766]",
        "(ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1b9e) [0x55f453d607ae]",
        "(ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::v15_2_0::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::v15_2_0::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >, RecoveryMessages*)+0x855) [0x55f453d612d5]",
        "(OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x71) [0x55f453d84e91]",
        "(ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8f) [0x55f453d53faf]",
        "(ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x55f453d6d106]",
        "(ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f) [0x55f453d6dbdf]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x55f453b73d12]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x55f453b16d6e]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55f4539a01b9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55f453bfd868]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x55f4539c01e8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55f45402b6c4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55f45402e364]",
        "/lib64/libpthread.so.0(+0x814a) [0x7fe93058c14a]",
        "clone()" 
    ],
    "ceph_version": "16.2.6",
    "crash_id": "2021-12-12T08:09:48.682272Z_d2564665-8c3a-4a94-b425-05281a6f7956",
    "entity_name": "osd.16",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "e787f935c8fa491a3b1b5ea5f71cb0958e8c68386adbe75fce0c11fdf3eba84c",
    "timestamp": "2021-12-12T08:09:48.682272Z",
    "utsname_hostname": "gpu014",
    "utsname_machine": "x86_64",
    "utsname_release": "5.8.0-59-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021" 
}

We have one malfunctioning disk, it producing a lot of read errors. This crash happens after we set the OSD out and start rebalancing. Multiple OSDs keep crashing with the same backtrace.

There is a warning in the log exactly just before crashing:

log_channel(cluster) log [WRN] : Error(s) ignored for 19:5a01dfb3:::20007abda0b.0000003d:head enough copies available

Pool 19 is a EC pool for cephfs:

pool 19 'cephfs.cephfs.data_ec' erasure profile clay_profile size 10 min_size 9 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 45504 flags hashpspool,ec_overwrites stripe_width 32768 application cephfs.

I've set norebalance and the cluster is now stable. The rebalance after set the faulty OSD out is already half done. Is there any workaround that we can try to let the rebalance proceed?

Actions

Also available in: Atom PDF