Bug #53584: FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to)) - RADOS - Ceph

Actions

Copy link

Bug #53584

open

FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

Added by 玮文胡 over 2 years ago. Updated over 2 years ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Backfill/Recovery

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

2 - major

Reviewed:

Affected Versions:

Ceph - v16.2.6

ceph-qa-suite:

Component(RADOS):

BlueStore, OSD

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

# ceph crash info 2021-12-12T08:09:48.682272Z_d2564665-8c3a-4a94-b425-05281a6f7956
{
    "assert_condition": "pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to)",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECBackend.cc",
    "assert_func": "void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)",
    "assert_line": 670,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7fe90d074700 time 2021-12-12T08:09:48.636155+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECBackend.cc: 670: FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))\n",
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12b20) [0x7fe930596b20]",
        "gsignal()",
        "abort()",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x55f45389b59d]",
        "/usr/bin/ceph-osd(+0x56a766) [0x55f45389b766]",
        "(ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)+0x1b9e) [0x55f453d607ae]",
        "(ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::v15_2_0::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::v15_2_0::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >, RecoveryMessages*)+0x855) [0x55f453d612d5]",
        "(OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x71) [0x55f453d84e91]",
        "(ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8f) [0x55f453d53faf]",
        "(ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x55f453d6d106]",
        "(ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f) [0x55f453d6dbdf]",
        "(PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x55f453b73d12]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x55f453b16d6e]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x55f4539a01b9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x55f453bfd868]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x55f4539c01e8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55f45402b6c4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55f45402e364]",
        "/lib64/libpthread.so.0(+0x814a) [0x7fe93058c14a]",
        "clone()" 
    ],
    "ceph_version": "16.2.6",
    "crash_id": "2021-12-12T08:09:48.682272Z_d2564665-8c3a-4a94-b425-05281a6f7956",
    "entity_name": "osd.16",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "e787f935c8fa491a3b1b5ea5f71cb0958e8c68386adbe75fce0c11fdf3eba84c",
    "timestamp": "2021-12-12T08:09:48.682272Z",
    "utsname_hostname": "gpu014",
    "utsname_machine": "x86_64",
    "utsname_release": "5.8.0-59-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#66~20.04.1-Ubuntu SMP Thu Jun 17 11:14:10 UTC 2021" 
}

We have one malfunctioning disk, it producing a lot of read errors. This crash happens after we set the OSD out and start rebalancing. Multiple OSDs keep crashing with the same backtrace.

There is a warning in the log exactly just before crashing:

log_channel(cluster) log [WRN] : Error(s) ignored for 19:5a01dfb3:::20007abda0b.0000003d:head enough copies available

Pool 19 is a EC pool for cephfs:

pool 19 'cephfs.cephfs.data_ec' erasure profile clay_profile size 10 min_size 9 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 45504 flags hashpspool,ec_overwrites stripe_width 32768 application cephfs.

I've set norebalance and the cluster is now stable. The rebalance after set the faulty OSD out is already half done. Is there any workaround that we can try to let the rebalance proceed?

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #53584

FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

Updated by Neha Ojha over 2 years ago

Updated by 玮文胡 over 2 years ago

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #53584

FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

Updated by Neha Ojha over 2 years ago

Updated by 玮文 胡 over 2 years ago

Updated by 玮文胡 over 2 years ago