Bug #51024: OSD - FAILED ceph_assert(clone_size.count(clone), keeps on restarting after one host reboot - RADOS - Ceph

Actions

Copy link

Bug #51024

open

OSD - FAILED ceph_assert(clone_size.count(clone), keeps on restarting after one host reboot

Added by Jeremi A almost 3 years ago. Updated almost 3 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

Ceph - v15.2.12

% Done:

Source:

Community (user)

Tags:

osd,ceph-ansible

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v15.2.10

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Good day

I'm currently experiencing the same issue as with this gentleman: https://www.mail-archive.com/ceph-users@ceph.io/msg02860.html

My Cluster:

v15.2.10
ceph-ansible
docker container
Ubuntu 18 LTS
EC8+2 pools
Max backfill = 1, max recovery = 1 , recovery op = 1
Host failure domain with 21 hosts in the cluster (24x 16TIB per host; or 19x 16TIB + 5x SSD)
50/100Gbe split Network between hosts & 100gbe mellanox switches

I set the cluster into "maintenance mode", noout, norebalance, nobackfill, norecover. And then proceeded to reboot a host, it came back online and everything was fine. I then restarted a second host, it came back online and it was fine. I restarted a third host, and when it came back online, I had 1 unfound object.

Querying the PG, it showed that 2x OSDS missing out of the K8+M2 (10 osds). It seems like ceph is actively refusing to peer or find with the 2x missings OSDs. If it would, everything will be fine, no disks were lost or anything, merely a single host reboot.

Reparing the PG or deep-scrubbing all related OSDs didn't do anything. I did give it at least 24-36 hours to do something, nothing happened.

Being an EC pool, I marked the unfound object as deleted.

Now the cluster started to recover.

Minutes later the acting "Main" OSD of the PG in question started to restart every 50 seconds. Every 30 minutes or so, it would stay on for a couple of minutes and go back to a 50 seconds restart cycle. During that time the 8 Osds + 2 missing of the PG, would turn into 7 oSDS + 3 missing.

(I've attached the full log)

The crash references to "FAILED ceph_assert(clone_size.count(clone)"

ceph crash info of the osd.164

{
    "assert_condition": "clone_size.count(clone)",
    "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.10/rpm/el8/BUILD/ceph-15.2.10/src/osd/osd_types.cc",
    "assert_func": "uint64_t SnapSet::get_clone_bytes(snapid_t) const",
    "assert_line": 5698,
    "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.10/rpm/el8/BUILD/ceph-15.2.10/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f153298a700 time 2021-05-30T19:42:17.937273+0200\n/home*/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.10/rpm/el8/BUILD/ceph-15.2.10/src/osd/osd_types.cc: 5698: FAILED ceph_assert(clone_size.count(clone))\n"*,
    "assert_thread_name": "tp_osd_tp",
    "backtrace": [
        "(()+0x12b20) [0x7f15541b4b20]",
        "(gsignal()+0x10f) [0x7f1552e1c7ff]",
        "(abort()+0x127) [0x7f1552e06c35]",
        "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x5638304a3dfb]",
        "(()+0x506fc4) [0x5638304a3fc4]",
        "(SnapSet::get_clone_bytes(snapid_t) const+0xe4) [0x563830790774]",
        "(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x297) [0x56383069de07]",
        "(PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x1aa1) [0x563830703291]",
        "(PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x10f2) [0x563830707e32]",
        "(OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x2f5) [0x5638305866b5]",
        "(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x5638307e309d]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x5638305a427f]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x563830be2a64]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563830be56c4]",
        "(()+0x814a) [0x7f15541aa14a]",
        "(clone()+0x43) [0x7f1552ee1f23]" 
    ],
    "ceph_version": "15.2.10",
    "crash_id": "2021-05-30T17:42:17.950008Z_c400ec0c-462f-4cb6-936a-90c25fc75938",
    "entity_name": "osd.164",
    "os_id": "centos",
    "os_name": "CentOS Linux",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "32e3356eb5699c8584c185bc2717179272cdb72d805e74c425a44ac00c4af8b8",
    "timestamp": "2021-05-30T17:42:17.950008Z",
    "utsname_hostname": "B-06-03-cephosd",
    "utsname_machine": "x86_64",
    "utsname_release": "5.4.0-62-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#70~18.04.1-Ubuntu SMP Tue Jan 12 17:18:00 UTC 2021" 
}

From the mailing list post, the person recommened that the main active OSD must be reweight to 0 to try and mitigate the issue. However the OSD keeps on restarting, however in that few seconds it is online the cluster gets close to being healthy and backfilling starts, just to lose the OSD again.

I let the osd.164 "reweight 0" over night and it is currently on 20% completed (remaining 2d) and at the time of writing this the cluster shows HEALTH_OK. However this comes at a cost, since the issue happened on my "volumes_data" pool, all my VM's were basically offline during the whole ordeal. I was under the impressing a 8+2 (min_size 8) would be able to mitigate 1 host failure and in this event it was a host reboot, with no disk failures.

Files

osd-164.zip (505 KB) osd-164.zip

Jeremi A, 05/31/2021 07:32 AM

Related issues 1 (1 open — 0 closed)

Actions

Copy link

Updated by Jeremi A almost 3 years ago

I forgot to add.

I pulled v15.2.12 on the affected host, and also try running the OSD in that version. It didn't made a difference.

Actions

Copy link

Updated by Jeremi A almost 3 years ago

https://tracker.ceph.com/issues/48060 is the same

Actions

Copy link

Updated by Dan van der Ster almost 3 years ago

Could be related to https://github.com/ceph/ceph/pull/40572

Actions

Copy link

Updated by Dan van der Ster almost 3 years ago

I set the cluster into "maintenance mode", noout, norebalance, nobackfill, norecover. And then proceeded to reboot a host, it came back online and everything was fine. I then restarted a second host, it came back online and it was fine. I restarted a third host, and when it came back online, I had 1 unfound object.

It is better to only set noout when rebooting a host. If you set 'norecover' and 'nobackfill' then the PGs will not be allowed to recover writes after that rebooted host is back online.

So did you in fact unset norecover, so that the PGs could all be active+clean before you proceeded to reboot the second host?
If you did not, then indeed your PGs probably dropped below min_size and perhaps https://github.com/ceph/ceph/pull/40572 became relevant (allowing writes below min_size in some cases).

Actions

Copy link