Bug #51024
openOSD - FAILED ceph_assert(clone_size.count(clone), keeps on restarting after one host reboot
0%
Description
Good day
I'm currently experiencing the same issue as with this gentleman: https://www.mail-archive.com/ceph-users@ceph.io/msg02860.html
My Cluster:- v15.2.10
- ceph-ansible
- docker container
- Ubuntu 18 LTS
- EC8+2 pools
- Max backfill = 1, max recovery = 1 , recovery op = 1
- Host failure domain with 21 hosts in the cluster (24x 16TIB per host; or 19x 16TIB + 5x SSD)
- 50/100Gbe split Network between hosts & 100gbe mellanox switches
I set the cluster into "maintenance mode", noout, norebalance, nobackfill, norecover. And then proceeded to reboot a host, it came back online and everything was fine. I then restarted a second host, it came back online and it was fine. I restarted a third host, and when it came back online, I had 1 unfound object.
Querying the PG, it showed that 2x OSDS missing out of the K8+M2 (10 osds). It seems like ceph is actively refusing to peer or find with the 2x missings OSDs. If it would, everything will be fine, no disks were lost or anything, merely a single host reboot.
Reparing the PG or deep-scrubbing all related OSDs didn't do anything. I did give it at least 24-36 hours to do something, nothing happened.
Being an EC pool, I marked the unfound object as deleted.
Now the cluster started to recover.
Minutes later the acting "Main" OSD of the PG in question started to restart every 50 seconds. Every 30 minutes or so, it would stay on for a couple of minutes and go back to a 50 seconds restart cycle. During that time the 8 Osds + 2 missing of the PG, would turn into 7 oSDS + 3 missing.
(I've attached the full log)
The crash references to "FAILED ceph_assert(clone_size.count(clone)"
ceph crash info of the osd.164
{ "assert_condition": "clone_size.count(clone)", "assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.10/rpm/el8/BUILD/ceph-15.2.10/src/osd/osd_types.cc", "assert_func": "uint64_t SnapSet::get_clone_bytes(snapid_t) const", "assert_line": 5698, "assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.10/rpm/el8/BUILD/ceph-15.2.10/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f153298a700 time 2021-05-30T19:42:17.937273+0200\n/home*/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.10/rpm/el8/BUILD/ceph-15.2.10/src/osd/osd_types.cc: 5698: FAILED ceph_assert(clone_size.count(clone))\n"*, "assert_thread_name": "tp_osd_tp", "backtrace": [ "(()+0x12b20) [0x7f15541b4b20]", "(gsignal()+0x10f) [0x7f1552e1c7ff]", "(abort()+0x127) [0x7f1552e06c35]", "(ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1a9) [0x5638304a3dfb]", "(()+0x506fc4) [0x5638304a3fc4]", "(SnapSet::get_clone_bytes(snapid_t) const+0xe4) [0x563830790774]", "(PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x297) [0x56383069de07]", "(PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x1aa1) [0x563830703291]", "(PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x10f2) [0x563830707e32]", "(OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x2f5) [0x5638305866b5]", "(ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x5638307e309d]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x5638305a427f]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x563830be2a64]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x563830be56c4]", "(()+0x814a) [0x7f15541aa14a]", "(clone()+0x43) [0x7f1552ee1f23]" ], "ceph_version": "15.2.10", "crash_id": "2021-05-30T17:42:17.950008Z_c400ec0c-462f-4cb6-936a-90c25fc75938", "entity_name": "osd.164", "os_id": "centos", "os_name": "CentOS Linux", "os_version": "8", "os_version_id": "8", "process_name": "ceph-osd", "stack_sig": "32e3356eb5699c8584c185bc2717179272cdb72d805e74c425a44ac00c4af8b8", "timestamp": "2021-05-30T17:42:17.950008Z", "utsname_hostname": "B-06-03-cephosd", "utsname_machine": "x86_64", "utsname_release": "5.4.0-62-generic", "utsname_sysname": "Linux", "utsname_version": "#70~18.04.1-Ubuntu SMP Tue Jan 12 17:18:00 UTC 2021" }
From the mailing list post, the person recommened that the main active OSD must be reweight to 0 to try and mitigate the issue. However the OSD keeps on restarting, however in that few seconds it is online the cluster gets close to being healthy and backfilling starts, just to lose the OSD again.
I let the osd.164 "reweight 0" over night and it is currently on 20% completed (remaining 2d) and at the time of writing this the cluster shows HEALTH_OK. However this comes at a cost, since the issue happened on my "volumes_data" pool, all my VM's were basically offline during the whole ordeal. I was under the impressing a 8+2 (min_size 8) would be able to mitigate 1 host failure and in this event it was a host reboot, with no disk failures.
Files
Updated by Jeremi A almost 3 years ago
I forgot to add.
I pulled v15.2.12 on the affected host, and also try running the OSD in that version. It didn't made a difference.
Updated by Jeremi A almost 3 years ago
https://tracker.ceph.com/issues/48060 is the same
Updated by Dan van der Ster almost 3 years ago
Could be related to https://github.com/ceph/ceph/pull/40572
Updated by Dan van der Ster almost 3 years ago
I set the cluster into "maintenance mode", noout, norebalance, nobackfill, norecover. And then proceeded to reboot a host, it came back online and everything was fine. I then restarted a second host, it came back online and it was fine. I restarted a third host, and when it came back online, I had 1 unfound object.
It is better to only set noout when rebooting a host. If you set 'norecover' and 'nobackfill' then the PGs will not be allowed to recover writes after that rebooted host is back online.
So did you in fact unset norecover, so that the PGs could all be active+clean before you proceeded to reboot the second host?
If you did not, then indeed your PGs probably dropped below min_size and perhaps https://github.com/ceph/ceph/pull/40572 became relevant (allowing writes below min_size in some cases).
Updated by Neha Ojha almost 3 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)
Updated by Neha Ojha almost 3 years ago
- Related to Bug #48060: data loss in EC pool added