Bug #59510
openosd crash
0%
Description
{ "archived": "2023-04-23 02:43:27.040739", "backtrace": [ "/lib64/libpthread.so.0(+0x12ce0) [0x7f4de5e1bce0]", "pthread_kill()", "(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x48c) [0x5615f3d6534c]", "(ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x23e) [0x5615f3d6573e]", "(PrimaryLogPG::scan_range(int, int, BackfillInterval*, ThreadPool::TPHandle&)+0x15a) [0x5615f38668da]", "(PrimaryLogPG::do_scan(boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x914) [0x5615f3867d34]", "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x776) [0x5615f3868826]", "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5615f36effc9]", "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5615f394ee78]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5615f370d4c8]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5615f3d8a2a4]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5615f3d8d184]", "/lib64/libpthread.so.0(+0x81ca) [0x7f4de5e111ca]", "clone()" ], "ceph_version": "16.2.10", "crash_id": "2023-04-23T02:28:30.051101Z_9ba78a52-c740-4505-bb93-f797f394cebe", "entity_name": "osd.65", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-osd", "stack_sig": "1a2700ce6c68288739eb14ca1b2b5f49449c59a5baafbd1e71df3a4316e3bffe", "timestamp": "2023-04-23T02:28:30.051101Z", "utsname_hostname": "node02", "utsname_machine": "x86_64", "utsname_release": "3.10.0-1160.45.1.el7.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021" }
Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from New to Need More Info
It looks the scan-for-backfill operation was taking long time and triggered the thread heartbeat. This could be even because of hardware issues. Could please check for them in e.g. dmesg
?
Updated by can zhu about 1 year ago
Updated by Igor Fedotov about 1 year ago
You might also want to compact this OSD's DB using ceph-kvstore-tool. Some chances are that the timeout is caused by slow DB access which in turn might be a result of "degraded" DB.
The latter is a pretty known issue most likely happening to [slow] DB disks after bulk data removal.
Updated by can zhu about 1 year ago
The index pool make of ssd, and the data pool make of hdd, the crash message come from hdd, is there a way to voild the slow ops? or maybe we can increase the timeout?
Updated by Radoslaw Zarzynski 12 months ago
Increasing the timeout could be obviously help in short term but won't deal with the underlying problem. Igor's idea / long shot looks reasonable.