Bug #59510
openosd crash
0%
Description
{ "archived": "2023-04-23 02:43:27.040739", "backtrace": [ "/lib64/libpthread.so.0(+0x12ce0) [0x7f4de5e1bce0]", "pthread_kill()", "(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x48c) [0x5615f3d6534c]", "(ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x23e) [0x5615f3d6573e]", "(PrimaryLogPG::scan_range(int, int, BackfillInterval*, ThreadPool::TPHandle&)+0x15a) [0x5615f38668da]", "(PrimaryLogPG::do_scan(boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x914) [0x5615f3867d34]", "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x776) [0x5615f3868826]", "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5615f36effc9]", "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5615f394ee78]", "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5615f370d4c8]", "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5615f3d8a2a4]", "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5615f3d8d184]", "/lib64/libpthread.so.0(+0x81ca) [0x7f4de5e111ca]", "clone()" ], "ceph_version": "16.2.10", "crash_id": "2023-04-23T02:28:30.051101Z_9ba78a52-c740-4505-bb93-f797f394cebe", "entity_name": "osd.65", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-osd", "stack_sig": "1a2700ce6c68288739eb14ca1b2b5f49449c59a5baafbd1e71df3a4316e3bffe", "timestamp": "2023-04-23T02:28:30.051101Z", "utsname_hostname": "node02", "utsname_machine": "x86_64", "utsname_release": "3.10.0-1160.45.1.el7.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021" }
Updated by Radoslaw Zarzynski about 1 year ago
- Status changed from New to Need More Info
It looks the scan-for-backfill operation was taking long time and triggered the thread heartbeat. This could be even because of hardware issues. Could please check for them in e.g. dmesg
?
Updated by can zhu about 1 year ago
Updated by Igor Fedotov about 1 year ago
You might also want to compact this OSD's DB using ceph-kvstore-tool. Some chances are that the timeout is caused by slow DB access which in turn might be a result of "degraded" DB.
The latter is a pretty known issue most likely happening to [slow] DB disks after bulk data removal.
Updated by can zhu about 1 year ago
The index pool make of ssd, and the data pool make of hdd, the crash message come from hdd, is there a way to voild the slow ops? or maybe we can increase the timeout?
Updated by Radoslaw Zarzynski about 1 year ago
Increasing the timeout could be obviously help in short term but won't deal with the underlying problem. Igor's idea / long shot looks reasonable.
Updated by can zhu about 1 year ago
Thank for your response, if use the ssd as the data pool, need to add a nvme device as DB?