Project

General

Profile

Actions

Bug #59510

open

osd crash

Added by can zhu about 1 year ago. Updated 12 months ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

{
    "archived": "2023-04-23 02:43:27.040739",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f4de5e1bce0]",
        "pthread_kill()",
        "(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x48c) [0x5615f3d6534c]",
        "(ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x23e) [0x5615f3d6573e]",
        "(PrimaryLogPG::scan_range(int, int, BackfillInterval*, ThreadPool::TPHandle&)+0x15a) [0x5615f38668da]",
        "(PrimaryLogPG::do_scan(boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x914) [0x5615f3867d34]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x776) [0x5615f3868826]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5615f36effc9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5615f394ee78]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5615f370d4c8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5615f3d8a2a4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5615f3d8d184]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f4de5e111ca]",
        "clone()" 
    ],
    "ceph_version": "16.2.10",
    "crash_id": "2023-04-23T02:28:30.051101Z_9ba78a52-c740-4505-bb93-f797f394cebe",
    "entity_name": "osd.65",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "1a2700ce6c68288739eb14ca1b2b5f49449c59a5baafbd1e71df3a4316e3bffe",
    "timestamp": "2023-04-23T02:28:30.051101Z",
    "utsname_hostname": "node02",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.45.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021" 
}
Actions #1

Updated by Neha Ojha about 1 year ago

  • Description updated (diff)
Actions #2

Updated by Radoslaw Zarzynski about 1 year ago

  • Status changed from New to Need More Info

It looks the scan-for-backfill operation was taking long time and triggered the thread heartbeat. This could be even because of hardware issues. Could please check for them in e.g. dmesg?

Actions #3

Updated by can zhu about 1 year ago

like this?
[6880136.695917] tp_osd_tp6383: segfault at 0 ip 00007ff38f003573 sp 00007ff36ba8a240 error 4 in libtcmalloc.so.4.5.3[7ff38efd8000+4d000]
[7488958.853233] tp_osd_tp64401: segfault at 5648fc9441a8 ip 00007f6f955e9f5c sp 00007f6f706a9f08 error 7 in libc-2.28.so[7f6f9551a000+1bc000]

Actions #4

Updated by Igor Fedotov about 1 year ago

You might also want to compact this OSD's DB using ceph-kvstore-tool. Some chances are that the timeout is caused by slow DB access which in turn might be a result of "degraded" DB.
The latter is a pretty known issue most likely happening to [slow] DB disks after bulk data removal.

Actions #5

Updated by can zhu about 1 year ago

The index pool make of ssd, and the data pool make of hdd, the crash message come from hdd, is there a way to voild the slow ops? or maybe we can increase the timeout?

Actions #6

Updated by Radoslaw Zarzynski 12 months ago

Increasing the timeout could be obviously help in short term but won't deal with the underlying problem. Igor's idea / long shot looks reasonable.

Actions #7

Updated by can zhu 12 months ago

Thank for your response, if use the ssd as the data pool, need to add a nvme device as DB?

Actions

Also available in: Atom PDF