Bug #59510: osd crash - RADOS - Ceph

Actions

Copy link

Bug #59510

open

osd crash

Added by can zhu about 1 year ago. Updated 12 months ago.

Status:

Need More Info

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v16.2.10

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

{
    "archived": "2023-04-23 02:43:27.040739",
    "backtrace": [
        "/lib64/libpthread.so.0(+0x12ce0) [0x7f4de5e1bce0]",
        "pthread_kill()",
        "(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, std::chrono::time_point<ceph::coarse_mono_clock, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> > >)+0x48c) [0x5615f3d6534c]",
        "(ceph::HeartbeatMap::reset_timeout(ceph::heartbeat_handle_d*, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >)+0x23e) [0x5615f3d6573e]",
        "(PrimaryLogPG::scan_range(int, int, BackfillInterval*, ThreadPool::TPHandle&)+0x15a) [0x5615f38668da]",
        "(PrimaryLogPG::do_scan(boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x914) [0x5615f3867d34]",
        "(PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x776) [0x5615f3868826]",
        "(OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x5615f36effc9]",
        "(ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x5615f394ee78]",
        "(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28) [0x5615f370d4c8]",
        "(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x5615f3d8a2a4]",
        "(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x5615f3d8d184]",
        "/lib64/libpthread.so.0(+0x81ca) [0x7f4de5e111ca]",
        "clone()" 
    ],
    "ceph_version": "16.2.10",
    "crash_id": "2023-04-23T02:28:30.051101Z_9ba78a52-c740-4505-bb93-f797f394cebe",
    "entity_name": "osd.65",
    "os_id": "centos",
    "os_name": "CentOS Stream",
    "os_version": "8",
    "os_version_id": "8",
    "process_name": "ceph-osd",
    "stack_sig": "1a2700ce6c68288739eb14ca1b2b5f49449c59a5baafbd1e71df3a4316e3bffe",
    "timestamp": "2023-04-23T02:28:30.051101Z",
    "utsname_hostname": "node02",
    "utsname_machine": "x86_64",
    "utsname_release": "3.10.0-1160.45.1.el7.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Wed Oct 13 17:20:51 UTC 2021" 
}

Actions

Copy link

Updated by Neha Ojha about 1 year ago

Description updated (diff)

Actions

Copy link

Updated by Radoslaw Zarzynski about 1 year ago

Status changed from New to Need More Info

It looks the scan-for-backfill operation was taking long time and triggered the thread heartbeat. This could be even because of hardware issues. Could please check for them in e.g. dmesg?

Actions

Copy link

Updated by can zhu about 1 year ago

like this?
[6880136.695917] tp_osd_tp⁶³⁸³: segfault at 0 ip 00007ff38f003573 sp 00007ff36ba8a240 error 4 in libtcmalloc.so.4.5.3[7ff38efd8000+4d000]
[7488958.853233] tp_osd_tp⁶⁴⁴⁰¹: segfault at 5648fc9441a8 ip 00007f6f955e9f5c sp 00007f6f706a9f08 error 7 in libc-2.28.so[7f6f9551a000+1bc000]

Actions

Copy link

Updated by Igor Fedotov about 1 year ago

You might also want to compact this OSD's DB using ceph-kvstore-tool. Some chances are that the timeout is caused by slow DB access which in turn might be a result of "degraded" DB.
The latter is a pretty known issue most likely happening to [slow] DB disks after bulk data removal.

Actions

Copy link

Updated by can zhu about 1 year ago

The index pool make of ssd, and the data pool make of hdd, the crash message come from hdd, is there a way to voild the slow ops? or maybe we can increase the timeout?

Actions

Copy link