Project

General

Profile

Actions

Bug #52445

open

OSD asserts on starting too many pushes

Added by Amudhan Pandia n over 2 years ago. Updated over 2 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I am running ceph version 15.2.5 cluster in the recent days scrub reported error and few pg failed due to OSD's randomly stops and
restarting OSD also fails after some time.

The cluster is in a non-usable state.

Attached crash log from one of the failed OSD.

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.7/rpm/el8/BUILD/ceph-15.2.7/src/osd/OSD.cc: 9521: FAILED ceph_assert(started <= reserved_pushes)

 ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x55fcb6621dbe]
 2: (()+0x504fd8) [0x55fcb6621fd8]
 3: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x55fcb6704c25]
 4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x55fcb6960a3d]
 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55fcb67224df]
 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55fcb6d5b224]
 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55fcb6d5de84]
 8: (()+0x82de) [0x7f04c1b1c2de]
 9: (clone()+0x43) [0x7f04c0853e83]

Files

crash-log (506 KB) crash-log Amudhan Pandia n, 08/28/2021 02:58 PM
Actions #1

Updated by Greg Farnum over 2 years ago

  • Project changed from Ceph to RADOS
  • Category deleted (OSD)
Actions #2

Updated by Greg Farnum over 2 years ago

  • Subject changed from OSD Stops without any reason to OSD asserts on starting too many pushes
  • Description updated (diff)
Actions #3

Updated by Neha Ojha over 2 years ago

  • Status changed from New to Need More Info

Can you please provide 1) osd logs with debug_osd=20 and debug_ms=1 2) ceph.conf 3) output of ceph -s?
Is this crash seen on one osd or several osds? Are there any other crashes, please share them if there are?

Actions #4

Updated by Amudhan Pandia n over 2 years ago

Neha Ojha wrote:

Can you please provide 1) osd logs with debug_osd=20 and debug_ms=1 2) ceph.conf 3) output of ceph -s?
Is this crash seen on one osd or several osds? Are there any other crashes, please share them if there are?

Multiple OSD's crash there is scrub error in PG and PG failure but it's all due to the OSD crash I think any way I will get you logs. I run this cluster in podman container can you guide me on how do I set these changes?

Actions #5

Updated by Amudhan Pandia n over 2 years ago

Hi,

I have managed to set debug log using ceph config set command and taken log output.

  1. options before changing
    ceph config get osd debug_osd
    1/5
    ceph config get osd debug_ms
    0/0
  1. values set as recommended (debug_osd=20 and debug_ms=1)
    ceph config set osd debug_osd 20
    ceph config get osd debug_osd
    20/20
    ceph config set osd debug_ms 1
    ceph config get osd debug_ms
    1/1

Failed single OSD log file is more than 500MB so I have compressed and uploaded it to my Google drive please download it from the link.

https://drive.google.com/file/d/1pEqSgsBinC36Sqjr6tvbNadfR4BWzjdM/view?usp=sharing

  1. ceph.conf output from node ##
  1. minimal ceph.conf for b6437922-3edf-11eb-adc2-0cc47a5ec98a
    [global]
    fsid = b6437922-3edf-11eb-adc2-0cc47a5ec98a
    mon_host = [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] [v2:10.0.103.2:3300/0,v1:10.0.103.2:6789/0] [v2:10.0.103.3:3300/0,v1:10.0.103.3:6789/0]
  1. output from ceph -s ##

    cluster:
    id: b6437922-3edf-11eb-adc2-0cc47a5ec98a
    health: HEALTH_ERR
    17/3505 objects unfound (0.485%)
    17 osds down
    Reduced data availability: 220 pgs inactive, 201 pgs down, 13 pgs peering, 184 pgs stale
    Possible data damage: 4 pgs recovery_unfound
    Degraded data redundancy: 1720/7147 objects degraded (24.066%), 226 pgs degraded, 240 pgs undersized
    170 daemons have recently crashed

    services:
    mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 2w)
    mgr: strg-node4.zlqori(active, since 2w)
    mds: cephfs:1 {0=cephfs.strg-node2.qdfwnt=up:active} 1 up:standby-replay 1 up:standby
    osd: 56 osds: 24 up (since 5d), 41 in (since 5d); 59 remapped pgs

    task status:
    scrub status:
    mds.cephfs.strg-node2.qdfwnt: idle
    mds.cephfs.strg-node3.yfsmqx: idle

    data:
    pools: 3 pools, 577 pgs
    objects: 3.50k objects, 13 GiB
    usage: 256 GiB used, 209 TiB / 209 TiB avail
    pgs: 38.128% pgs not active
    1720/7147 objects degraded (24.066%)
    90/7147 objects misplaced (1.259%)
    17/3505 objects unfound (0.485%)
    162 active+undersized+degraded
    151 stale+down
    109 active+clean
    50 down
    36 active+undersized+degraded+remapped+backfill_wait
    20 active+undersized+remapped
    12 stale+peering
    10 active+recovery_wait+degraded
    10 stale+active+undersized+degraded
    6 stale+activating+undersized
    3 stale+active+recovery_unfound+undersized+degraded
    2 active+recovering+degraded
    2 active+recovery_wait+undersized+degraded+remapped
    1 stale+active+clean
    1 stale+remapped+peering
    1 active+recovery_unfound+undersized+degraded
    1 active+clean+remapped

    io:
    client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

Actions #6

Updated by Amudhan Pandia n over 2 years ago

Neha Ojha wrote:

Can you please provide 1) osd logs with debug_osd=20 and debug_ms=1 2) ceph.conf 3) output of ceph -s?
Is this crash seen on one osd or several osds? Are there any other crashes, please share them if there are?

Neha Ojha,

I have provided the info you have requested were you able to check it. ticket still in the status "need more info"
please let me know if you need more data about this issue.

Actions #7

Updated by Neha Ojha over 2 years ago

  • Status changed from Need More Info to New

Thanks, is it possible for you to share the logs using ceph-post-file (https://docs.ceph.com/en/pacific/man/8/ceph-post-file/), which is the standard way for us to access logs.

Actions #8

Updated by Amudhan Pandia n over 2 years ago

Neha Ojha wrote:

Thanks, is it possible for you to share the logs using ceph-post-file (https://docs.ceph.com/en/pacific/man/8/ceph-post-file/), which is the standard way for us to access logs.

I am getting errors when upload the raw log file which is about 949 MB. So I have uploaded TAR file using ceph-post-file upload id `39bc8150-e23f-4c5d-bb3f-9a1830d6da41`.

Actions

Also available in: Atom PDF