Bug #52445: OSD asserts on starting too many pushes - RADOS - Ceph

Actions

Copy link

Bug #52445

open

OSD asserts on starting too many pushes

Added by Amudhan Pandia n over 2 years ago. Updated over 2 years ago.

Status:

New

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v15.2.1

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I am running ceph version 15.2.5 cluster in the recent days scrub reported error and few pg failed due to OSD's randomly stops and
restarting OSD also fails after some time.

The cluster is in a non-usable state.

Attached crash log from one of the failed OSD.

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.7/rpm/el8/BUILD/ceph-15.2.7/src/osd/OSD.cc: 9521: FAILED ceph_assert(started <= reserved_pushes)

 ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x55fcb6621dbe]
 2: (()+0x504fd8) [0x55fcb6621fd8]
 3: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x55fcb6704c25]
 4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x55fcb6960a3d]
 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55fcb67224df]
 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55fcb6d5b224]
 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55fcb6d5de84]
 8: (()+0x82de) [0x7f04c1b1c2de]
 9: (clone()+0x43) [0x7f04c0853e83]

Files

crash-log (506 KB) crash-log

Amudhan Pandia n, 08/28/2021 02:58 PM

Actions

Copy link

Updated by Greg Farnum over 2 years ago

Project changed from Ceph to RADOS
Category deleted (~~OSD~~)

Actions

Copy link

Updated by Greg Farnum over 2 years ago

Subject changed from OSD Stops without any reason to OSD asserts on starting too many pushes
Description updated (diff)

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Status changed from New to Need More Info

Can you please provide 1) osd logs with debug_osd=20 and debug_ms=1 2) ceph.conf 3) output of ceph -s?
Is this crash seen on one osd or several osds? Are there any other crashes, please share them if there are?

Actions

Copy link

Updated by Amudhan Pandia n over 2 years ago

Neha Ojha wrote:

Can you please provide 1) osd logs with debug_osd=20 and debug_ms=1 2) ceph.conf 3) output of ceph -s?
Is this crash seen on one osd or several osds? Are there any other crashes, please share them if there are?

Multiple OSD's crash there is scrub error in PG and PG failure but it's all due to the OSD crash I think any way I will get you logs. I run this cluster in podman container can you guide me on how do I set these changes?

Actions

Copy link

Updated by Amudhan Pandia n over 2 years ago

Hi,

I have managed to set debug log using ceph config set command and taken log output.

options before changing
ceph config get osd debug_osd
1/5
ceph config get osd debug_ms
0/0

values set as recommended (debug_osd=20 and debug_ms=1)
ceph config set osd debug_osd 20
ceph config get osd debug_osd
20/20
ceph config set osd debug_ms 1
ceph config get osd debug_ms
1/1

Failed single OSD log file is more than 500MB so I have compressed and uploaded it to my Google drive please download it from the link.

https://drive.google.com/file/d/1pEqSgsBinC36Sqjr6tvbNadfR4BWzjdM/view?usp=sharing

ceph.conf output from node ##

minimal ceph.conf for b6437922-3edf-11eb-adc2-0cc47a5ec98a
[global]
fsid = b6437922-3edf-11eb-adc2-0cc47a5ec98a
mon_host = [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] [v2:10.0.103.2:3300/0,v1:10.0.103.2:6789/0] [v2:10.0.103.3:3300/0,v1:10.0.103.3:6789/0]

output from ceph -s ##
cluster:
id: b6437922-3edf-11eb-adc2-0cc47a5ec98a
health: HEALTH_ERR
17/3505 objects unfound (0.485%)
17 osds down
Reduced data availability: 220 pgs inactive, 201 pgs down, 13 pgs peering, 184 pgs stale
Possible data damage: 4 pgs recovery_unfound
Degraded data redundancy: 1720/7147 objects degraded (24.066%), 226 pgs degraded, 240 pgs undersized
170 daemons have recently crashed

services:
mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 2w)
mgr: strg-node4.zlqori(active, since 2w)
mds: cephfs:1 {0=cephfs.strg-node2.qdfwnt=up:active} 1 up:standby-replay 1 up:standby
osd: 56 osds: 24 up (since 5d), 41 in (since 5d); 59 remapped pgs

task status:
scrub status:
mds.cephfs.strg-node2.qdfwnt: idle
mds.cephfs.strg-node3.yfsmqx: idle

data:
pools: 3 pools, 577 pgs
objects: 3.50k objects, 13 GiB
usage: 256 GiB used, 209 TiB / 209 TiB avail
pgs: 38.128% pgs not active
1720/7147 objects degraded (24.066%)
90/7147 objects misplaced (1.259%)
17/3505 objects unfound (0.485%)
162 active+undersized+degraded
151 stale+down
109 active+clean
50 down
36 active+undersized+degraded+remapped+backfill_wait
20 active+undersized+remapped
12 stale+peering
10 active+recovery_wait+degraded
10 stale+active+undersized+degraded
6 stale+activating+undersized
3 stale+active+recovery_unfound+undersized+degraded
2 active+recovering+degraded
2 active+recovery_wait+undersized+degraded+remapped
1 stale+active+clean
1 stale+remapped+peering
1 active+recovery_unfound+undersized+degraded
1 active+clean+remapped

io:
client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr

Actions

Copy link

Updated by Amudhan Pandia n over 2 years ago

Neha Ojha wrote:

Can you please provide 1) osd logs with debug_osd=20 and debug_ms=1 2) ceph.conf 3) output of ceph -s?
Is this crash seen on one osd or several osds? Are there any other crashes, please share them if there are?

Neha Ojha,

I have provided the info you have requested were you able to check it. ticket still in the status "need more info"
please let me know if you need more data about this issue.

Actions

Copy link

Updated by Neha Ojha over 2 years ago

Status changed from Need More Info to New

Thanks, is it possible for you to share the logs using ceph-post-file (https://docs.ceph.com/en/pacific/man/8/ceph-post-file/), which is the standard way for us to access logs.

Actions

Copy link

Updated by Amudhan Pandia n over 2 years ago

Neha Ojha wrote:

Thanks, is it possible for you to share the logs using ceph-post-file (https://docs.ceph.com/en/pacific/man/8/ceph-post-file/), which is the standard way for us to access logs.

I am getting errors when upload the raw log file which is about 949 MB. So I have uploaded TAR file using ceph-post-file upload id `39bc8150-e23f-4c5d-bb3f-9a1830d6da41`.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #52445

OSD asserts on starting too many pushes

Updated by Greg Farnum over 2 years ago

Updated by Greg Farnum over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Amudhan Pandia n over 2 years ago

Updated by Amudhan Pandia n over 2 years ago

Updated by Amudhan Pandia n over 2 years ago

Updated by Neha Ojha over 2 years ago

Updated by Amudhan Pandia n over 2 years ago