Bug #52445
openOSD asserts on starting too many pushes
0%
Description
I am running ceph version 15.2.5 cluster in the recent days scrub reported error and few pg failed due to OSD's randomly stops and
restarting OSD also fails after some time.
The cluster is in a non-usable state.
Attached crash log from one of the failed OSD.
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.7/rpm/el8/BUILD/ceph-15.2.7/src/osd/OSD.cc: 9521: FAILED ceph_assert(started <= reserved_pushes) ceph version 15.2.7 (88e41c6c49beb18add4fdb6b4326ca466d931db8) octopus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x55fcb6621dbe] 2: (()+0x504fd8) [0x55fcb6621fd8] 3: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x5f5) [0x55fcb6704c25] 4: (ceph::osd::scheduler::PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x1d) [0x55fcb6960a3d] 5: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55fcb67224df] 6: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55fcb6d5b224] 7: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55fcb6d5de84] 8: (()+0x82de) [0x7f04c1b1c2de] 9: (clone()+0x43) [0x7f04c0853e83]
Files
Updated by Greg Farnum over 2 years ago
- Project changed from Ceph to RADOS
- Category deleted (
OSD)
Updated by Greg Farnum over 2 years ago
- Subject changed from OSD Stops without any reason to OSD asserts on starting too many pushes
- Description updated (diff)
Updated by Neha Ojha over 2 years ago
- Status changed from New to Need More Info
Can you please provide 1) osd logs with debug_osd=20 and debug_ms=1 2) ceph.conf 3) output of ceph -s?
Is this crash seen on one osd or several osds? Are there any other crashes, please share them if there are?
Updated by Amudhan Pandia n over 2 years ago
Neha Ojha wrote:
Can you please provide 1) osd logs with debug_osd=20 and debug_ms=1 2) ceph.conf 3) output of ceph -s?
Is this crash seen on one osd or several osds? Are there any other crashes, please share them if there are?
Multiple OSD's crash there is scrub error in PG and PG failure but it's all due to the OSD crash I think any way I will get you logs. I run this cluster in podman container can you guide me on how do I set these changes?
Updated by Amudhan Pandia n over 2 years ago
Hi,
I have managed to set debug log using ceph config set command and taken log output.
- options before changing
ceph config get osd debug_osd
1/5
ceph config get osd debug_ms
0/0
- values set as recommended (debug_osd=20 and debug_ms=1)
ceph config set osd debug_osd 20
ceph config get osd debug_osd
20/20
ceph config set osd debug_ms 1
ceph config get osd debug_ms
1/1
Failed single OSD log file is more than 500MB so I have compressed and uploaded it to my Google drive please download it from the link.
https://drive.google.com/file/d/1pEqSgsBinC36Sqjr6tvbNadfR4BWzjdM/view?usp=sharing
- ceph.conf output from node ##
- minimal ceph.conf for b6437922-3edf-11eb-adc2-0cc47a5ec98a
[global]
fsid = b6437922-3edf-11eb-adc2-0cc47a5ec98a
mon_host = [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] [v2:10.0.103.2:3300/0,v1:10.0.103.2:6789/0] [v2:10.0.103.3:3300/0,v1:10.0.103.3:6789/0]
- output from ceph -s ##
cluster:
id: b6437922-3edf-11eb-adc2-0cc47a5ec98a
health: HEALTH_ERR
17/3505 objects unfound (0.485%)
17 osds down
Reduced data availability: 220 pgs inactive, 201 pgs down, 13 pgs peering, 184 pgs stale
Possible data damage: 4 pgs recovery_unfound
Degraded data redundancy: 1720/7147 objects degraded (24.066%), 226 pgs degraded, 240 pgs undersized
170 daemons have recently crashedservices:
mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 2w)
mgr: strg-node4.zlqori(active, since 2w)
mds: cephfs:1 {0=cephfs.strg-node2.qdfwnt=up:active} 1 up:standby-replay 1 up:standby
osd: 56 osds: 24 up (since 5d), 41 in (since 5d); 59 remapped pgstask status:
scrub status:
mds.cephfs.strg-node2.qdfwnt: idle
mds.cephfs.strg-node3.yfsmqx: idledata:
pools: 3 pools, 577 pgs
objects: 3.50k objects, 13 GiB
usage: 256 GiB used, 209 TiB / 209 TiB avail
pgs: 38.128% pgs not active
1720/7147 objects degraded (24.066%)
90/7147 objects misplaced (1.259%)
17/3505 objects unfound (0.485%)
162 active+undersized+degraded
151 stale+down
109 active+clean
50 down
36 active+undersized+degraded+remapped+backfill_wait
20 active+undersized+remapped
12 stale+peering
10 active+recovery_wait+degraded
10 stale+active+undersized+degraded
6 stale+activating+undersized
3 stale+active+recovery_unfound+undersized+degraded
2 active+recovering+degraded
2 active+recovery_wait+undersized+degraded+remapped
1 stale+active+clean
1 stale+remapped+peering
1 active+recovery_unfound+undersized+degraded
1 active+clean+remappedio:
client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
Updated by Amudhan Pandia n over 2 years ago
Neha Ojha wrote:
Can you please provide 1) osd logs with debug_osd=20 and debug_ms=1 2) ceph.conf 3) output of ceph -s?
Is this crash seen on one osd or several osds? Are there any other crashes, please share them if there are?
Neha Ojha,
I have provided the info you have requested were you able to check it. ticket still in the status "need more info"
please let me know if you need more data about this issue.
Updated by Neha Ojha over 2 years ago
- Status changed from Need More Info to New
Thanks, is it possible for you to share the logs using ceph-post-file (https://docs.ceph.com/en/pacific/man/8/ceph-post-file/), which is the standard way for us to access logs.
Updated by Amudhan Pandia n over 2 years ago
Neha Ojha wrote:
Thanks, is it possible for you to share the logs using ceph-post-file (https://docs.ceph.com/en/pacific/man/8/ceph-post-file/), which is the standard way for us to access logs.
I am getting errors when upload the raw log file which is about 949 MB. So I have uploaded TAR file using ceph-post-file upload id `39bc8150-e23f-4c5d-bb3f-9a1830d6da41`.