Project

General

Profile

Actions

Bug #24227

closed

jewel->luminous: osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end())

Added by Siegfried Hoellrigl almost 6 years ago. Updated about 3 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
OSD
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi !

We have done an (almost) successful upgrade to Ceph Luminous 12.2.5.

The Cluster becomes almost healthy. But very shortly before that, One OSD crashes (osd.130).
We could already identify a faulty placement group.

Surprisingly this PG is up&running on 2 other servers without any problem.

With "ceph pg dump" we can see, that the last column (SNAPTRIMQ_LEN) of PG 5.9b is "27826
", and not zero like on the other pgs...

To solve the Problem, we already purged all snapshots of the rbd Pool. (ID=5)
Then we increases verbosity, cache Size and throttle_bytes in ceph.conf like this :
[osd.130]
debug bluestore = 20
debug osd = 20
bluestore_throttle_bytes = 0
bluestore_throttle_deferred_bytes = 0
debug throttle = 10
bluestore_cache_size_hdd = 10737418240
bluestore_cache_size_ssd = 10737418240

Then we deleted the PG from the crashing OSD :
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-130/ --pgid 5.9b --op remove --force

But again, the OSD crashes.

/build/ceph-12.2.5/src/osd/PrimaryLogPG.cc: 358: FAILED assert(p != recovery_info.ss.clone_snaps.end())

What can we do to bring the PG up on all 3 OSDs ?


Files

OSD130LOG.zip (190 KB) OSD130LOG.zip Siegfried Hoellrigl, 05/22/2018 01:25 PM
pgdump.zip (628 KB) pgdump.zip Siegfried Hoellrigl, 05/22/2018 01:25 PM
Actions

Also available in: Atom PDF