Bug #21825: OSD won't stay online and crashes with abort - RADOS - Ceph

Actions

Copy link

Bug #21825

closed

OSD won't stay online and crashes with abort

Added by Jérôme Poulin over 6 years ago. Updated over 6 years ago.

Status:

Closed

Priority:

High

Assignee:

David Zafman

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

Ceph - v11.2.1

ceph-qa-suite:

Component(RADOS):

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

I have an issue where 2 OSDs can't stay up at the same time and one will crash the other causing down PGs,

Exporting the problematic PG using ceph-objectstore tool work but importing crashes.

This is on an upgraded Luminous cluster.

Files

Download all files

osd.8.crash.log.gz (243 KB) osd.8.crash.log.gz		Jérôme Poulin, 10/18/2017 03:32 AM
ceph-objectstore-tool.crash.log (6.05 KB) ceph-objectstore-tool.crash.log		Jérôme Poulin, 10/18/2017 03:33 AM
osd.3.crash.after.marking.unfound.revert.log (259 KB) osd.3.crash.after.marking.unfound.revert.log		Jérôme Poulin, 10/18/2017 05:09 AM
osd.5.crash.log.gz (314 KB) osd.5.crash.log.gz		Jérôme Poulin, 10/18/2017 09:46 PM

Actions

Copy link

Updated by David Zafman over 6 years ago

You should bump up the OSD logging to see more of what is happening.

Actions

Copy link

Updated by David Zafman over 6 years ago

Assignee set to David Zafman
Source set to Community (user)

Actions

Copy link

Updated by Jérôme Poulin over 6 years ago

File osd.3.crash.after.marking.unfound.revert.log osd.3.crash.after.marking.unfound.revert.log added

After tempering around with OSD kill and starting many, marking lost and unfound, I finally was able to recover all but 5 objects and have the PG active+clean again, another crash happened after
ceph pg 0.ae mark_unfound_lost revert
pg has 5 objects unfound and apparently lost marking

crash attached.

root@cadevceph2:~# ceph pg 0.ae list_missing
{
    "offset": {
        "oid": "",
        "key": "",
        "snapid": 0,
        "hash": 0,
        "max": 0,
        "pool": -9223372036854775808,
        "namespace": "" 
    },
    "num_missing": 5,
    "num_unfound": 5,
    "objects": [
        {
            "oid": {
                "oid": "rbd_data.1e54d7a493069.000000000000012d",
                "key": "",
                "snapid": -2,
                "hash": 521486766,
                "max": 0,
                "pool": 0,
                "namespace": "" 
            },
            "need": "14657'6714714",
            "have": "0'0",
            "flags": "none",
            "locations": []
        },
        {
            "oid": {
                "oid": "rbd_data.56f1c3d1b58ba.00000000000004a0",
                "key": "",
                "snapid": -2,
                "hash": 2589933998,
                "max": 0,
                "pool": 0,
                "namespace": "" 
            },
            "need": "14663'6714752",
            "have": "0'0",
            "flags": "none",
            "locations": []
        },
        {
            "oid": {
                "oid": "rbd_data.15132a3d1b58ba.000000000000036b",
                "key": "",
                "snapid": -2,
                "hash": 2743273902,
                "max": 0,
                "pool": 0,
                "namespace": "" 
            },
            "need": "14678'6714754",
            "have": "14663'6714753",
            "flags": "none",
            "locations": []
        },
        {
            "oid": {
                "oid": "rbd_data.25aeef2eb141f2.0000000000000145",
                "key": "",
                "snapid": -2,
                "hash": 2505573294,
                "max": 0,
                "pool": 0,
                "namespace": "" 
            },
            "need": "14678'6714755",
            "have": "0'0",
            "flags": "none",
            "locations": []
        },
        {
            "oid": {
                "oid": "rbd_data.4b5782ae8944a.00000000000002f0",
                "key": "",
                "snapid": -2,
                "hash": 1395788718,
                "max": 0,
                "pool": 0,
                "namespace": "" 
            },
            "need": "14663'6714734",
            "have": "0'0",
            "flags": "none",
            "locations": []
        }
    ],
    "more": false
}

Actions

Copy link

Updated by Jérôme Poulin over 6 years ago

I think there is more to this, after active+clean, I shutdown osd.3 and then the PG went active+clean+snaptrim then osd.5 crashed for the first time.

I restarted both osd.5 and osd.3, osd.3 is still out but I can't afford to get those PG down again, I may try again tomorrow evening since I have to sleep a bit..

Actions

Copy link

Updated by Jason Dillaman over 6 years ago

Project changed from rbd to RADOS

Actions

Copy link

Updated by Jérôme Poulin over 6 years ago

File osd.5.crash.log.gz osd.5.crash.log.gz added

I had a chance to try and rm osd 3 today and replace the hard disk with a new one, no crash so far, it is rebalancing.

I dumped the log of yesterday's osd.5 crash. This was the only crash from this OSD.
root@cadevceph9:~# zcat /var/log/ceph/osd.5.log.1.gz | grep "end dump" -B 12000 | gzip -9 > osd.5.crash.log.gz

Actions

Copy link

Updated by Sage Weil over 6 years ago

Status changed from New to 12
Priority changed from Normal to High

Can you confirm you're not using jemalloc (check /etc/{default,sysconfig}/ceph)?

Actions

Copy link

Updated by Jérôme Poulin over 6 years ago

I did a quick check on my 4 hosts and jemalloc is not enabled. The cluster is now back to active+clean.

Actions

Copy link

Updated by Jérôme Poulin over 6 years ago

Would you be interested in having a copy of the 2 GB PG which causes ceph-objectstore-tool to crash?

Actions

Copy link

#10

Updated by David Zafman over 6 years ago

The crash of the ceph-objectstore-tool would be caused by removing a PG using "rm -rf" and then trying to import the PG. In the future if you use "--op remove" or in the latest code "--op export-remove" before importing, then the PG will be properly deleted. The problem is the information in leveldb/rocksdb hasn't been removed when just removing directories right out of filestore.

Actions

Copy link

#11

Updated by Jérôme Poulin over 6 years ago

My initial attempt to import I used --op remove since this was a bluestore, the initial attempt also crashed, now I'm not sure which log I provided, if it was my attempt after creating a fresh new OSD in filestore or another attempt in bluestore. One thing I'm sure is that I did try both --op remove and rm -rf in my new filestore in an attempt to recover the data.

Actions

Copy link

#12

Updated by Greg Farnum over 6 years ago

Status changed from 12 to Closed

Looks like stuff is working now.

Actions

Copy link

#13

Updated by Jérôme Poulin over 6 years ago

Looks like this ticket won't get investigated anymore, should I delete my object-store-tool export of the PG? You can import that PG and cause the crash on demand.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » RADOS

Custom queries

Bug #21825

OSD won't stay online and crashes with abort

Updated by David Zafman over 6 years ago

Updated by David Zafman over 6 years ago

Updated by Jérôme Poulin over 6 years ago

Updated by Jérôme Poulin over 6 years ago

Updated by Jason Dillaman over 6 years ago

Updated by Jérôme Poulin over 6 years ago

Updated by Sage Weil over 6 years ago

Updated by Jérôme Poulin over 6 years ago

Updated by Jérôme Poulin over 6 years ago

Updated by David Zafman over 6 years ago

Updated by Jérôme Poulin over 6 years ago

Updated by Greg Farnum over 6 years ago

Updated by Jérôme Poulin over 6 years ago