Bug #21825
closedOSD won't stay online and crashes with abort
0%
Description
I have an issue where 2 OSDs can't stay up at the same time and one will crash the other causing down PGs,
Exporting the problematic PG using ceph-objectstore tool work but importing crashes.
This is on an upgraded Luminous cluster.
Files
Updated by David Zafman over 6 years ago
You should bump up the OSD logging to see more of what is happening.
Updated by David Zafman over 6 years ago
- Assignee set to David Zafman
- Source set to Community (user)
Updated by Jérôme Poulin over 6 years ago
- File osd.3.crash.after.marking.unfound.revert.log osd.3.crash.after.marking.unfound.revert.log added
After tempering around with OSD kill and starting many, marking lost and unfound, I finally was able to recover all but 5 objects and have the PG active+clean again, another crash happened after
ceph pg 0.ae mark_unfound_lost revert
pg has 5 objects unfound and apparently lost marking
crash attached.
root@cadevceph2:~# ceph pg 0.ae list_missing
{
"offset": {
"oid": "",
"key": "",
"snapid": 0,
"hash": 0,
"max": 0,
"pool": -9223372036854775808,
"namespace": ""
},
"num_missing": 5,
"num_unfound": 5,
"objects": [
{
"oid": {
"oid": "rbd_data.1e54d7a493069.000000000000012d",
"key": "",
"snapid": -2,
"hash": 521486766,
"max": 0,
"pool": 0,
"namespace": ""
},
"need": "14657'6714714",
"have": "0'0",
"flags": "none",
"locations": []
},
{
"oid": {
"oid": "rbd_data.56f1c3d1b58ba.00000000000004a0",
"key": "",
"snapid": -2,
"hash": 2589933998,
"max": 0,
"pool": 0,
"namespace": ""
},
"need": "14663'6714752",
"have": "0'0",
"flags": "none",
"locations": []
},
{
"oid": {
"oid": "rbd_data.15132a3d1b58ba.000000000000036b",
"key": "",
"snapid": -2,
"hash": 2743273902,
"max": 0,
"pool": 0,
"namespace": ""
},
"need": "14678'6714754",
"have": "14663'6714753",
"flags": "none",
"locations": []
},
{
"oid": {
"oid": "rbd_data.25aeef2eb141f2.0000000000000145",
"key": "",
"snapid": -2,
"hash": 2505573294,
"max": 0,
"pool": 0,
"namespace": ""
},
"need": "14678'6714755",
"have": "0'0",
"flags": "none",
"locations": []
},
{
"oid": {
"oid": "rbd_data.4b5782ae8944a.00000000000002f0",
"key": "",
"snapid": -2,
"hash": 1395788718,
"max": 0,
"pool": 0,
"namespace": ""
},
"need": "14663'6714734",
"have": "0'0",
"flags": "none",
"locations": []
}
],
"more": false
}
Updated by Jérôme Poulin over 6 years ago
I think there is more to this, after active+clean, I shutdown osd.3 and then the PG went active+clean+snaptrim then osd.5 crashed for the first time.
I restarted both osd.5 and osd.3, osd.3 is still out but I can't afford to get those PG down again, I may try again tomorrow evening since I have to sleep a bit..
Updated by Jérôme Poulin over 6 years ago
- File osd.5.crash.log.gz osd.5.crash.log.gz added
I had a chance to try and rm osd 3 today and replace the hard disk with a new one, no crash so far, it is rebalancing.
I dumped the log of yesterday's osd.5 crash. This was the only crash from this OSD.
root@cadevceph9:~# zcat /var/log/ceph/osd.5.log.1.gz | grep "end dump" -B 12000 | gzip -9 > osd.5.crash.log.gz
Updated by Sage Weil over 6 years ago
- Status changed from New to 12
- Priority changed from Normal to High
Can you confirm you're not using jemalloc (check /etc/{default,sysconfig}/ceph)?
Updated by Jérôme Poulin over 6 years ago
I did a quick check on my 4 hosts and jemalloc is not enabled. The cluster is now back to active+clean.
Updated by Jérôme Poulin over 6 years ago
Would you be interested in having a copy of the 2 GB PG which causes ceph-objectstore-tool to crash?
Updated by David Zafman over 6 years ago
The crash of the ceph-objectstore-tool would be caused by removing a PG using "rm -rf" and then trying to import the PG. In the future if you use "--op remove" or in the latest code "--op export-remove" before importing, then the PG will be properly deleted. The problem is the information in leveldb/rocksdb hasn't been removed when just removing directories right out of filestore.
Updated by Jérôme Poulin over 6 years ago
My initial attempt to import I used --op remove since this was a bluestore, the initial attempt also crashed, now I'm not sure which log I provided, if it was my attempt after creating a fresh new OSD in filestore or another attempt in bluestore. One thing I'm sure is that I did try both --op remove and rm -rf in my new filestore in an attempt to recover the data.
Updated by Greg Farnum over 6 years ago
- Status changed from 12 to Closed
Looks like stuff is working now.
Updated by Jérôme Poulin over 6 years ago
Looks like this ticket won't get investigated anymore, should I delete my object-store-tool export of the PG? You can import that PG and cause the crash on demand.