Project

General

Profile

Bug #21287

1 PG down, OSD fails with "FAILED assert(i->prior_version == last || i->is_error())"

Added by Henrik Korkuc over 6 years ago. Updated over 3 years ago.

Status:
Duplicate
Priority:
High
Assignee:
-
Category:
EC Pools
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

One PG went down for me during large rebalance (I added racks to OSD placement, almost all data had to be shuffled). RBD EC pool, min_size was set to k as some PGs went inactive after rebalance/OSD restarts.

OSD is failing with FAILED assert(i->prior_version == last || i->is_error()) during peering. This does not seem to be OSD specific, as I moved PG to another OSD, but it experienced same issue after marking original OSD lost. And for example shutting down some other OSDs of that PG enables startup of previously failing OSD. E.g. OSD 133 fails to start. Shutdown OSD 65. OSD 133 starts sucessfully. Start OSD 65, OSD 133, 118 (it has a copy of that pg from OSD 133) and 65 crash. Shutdown OSD 381, OSD 65 can start.

I poster cluster log, OSD 133 log from yesterday and will paste OSD 65 log with more debugging in next message.

ceph-post-file for cluster log: 0d3a2d70-cb27-4ade-b0d7-6ee6f69ffbc9
ceph-post-file for OSD 133: b48cce81-5571-4575-a626-daa9f43725d7


Related issues

Duplicates RADOS - Bug #22916: OSD crashing in peering Duplicate 02/05/2018

History

#1 Updated by Henrik Korkuc over 6 years ago

btw down pg is 1.1735.

Starting OSD 381 crashes 65, 133 and 118. Stoping 65 enables to start remaining OSDs, starting it crash 118 and 133.

To decrease debug points I removed 1.1735s0 from OSD 118 (it was copied from 133 there).

added debug_osd 20/20 to config, started OSD 133 and after it booted, started OSD 65. Both of them crashed, logs posted.
OSD 133: c396e54f-f96e-4312-9a18-77ea5a6ed5c2
OSD 65: a76c4057-8acf-46c2-aa02-39528b98dd27

#2 Updated by Henrik Korkuc over 6 years ago

I had to delete affected pool to reclaim occupied space so I am unable to verify any fixes

#3 Updated by mingxin liu over 6 years ago

we hit this bug too, RBD+EC

#4 Updated by mingxin liu over 6 years ago

@{
"op": "modify",
"object": "1:4ec1cf44:::xbd_data.100119495cff.0000000000002897:head",
"version": "1624'39707",
"prior_version": "1624'39706",
"reqid": "client.77068.0:18102704",
"extra_reqids": [],
"mtime": "2017-11-12 03:44:55.752943",
"return_code": 0,
"mod_desc": {
"object_mod_desc": {
"can_local_rollback": true,
"rollback_info_completed": false,
"ops": [ {
"code": "SETATTRS",
"attrs": [
"_",
"hinfo_key",
"snapset"
]
}, {
"code": "ROLLBACK_EXTENTS",
"gen": 39707,
"snaps": "[536576,4096]"
}
]
}
}
}, {
"op": "modify",
"object": "1:4ec1cf44:::xbd_data.100119495cff.0000000000002897:head",
"version": "1624'39709",
"prior_version": "1624'39708",
"reqid": "client.86630.0:19763476",
"extra_reqids": [],
"mtime": "2017-11-12 03:44:55.774969",
"return_code": 0,
"mod_desc": {
"object_mod_desc": {
"can_local_rollback": true,
"rollback_info_completed": false,
"ops": [ {
"code": "SETATTRS",
"attrs": [
"_",
"hinfo_key",
"snapset"
]
}, {
"code": "ROLLBACK_EXTENTS",
"gen": 39709,
"snaps": "[540672,4096]"
}
]
}
@
the last two updates, it's divergent entries, version 39708 disappear. i guess op has applied disorderly

#5 Updated by mingxin liu over 6 years ago

liu mingxin wrote:

@{
"op": "modify",
"object": "1:4ec1cf44:::rbd_data.100119495cff.0000000000002897:head",
"version": "1624'39707",
"prior_version": "1624'39706",
"reqid": "client.77068.0:18102704",
"extra_reqids": [],
"mtime": "2017-11-12 03:44:55.752943",
"return_code": 0,
"mod_desc": {
"object_mod_desc": {
"can_local_rollback": true,
"rollback_info_completed": false,
"ops": [ {
"code": "SETATTRS",
"attrs": [
"_",
"hinfo_key",
"snapset"
]
}, {
"code": "ROLLBACK_EXTENTS",
"gen": 39707,
"snaps": "[536576,4096]"
}
]
}
}
}, {
"op": "modify",
"object": "1:4ec1cf44:::rbd_data.100119495cff.0000000000002897:head",
"version": "1624'39709",
"prior_version": "1624'39708",
"reqid": "client.86630.0:19763476",
"extra_reqids": [],
"mtime": "2017-11-12 03:44:55.774969",
"return_code": 0,
"mod_desc": {
"object_mod_desc": {
"can_local_rollback": true,
"rollback_info_completed": false,
"ops": [ {
"code": "SETATTRS",
"attrs": [
"_",
"hinfo_key",
"snapset"
]
}, {
"code": "ROLLBACK_EXTENTS",
"gen": 39709,
"snaps": "[540672,4096]"
}
]
}
@
the last two updates, it's divergent entries, version 39708 disappear. i guess op has applied disorderly

#6 Updated by Shinobu Kinjo over 6 years ago

  • Assignee set to Shinobu Kinjo

#7 Updated by Greg Farnum over 6 years ago

  • Category set to EC Pools
  • Priority changed from Normal to High

#8 Updated by Shinobu Kinjo over 6 years ago

  • Assignee deleted (Shinobu Kinjo)

#9 Updated by lingjie kong about 6 years ago

we hit this bug too in ec pool 2+1, i find one peer did not receive one piece of op message sended from primary osd, but the former and latter had been handled and added to the log, which may cause this assert.

#11 Updated by Chang Liu about 6 years ago

  • Duplicates Bug #22916: OSD crashing in peering added

#12 Updated by Josh Durgin about 6 years ago

  • Status changed from New to Duplicate

#13 Updated by huang jun over 3 years ago

Chang Liu wrote:

see https://github.com/ceph/ceph/pull/16675

seems that patch was already applied since the "FAILED assert(i->prior_version == last || i->is_error())"

Also available in: Atom PDF