Project

General

Profile

Actions

Bug #52079

closed

bluefs mount failed to replay log: (5) Input/output error

Added by Viktor Svecov over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
pacific, octopus
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

In the testlab after simultaneous power off of all OSD nodes (3) two of them can not start.

h2 node:

...
debug 2021-08-06T04:49:14.966+0000 7f42e40d5080 -1 bluefs _replay 0xb5000: stop: failed to decode: bad crc 1492738775 expected 0: Malformed input
debug 2021-08-06T04:49:14.966+0000 7f42e40d5080 -1 bluefs mount failed to replay log: (5) Input/output error
debug 2021-08-06T04:49:14.966+0000 7f42e40d5080 -1 bluestore(/var/lib/ceph/osd/ceph-1) _open_bluefs failed bluefs mount: (5) Input/output error
debug 2021-08-06T04:49:14.966+0000 7f42e40d5080 -1 bluestore(/var/lib/ceph/osd/ceph-1) _open_db failed to prepare db environment:
debug 2021-08-06T04:49:14.966+0000 7f42e40d5080  1 bdev(0x5621b7fce400 /var/lib/ceph/osd/ceph-1/block) close
debug 2021-08-06T04:49:15.226+0000 7f42e40d5080 -1 osd.1 0 OSD:init: unable to mount object store
debug 2021-08-06T04:49:15.226+0000 7f42e40d5080 -1  ** ERROR: osd init failed: (5) Input/output error@

h3 node:

...
debug 2021-08-06T04:38:55.526+0000 7f139f28e080 -1 bluefs _replay 0xc07000: stop: failed to decode: bad crc 3449997429 expected 0: Malformed input
debug 2021-08-06T04:38:55.526+0000 7f139f28e080 -1 bluefs mount failed to replay log: (5) Input/output error
debug 2021-08-06T04:38:55.526+0000 7f139f28e080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _open_bluefs failed bluefs mount: (5) Input/output error
debug 2021-08-06T04:38:55.526+0000 7f139f28e080 -1 bluestore(/var/lib/ceph/osd/ceph-2) _open_db failed to prepare db environment:
debug 2021-08-06T04:38:55.526+0000 7f139f28e080  1 bdev(0x55898df4e400 /var/lib/ceph/osd/ceph-2/block) close
debug 2021-08-06T04:38:55.686+0000 7f139f28e080 -1 osd.2 0 OSD:init: unable to mount object store
debug 2021-08-06T04:38:55.686+0000 7f139f28e080 -1  ** ERROR: osd init failed: (5) Input/output error

Full log files attached.

ceph-bluestore-tool and ceph-objectstore-tool outputs the same error messages.

I am not sure if this problem related to existing expecially BUG #50965.


Files


Related issues 2 (0 open2 closed)

Copied to bluestore - Backport #52492: pacific: bluefs mount failed to replay log: (5) Input/output errorResolvedActions
Copied to bluestore - Backport #52493: octopus: bluefs mount failed to replay log: (5) Input/output errorResolvedActions
Actions #1

Updated by Igor Fedotov over 2 years ago

Could you please set debug-bluefs to 20, retry startup attempt and share the log?

Actions #2

Updated by Viktor Svecov over 2 years ago

Thank you for help. I have attached log files with 'debug_bluefs = 20' from two nodes.

Actions #3

Updated by Igor Fedotov over 2 years ago

Viktor Svecov wrote:

Thank you for help. I have attached log files with 'debug_bluefs = 20' from two nodes.

One of the new log files looks incomplete, could you please update.

Actions #4

Updated by Viktor Svecov over 2 years ago

Sorry i didn't notice that err standard output stopped before the end of actual log on node h3. Now the log is complete for the OSD node h3.

Actions #5

Updated by Igor Fedotov over 2 years ago

Igor Fedotov wrote:

Viktor Svecov wrote:

Thank you for help. I have attached log files with 'debug_bluefs = 20' from two nodes.

One of the new log files looks incomplete, could you please update.

Thanks for the update.
I've just shared my analysis and related questions at Ceph dev's mailing list, see
https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/DNDJQ656DMGLXJG7FPRAKXDVQYSJ7XMP/

and I think that you can try to recover the OSDs (and hence additionally prove my analysis) by the following steps:
For osd.1:
fill the 4K block at offset 0xB558715000 (=0xb558660000 + 0xb5000) with zeros (please make a backup first)
then try to start the OSD.

For osd.2:
this should be offset (if my math is valid) 0xcb72b60000 - 0x10000 + 0xc07000 = 0xCB73757000

Actions #6

Updated by Viktor Svecov over 2 years ago

You are right. After zeroing appropriate areas of OSD block devices OSD daemons started. Now all PGs of Ceph Storage Cluster are active+clean. Thank you.

What are the future plans? How can i prevent such behaviour of BlueFS in the future?

Actions #7

Updated by Igor Fedotov over 2 years ago

  • Status changed from New to In Progress
  • Backport set to pacific, octopus
Actions #8

Updated by Igor Fedotov over 2 years ago

  • Status changed from In Progress to Fix Under Review
  • Pull request ID set to 42830
Actions #9

Updated by Igor Fedotov over 2 years ago

Viktor Svecov wrote:

You are right. After zeroing appropriate areas of OSD block devices OSD daemons started. Now all PGs of Ceph Storage Cluster are active+clean. Thank you.

What are the future plans? How can i prevent such behaviour of BlueFS in the future?

You'll need to upgrade to the relevant pacific release which has the proper patch - I've just submitted one to master hence it should pass all the stages: review, merge in master, backport to pacific, minor pacific release...

Actions #10

Updated by Igor Fedotov over 2 years ago

  • Status changed from Fix Under Review to Resolved
Actions #11

Updated by Igor Fedotov over 2 years ago

  • Status changed from Resolved to Pending Backport
Actions #12

Updated by Backport Bot over 2 years ago

  • Copied to Backport #52492: pacific: bluefs mount failed to replay log: (5) Input/output error added
Actions #13

Updated by Backport Bot over 2 years ago

  • Copied to Backport #52493: octopus: bluefs mount failed to replay log: (5) Input/output error added
Actions #14

Updated by Igor Fedotov over 2 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF