Project

General

Profile

Bug #10225

keyvaluestore: OSDs do not start after few weeks of downtime (osd init failed / unable to read osd superblock)

Added by Dmitry Smirnov almost 7 years ago. Updated over 6 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On "Giant" I've created seven KV OSDs (on 4 or 5 different hosts) before cluster went down due to cascade of OSD crashes (see #9978).
All KV OSDs were stopped as it was suggested that they may be causing the problem (unfortunately it did not relieve situation as filestore-based OSD are crashing as well...).
Some weeks passed. Now I'm trying to start KV OSD (to capture yet another crash log) but none of them are starting:

2014-12-03 17:40:26.739455 7f6a23e88880  0 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578), process ceph-osd, pid 28587
2014-12-03 17:40:26.741767 7f6a23e88880  5 basedir /var/lib/ceph/osd/ceph-7
2014-12-03 17:40:26.741782 7f6a23e88880 10 mount fsid is ce1d4ff1-6ea7-48a0-b07e-9c9268acf892
2014-12-03 17:40:30.682296 7f6a23e88880 20 (init)genericobjectmap: seq is 1976429
2014-12-03 17:40:30.682420 7f6a23e88880  5 umount /var/lib/ceph/osd/ceph-7
2014-12-03 17:40:30.953776 7f6a23e88880  5 test_mount basedir /var/lib/ceph/osd/ceph-7
2014-12-03 17:40:30.953921 7f6a23e88880  5 basedir /var/lib/ceph/osd/ceph-7
2014-12-03 17:40:30.953932 7f6a23e88880 10 mount fsid is ce1d4ff1-6ea7-48a0-b07e-9c9268acf892
2014-12-03 17:40:34.745314 7f6a23e88880 20 (init)genericobjectmap: seq is 1976429
2014-12-03 17:40:34.745427 7f6a23e88880 15 read meta/23c2fcde/osd_superblock/0//-1 0~0
2014-12-03 17:40:34.745518 7f6a23e88880 20 lookup_strip_header failed to get strip_header  cid meta oid 23c2fcde/osd_superblock/0//-1
2014-12-03 17:40:34.745525 7f6a23e88880 10 read meta/23c2fcde/osd_superblock/0//-1 0~0 header isn't exist: r = -2
2014-12-03 17:40:34.745529 7f6a23e88880 -1 osd.7 0 OSD::init() : unable to read osd superblock
2014-12-03 17:40:34.745531 7f6a23e88880  5 umount /var/lib/ceph/osd/ceph-7
2014-12-03 17:40:35.012015 7f6a23e88880 -1 ESC[0;31m ** ERROR: osd init failed: (22) Invalid argumentESC[0m

Associated revisions

Revision 61769636 (diff)
Added by xie xingguo over 5 years ago

os/bluestore: end scope of std::hex properly; convert csum error to EIO

Mark's comments:

This passed "ceph_test_objectstore --gtest_filter=*/2".
This PR did not appear to have a significant impact on performance tests.

Closes #10225

os/bluestore: end scope of std::hex properly

To avoid side-effects by accident.

Signed-off-by: xie xingguo <>

os/bluestore: convert csum error to EIO

The verify_csum() method either returns -1 or -EOPNOTSUPP, which
is not very proper and difficult for user understanding.

Signed-off-by: xie xingguo <>

os/bluestore: assert lextent is shared

Otherwise we are risking of accessing violation.

Signed-off-by: xie xingguo <>

os/bluestore: drop duplicated assignment of result code

These two methods never fail actually.

Signed-off-by: xie xingguo <>

os/bluestore: improve _do_read() a little

Signed-off-by: xie xingguo <>

os/bluestore: assert decoding of shard of key to be successful

Otherwise we are risking of acessing null pointer.

Signed-off-by: xie xingguo <>

History

#1 Updated by Haomai Wang almost 7 years ago

Hmm, I'm not sure why this happen. It seemed keyvaluestore lose "osd_superblock"?

Do you upgrade ceph?

#2 Updated by Dmitry Smirnov almost 7 years ago

I didn't do anything to Ceph because cluster was down so I didn't even start those OSDs. No upgrades to Ceph was deployed. As far as I'm aware the only thing is that KV OSDs were down for few weeks. I seriously doubt that all seven KV OSDs could lost their superblock on four different hosts at the same time. HDDs are healthy, no file system errors etc. If I'm not mistaken at least two machines didn't even reboot since KV OSDs were stopped...

#3 Updated by Dmitry Smirnov almost 7 years ago

Oh yeah, another thing: some filestore based OSDs crash at the end of boot sequence so I didn't bother to start them for some weeks as well. Now I can start all filestore OSDs that were down for few weeks but none of KV OSDs...

#4 Updated by Sage Weil almost 7 years ago

  • Assignee set to Haomai Wang

Just a reminder that the "_dev" in "keyvaluestore_dev" means "experimental! danger! danger!". This code is not well-tested and should not be used in production.

#5 Updated by Dmitry Smirnov almost 7 years ago

Sage Weil wrote:

Just a reminder that the "_dev" in "keyvaluestore_dev" means "experimental! danger! danger!". This code is not well-tested and should not be used in production.

With all due respect I started to experiment with "keyvaluestore_dev" because I have the same feeling regarding stability of filestore OSDs and I had silly little hope that KV might behave just a little bit better... Anyway one shall try new features and report bugs once they are found...

#6 Updated by Dmitry Smirnov almost 7 years ago

  • Assignee deleted (Haomai Wang)

This issue has something to do with down time. On KV OSDs I've checked 'superblock' files and found that they are OK and binary identical to new superblock produced by "ceph-osd --mkfs". Finally I've identified one OSD that was still possible to start -- the only difference is that is was stopped later than others so this mysterious "decay period" did not break it yet. Finally out of curiosity/desperation I tried to bring 'em up with 0.89 and it worked! Unfortunately #9978 is still not fixed... :(

#7 Updated by Dmitry Smirnov almost 7 years ago

  • Assignee set to Haomai Wang

#8 Updated by Haomai Wang almost 7 years ago

So there not exists OSD superblock issue? only EC+KV problem that #9978 mentioned?

#9 Updated by Haomai Wang over 6 years ago

  • Status changed from New to Closed

Also available in: Atom PDF