Project

General

Profile

Actions

Bug #15520

closed

OSDs refuse to start, latest osdmap missing

Added by Markus Blank-Burian about 8 years ago. Updated almost 8 years ago.

Status:
Rejected
Priority:
Urgent
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We had a problem on our production cluster (running 9.2.1) which caused /proc, /dev and /sys to be unmounted. During this time, we received the following error on a large number of OSDs (for various osdmap epochs):

Apr 15 15:25:19 kaa-99 ceph-osd4167: 2016-04-15 15:25:19.457774 7f1c817fd700 0 filestore(/local/ceph/osd.43) write couldn't open meta/-1/c188e154/osdmap.276293/0: (2) No such file or directory

After restarting the hosts, the OSDs now refuse to start with:

Apr 15 16:03:53 kaa-99 ceph-osd4211: -2> 2016-04-15 16:03:53.089842 7f8e9f840840 10 _load_class version success
Apr 15 16:03:53 kaa-99 ceph-osd4211: -1> 2016-04-15 16:03:53.089863 7f8e9f840840 20 osd.43 0 get_map 276424 - loading and decoding 0x7f8e9b841780
Apr 15 16:03:53 kaa-99 ceph-osd4211: 0> 2016-04-15 16:03:53.140754 7f8e9f840840 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f8e9f840840 time 2016-04-15 16:03:53.139563
osd/OSD.h: 847: FAILED assert(ret)

Inserting the map with ceph-objectstore-tool –op set-osdmap does not work and gives the following error:

osdmap (-1/c1882e94/osdmap.276507/0) does not exist.
2016-04-15 17:14:00.335751 7f4b4d75b840 1 journal close /dev/ssd/journal.43

How can I get the OSDs running again?

Actions #1

Updated by Markus Blank-Burian about 8 years ago

Ok, so I tried touching the missing osdmap file, then ceph-objectstore-tool could insert the osdmap. The ceph-osd daemon started without an error. Are there any side effects or is it save to apply to all the other OSDs?

Actions #2

Updated by Kefu Chai about 8 years ago

Markus, you can pass the "--force" option to the ceph-objectstore-tool CLI tool

 ceph-objectstore-tool –op set-osdmap <osdmap-file> --epoch <epoch-number> --force

to create an osdmap in the meta collection even if the osdmap does not exists.

Are there any side effects

no, it just creates/overwrites an osdmap file for osd

is it save to apply to all the other OSDs?

i think you meant to say "safe" not "save". yes, the osdmap file is the same across all OSDs in the cluster.

Actions #3

Updated by Markus Blank-Burian about 8 years ago

Thanks for the quick reply. All OSDs are now running again. There were some incomplete PGs in our EC pool, which I had to reset with ceph-objectstore-tool --op mark-complete.

Actions #4

Updated by Kefu Chai about 8 years ago

  • Status changed from New to Rejected

i think it's not a bug of ceph, so i am closing this issue as "rejected".

Actions #5

Updated by shasha lu almost 8 years ago

I think this is a bug of rocksdb.

when submit_transaction to rocksdb return errors, ceph doing nothing. Actually, data of bluestore doesn't write correctly.
But the osd runing well until you restart it.
While restart, osd will load lastest osdmap, due to rocksdb's write error, osd can't find the osdmap.

This commit handle the error. https://github.com/ceph/ceph/pull/8599
When rocksd return error,the osd will abort.

Actually, you can simply avoid this by turnoff recycle_log_file_num feature of rocksdb. I think this feature have bugs.

I met the error too when using the default bluestore_rocksdb_options which is
bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3,recycle_log_file_num=16.

I turnoff recycle_log_file_num = 16,by adding the following entry in ceph.conf
bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=16,min_write_buffer_number_to_merge=3
Then rocksdb no longer returns error.

Actions

Also available in: Atom PDF