Project

General

Profile

Actions

Bug #14406

closed

osd aborted when try to decode osdmap with 0 bytes

Added by huang jun over 8 years ago. Updated about 6 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Our ceph cluster(v0.94.5) configured with cache tier + erasure code pool,
and when we adjust the osd weight by "ceph osd reweight-by-pg 110 POOL",
one osd down, and can not restart it.

The osd log:
-10> 2016-01-19 11:21:25.533873 7f6350b80800 10 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=0 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] handle_loaded
-9> 2016-01-19 11:21:25.533902 7f6350b80800 5 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=0 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] exit Initial 0.057136 0 0.000000
-8> 2016-01-19 11:21:25.533914 7f6350b80800 5 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=0 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] enter Reset
-7> 2016-01-19 11:21:25.533921 7f6350b80800 20 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=0 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] set_last_peering_reset 544
-6> 2016-01-19 11:21:25.533928 7f6350b80800 10 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=544 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] Clearing blocked outgoing recovery messages
-5> 2016-01-19 11:21:25.533936 7f6350b80800 10 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=544 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] Not blocking outgoing recovery messages
-4> 2016-01-19 11:21:25.533943 7f6350b80800 10 osd.52 559 load_pgs loaded pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348
les/c/f 544/544/0 543/543/379) [52,21] r=0 lpr=544 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] log((495'12434,502'15465], crt=502'15463)
-3> 2016-01-19 11:21:25.533953 7f6350b80800 10 osd.52 559 pgid 3.1b4 coll 3.1b4_head
-2> 2016-01-19 11:21:25.534080 7f6350b80800 20 osd.52 0 get_map 553 - loading and decoding 0x7f635005fc80
-1> 2016-01-19 11:21:25.534144 7f6350b80800 10 osd.52 0 add_map_bl 553 0 bytes
0> 2016-01-19 11:21:25.536194 7f6350b80800 -1 *** Caught signal (Aborted) **

GBD debug log:
(gdb) bt
#0 0x000000343d60f5db in raise () from /lib64/libpthread.so.0
#1 0x0000000000a57266 in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2 handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3 <signal handler called>
#4 0x000000343d232925 in raise () from /lib64/libc.so.6
#5 0x000000343d234105 in abort () from /lib64/libc.so.6
#6 0x000000343f6bea5d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#7 0x000000343f6bcbe6 in ?? () from /usr/lib64/libstdc++.so.6
#8 0x000000343f6bcc13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#9 0x000000343f6bcd0e in __cxa_throw () from /usr/lib64/libstdc++.so.6
#10 0x0000000000bb2e9f in copy_out (this=0x7ffc6bc9f120, len=1, dest=0x7ffc6bc9f0df "") at include/buffer.h:226
#11 ceph::buffer::list::iterator::copy (this=0x7ffc6bc9f120, len=1, dest=0x7ffc6bc9f0df "") at common/buffer.cc:944
#12 0x0000000000b49524 in decode_raw<unsigned char> (this=0x7effbdc5fc80, bl=...) at include/encoding.h:57
#13 decode (this=0x7effbdc5fc80, bl=...) at include/encoding.h:64
#14 OSDMap::decode (this=0x7effbdc5fc80, bl=...) at osd/OSDMap.cc:2097
#15 0x0000000000b49f17 in OSDMap::decode (this=0x7effbdc5fc80, bl=<value optimized out>) at osd/OSDMap.cc:1970
#16 0x000000000064d0b3 in OSDService::try_get_map (this=0x7effbddc66c0, epoch=553) at osd/OSD.cc:1182
#17 0x0000000000672950 in OSD::load_pgs (this=0x7effbddc5000) at osd/OSD.cc:2827
#18 0x0000000000688cce in OSD::init (this=0x7effbddc5000) at osd/OSD.cc:1893
#19 0x000000000062f6ff in main (argc=<value optimized out>, argv=<value optimized out>) at ceph_osd.cc:523
(gdb) f 17
#17 0x0000000000672950 in OSD::load_pgs (this=0x7effbddc5000) at osd/OSD.cc:2827
2827 OSDMapRef pgosdmap = service.try_get_map(map_epoch);
(gdb) p map_epoch
$1 = 553
(gdb) p pgid
$2 = {pgid = {m_pool = 3, m_seed = 436, m_preferred = -1}, shard = {id = 255 '\377', static NO_SHARD = {id = 255 '\377',
static NO_SHARD = <same as static member of an already seen type>}}}
(gdb) p bl
$3 = {_buffers = empty std::list, _len = 0, _memcopy_count = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0x7ffc6bc9fae0,
ls = 0x7ffc6bc9fae0, off = 0, p = {_raw = , _off = 0, _len = 0}, p_off = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}
(gdb) f 16
#16 0x000000000064d0b3 in OSDService::try_get_map (this=0x7effbddc66c0, epoch=553) at osd/OSD.cc:1182
1182 map->decode(bl);
(gdb) p epoch
$4 = 553
(gdb) p bl
$5 = {_buffers = empty std::list, _len = 0, _memcopy_count = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0x7ffc6bc9f2d0,
ls = 0x7ffc6bc9f2d0, off = 0, p = {_raw = , _off = 0, _len = 0}, p_off = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}
(gdb) p whoami
$7 = 52

Local osdmap file:
[root@srv5 meta]# stat DIR_2/DIR_8/inc\\uosdmap.553__0_C673C082__none
File: `DIR_2/DIR_8/inc\\uosdmap.553__0_C673C082__none'
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 811h/2065d Inode: 805325626 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2016-01-19 10:34:48.124756907 +0800
Modify: 2016-01-19 10:34:48.124756907 +0800
Change: 2016-01-19 10:34:48.124756907 +0800
[root@srv5 meta]# stat DIR_5/DIR_5/osdmap.553__0_AC963655__none
File: `DIR_5/DIR_5/osdmap.553__0_AC963655__none'
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 811h/2065d Inode: 5802706 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2016-01-19 10:34:48.134757123 +0800
Modify: 2016-01-19 10:34:48.134757123 +0800
Change: 2016-01-19 10:34:48.134757123 +0800

Seems like in osd boot procedure, osd try to decode the 0 length osdmap, which cause the abort.
and the osdmap was read from local filestore.

I think the core problem is why the osdmap length 0 bytes.
can we provide more information when try to decide 0 bytes osdmap.
what should to do when osd met this problem?

Actions #1

Updated by Samuel Just about 8 years ago

  • Status changed from New to Can't reproduce

Sounds like a corrupt osd store. You could probably try to copy the osdmap from another osd. If you can reproduce the issue from when the osd is healthy until it has a 0 length osdmap with debug osd = 20, debug filestore = 20, debug ms = 1, feel free to reopen.

Actions #2

Updated by huang jun about 8 years ago

yes, i copy the osdmap from other osd, and restart osd, it works.

Actions #3

Updated by huang jun about 8 years ago

therer is a big chance occur when we set
filestore_fsync_flushes_journal_data = true
filestore_max_sync_interval = 60
and cut off the machine's power.

Actions #4

Updated by Emmanuel Lacour almost 8 years ago

It seems I had a similar issue when upgrading from dumpling to firefly to hammer. In the last step I had two OSDs that did not wan't to start and crash with same kind of log:

-3> 2016-05-15 15:25:11.268633 7f5a1e13a840 10 _load_class rbd success
-2> 2016-05-15 15:25:11.268655 7f5a1e13a840 20 osd.6 0 get_map 3975 - loading and decoding 0x4caa240
-1> 2016-05-15 15:25:11.268720 7f5a1e13a840 10 osd.6 0 add_map_bl 3975 2985 bytes
0> 2016-05-15 15:25:11.272129 7f5a1e13a840 -1 ** Caught signal (Aborted) *
in thread 7f5a1e13a840
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: /usr/bin/ceph-osd() [0xbf03dc]
2: (()+0xf0a0) [0x7f5a1d0370a0]
3: (gsignal()+0x35) [0x7f5a1b98a125]
4: (abort()+0x180) [0x7f5a1b98d3a0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f5a1c1ed89d]
6: (()+0x63996) [0x7f5a1c1eb996]
7: (()+0x639c3) [0x7f5a1c1eb9c3]
8: (()+0x63bee) [0x7f5a1c1ebbee]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0xd524e7]
10: (OSDMap::decode_classic(ceph::buffer::list::iterator&)+0x438) [0xca02f8]
11: (OSDMap::decode(ceph::buffer::list::iterator&)+0x66) [0xca07d6]
12: (OSDMap::decode(ceph::buffer::list&)+0x3e) [0xca105e]
13: (OSDService::try_get_map(unsigned int)+0x373) [0x7cda33]
14: (OSDService::get_map(unsigned int)+0x16) [0x836fe6]
15: (OSD::init()+0xd44) [0x7ebac4]
16: (main()+0x253e) [0x78dd6e]
17: (__libc_start_main()+0xfd) [0x7f5a1b976ead]
18: /usr/bin/ceph-osd() [0x793de9]

after many tries I found this bug report and tested to sync the osdmap from another working osd. Thought I did not find references on how to do this. I tried to sync the whole current/meta/, but osd complains that is ID is not the right one, I then restored the dir and synced only current/meta/DIR*, osd started and cluster is now healthy again and we did not seemed to had data loss :)

My complains:

- there is no real error message the the filestore/osdmap is corrupted, it's very hard to decode just "add_map_bl" line
- there doesn't seems to be doc about filestore osdmap and how to recover such corruption

Full log of one of the two osds, including both upgrades and all our tries with debug mode on and finally the last successfull start is here with ceph.conf:

http://owncloud.home-dn.net/index.php/s/jctJmSpojLM1Peh

Actions #5

Updated by k z about 6 years ago

I think i have met the same problem.
The osd is blocked in the OSDMap::decode function after the system restarted.
The stack is the same as above.
Do any friends know it?

Actions

Also available in: Atom PDF