Project

General

Profile

Actions

Bug #14406

closed

osd aborted when try to decode osdmap with 0 bytes

Added by huang jun over 8 years ago. Updated over 6 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Our ceph cluster(v0.94.5) configured with cache tier + erasure code pool,
and when we adjust the osd weight by "ceph osd reweight-by-pg 110 POOL",
one osd down, and can not restart it.

The osd log:
-10> 2016-01-19 11:21:25.533873 7f6350b80800 10 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=0 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] handle_loaded
-9> 2016-01-19 11:21:25.533902 7f6350b80800 5 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=0 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] exit Initial 0.057136 0 0.000000
-8> 2016-01-19 11:21:25.533914 7f6350b80800 5 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=0 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] enter Reset
-7> 2016-01-19 11:21:25.533921 7f6350b80800 20 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=0 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] set_last_peering_reset 544
-6> 2016-01-19 11:21:25.533928 7f6350b80800 10 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=544 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] Clearing blocked outgoing recovery messages
-5> 2016-01-19 11:21:25.533936 7f6350b80800 10 osd.52 pg_epoch: 544 pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348 les/c/
f 544/544/0 543/543/379) [52,21] r=0 lpr=544 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] Not blocking outgoing recovery messages
-4> 2016-01-19 11:21:25.533943 7f6350b80800 10 osd.52 559 load_pgs loaded pg[3.1b0( v 502'15465 (495'12434,502'15465] local-les=544 n=121 ec=348
les/c/f 544/544/0 543/543/379) [52,21] r=0 lpr=544 crt=502'15463 lcod 0'0 mlcod 0'0 inactive] log((495'12434,502'15465], crt=502'15463)
-3> 2016-01-19 11:21:25.533953 7f6350b80800 10 osd.52 559 pgid 3.1b4 coll 3.1b4_head
-2> 2016-01-19 11:21:25.534080 7f6350b80800 20 osd.52 0 get_map 553 - loading and decoding 0x7f635005fc80
-1> 2016-01-19 11:21:25.534144 7f6350b80800 10 osd.52 0 add_map_bl 553 0 bytes
0> 2016-01-19 11:21:25.536194 7f6350b80800 -1 *** Caught signal (Aborted) **

GBD debug log:
(gdb) bt
#0 0x000000343d60f5db in raise () from /lib64/libpthread.so.0
#1 0x0000000000a57266 in reraise_fatal (signum=6) at global/signal_handler.cc:59
#2 handle_fatal_signal (signum=6) at global/signal_handler.cc:109
#3 <signal handler called>
#4 0x000000343d232925 in raise () from /lib64/libc.so.6
#5 0x000000343d234105 in abort () from /lib64/libc.so.6
#6 0x000000343f6bea5d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#7 0x000000343f6bcbe6 in ?? () from /usr/lib64/libstdc++.so.6
#8 0x000000343f6bcc13 in std::terminate() () from /usr/lib64/libstdc++.so.6
#9 0x000000343f6bcd0e in __cxa_throw () from /usr/lib64/libstdc++.so.6
#10 0x0000000000bb2e9f in copy_out (this=0x7ffc6bc9f120, len=1, dest=0x7ffc6bc9f0df "") at include/buffer.h:226
#11 ceph::buffer::list::iterator::copy (this=0x7ffc6bc9f120, len=1, dest=0x7ffc6bc9f0df "") at common/buffer.cc:944
#12 0x0000000000b49524 in decode_raw<unsigned char> (this=0x7effbdc5fc80, bl=...) at include/encoding.h:57
#13 decode (this=0x7effbdc5fc80, bl=...) at include/encoding.h:64
#14 OSDMap::decode (this=0x7effbdc5fc80, bl=...) at osd/OSDMap.cc:2097
#15 0x0000000000b49f17 in OSDMap::decode (this=0x7effbdc5fc80, bl=<value optimized out>) at osd/OSDMap.cc:1970
#16 0x000000000064d0b3 in OSDService::try_get_map (this=0x7effbddc66c0, epoch=553) at osd/OSD.cc:1182
#17 0x0000000000672950 in OSD::load_pgs (this=0x7effbddc5000) at osd/OSD.cc:2827
#18 0x0000000000688cce in OSD::init (this=0x7effbddc5000) at osd/OSD.cc:1893
#19 0x000000000062f6ff in main (argc=<value optimized out>, argv=<value optimized out>) at ceph_osd.cc:523
(gdb) f 17
#17 0x0000000000672950 in OSD::load_pgs (this=0x7effbddc5000) at osd/OSD.cc:2827
2827 OSDMapRef pgosdmap = service.try_get_map(map_epoch);
(gdb) p map_epoch
$1 = 553
(gdb) p pgid
$2 = {pgid = {m_pool = 3, m_seed = 436, m_preferred = -1}, shard = {id = 255 '\377', static NO_SHARD = {id = 255 '\377',
static NO_SHARD = <same as static member of an already seen type>}}}
(gdb) p bl
$3 = {_buffers = empty std::list, _len = 0, _memcopy_count = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0x7ffc6bc9fae0,
ls = 0x7ffc6bc9fae0, off = 0, p = {_raw = , _off = 0, _len = 0}, p_off = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}
(gdb) f 16
#16 0x000000000064d0b3 in OSDService::try_get_map (this=0x7effbddc66c0, epoch=553) at osd/OSD.cc:1182
1182 map->decode(bl);
(gdb) p epoch
$4 = 553
(gdb) p bl
$5 = {_buffers = empty std::list, _len = 0, _memcopy_count = 0, append_buffer = {_raw = 0x0, _off = 0, _len = 0}, last_p = {bl = 0x7ffc6bc9f2d0,
ls = 0x7ffc6bc9f2d0, off = 0, p = {_raw = , _off = 0, _len = 0}, p_off = 0}, static CLAIM_DEFAULT = 0, static CLAIM_ALLOW_NONSHAREABLE = 1}
(gdb) p whoami
$7 = 52

Local osdmap file:
[root@srv5 meta]# stat DIR_2/DIR_8/inc\\uosdmap.553__0_C673C082__none
File: `DIR_2/DIR_8/inc\\uosdmap.553__0_C673C082__none'
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 811h/2065d Inode: 805325626 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2016-01-19 10:34:48.124756907 +0800
Modify: 2016-01-19 10:34:48.124756907 +0800
Change: 2016-01-19 10:34:48.124756907 +0800
[root@srv5 meta]# stat DIR_5/DIR_5/osdmap.553__0_AC963655__none
File: `DIR_5/DIR_5/osdmap.553__0_AC963655__none'
Size: 0 Blocks: 0 IO Block: 4096 regular empty file
Device: 811h/2065d Inode: 5802706 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2016-01-19 10:34:48.134757123 +0800
Modify: 2016-01-19 10:34:48.134757123 +0800
Change: 2016-01-19 10:34:48.134757123 +0800

Seems like in osd boot procedure, osd try to decode the 0 length osdmap, which cause the abort.
and the osdmap was read from local filestore.

I think the core problem is why the osdmap length 0 bytes.
can we provide more information when try to decide 0 bytes osdmap.
what should to do when osd met this problem?

Actions

Also available in: Atom PDF