Project

General

Profile

Bug #11429

OSD::load_pgs: we need to handle the case where an upgrade from earlier versions which ignored non-existent pgs resurrects a pg with a prehistoric osdmap

Added by Samuel Just almost 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
hammer,firefly
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

ceph-osd.25.log.tar.gz (173 KB) Irek Fasikhov, 05/26/2015 06:45 AM


Related issues

Related to Ceph - Bug #10617: osd: pgs for deleted pools don't finish getting removed if osd restarts Resolved 01/23/2015
Duplicated by Ceph - Bug #11305: "FAILED assert(ret)" in upgrade:dumpling-x-firefly-distro-basic-vps run Duplicate 04/01/2015
Duplicated by Ceph - Bug #11373: OSD crash in OSDService::get_map Duplicate 04/11/2015
Duplicated by Ceph - Bug #11554: Osds not start after upgrade Duplicate 05/07/2015

Associated revisions

Revision f0c925e3 (diff)
Added by Sage Weil over 8 years ago

suites/rados/singleton-nomsgr/all/11429.yaml: double-hop and fix

- simplify this.. lots of extra cruft we don't need
- restart twice at hammer to ensure that we can still load pgs
post-upgrade
- do the same for the final version.

Fixes: #11429 (again, for ~infernalis)
Fixes: #13060
Signed-off-by: Sage Weil <>

History

#1 Updated by Samuel Just almost 9 years ago

  • Status changed from New to 7

#2 Updated by Samuel Just almost 9 years ago

  • Status changed from 7 to 12

#3 Updated by Sage Weil almost 9 years ago

It seems like the safest option here would be to have users manually run ceph-objectstore-tool remove.

We could make the OSD automatically delete PGs when the map is ancient, but that seems dangerous to me since an epoch-related bug could trigger deletion.

#4 Updated by Ken Dreyer almost 9 years ago

Was there a patch that went in for this?

#5 Updated by Samuel Just almost 9 years ago

Yeah, I tried to make it remove the pg automatically, but it turned out to be complicated. Instead, it'll just skip the pg and complain into the log that the user should manually clean up the pg at some point in the future.

#6 Updated by Tuomas Juntunen almost 9 years ago

Could someone give out the process on how to use the ceph-objectstore-tool remove.

The one with get the pg's and compare to invalid pg's in 'ceph osd pool ls detail' seems too vague.

T

#7 Updated by Samuel Just almost 9 years ago

  • Status changed from 12 to Pending Backport

#8 Updated by Ken Dreyer almost 9 years ago

Patch that went into master: https://github.com/ceph/ceph/pull/4539

#9 Updated by Loïc Dachary almost 9 years ago

  • Severity changed from 3 - minor to 1 - critical

#12 Updated by Loïc Dachary almost 9 years ago

  • Regression set to No

#14 Updated by Loïc Dachary almost 9 years ago

#15 Updated by Loïc Dachary almost 9 years ago

since the task installs firefly to reproduce the problem, it will become a noop as soon as the bug is fixed in v0.80.10+. It should install v0.80.9 instead of firefly.

#16 Updated by Loïc Dachary almost 9 years ago

<sjustwork> loicd: http://tracker.ceph.com/issues/11429 I think the task installs v0.80.8
<sjustwork> not firefly
<loicd> sjustwork: ah cool, my mistake
<loicd> sjustwork: I wonder why I thought it installed firefly... sorry for the noise
<sjustwork> ok
<loicd> I probably read  print: '**** done installing firefly'
<loicd> instead of the line before
<loicd>    branch: v0.80.8
<loicd> oh well

#17 Updated by Loïc Dachary almost 9 years ago

  • Status changed from Pending Backport to Resolved

#18 Updated by Irek Fasikhov almost 9 years ago

  • File ceph-osd.25.log.tar.gz added

Hi Loic,Samuel

The problem remains despite the patch.
Look at the attached log file.

[root@ceph03p24 ceph]# ceph -v
ceph version 0.94.1-116-g63832d4 (63832d4039889b6b704b88b86eaba4aadcfceb2e)

Configuration

[osd]
        osd journal size = 10000
        osd mkfs type = xfs
        osd mkfs options xfs = -f -i size=2048
        osd mount options xfs = rw,noatime,inode64,logbsize=256k,allocsize=1m
        filestore xattr use omap = true

        osd scrub load threshold = 2
        osd recovery op priority = 2
        osd max backfills = 1
        osd recovery max active = 1
        osd recovery threads = 1
        osd crush update on start = false
        osd recovery delay start = 5
        osd snap trim sleep = 0.5
        osd disk thread ioprio class = idle
        osd disk thread ioprio priority = 7

        debug_objecter = 10/10
        debug_ms = 10/10
        debug_filestore = 10/10
        debug_osd = 10/10
        debug_journal = 10/10

#19 Updated by Irek Fasikhov almost 9 years ago

        debug_objecter = 20/20
        debug_ms = 20/20
        debug_filestore = 20/20
        debug_osd = 20/20
        debug_journal = 20/20

#20 Updated by Loïc Dachary almost 9 years ago

  • File deleted (ceph-osd.25.log.tar.gz)

#21 Updated by Loïc Dachary almost 9 years ago

From ceph-osd.25.log.tar.gz

    -5> 2015-05-26 09:42:54.997851 7f0fbfa34880 10 _load_class version success
    -4> 2015-05-26 09:42:54.997862 7f0fbfa34880 20 osd.25 0 get_map 17735 - loading and decoding 0x4589200
    -3> 2015-05-26 09:42:54.997869 7f0fbfa34880 15 filestore(/var/lib/ceph/osd/ceph-25) read meta/4e928679/osdmap.17735/0//-1 0~0
    -2> 2015-05-26 09:42:54.997890 7f0fbfa34880 10 filestore(/var/lib/ceph/osd/ceph-25) error opening file /var/lib/ceph/osd/ceph-25/current/meta/DIR_9/DIR_7/osdmap.17735__0_4E928679__none with flags=2: (2) No such file or directory
    -1> 2015-05-26 09:42:54.997899 7f0fbfa34880 10 filestore(/var/lib/ceph/osd/ceph-25) FileStore::read(meta/4e928679/osdmap.17735/0//-1) open error: (2) No such file or directory
     0> 2015-05-26 09:42:54.999254 7f0fbfa34880 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f0fbfa34880 time 2015-05-26 09:42:54.997908
osd/OSD.h: 716: FAILED assert(ret)

 ceph version 0.94.1-116-g63832d4 (63832d4039889b6b704b88b86eaba4aadcfceb2e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc4e15]
 2: (OSDService::get_map(unsigned int)+0x3f) [0x6ffa9f]
 3: (OSD::init()+0x6b7) [0x6b8e17]
 4: (main()+0x27f3) [0x643b63]
 5: (__libc_start_main()+0xf5) [0x7f0fbcdd2af5]
 6: /usr/bin/ceph-osd() [0x65cdc9]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

#22 Updated by Irek Fasikhov almost 9 years ago

Loic, This conclusion was already with the patch: https://github.com/ceph/ceph/pull/4559
Or you need to recreate the OSD to correct?
Thanks

#23 Updated by Loïc Dachary almost 9 years ago

what you're having is different: it's a failure to load the osdmap because the epoch in the osd superblock is a reference to an osdmap that does not exist. This bug is about a failure to load an osdmap referenced from a resurected pg. Would you mind creating another bug report with the same information ?

#24 Updated by Irek Fasikhov almost 9 years ago

Loic Dachary wrote:

what you're having is different: it's a failure to load the osdmap because the epoch in the osd superblock is a reference to an osdmap that does not exist. This bug is about a failure to load an osdmap referenced from a resurected pg. Would you mind creating another bug report with the same information ?

Of course, it will create.

#25 Updated by Irek Fasikhov almost 9 years ago

Loic, Already there is also attached to the current task: #11373

#26 Updated by Loïc Dachary almost 9 years ago

The bug #11373 is a duplicate of this one and the trace shows it crashes in load_pgs. Your problem seems slightly different: it does not involve load_pgs.

#28 Updated by Srikanth Madugundi almost 9 years ago

We recently started seeing this crash in some of our OSDs, we applied the patch to firefly and did not fix the crash.

-5> 2015-06-03 05:31:25.986906 7f54fc5ee780 10 register_cxx_method kvs.create_with_omap flags 2 0x7f54ee423000
-4> 2015-06-03 05:31:25.986908 7f54fc5ee780 10 register_cxx_method kvs.omap_remove flags 2 0x7f54ee422110
-3> 2015-06-03 05:31:25.986909 7f54fc5ee780 10 register_cxx_method kvs.maybe_read_for_balance flags 1 0x7f54ee422820
-2> 2015-06-03 05:31:25.986911 7f54fc5ee780 10 _load_class kvs success
-1> 2015-06-03 05:31:25.986926 7f54fc5ee780 20 osd.61 0 get_map 11487 - loading and decoding 0x219f000
0> 2015-06-03 05:31:25.987927 7f54fc5ee780 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f54fc5ee780 time 2015-06-03 05:31:25.986975
osd/OSD.h: 634: FAILED assert(ret)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (OSDService::get_map(unsigned int)+0x3f) [0x68e86f]
2: (OSD::init()+0x2259) [0x64e529]
3: (main()+0x35aa) [0x5f991a]
4: (__libc_start_main()+0xfd) [0x3f3281ed5d]
5: /home/y/bin64/ceph-osd() [0x5f5fb9]
NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Also available in: Atom PDF