Bug #11429
closedOSD::load_pgs: we need to handle the case where an upgrade from earlier versions which ignored non-existent pgs resurrects a pg with a prehistoric osdmap
Added by Samuel Just about 9 years ago. Updated almost 9 years ago.
0%
Files
ceph-osd.25.log.tar.gz (173 KB) ceph-osd.25.log.tar.gz | Irek Fasikhov, 05/26/2015 06:45 AM |
Updated by Sage Weil about 9 years ago
It seems like the safest option here would be to have users manually run ceph-objectstore-tool remove.
We could make the OSD automatically delete PGs when the map is ancient, but that seems dangerous to me since an epoch-related bug could trigger deletion.
Updated by Ken Dreyer about 9 years ago
Was there a patch that went in for this?
Updated by Samuel Just about 9 years ago
Yeah, I tried to make it remove the pg automatically, but it turned out to be complicated. Instead, it'll just skip the pg and complain into the log that the user should manually clean up the pg at some point in the future.
Updated by Tuomas Juntunen about 9 years ago
Could someone give out the process on how to use the ceph-objectstore-tool remove.
The one with get the pg's and compare to invalid pg's in 'ceph osd pool ls detail' seems too vague.
T
Updated by Samuel Just about 9 years ago
- Status changed from 12 to Pending Backport
Updated by Ken Dreyer about 9 years ago
Patch that went into master: https://github.com/ceph/ceph/pull/4539
Updated by Loïc Dachary about 9 years ago
- Severity changed from 3 - minor to 1 - critical
Updated by Xinxin Shu almost 9 years ago
- firefly backport https://github.com/ceph/ceph/pull/4556
Updated by Loïc Dachary almost 9 years ago
- hammer backport https://github.com/ceph/ceph/pull/4559
Updated by Loïc Dachary almost 9 years ago
- Regression set to No
- ceph-qa-suite master https://github.com/ceph/ceph-qa-suite/pull/428
Updated by Loïc Dachary almost 9 years ago
- ceph-qa-suite hammer backport https://github.com/ceph/ceph-qa-suite/pull/432
Updated by Loïc Dachary almost 9 years ago
- ceph-qa-suite firefly backport https://github.com/ceph/ceph-qa-suite/pull/435
Updated by Loïc Dachary almost 9 years ago
since the task installs firefly to reproduce the problem, it will become a noop as soon as the bug is fixed in v0.80.10+. It should install v0.80.9 instead of firefly.
Updated by Loïc Dachary almost 9 years ago
<sjustwork> loicd: http://tracker.ceph.com/issues/11429 I think the task installs v0.80.8 <sjustwork> not firefly <loicd> sjustwork: ah cool, my mistake <loicd> sjustwork: I wonder why I thought it installed firefly... sorry for the noise <sjustwork> ok <loicd> I probably read print: '**** done installing firefly' <loicd> instead of the line before <loicd> branch: v0.80.8 <loicd> oh well
Updated by Loïc Dachary almost 9 years ago
- Status changed from Pending Backport to Resolved
Updated by Irek Fasikhov almost 9 years ago
- File ceph-osd.25.log.tar.gz added
Hi Loic,Samuel
The problem remains despite the patch.
Look at the attached log file.
[root@ceph03p24 ceph]# ceph -v ceph version 0.94.1-116-g63832d4 (63832d4039889b6b704b88b86eaba4aadcfceb2e)
Configuration
[osd] osd journal size = 10000 osd mkfs type = xfs osd mkfs options xfs = -f -i size=2048 osd mount options xfs = rw,noatime,inode64,logbsize=256k,allocsize=1m filestore xattr use omap = true osd scrub load threshold = 2 osd recovery op priority = 2 osd max backfills = 1 osd recovery max active = 1 osd recovery threads = 1 osd crush update on start = false osd recovery delay start = 5 osd snap trim sleep = 0.5 osd disk thread ioprio class = idle osd disk thread ioprio priority = 7 debug_objecter = 10/10 debug_ms = 10/10 debug_filestore = 10/10 debug_osd = 10/10 debug_journal = 10/10
Updated by Irek Fasikhov almost 9 years ago
- File ceph-osd.25.log.tar.gz ceph-osd.25.log.tar.gz added
debug_objecter = 20/20 debug_ms = 20/20 debug_filestore = 20/20 debug_osd = 20/20 debug_journal = 20/20
Updated by Loïc Dachary almost 9 years ago
- File deleted (
ceph-osd.25.log.tar.gz)
Updated by Loïc Dachary almost 9 years ago
From ceph-osd.25.log.tar.gz
-5> 2015-05-26 09:42:54.997851 7f0fbfa34880 10 _load_class version success -4> 2015-05-26 09:42:54.997862 7f0fbfa34880 20 osd.25 0 get_map 17735 - loading and decoding 0x4589200 -3> 2015-05-26 09:42:54.997869 7f0fbfa34880 15 filestore(/var/lib/ceph/osd/ceph-25) read meta/4e928679/osdmap.17735/0//-1 0~0 -2> 2015-05-26 09:42:54.997890 7f0fbfa34880 10 filestore(/var/lib/ceph/osd/ceph-25) error opening file /var/lib/ceph/osd/ceph-25/current/meta/DIR_9/DIR_7/osdmap.17735__0_4E928679__none with flags=2: (2) No such file or directory -1> 2015-05-26 09:42:54.997899 7f0fbfa34880 10 filestore(/var/lib/ceph/osd/ceph-25) FileStore::read(meta/4e928679/osdmap.17735/0//-1) open error: (2) No such file or directory 0> 2015-05-26 09:42:54.999254 7f0fbfa34880 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f0fbfa34880 time 2015-05-26 09:42:54.997908 osd/OSD.h: 716: FAILED assert(ret) ceph version 0.94.1-116-g63832d4 (63832d4039889b6b704b88b86eaba4aadcfceb2e) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc4e15] 2: (OSDService::get_map(unsigned int)+0x3f) [0x6ffa9f] 3: (OSD::init()+0x6b7) [0x6b8e17] 4: (main()+0x27f3) [0x643b63] 5: (__libc_start_main()+0xf5) [0x7f0fbcdd2af5] 6: /usr/bin/ceph-osd() [0x65cdc9] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Updated by Irek Fasikhov almost 9 years ago
Loic, This conclusion was already with the patch: https://github.com/ceph/ceph/pull/4559
Or you need to recreate the OSD to correct?
Thanks
Updated by Loïc Dachary almost 9 years ago
what you're having is different: it's a failure to load the osdmap because the epoch in the osd superblock is a reference to an osdmap that does not exist. This bug is about a failure to load an osdmap referenced from a resurected pg. Would you mind creating another bug report with the same information ?
Updated by Irek Fasikhov almost 9 years ago
Loic Dachary wrote:
what you're having is different: it's a failure to load the osdmap because the epoch in the osd superblock is a reference to an osdmap that does not exist. This bug is about a failure to load an osdmap referenced from a resurected pg. Would you mind creating another bug report with the same information ?
Of course, it will create.
Updated by Irek Fasikhov almost 9 years ago
Loic, Already there is also attached to the current task: #11373
Updated by Loïc Dachary almost 9 years ago
The bug #11373 is a duplicate of this one and the trace shows it crashes in load_pgs. Your problem seems slightly different: it does not involve load_pgs.
Updated by Srikanth Madugundi almost 9 years ago
We recently started seeing this crash in some of our OSDs, we applied the patch to firefly and did not fix the crash.
-5> 2015-06-03 05:31:25.986906 7f54fc5ee780 10 register_cxx_method kvs.create_with_omap flags 2 0x7f54ee423000
-4> 2015-06-03 05:31:25.986908 7f54fc5ee780 10 register_cxx_method kvs.omap_remove flags 2 0x7f54ee422110
-3> 2015-06-03 05:31:25.986909 7f54fc5ee780 10 register_cxx_method kvs.maybe_read_for_balance flags 1 0x7f54ee422820
-2> 2015-06-03 05:31:25.986911 7f54fc5ee780 10 _load_class kvs success
-1> 2015-06-03 05:31:25.986926 7f54fc5ee780 20 osd.61 0 get_map 11487 - loading and decoding 0x219f000
0> 2015-06-03 05:31:25.987927 7f54fc5ee780 -1 osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f54fc5ee780 time 2015-06-03 05:31:25.986975
osd/OSD.h: 634: FAILED assert(ret)
ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
1: (OSDService::get_map(unsigned int)+0x3f) [0x68e86f]
2: (OSD::init()+0x2259) [0x64e529]
3: (main()+0x35aa) [0x5f991a]
4: (__libc_start_main()+0xfd) [0x3f3281ed5d]
5: /home/y/bin64/ceph-osd() [0x5f5fb9]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.