Project

General

Profile

Actions

Bug #19909

closed

PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))

Added by Dan van der Ster almost 7 years ago. Updated over 6 years ago.

Status:
Won't Fix
Priority:
High
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

After updating osds from 12.0.1 to 12.0.1-2248-g745902a, we get all osd's failing like this:

    -1> 2017-05-11 14:44:20.024794 7f797335e700 -1 osd.58 1886 failed to load OSD map for epoch 269, got 0 bytes
     0> 2017-05-11 14:44:20.027699 7f797335e700 -1 /home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-2248-g745902a/rpm/el7/BUILD/ceph-12.0.1-2248-g745902a/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f797335e700 time 2017-05-11 14:44:20.024809
/home/jenkins-build/build/workspace/ceph-dev-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.0.1-2248-g745902a/rpm/el7/BUILD/ceph-12.0.1-2248-g745902a/src/osd/OSD.h: 1064: FAILED assert(ret)

 ceph version 12.0.1-2248-g745902a (745902aec6b1a0395ca244ae280abb2e99183189)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f7981b1c0c0]
 2: (OSDService::get_map(unsigned int)+0x3d) [0x7f798164c5bd]
 3: (OSD::build_initial_pg_history(spg_t, unsigned int, utime_t, pg_history_t*, PastIntervals*)+0x186) [0x7f79815efa06]
 4: (OSD::handle_pg_create(boost::intrusive_ptr<OpRequest>)+0x9b4) [0x7f79815faa04]
 5: (OSD::dispatch_op(boost::intrusive_ptr<OpRequest>)+0x1e1) [0x7f79815fcb91]
 6: (OSD::_dispatch(Message*)+0x3a7) [0x7f79815fd5b7]
 7: (OSD::ms_dispatch(Message*)+0x87) [0x7f79815fd9f7]
 8: (DispatchQueue::entry()+0x7a2) [0x7f7981cf6d82]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f7981b9a62d]
 10: (()+0x7dc5) [0x7f797e765dc5]
 11: (clone()+0x6d) [0x7f797d65473d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Here's the ceph status:

[14:58][root@p06253939f44928 (ceph_dev:ceph/halpert/mon*2) ~]# ceph status
2017-05-11 14:58:30.902776 7f9be729b700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
2017-05-11 14:58:30.924144 7f9be729b700 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
    cluster 1d0f18a4-bf2d-43cc-9b91-511570f1a037
     health HEALTH_ERR
            17149 pgs are stuck inactive for more than 300 seconds
            6455 pgs degraded
            250 pgs down
            1871 pgs peering
            8576 pgs stale
            6455 pgs stuck degraded
            8573 pgs stuck inactive
            8576 pgs stuck stale
            8576 pgs stuck unclean
            6455 pgs stuck undersized
            6455 pgs undersized
            recovery 119682/240036 objects degraded (49.860%)
            mds0: 180 slow requests are blocked > 30 sec
            50/83 in osds are down
     monmap e4: 3 mons at {p06253939f44928=128.142.162.29:6789/0,p06253939y22952=128.142.162.19:6789/0,p06253939y31416=128.142.162.17:6789/0}
            election epoch 68, quorum 0,1,2 p06253939y31416,p06253939y22952,p06253939f44928
      fsmap e3081: 3/3/3 up {0=cephhalpert-mds-981001588f=up:active,1=cephhalpert-mds-96d8ad3ea3=up:active,2=cephhalpert-mds-135c39f87d=up:active}, 1 up:standby
        mgr active: p06253939f44928 standbys: p06253939y22952, p06253939y31416
     osdmap e1959: 112 osds: 33 up, 83 in; 22 remapped pgs
      pgmap v1984586: 8576 pgs, 3 pools, 5943 MB data, 80012 objects
            198 GB used, 379 TB / 379 TB avail
            99.965% pgs inactive
            119682/240036 objects degraded (49.860%)
                6451 stale+undersized+degraded+peered
                1861 stale+peering
                 250 stale+down
                  10 stale+remapped+peering
                   3 stale+active+undersized+degraded
                   1 stale+activating+undersized+degraded
Actions #1

Updated by Greg Farnum almost 7 years ago

I'm just pattern-matching from going through my email, but https://github.com/ceph/ceph/pull/15046 is about OSDMap decoding and references a pretty new commit. Try with that patch applied?

Actions #2

Updated by Greg Farnum almost 7 years ago

Or Sage/Zheng can confirm if this failure mode matches that error...

Actions #3

Updated by Bertrand Gouny almost 7 years ago

i have the same error with 12.0.3


2017-05-19 14:50:28.126524 7f3e37383c80 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2017-05-19 14:50:29.209771 7f3e37383c80 -1 osd.1 137157 log_to_monitors {default=true}
2017-05-19 14:50:30.296271 7f3e215e2700 -1 osd.1 137160 failed to load OSD map for epoch 1, got 0 bytes
/build/ceph-12.0.3/src/osd/OSD.h: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f3e215e2700 time 2017-05-19 14:50:30.296284
/build/ceph-12.0.3/src/osd/OSD.h: 1064: FAILED assert(ret)
 ceph version 12.0.3 (f2337d1b42fa49dbb0a93e4048a42762e3dffbbf)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5599751ec932]
 2: (OSDService::get_map(unsigned int)+0x5d) [0x559974ca97ed]
 3: (OSD::build_initial_pg_history(spg_t, unsigned int, utime_t, pg_history_t*, PastIntervals*)+0x1bc) [0x559974c42d3c]
 4: (OSD::handle_pg_create(boost::intrusive_ptr<OpRequest>)+0xa84) [0x559974c62dc4]
 5: (OSD::dispatch_op(boost::intrusive_ptr<OpRequest>)+0x1b1) [0x559974c64e51]
 6: (OSD::_dispatch(Message*)+0x34d) [0x559974c658bd]
 7: (OSD::ms_dispatch(Message*)+0x87) [0x559974c65c37]
 8: (DispatchQueue::entry()+0xf4a) [0x5599753e25fa]
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x559975274f2d]
 10: (()+0x76ba) [0x7f3e3585a6ba]
 11: (clone()+0x6d) [0x7f3e348d182d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #4

Updated by WANG Guoqin almost 7 years ago

Still happens in 12.0.3, with the patch [[https://github.com/ceph/ceph/pull/15046]] applied。

Actions #5

Updated by Jason McNeil almost 7 years ago

I've no idea the repercussions (thinking I'll backup and recreate the cluster) but if you write an osdmap into all of the empty epochs, it appears possible to recover from this.

In my case, osdmap epochs 178 through 2671 were empty and had to have a map set at every epoch before the OSDs would start again. I pulled some epochs out via the monitor, others I extracted from the other OSDs (so everyone in theory all OSDs should have the same history, even if some of it is lies)

I believe that rolling back to 12.0.1 didn't work (different error I believe, seems I didn't save that output)

I've posted more details in a gist, https://gist.github.com/jasonrm/55b26261271b23fb707fa093d2d2af95

Actions #6

Updated by Greg Farnum almost 7 years ago

  • Project changed from Ceph to RADOS
  • Priority changed from Normal to Urgent

I presume this was a bug in the older dev releases, but we should verify that before release.

Actions #7

Updated by WANG Guoqin almost 7 years ago

A month past and I'm still not able to figure where the problem was, neither am I able to recover my cluster. Trying with the most recent build from chacra last week, but no luck.

The problem seems to be, as Jason has figured out, some maps being empty. For me there were about 50k osdmaps so I'm not willing to do as Jason has done.

Not much has been found online. Seems to be a new problem. For me, it's a 18-node osd cluster, and now 3 being online all the time, 3 or 4 being spinning, and about 10 nodes remain offline. Don't have a clue on the relationships among the online osds and the offlines, but they are all in 12.0.3.

P.S. "missing creating pgs; upgrade from post-kraken?" has appeared in the log for both 12.0.2 and 12.0.3. I'm not sure if they're relevant, and I saw that problem is now tolerated in 12.0.3, but why aren't we upgrading that, fixing the problem in the first place? Are we supposed to recreate the whole cluster for every major upgrade?

Actions #8

Updated by Greg Farnum almost 7 years ago

N.0.Y releases such as 12.0.2 are dev releases; you should not run them if you can't afford to rebuild them. Upgrades across releases are supported and usually work even on dev releases, but they're basically just snapshots of master branch without any special QA, so bugs will arise.
In the case of the "missing creating pgs" I don't think it will actually cause issues and the monitors just create the needed structure as an empty one, but I don't recall for sure. And it's because we have code for upgrading from Kraken to Luminous but did not try and support roll-forward of dev code which was in the intermediate state.

Actions #9

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Won't Fix

What Greg said! :)

Actions #10

Updated by red ref almost 7 years ago

I can confirm the second behavior ("failed to load OSD map for epoch 1") in native installed 12.0.3 (not in production). I got the same stack-trace as #3.

After loading a osdmaptool crafted osdmap at epoch 1 with ceph-objectstore-tool, right after "add_map_bl" function for osdmap epoch 1, I got :

osd/osd_types.cc: In function 'static bool PastIntervals::check_new_interval(int, int, const std::vector<int>&, const std::vector<int>&, int, int, const std::vector<i
nt>&, const std::vector<int>&, epoch_t, epoch_t, OSDMapRef, OSDMapRef, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)'
osd/osd_types.cc: 3513: FAILED assert(lastmap->get_pools().count(pgid.pool()))

 ceph version 12.0.3 (f2337d1b42fa49dbb0a93e4048a42762e3dffbbf)                                                                                                                                                                                               
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f0c2ad9eba0]                                                                                                                                                             
 2: (PastIntervals::check_new_interval(int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, int, int, std::vector<int, std::allocator<int> > const&, std::vector<int, std::allocator<int> > const&, unsigne
d int, unsigned int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x5d7) [0x7f0c2aa7e0d7]                                                                                    
 3: (OSD::build_initial_pg_history(spg_t, unsigned int, utime_t, pg_history_t*, PastIntervals*)+0x5ac) [0x7f0c2a86bb9c]                                                                                                                                       
 4: (OSD::handle_pg_create(boost::intrusive_ptr<OpRequest>)+0x9b4) [0x7f0c2a876774]                                                                                                                                                                           
 5: (OSD::dispatch_op(boost::intrusive_ptr<OpRequest>)+0x1e1) [0x7f0c2a878901]                                                                                                                                                                                
 6: (OSD::_dispatch(Message*)+0x3a7) [0x7f0c2a879327]                                                                                                                                                                                                         
 7: (OSD::ms_dispatch(Message*)+0x87) [0x7f0c2a879767]                                                                                                                                                                                                        
 8: (DispatchQueue::entry()+0x7a2) [0x7f0c2af79672]                                                                                                                                                                                                           
 9: (DispatchQueue::DispatchThread::entry()+0xd) [0x7f0c2ae1d07d]                                                                                                                                                                                             
 10: (()+0x7dc5) [0x7f0c27a2edc5]                                                                                                                                                                                                                             
 11: (clone()+0x6d) [0x7f0c2691cced]                                                        

From code analyse (please apologize this rough description):
- pg creation is handled on mon side by mon/Session.h -> Subscription structure
- "stale+down" PG's does'nt contain created epoch (need further analyse here)
- "next" attribute passed to the structure for this PG is not defined
- default value for "next" in structure is set to 0
- this default value is passed to OSD to construct PG History (our failing function)

I re-compiled with this testing commit (definitely not enough/secure to fix it that way) : https://github.com/redref/ceph/commit/12c28242e4e0592c57c3be013a3e07f7bcf72b47

Let me know your thoughts on this. I'm pretty sure I did'nt lost data, but cannot exclude it.

Actions #11

Updated by Greg Farnum almost 7 years ago

red ref, are you saying you created a brand-new cluster with 12.0.3 and saw this on first boot?

Sage, do you think you know what caused this?

Actions #12

Updated by red ref almost 7 years ago

Yes, i'm pretty sure it was 12.0.3. But, not on first boot, only after massive failures got me to stale+down PG status.

I tried to cross check it with a new tiny cluster but did'nt succeed to get stale+down pgs.

Actions #13

Updated by Greg Farnum almost 7 years ago

  • Status changed from Won't Fix to New
  • Priority changed from Urgent to High

So I didn't follow it all the way through but it sure looks to me like our acting_primary input to the crashing sequence is an output from a function that can return -1, and there's no checks for that state of the world.

Actions #14

Updated by red ref almost 7 years ago

Basically yes.

In src/mon/Session.h > Subscription>next = -1 or 0.

I am learning C++ standard all the way but I rewired it to :
- src/messages/MMonSubscribe.h > MMonSubscribe>what.start

I cannot go further but it seems finally this value come from OSD.

As I cannot give more input on OSD (trashed it), I'm fine letting this down until I can reproduce or it pops in QA testing.

Actions #15

Updated by Josh Durgin almost 7 years ago

  • Subject changed from master osd's won't start after upgrade from 12.0.1 to PastIntervals::check_new_interval: assert(lastmap->get_pools().count(pgid.pool()))
Actions #16

Updated by Sage Weil almost 7 years ago

  • Status changed from New to Won't Fix

There was a lot of code churn around the 12.0.3 time period so this isn't too surprising to me. I'm not sure it's worth investigating

Actions #17

Updated by sean redmond almost 7 years ago

https://pastebin.com/raw/xmDPg84a was talked about in IRC by @mguz it seems it maybe related but this was kraken, just adding here for reference.

Actions #18

Updated by WANG Guoqin almost 7 years ago

sean redmond wrote:

https://pastebin.com/raw/xmDPg84a was talked about in IRC by @mguz it seems it maybe related but this was kraken, just adding here for reference.

I believe it's related to kraken as well. Which IRC was that and do you have a chatting log on that?

Actions #19

Updated by Jason McNeil almost 7 years ago

WANG Guoqin wrote:

Which IRC was that and do you have a chatting log on that?

https://gist.githubusercontent.com/jasonrm/99218b043ece7733355892c8700ec967/raw/5a6ebccaf606208eed7701a3b7e5b570550f1847/gistfile1.txt is a edited copy of my log as http://irclogs.ceph.widodh.nl seems to be broken.

I've tried to trim the chat down to just mguz's issue. Note that he did have to do some cleanup related to an "endless cycle of peering, remapping and peering" so I've left that in case it is relevant.

Actions #20

Updated by WANG Guoqin over 6 years ago

Finally I got some clues about the situation I'm facing. Don't know if anyone's still watching this thread.

After upgrading to 12.0.3 and having all mon and osd daemons restarted, the osdmaps on all osds have been trimmed to ~ 40000-57000, nothing <40000 on any of the osds, which I guess was by design, but both of my pools are referring to lower osdmaps. The pool cephfs_data has last_change 27850, and cephfs_metadata has last_change 79.

Now on each time an osd daemon tries to start, it calls handle_pg_create for each pg from old osdmap 27850, and thus calls build_initial_pg_history and get_map on 27850, finding it doesn't exist, and crashes.

I'm still on my way finding out how osdmaps are trimmed, and why the pools are still referring to old osdmaps.

Actions #21

Updated by Nathan Cutler over 6 years ago

Can you upgrade to 12.1.1 - that's the latest version?

Actions #22

Updated by WANG Guoqin over 6 years ago

Yes and that doesn't help. None of the osds can start up steadily.

Anyone familiar with the trimming algo of osdmap, and how pgs recreate themselves after osdmaps been trimmed?

Actions

Also available in: Atom PDF