Project

General

Profile

Bug #23937

FAILED assert(info.history.same_interval_since != 0)

Added by Rafal Wadolowski almost 6 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
OSD
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Two of our osds hit these assert and now they are down.

   -10> 2018-04-29 23:18:01.237866 7fa8fbe08700  5 osd.33 pg_epoch: 157504 pg[6.3e9s1( v 49315'28553 (46049'27009,49315'28553] local-lis/les=157492/157503 n=7220 ec=488/66 lis/c 157492/137632 les/c/f 157503/137634/110032 157485/157492/151703) [146,33,975,252,861,231]/[146,33,416,252,68,574]p146(0) r=1 lpr=157503 pi=[137632,157492)/8 crt=49314'28552 lcod 0'0 unknown NOTIFY] enter Start
    -9> 2018-04-29 23:18:01.237876 7fa8fbe08700  1 osd.33 pg_epoch: 157504 pg[6.3e9s1( v 49315'28553 (46049'27009,49315'28553] local-lis/les=157492/157503 n=7220 ec=488/66 lis/c 157492/137632 les/c/f 157503/137634/110032 157485/157492/151703) [146,33,975,252,861,231]/[146,33,416,252,68,574]p146(0) r=1 lpr=157503 pi=[137632,157492)/8 crt=49314'28552 lcod 0'0 unknown NOTIFY] state<Start>: transitioning to Stray
    -8> 2018-04-29 23:18:01.237887 7fa8fbe08700  5 osd.33 pg_epoch: 157504 pg[6.3e9s1( v 49315'28553 (46049'27009,49315'28553] local-lis/les=157492/157503 n=7220 ec=488/66 lis/c 157492/137632 les/c/f 157503/137634/110032 157485/157492/151703) [146,33,975,252,861,231]/[146,33,416,252,68,574]p146(0) r=1 lpr=157503 pi=[137632,157492)/8 crt=49314'28552 lcod 0'0 unknown NOTIFY] exit Start 0.000021 0 0.000000
    -7> 2018-04-29 23:18:01.237897 7fa8fbe08700  5 osd.33 pg_epoch: 157504 pg[6.3e9s1( v 49315'28553 (46049'27009,49315'28553] local-lis/les=157492/157503 n=7220 ec=488/66 lis/c 157492/137632 les/c/f 157503/137634/110032 157485/157492/151703) [146,33,975,252,861,231]/[146,33,416,252,68,574]p146(0) r=1 lpr=157503 pi=[137632,157492)/8 crt=49314'28552 lcod 0'0 unknown NOTIFY] enter Started/Stray
    -6> 2018-04-29 23:18:01.238745 7fa8fbe08700  5 osd.33 pg_epoch: 157504 pg[6.6f0s0( v 49315'28967 (46049'27467,49315'28967] lb MIN (bitwise) local-lis/les=157082/157092 n=0 ec=2496/66 lis/c 157082/156477 les/c/f 157092/156538/110032 157424/157424/157424) [2147483647,28,554,167,601,331]p28(1) r=-1 lpr=157485 pi=[156477,157424)/2 crt=49315'28967 lcod 0'0 unknown NOTIFY] exit Reset 25.049585 20 0.000098
    -5> 2018-04-29 23:18:01.238770 7fa8fbe08700  5 osd.33 pg_epoch: 157504 pg[6.6f0s0( v 49315'28967 (46049'27467,49315'28967] lb MIN (bitwise) local-lis/les=157082/157092 n=0 ec=2496/66 lis/c 157082/156477 les/c/f 157092/156538/110032 157424/157424/157424) [2147483647,28,554,167,601,331]p28(1) r=-1 lpr=157485 pi=[156477,157424)/2 crt=49315'28967 lcod 0'0 unknown NOTIFY] enter Started
    -4> 2018-04-29 23:18:01.238782 7fa8fbe08700  5 osd.33 pg_epoch: 157504 pg[6.6f0s0( v 49315'28967 (46049'27467,49315'28967] lb MIN (bitwise) local-lis/les=157082/157092 n=0 ec=2496/66 lis/c 157082/156477 les/c/f 157092/156538/110032 157424/157424/157424) [2147483647,28,554,167,601,331]p28(1) r=-1 lpr=157485 pi=[156477,157424)/2 crt=49315'28967 lcod 0'0 unknown NOTIFY] enter Start
    -3> 2018-04-29 23:18:01.238792 7fa8fbe08700  1 osd.33 pg_epoch: 157504 pg[6.6f0s0( v 49315'28967 (46049'27467,49315'28967] lb MIN (bitwise) local-lis/les=157082/157092 n=0 ec=2496/66 lis/c 157082/156477 les/c/f 157092/156538/110032 157424/157424/157424) [2147483647,28,554,167,601,331]p28(1) r=-1 lpr=157485 pi=[156477,157424)/2 crt=49315'28967 lcod 0'0 unknown NOTIFY] state<Start>: transitioning to Stray
    -2> 2018-04-29 23:18:01.238803 7fa8fbe08700  5 osd.33 pg_epoch: 157504 pg[6.6f0s0( v 49315'28967 (46049'27467,49315'28967] lb MIN (bitwise) local-lis/les=157082/157092 n=0 ec=2496/66 lis/c 157082/156477 les/c/f 157092/156538/110032 157424/157424/157424) [2147483647,28,554,167,601,331]p28(1) r=-1 lpr=157485 pi=[156477,157424)/2 crt=49315'28967 lcod 0'0 unknown NOTIFY] exit Start 0.000021 0 0.000000
    -1> 2018-04-29 23:18:01.238814 7fa8fbe08700  5 osd.33 pg_epoch: 157504 pg[6.6f0s0( v 49315'28967 (46049'27467,49315'28967] lb MIN (bitwise) local-lis/les=157082/157092 n=0 ec=2496/66 lis/c 157082/156477 les/c/f 157092/156538/110032 157424/157424/157424) [2147483647,28,554,167,601,331]p28(1) r=-1 lpr=157485 pi=[156477,157424)/2 crt=49315'28967 lcod 0'0 unknown NOTIFY] enter Started/Stray
     0> 2018-04-29 23:18:01.239367 7fa8fc609700 -1 /build/ceph-12.2.4/src/osd/PG.cc: In function 'void PG::start_peering_interval(OSDMapRef, const std::vector<int>&, int, const std::vector<int>&, int, ObjectStore::Transaction*)' thread 7fa8fc609700 time 2018-04-29 23:18:01.235920
/build/ceph-12.2.4/src/osd/PG.cc: 5450: FAILED assert(info.history.same_interval_since != 0)

 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x56502dfd6872]
 2: (PG::start_peering_interval(std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> > const&, int, std::vector<int, std::allocator<int> > const&, int, ObjectStore::Transaction*)+0x151e) [0x56502dab63fe]
 3: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x479) [0x56502dab69c9]
 4: (boost::statechart::simple_state<PG::RecoveryState::Reset, PG::RecoveryState::RecoveryMachine, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x166) [0x56502db02186]
 5: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base const&)+0x69) [0x56502dad7a69]
 6: (PG::handle_advance_map(std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, std::vector<int, std::allocator<int> >&, int, std::vector<int, std::allocator<int> >&, int, PG::RecoveryCtx*)+0x496) [0x56502da9b2e6]
 7: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, PG::RecoveryCtx*, std::set<boost::intrusive_ptr<PG>, std::less<boost::intrusive_ptr<PG> >, std::allocator<boost::intrusive_ptr<PG> > >*)+0x2e4) [0x56502d9dec74]
 8: (OSD::process_peering_events(std::__cxx11::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x1e4) [0x56502d9dfba4]
 9: (ThreadPool::BatchWorkQueue<PG>::_void_process(void*, ThreadPool::TPHandle&)+0x27) [0x56502da535f7]
 10: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb9) [0x56502dfdd779]
 11: (ThreadPool::WorkThread::entry()+0x10) [0x56502dfde880]
 12: (()+0x76ba) [0x7fa9289b96ba]
 13: (clone()+0x6d) [0x7fa927a3041d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

History

#1 Updated by Rafal Wadolowski almost 6 years ago

I found a little more

 -56> 2018-04-30 13:07:55.839696 7f5a29a3d700 -1 log_channel(cluster) log [ERR] : 5.b56s3 required past_interval bounds are empty [138169,0) but past_intervals is not: ([143157,157232] intervals=([144599,144615] acting 4(3),113(1),345(2),412(5)),([144616,144649] acting 4(3),113(1),203(4),345(2),412(5)),([151499,151511] acting 4(3),113(1),203(4),345(2),375(0)),([156704,156883] acting 113(1),203(4),345(2),375(0),412(5)),([156884,157232] acting 4(3),113(1),203(4),345(2),375(0),412(5)))
   -55> 2018-04-30 13:07:55.839705 7f5a29a3d700 -1 osd.33 pg_epoch: 157494 pg[5.b56s3( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [375,113,345,2147483647,203,412]p375(0) r=-1 lpr=157491 pi=[143157,157233)/5 crt=0'0 unknown NOTIFY] 5.b56s3 required past_interval bounds are empty [138169,0) but past_intervals is not: ([143157,157232] intervals=([144599,144615] acting 4(3),113(1),345(2),412(5)),([144616,144649] acting 4(3),113(1),203(4),345(2),412(5)),([151499,151511] acting 4(3),113(1),203(4),345(2),375(0)),([156704,156883] acting 113(1),203(4),345(2),375(0),412(5)),([156884,157232] acting 4(3),113(1),203(4),345(2),375(0),412(5)))
   -54> 2018-04-30 13:07:55.839874 7f5a29a3d700 -1 log_channel(cluster) log [ERR] : 5.b56s3 required past_interval bounds are empty [138169,0) but past_intervals is not: ([143157,157232] intervals=([144599,144615] acting 4(3),113(1),345(2),412(5)),([144616,144649] acting 4(3),113(1),203(4),345(2),412(5)),([151499,151511] acting 4(3),113(1),203(4),345(2),375(0)),([156704,156883] acting 113(1),203(4),345(2),375(0),412(5)),([156884,157232] acting 4(3),113(1),203(4),345(2),375(0),412(5)))
   -53> 2018-04-30 13:07:55.839899 7f5a29a3d700 -1 osd.33 pg_epoch: 157495 pg[5.b56s3( DNE empty local-lis/les=0/0 n=0 ec=0/0 lis/c 0/0 les/c/f 0/0/0 0/0/0) [375,113,345,2147483647,203,412]p375(0) r=-1 lpr=157491 pi=[143157,157233)/5 crt=0'0 unknown NOTIFY] 5.b56s3 required past_interval bounds are empty [138169,0) but past_intervals is not: ([143157,157232] intervals=([144599,144615] acting 4(3),113(1),345(2),412(5)),([144616,144649] acting 4(3),113(1),203(4),345(2),412(5)),([151499,151511] acting 4(3),113(1),203(4),345(2),375(0)),([156704,156883] acting 113(1),203(4),345(2),375(0),412(5)),([156884,157232] acting 4(3),113(1),203(4),345(2),375(0),412(5)))
   -52> 2018-04-30 13:07:55.840021 7f5a29a3d700 -1 log_channel(cluster) log [ERR] : 5.b56s3 required past_interval bounds are empty [138169,0) but past_intervals is not: ([143157,157232] intervals=([144599,144615] acting 4(3),113(1),345(2),412(5)),([144616,144649] acting 4(3),113(1),203(4),345(2),412(5)),([151499,151511] acting 4(3),113(1),203(4),345(2),375(0)),([156704,156883] acting 113(1),203(4),345(2),375(0),412(5)),([156884,157232] acting 4(3),113(1),203(4),345(2),375(0),412(5)))

Any ideas?

#2 Updated by Rafal Wadolowski almost 6 years ago

This pg has 0 value in same_interval_since. I checked this with following output:
https://paste.fedoraproject.org/paste/Yf5Syb0TssYVIFBGava2yA

#3 Updated by Greg Farnum almost 6 years ago

  • Project changed from Ceph to RADOS
  • Category set to Correctness/Safety
  • Component(RADOS) OSD added

#4 Updated by Josh Durgin almost 6 years ago

Did you import or export any PGs? The on-disk pg info from comment #2 indicates the pg doesn't exist on osd.33 yet.

Is pool 5 recently created?

#5 Updated by Rafal Wadolowski almost 6 years ago

I didn't import or export any pgs, that was working osd in the cluster.
Is it possible that the restart of the osd process could cause this?
Right now I have one pg down, so we can't access all of our data:
https://paste.fedoraproject.org/paste/5grVsv4j~2U0uedfOke0HQ
It is strange, because I have 4+2 EC configuration and the cluster is waiting for one osd..

The pool was created weeks ago and this osd was working fine.
Maybe there should be a command to abandon one shard in pg or any other solution?

#6 Updated by Josh Durgin almost 6 years ago

The pg is waiting for state from osd.33 - can you use ceph-post-file to upload the full log from the crash?

You might be able to export pg 5.c0as3 from osd.33, backing up the copy of that pg in osd.878, and importing it into osd.878. From the existing trace we can't tell which PG is hitting that crash, so this may or may not cause the same problem on osd.878.

#7 Updated by Rafal Wadolowski almost 6 years ago

I can't post using ceph-post-file, so I uploaded file here https://eocloud.eu:8080/swift/v1/rwadolowski/ceph-osd.33.log.4.gz

#8 Updated by huang jun about 5 years ago

@rafal what the status of this issue now? did you resolve the problem?

#9 Updated by Rafal Wadolowski about 5 years ago

huang jun wrote:

@rafal what the status of this issue now? did you resolve the problem?

I had to redeploy this drive. This issue happened 2 or 3 times during cluster live. Do you know what is the meaning behind this assert?

#10 Updated by huang jun about 5 years ago

I think it was resolved by this https://github.com/ceph/ceph/pull/20571
and the related issue is http://tracker.ceph.com/issues/21833

#11 Updated by huang jun about 4 years ago

unfortunately, we still hit this after the patch (https://github.com/ceph/ceph/pull/20571) applied.
we didn't do export and import.
it occurs when we expand cluster by adding osds to pool.
one osd get crashed 3 times caused by heartbeatmap suicide timeout,
and the pg was migrated to other osds, that pg is stray.

Also available in: Atom PDF