Project

General

Profile

Actions

Bug #20874

closed

osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end() || (miter->second.need == i->versi on && miter->second.have == eversion_t()))

Added by Sage Weil over 6 years ago. Updated over 4 years ago.

Status:
Can't reproduce
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
luminous
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

    -8> 2017-08-02 03:05:27.287926 7f45b9a56d00 15 read_log_and_missing  missing 259'1095 (218'220) delete   5:bcbee507:::benchmark_data_smithi017_174432_object3591:head by client.4721.0:20327 2017-08-02 02:51:47.936987 0 (have 218'220)
    -7> 2017-08-02 03:05:27.287933 7f45b9a56d00 15 bluestore(/var/lib/ceph/osd/ceph-1) getattr 5.1d_head #5:be2b1501:::benchmark_data_smithi017_174432_object3577:head# _
    -6> 2017-08-02 03:05:27.287935 7f45b9a56d00 30 bluestore.OnodeSpace(0x7f45c41b0988 in 0x7f45c390f9c0) lookup
    -5> 2017-08-02 03:05:27.287936 7f45b9a56d00 30 bluestore.OnodeSpace(0x7f45c41b0988 in 0x7f45c390f9c0) lookup #5:be2b1501:::benchmark_data_smithi017_174432_object3577:head# miss
    -4> 2017-08-02 03:05:27.287939 7f45b9a56d00 20 bluestore(/var/lib/ceph/osd/ceph-1).collection(5.1d_head 0x7f45c41b0840) get_onode oid #5:be2b1501:::benchmark_data_smithi017_174432_object3577:head# key 0x7f8000000000000005be2b1501216265'nchmark_data_smithi017_174432_object3577!='0xfffffffffffffffeffffffffffffffff'o'
    -3> 2017-08-02 03:05:27.287952 7f45b9a56d00 20 bluestore(/var/lib/ceph/osd/ceph-1).collection(5.1d_head 0x7f45c41b0840)  r -2 v.len 0
    -2> 2017-08-02 03:05:27.287953 7f45b9a56d00 10 bluestore(/var/lib/ceph/osd/ceph-1) getattr 5.1d_head #5:be2b1501:::benchmark_data_smithi017_174432_object3577:head# _ = -2
    -1> 2017-08-02 03:05:27.287955 7f45b9a56d00 15 read_log_and_missing  missing 259'1094 (218'218) delete   5:be2b1501:::benchmark_data_smithi017_174432_object3577:head by client.4721.0:20313 2017-08-02 02:51:47.837432 0
     0> 2017-08-02 03:05:27.321189 7f45b9a56d00 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.1-944-gb1162cb/rpm/el7/BUILD/ceph-12.1.1-944-gb1162cb/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; std::ostringstream = std::basic_ostringstream<char>]' thread 7f45b9a56d00 time 2017-08-02 03:05:27.294757
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.1.1-944-gb1162cb/rpm/el7/BUILD/ceph-12.1.1-944-gb1162cb/src/osd/PGLog.h: 1386: FAILED assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))

 ceph version 12.1.1-944-gb1162cb (b1162cb5929635662b7b1cdcf9b9fb41beb10a5a) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f45ba4d3e80]
 2: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t const&, PGLog::IndexedLog&, pg_missing_set<true>&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, bool*, DoutPrefixProvider const*, std::set<std::string, std::less<std::string>, std::allocator<std::string> >*, bool)+0x210e) [0x7f45ba06248e]
 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x4dc) [0x7f45ba0194ac]
 4: (OSD::load_pgs()+0x97a) [0x7f45b9f678aa]
 5: (OSD::init()+0x1f59) [0x7f45b9f7e779]
 6: (main()+0x2c25) [0x7f45b9e84555]
 7: (__libc_start_main()+0xf5) [0x7f45b60cbb35]

/a/sage-2017-08-02_01:58:49-rados-wip-sage-testing-distro-basic-smithi/1469908

Related issues 1 (1 open0 closed)

Related to RADOS - Bug #45702: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))Fix Under ReviewNitzan Mordechai

Actions
Actions #1

Updated by Josh Durgin over 6 years ago

  • Priority changed from Urgent to Immediate
Actions #2

Updated by Josh Durgin over 6 years ago

This may be a bluestore bug - the log is so large from bluestore debugging that I haven't had time to properly read it to find what the missing set contained earlier in the log.

The transaction to remove the missing object seems to have completed the deletion of the object,
but not the modification to the missing set:

2017-08-02 02:52:00.622327 7fc5a37c9700 10 bluestore(/var/lib/ceph/osd/ceph-1) _remove 5.1d_head #5:be2b1501:::benchmark_data_smithi017_174432_object3577:head# = 0

At this point the osd is killed:

2017-08-02T02:52:00.625 INFO:tasks.thrashosds.thrasher:Killing osd 1, live_osds are [1, 3, 2, 5, 4, 0]
2017-08-02T02:52:00.692 INFO:tasks.ceph.osd.1:Stopped

My guess is that the missing set still contains benchmark_data_smithi017_174432_object3577 with have at the earlier version, failing the 'have == eversion_t()' part of the assert.

Actions #3

Updated by Sage Weil over 6 years ago

  • Backport set to luminous
Actions #4

Updated by Sage Weil over 6 years ago

  • Priority changed from Immediate to Urgent
Actions #5

Updated by Josh Durgin over 6 years ago

  • Priority changed from Urgent to High
Actions #6

Updated by Greg Farnum over 6 years ago

  • Status changed from 12 to Can't reproduce
  • Priority changed from High to Urgent

There have been changes to bluestore, and I think maybe bugfixes to the missing set. And no reproductions.

Actions #7

Updated by Sage Weil over 6 years ago

  • Status changed from Can't reproduce to 12

/a/sage-2018-01-16_03:08:54-rados-wip-sage2-testing-2018-01-15-1257-distro-basic-smithi/2077982

    -6> 2018-01-16 06:11:40.843 7ffbaf9a1d80 15 read_log_and_missing  missing 680'627 (648'139) delete   9:3e312ce4:::benchmark_data_smithi036_156792_object2498:head by client.6950.0:10983 2018-01-16 06:09:54.810319 0 (have 648'139)
    -5> 2018-01-16 06:11:40.843 7ffbaf9a1d80 15 bluestore(/var/lib/ceph/osd/ceph-0) getattr 9.c_head #9:37229082:::benchmark_data_smithi036_156792_object2474:head# _
    -4> 2018-01-16 06:11:40.843 7ffbaf9a1d80 20 bluestore(/var/lib/ceph/osd/ceph-0).collection(9.c_head 0x5597dd2bb800) get_onode oid #9:37229082:::benchmark_data_smithi036_156792_object2474:head# key 0x7f800000000000000937229082216265'nchmark_data_smithi036_156792_object2474!='0xfffffffffffffffeffffffffffffffff'o'
    -3> 2018-01-16 06:11:40.843 7ffbaf9a1d80 20 bluestore(/var/lib/ceph/osd/ceph-0).collection(9.c_head 0x5597dd2bb800)  r -2 v.len 0
    -2> 2018-01-16 06:11:40.843 7ffbaf9a1d80 10 bluestore(/var/lib/ceph/osd/ceph-0) getattr 9.c_head #9:37229082:::benchmark_data_smithi036_156792_object2474:head# _ = -2
    -1> 2018-01-16 06:11:40.843 7ffbaf9a1d80 15 read_log_and_missing  missing 680'626 (648'138) delete   9:37229082:::benchmark_data_smithi036_156792_object2474:head by client.6950.0:10959 2018-01-16 06:09:54.701832 0
     0> 2018-01-16 06:11:40.843 7ffbaf9a1d80 -1 /build/ceph-13.0.0-4861-g2d59c16/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(ObjectStore*, coll_t, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*, std::set<std::__c
xx11::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; std::ostringstream = std::__cxx11::basic_ostringstream<char>]' thread 7ffbaf9a1d80 time 2018-01-16 06:11:40.844750
/build/ceph-13.0.0-4861-g2d59c16/src/osd/PGLog.h: 1418: FAILED assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t()))

leading up to the crash,

2018-01-16 06:10:10.287 7f983a974700  1 -- 172.21.15.36:6806/26487 --> 172.21.15.133:6809/26564 -- MOSDPGRecoveryDeleteReply(9.c e686,683 9:3bec509c:::benchmark_data_smithi036_156792_object2466:head,680'625) v2 -- ?+0 0x55bd5aaaad00 con 0x55bd4ea7b440
2018-01-16 06:10:10.391 7f98128f3700  1 -- 172.21.15.36:6806/26487 <== osd.4 172.21.15.133:6809/26564 28324 ==== MOSDPGRecoveryDelete(9.c e686,683 9:37229082:::benchmark_data_smithi036_156792_object2474:head,680'626) v2 ==== 148+0+0 (4083099826 0 0) 0x55bd5aaaad00 con 0x55bd4ea7b440
2018-01-16 06:10:10.391 7f98128f3700 15 osd.0 686 enqueue_op 0x55bd5e936a80 prio 3 cost 1000 latency 0.000103 epoch 686 MOSDPGRecoveryDelete(9.c e686,683 9:37229082:::benchmark_data_smithi036_156792_object2474:head,680'626) v2
2018-01-16 06:10:10.391 7f98128f3700 20 osd.0 op_wq(4) _enqueue OpQueueItem(9.c PGOpItem(op=MOSDPGRecoveryDelete(9.c e686,683 9:37229082:::benchmark_data_smithi036_156792_object2474:head,680'626) v2) prio 3 cost 1000 e686)
2018-01-16 06:10:10.391 7f982b956700 20 osd.0 op_wq(4) _process OpQueueItem(9.c PGOpItem(op=MOSDPGRecoveryDelete(9.c e686,683 9:37229082:::benchmark_data_smithi036_156792_object2474:head,680'626) v2) prio 3 cost 1000 e686) queued
2018-01-16 06:10:10.391 7f982b956700 20 osd.0 op_wq(4) _process OpQueueItem(9.c PGOpItem(op=MOSDPGRecoveryDelete(9.c e686,683 9:37229082:::benchmark_data_smithi036_156792_object2474:head,680'626) v2) prio 3 cost 1000 e686) pg 0x55bd597b8000
2018-01-16 06:10:10.391 7f982b956700 10 osd.0 686 dequeue_op 0x55bd5e936a80 prio 3 cost 1000 latency 0.000239 MOSDPGRecoveryDelete(9.c e686,683 9:37229082:::benchmark_data_smithi036_156792_object2474:head,680'626) v2 pg pg[9.c( v 686'713 lc 680'625 (0'0,686'713] local-lis/les=683/684 n=263 ec=636/636 lis/c 683/636 les/c/f 684/637/0 683/683/683) [4,7,0] r=2 lpr=683 pi=[636,683)/2 luod=0'0 crt=686'713 active m=17]
2018-01-16 06:10:10.391 7f982b956700 20 osd.0 pg_epoch: 686 pg[9.c( v 686'713 lc 680'625 (0'0,686'713] local-lis/les=683/684 n=263 ec=636/636 lis/c 683/636 les/c/f 684/637/0 683/683/683) [4,7,0] r=2 lpr=683 pi=[636,683)/2 luod=0'0 crt=686'713 active m=17] handle_recovery_delete 0x55bd5e936a80
2018-01-16 06:10:10.391 7f982b956700 20 osd.0 pg_epoch: 686 pg[9.c( v 686'713 lc 680'625 (0'0,686'713] local-lis/les=683/684 n=263 ec=636/636 lis/c 683/636 les/c/f 684/637/0 683/683/683) [4,7,0] r=2 lpr=683 pi=[636,683)/2 luod=0'0 crt=686'713 active m=17] remove_missing_object 9:37229082:::benchmark_data_smithi036_156792_object2474:head 680'626
Actions #8

Updated by Josh Durgin over 6 years ago

  • Assignee set to Josh Durgin
Actions #9

Updated by Sage Weil about 6 years ago

/a/sage-2018-04-11_22:26:40-rados-wip-sage-testing-2018-04-11-1604-distro-basic-smithi/2387226

Actions #10

Updated by Josh Durgin almost 6 years ago

  • Priority changed from Urgent to High
Actions #11

Updated by xie xingguo over 5 years ago

Hit this again:

http://pulpito.ceph.com/xxg-2018-12-19_01:25:39-rados:thrash-wip-no-upmap-for-merge-distro-basic-smithi/3374896/

    -2> 2018-12-19 02:19:34.468 7f9a963498c0 15 read_log_and_missing  missing 887'836 (816'751) delete   2:f5084a30:::smithi17242158-33 ooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo:27e by unknown.0.0:0 2018-12-19 02:19:09.081841 0 snaps []
    -1> 2018-12-19 02:19:34.470 7f9a963498c0 -1 /home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/ce
ntos7/MACHINE_SIZE/huge/release/14.0.1-1784-g1143b16/rpm/el7/BUILD/ceph-14.0.1-1784-g1143b16/src/osd/PGLog.h: In function 'static void PGLog::read_log_and_missing(Obje
ctStore*, ObjectStore::CollectionHandle&, ghobject_t, const pg_info_t&, PGLog::IndexedLog&, missing_type&, std::ostringstream&, bool, bool*, const DoutPrefixProvider*,
 std::set<std::basic_string<char> >*, bool) [with missing_type = pg_missing_set<true>; ObjectStore::CollectionHandle = boost::intrusive_ptr<ObjectStore::CollectionImpl
>; std::ostringstream = std::basic_ostringstream<char>]' thread 7f9a963498c0 time 2018-12-19 02:19:34.470059
/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.0.1-1784-g114
3b16/rpm/el7/BUILD/ceph-14.0.1-1784-g1143b16/src/osd/PGLog.h: 1454: FAILED ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter
->second.have == eversion_t()))

 ceph version 14.0.1-1784-g1143b16 (1143b16a07272a8087f9bd95c3f18242d20b1c5f) nautilus (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55f50cf6d53e]
 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55f50cf6d70c]
 3: (void PGLog::read_log_and_missing<pg_missing_set<true> >(ObjectStore*, boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t, pg_info_t const&, PGLog::Ind
exedLog&, pg_missing_set<true>&, std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >&, bool, bool*, DoutPrefixProvider const*, std::set<std::
string, std::less<std::string>, std::allocator<std::string> >*, bool)+0x2f10) [0x55f50d1b6640]
 4: (PG::read_state(ObjectStore*)+0x1c5) [0x55f50d182cb5]
 5: (OSD::load_pgs()+0x626) [0x55f50d0a8b56]
 6: (OSD::init()+0xe57) [0x55f50d0cbe27]
 7: (main()+0x1531) [0x55f50cf70eb1]
 8: (__libc_start_main()+0xf5) [0x7f9a91b353d5]
 9: (()+0x53a435) [0x55f50d061435]

     0> 2018-12-19 02:19:34.472 7f9a963498c0 -1 *** Caught signal (Aborted) **
 in thread 7f9a963498c0 thread_name:ceph-osd

 ceph version 14.0.1-1784-g1143b16 (1143b16a07272a8087f9bd95c3f18242d20b1c5f) nautilus (dev)
Actions #12

Updated by Josh Durgin over 4 years ago

  • Status changed from 12 to Can't reproduce
Actions #13

Updated by Neha Ojha about 3 years ago

  • Related to Bug #45702: PGLog::read_log_and_missing: ceph_assert(miter == missing.get_items().end() || (miter->second.need == i->version && miter->second.have == eversion_t())) added
Actions

Also available in: Atom PDF