Bug #20208
closedceph-mgr: mon/PGMap.cc: 1367: FAILED assert(end >= 0)
0%
Description
My wip-19657 is rebased on master at bc8c432a69cfb3ef0cc726c5ecd404a372a29d2b
http://pulpito.ceph.com/dzafman-2017-06-06_20:29:57-rados-wip-19657-distro-basic-smithi/1266972/
remote/smithi065/log/ceph-mgr.x.log.gz
2017-06-07 05:23:31.381731 7ffbaf9e3700 0 mgr tick tick 2017-06-07 05:23:31.381747 7ffbaf9e3700 1 mgr send_beacon active 2017-06-07 05:23:31.381750 7ffbaf9e3700 10 mgr send_beacon sending beacon as gid 4101 2017-06-07 05:23:31.381756 7ffbaf9e3700 0 mgr tick 2017-06-07 05:23:31.381758 7ffbaf9e3700 10 mgr update_delta_stats v189 2017-06-07 05:23:31.385749 7ffbaf9e3700 -1 /build/ceph-12.0.2-2441-g54d26e8/src/mon/PGMap.cc: In function 'void PGMap::stat_pg_sub(const pg_t&, const pg_stat_t&, bool)' thread 7ffbaf9e3700 time 2017-06-07 05:23:31.382130 /build/ceph-12.0.2-2441-g54d26e8/src/mon/PGMap.cc: 1367: FAILED assert(end >= 0) ceph version 12.0.2-2441-g54d26e8 (54d26e80341b627b3027566465035307e2738b4f) luminous (dev) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7ffbbb15ea2e] 2: (()+0x2383ab) [0x7ffbbafdd3ab] 3: (PGMap::apply_incremental(CephContext*, PGMap::Incremental const&)+0x2e6) [0x7ffbbafe1866] 4: (ClusterState::update_delta_stats()+0x93) [0x7ffbbb01c813] 5: (DaemonServer::send_report()+0xc3) [0x7ffbbb00e1f3] 6: (Mgr::tick()+0x5d) [0x7ffbbb03ebbd] 7: (MgrStandby::tick()+0x87) [0x7ffbbb036be7] 8: (FunctionContext::finish(int)+0x1a) [0x7ffbbb01911a] 9: (Context::complete(int)+0x9) [0x7ffbbb015719] 10: (SafeTimer::timer_thread()+0xec) [0x7ffbbb15b3fc] 11: (SafeTimerThread::entry()+0xd) [0x7ffbbb15cd8d] 12: (()+0x8184) [0x7ffbb95fa184] 13: (clone()+0x6d) [0x7ffbb86e037d]
Updated by Loïc Dachary almost 7 years ago
- Status changed from New to 12
- Priority changed from Normal to Urgent
Upgrading to urgent because it breaks some suites on master.
http://pulpito.ceph.com/loic-2017-06-07_08:03:59-ceph-disk-loic-master-distro-basic-vps/
2017-06-06T21:30:15.046 INFO:tasks.ceph.mgr.x.ovh084.stderr: ceph version 12.0.2-2357-gdd1f471 (dd1f47126deb239f70a10183867ef6a20d611d69) luminous (dev) 2017-06-06T21:30:15.046 INFO:tasks.ceph.mgr.x.ovh084.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x557c34084262] 2017-06-06T21:30:15.046 INFO:tasks.ceph.mgr.x.ovh084.stderr: 2: (PGMap::apply_incremental(CephContext*, PGMap::Incremental const&)+0x15b7) [0x557c33ede957] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 3: (ClusterState::update_delta_stats()+0x1f7) [0x557c33f1fec7] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 4: (DaemonServer::send_report()+0xdf) [0x557c33f106cf] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 5: (Mgr::tick()+0xfd) [0x557c33f4896d] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 6: (MgrStandby::tick()+0x121) [0x557c33f3f911] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 7: (FunctionContext::finish(int)+0x2a) [0x557c33f1b66a] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 8: (Context::complete(int)+0x9) [0x557c33f18159] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 9: (SafeTimer::timer_thread()+0x452) [0x557c34080fa2] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 10: (SafeTimerThread::entry()+0xd) [0x557c340823ad] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 11: (()+0x76ba) [0x7fc73c7356ba] 2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 12: (clone()+0x6d) [0x7fc73b7a182d]
Updated by Sage Weil almost 7 years ago
- Priority changed from Urgent to Immediate
Updated by Sage Weil almost 7 years ago
- Status changed from 12 to In Progress
- Assignee set to Sage Weil
I think there are two crashes.. the second one,
src/mon/PGMap.cc: 1085: FAILED assert(osd_stat.size() == osd_epochs.size())
is I think what Loic saw (it's what I'm getting from the ceph-disk suite). It should be fixed by
diff --git a/src/mon/PGMap.cc b/src/mon/PGMap.cc index a623e20..b39b6a3 100644 --- a/src/mon/PGMap.cc +++ b/src/mon/PGMap.cc @@ -1130,6 +1130,7 @@ void PGMap::apply_incremental(CephContext *cct, const Incremental& inc) if (t != osd_stat.end()) { stat_osd_sub(t->first, t->second); osd_stat.erase(t); + osd_epochs.erase(*p); } // remove these old osds from full/nearfull set(s), too
testing, and still working on the first one.
Updated by Loïc Dachary almost 7 years ago
I'm testing the proposed patch as well.
Updated by Sage Weil almost 7 years ago
https://github.com/ceph/ceph/pull/15573 for the second instance that loic saw; haven't figured out the first instance yet.
Updated by Yuri Weinstein almost 7 years ago
- Status changed from In Progress to Fix Under Review
Updated by Sage Weil almost 7 years ago
- Status changed from Fix Under Review to Resolved
Updated by Sage Weil almost 7 years ago
- Status changed from Resolved to 12
first instance happened again:
2017-06-16T02:34:24.509 INFO:tasks.ceph.mgr.x.smithi198.stderr:/build/ceph-12.0.3-1748-g62eaa13/src/mon/PGMap.cc: In function 'void PGMap::stat_pg_sub(const pg_t&, const pg_stat_t&, bool)' thread 7f8cd328b700 time 2017-06-16 02:34:24.509213 2017-06-16T02:34:24.509 INFO:tasks.ceph.mgr.x.smithi198.stderr:/build/ceph-12.0.3-1748-g62eaa13/src/mon/PGMap.cc: 1370: FAILED assert(end >= 0) 2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: ceph version 12.0.3-1748-g62eaa13 (62eaa139b3add8abb500dc17b522ce9bbfb9fac3) luminous (dev) 2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5601ff0cb0d2] 2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 2: (()+0x2446d9) [0x5601fef246d9] 2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 3: (PGMap::apply_incremental(CephContext*, PGMap::Incremental const&)+0x2b0) [0x5601fef24990] 2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 4: (ClusterState::notify_osdmap(OSDMap const&)+0x675) [0x5601fef6b345] 2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 5: (()+0x2b19a2) [0x5601fef919a2] 2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 6: (Mgr::handle_osd_map()+0xb3) [0x5601fef91f43] 2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 7: (Mgr::ms_dispatch(Message*)+0x164) [0x5601fef93144] 2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 8: (MgrStandby::ms_dispatch(Message*)+0xb4) [0x5601fef89304] 2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 9: (DispatchQueue::entry()+0xf4a) [0x5601ff32acfa] 2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x5601ff17857d] 2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 11: (()+0x76ba) [0x7f8cd9f416ba] 2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 12: (clone()+0x6d) [0x7f8cd8fad82d] 2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. 2017-06-16T02:34:24.527 INFO:tasks.ceph.mgr.x.smithi198.stderr:2017-06-16 02:34:24.526256 7f8cd328b700 -1 /build/ceph-12.0.3-1748-g62eaa13/src/mon/PGMap.cc: In function 'void PGMap::stat_pg_sub(const pg_t&, const pg_stat_t&, bool)' thread 7f8cd328b700 time 2017-06-16 02:34:24.509213
/a/sage-2017-06-16_00:46:50-rados-wip-sage-testing-distro-basic-smithi/1292390
unfortunately the core file seems to be bad. :(
Updated by Sage Weil almost 7 years ago
- Status changed from 12 to Need More Info
Updated by Sage Weil almost 7 years ago
- Priority changed from Immediate to Urgent
Updated by Greg Farnum almost 7 years ago
- Project changed from Ceph to mgr
- Category deleted (
110)
Updated by Sage Weil almost 7 years ago
/a/sage-2017-06-21_02:01:04-rados-wip-sage-testing2-distro-basic-smithi/1308406
core file again doesn't work. wth..
Missing separate debuginfo for the main executable file Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/3e/0788b32f603e75fb8594bb211d83af6d28e3e5 Core was generated by `ceph-mgr -f --cluster ceph -i y'. Program terminated with signal 6, Aborted. #0 0x00007f57da86a23b in ?? () "/home/ubuntu/1498021974.23889.core" is a core file. Please specify an executable to debug. (gdb) q [ubuntu@smithi148 ~]$ ls /usr/lib/debug/.build-id/3e/0788b32f603e75fb8594bb211d83af6d28e3e5 ls: cannot access /usr/lib/debug/.build-id/3e/0788b32f603e75fb8594bb211d83af6d28e3e5: No such file or directory [ubuntu@smithi148 ~]$ ls /usr/lib/debug/.build-id/3e/ a14aebdd22407fb52918a54f3b4a8b7a70ca28 a14aebdd22407fb52918a54f3b4a8b7a70ca28.debug [ubuntu@smithi148 ~]$ rpm -qa | grep debuginfo ceph-debuginfo-12.0.3-1957.gcbf8433.el7.x86_64
Updated by Sage Weil almost 7 years ago
/a/sage-2017-06-21_17:39:30-rados-wip-bluestore-test-fsck-distro-basic-smithi/1312108
core finally works!
a and b have inc and pg_map dumps
Updated by Greg Farnum almost 7 years ago
Waiting to reproduce with newly-set higher debug log levels.
Updated by Sage Weil almost 7 years ago
revert teuthology.git 31156d94c621e868c1fbb546015fced0bb56a873 when this is found (it raises debug mgr = 30)
Another instance, but without logs:
/a/sage-2017-07-03_15:41:59-rados-wip-sage-testing-distro-basic-smithi/1356383
Updated by Sage Weil almost 7 years ago
Another case, without logs. Another hint, though..
-198> 2017-07-07 02:18:26.111735 7f313c656700 10 mgr notify_osdmap v135 -197> 2017-07-07 02:18:26.111739 7f313c656700 10 check_osd_map pool 90 gone, removing pgs -196> 2017-07-07 02:18:26.111754 7f313c656700 10 check_osd_map pool 83 gone, removing pgs -195> 2017-07-07 02:18:26.111762 7f313c656700 10 check_osd_map pool 15 pg_num 11 != my pg_num 8 -194> 2017-07-07 02:18:26.111782 7f313c656700 20 check_osd_map adding 15.8 -193> 2017-07-07 02:18:26.111785 7f313c656700 20 check_osd_map adding 15.9 -192> 2017-07-07 02:18:26.111787 7f313c656700 20 check_osd_map adding 15.a -191> 2017-07-07 02:18:26.111902 7f313c656700 20 deleted pool 83 -190> 2017-07-07 02:18:26.111924 7f313c656700 20 deleted pool 90 ... -147> 2017-07-07 02:18:26.403722 7f3138b4c700 10 mgr update_delta_stats v136 ... -11> 2017-07-07 02:18:27.204029 7f313c656700 10 mgr notify_osdmap v137 -10> 2017-07-07 02:18:27.204035 7f313c656700 10 check_osd_map pool 15 pg_num 11 != my pg_num 8 -9> 2017-07-07 02:18:27.204038 7f313c656700 10 check_osd_map pool 91 pg_num 8 != my pg_num 0 -8> 2017-07-07 02:18:27.204040 7f313c656700 20 check_osd_map adding 91.0 -7> 2017-07-07 02:18:27.204044 7f313c656700 20 check_osd_map adding 91.1 -6> 2017-07-07 02:18:27.204045 7f313c656700 20 check_osd_map adding 91.2 -5> 2017-07-07 02:18:27.204047 7f313c656700 20 check_osd_map adding 91.3 -4> 2017-07-07 02:18:27.204048 7f313c656700 20 check_osd_map adding 91.4 -3> 2017-07-07 02:18:27.204050 7f313c656700 20 check_osd_map adding 91.5 -2> 2017-07-07 02:18:27.204051 7f313c656700 20 check_osd_map adding 91.6 -1> 2017-07-07 02:18:27.204052 7f313c656700 20 check_osd_map adding 91.7
not that the pool 15 count didn't match twice! also, the second time they weren't printed, which means they were already in pending_inc. probably an update from the osd?
Updated by Sage Weil almost 7 years ago
- Status changed from Need More Info to Fix Under Review
Updated by Sage Weil almost 7 years ago
- Status changed from Fix Under Review to Resolved