Project

General

Profile

Bug #20208

ceph-mgr: mon/PGMap.cc: 1367: FAILED assert(end >= 0)

Added by David Zafman 4 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
Start date:
06/06/2017
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Release:
Needs Doc:
No

Description

My wip-19657 is rebased on master at bc8c432a69cfb3ef0cc726c5ecd404a372a29d2b

http://pulpito.ceph.com/dzafman-2017-06-06_20:29:57-rados-wip-19657-distro-basic-smithi/1266972/

remote/smithi065/log/ceph-mgr.x.log.gz

2017-06-07 05:23:31.381731 7ffbaf9e3700  0 mgr tick tick
2017-06-07 05:23:31.381747 7ffbaf9e3700  1 mgr send_beacon active
2017-06-07 05:23:31.381750 7ffbaf9e3700 10 mgr send_beacon sending beacon as gid 4101
2017-06-07 05:23:31.381756 7ffbaf9e3700  0 mgr tick
2017-06-07 05:23:31.381758 7ffbaf9e3700 10 mgr update_delta_stats  v189
2017-06-07 05:23:31.385749 7ffbaf9e3700 -1 /build/ceph-12.0.2-2441-g54d26e8/src/mon/PGMap.cc: In function 'void PGMap::stat_pg_sub(const pg_t&, const pg_stat_t&, bool)' thread 7ffbaf9e3700 time 2017-06-07 05:23:31.382130
/build/ceph-12.0.2-2441-g54d26e8/src/mon/PGMap.cc: 1367: FAILED assert(end >= 0)

 ceph version  12.0.2-2441-g54d26e8 (54d26e80341b627b3027566465035307e2738b4f) luminous (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x10e) [0x7ffbbb15ea2e]
 2: (()+0x2383ab) [0x7ffbbafdd3ab]
 3: (PGMap::apply_incremental(CephContext*, PGMap::Incremental const&)+0x2e6) [0x7ffbbafe1866]
 4: (ClusterState::update_delta_stats()+0x93) [0x7ffbbb01c813]
 5: (DaemonServer::send_report()+0xc3) [0x7ffbbb00e1f3]
 6: (Mgr::tick()+0x5d) [0x7ffbbb03ebbd]
 7: (MgrStandby::tick()+0x87) [0x7ffbbb036be7]
 8: (FunctionContext::finish(int)+0x1a) [0x7ffbbb01911a]
 9: (Context::complete(int)+0x9) [0x7ffbbb015719]
 10: (SafeTimer::timer_thread()+0xec) [0x7ffbbb15b3fc]
 11: (SafeTimerThread::entry()+0xd) [0x7ffbbb15cd8d]
 12: (()+0x8184) [0x7ffbb95fa184]
 13: (clone()+0x6d) [0x7ffbb86e037d]

History

#1 Updated by David Zafman 4 months ago

  • Description updated (diff)

#2 Updated by Loic Dachary 4 months ago

  • Status changed from New to Verified
  • Priority changed from Normal to Urgent

Upgrading to urgent because it breaks some suites on master.

http://pulpito.ceph.com/loic-2017-06-07_08:03:59-ceph-disk-loic-master-distro-basic-vps/

2017-06-06T21:30:15.046 INFO:tasks.ceph.mgr.x.ovh084.stderr: ceph version  12.0.2-2357-gdd1f471 (dd1f47126deb239f70a10183867ef6a20d611d69) luminous (dev)
2017-06-06T21:30:15.046 INFO:tasks.ceph.mgr.x.ovh084.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x557c34084262]
2017-06-06T21:30:15.046 INFO:tasks.ceph.mgr.x.ovh084.stderr: 2: (PGMap::apply_incremental(CephContext*, PGMap::Incremental const&)+0x15b7) [0x557c33ede957]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 3: (ClusterState::update_delta_stats()+0x1f7) [0x557c33f1fec7]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 4: (DaemonServer::send_report()+0xdf) [0x557c33f106cf]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 5: (Mgr::tick()+0xfd) [0x557c33f4896d]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 6: (MgrStandby::tick()+0x121) [0x557c33f3f911]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 7: (FunctionContext::finish(int)+0x2a) [0x557c33f1b66a]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 8: (Context::complete(int)+0x9) [0x557c33f18159]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 9: (SafeTimer::timer_thread()+0x452) [0x557c34080fa2]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 10: (SafeTimerThread::entry()+0xd) [0x557c340823ad]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 11: (()+0x76ba) [0x7fc73c7356ba]
2017-06-06T21:30:15.047 INFO:tasks.ceph.mgr.x.ovh084.stderr: 12: (clone()+0x6d) [0x7fc73b7a182d]

#3 Updated by Sage Weil 4 months ago

  • Priority changed from Urgent to Immediate

#4 Updated by Sage Weil 4 months ago

  • Status changed from Verified to In Progress
  • Assignee set to Sage Weil

I think there are two crashes.. the second one,

src/mon/PGMap.cc: 1085: FAILED assert(osd_stat.size() == osd_epochs.size())

is I think what Loic saw (it's what I'm getting from the ceph-disk suite). It should be fixed by
diff --git a/src/mon/PGMap.cc b/src/mon/PGMap.cc
index a623e20..b39b6a3 100644
--- a/src/mon/PGMap.cc
+++ b/src/mon/PGMap.cc
@@ -1130,6 +1130,7 @@ void PGMap::apply_incremental(CephContext *cct, const Incremental& inc)
     if (t != osd_stat.end()) {
       stat_osd_sub(t->first, t->second);
       osd_stat.erase(t);
+      osd_epochs.erase(*p);
     }

     // remove these old osds from full/nearfull set(s), too

testing, and still working on the first one.

#5 Updated by Loic Dachary 4 months ago

I'm testing the proposed patch as well.

#6 Updated by Sage Weil 4 months ago

https://github.com/ceph/ceph/pull/15573 for the second instance that loic saw; haven't figured out the first instance yet.

#7 Updated by Yuri Weinstein 4 months ago

  • Status changed from In Progress to Need Review

#8 Updated by Sage Weil 3 months ago

  • Status changed from Need Review to Resolved

#9 Updated by Greg Farnum 3 months ago

  • Category changed from 106 to 110

#10 Updated by Sage Weil 3 months ago

  • Status changed from Resolved to Verified

first instance happened again:

2017-06-16T02:34:24.509 INFO:tasks.ceph.mgr.x.smithi198.stderr:/build/ceph-12.0.3-1748-g62eaa13/src/mon/PGMap.cc: In function 'void PGMap::stat_pg_sub(const pg_t&, const pg_stat_t&, bool)' thread 7f8cd328b700 time 2017-06-16 02:34:24.509213
2017-06-16T02:34:24.509 INFO:tasks.ceph.mgr.x.smithi198.stderr:/build/ceph-12.0.3-1748-g62eaa13/src/mon/PGMap.cc: 1370: FAILED assert(end >= 0)
2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: ceph version 12.0.3-1748-g62eaa13 (62eaa139b3add8abb500dc17b522ce9bbfb9fac3) luminous (dev)
2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x5601ff0cb0d2]
2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 2: (()+0x2446d9) [0x5601fef246d9]
2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 3: (PGMap::apply_incremental(CephContext*, PGMap::Incremental const&)+0x2b0) [0x5601fef24990]
2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 4: (ClusterState::notify_osdmap(OSDMap const&)+0x675) [0x5601fef6b345]
2017-06-16T02:34:24.525 INFO:tasks.ceph.mgr.x.smithi198.stderr: 5: (()+0x2b19a2) [0x5601fef919a2]
2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 6: (Mgr::handle_osd_map()+0xb3) [0x5601fef91f43]
2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 7: (Mgr::ms_dispatch(Message*)+0x164) [0x5601fef93144]
2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 8: (MgrStandby::ms_dispatch(Message*)+0xb4) [0x5601fef89304]
2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 9: (DispatchQueue::entry()+0xf4a) [0x5601ff32acfa]
2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x5601ff17857d]
2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 11: (()+0x76ba) [0x7f8cd9f416ba]
2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: 12: (clone()+0x6d) [0x7f8cd8fad82d]
2017-06-16T02:34:24.526 INFO:tasks.ceph.mgr.x.smithi198.stderr: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
2017-06-16T02:34:24.527 INFO:tasks.ceph.mgr.x.smithi198.stderr:2017-06-16 02:34:24.526256 7f8cd328b700 -1 /build/ceph-12.0.3-1748-g62eaa13/src/mon/PGMap.cc: In function 'void PGMap::stat_pg_sub(const pg_t&, const pg_stat_t&, bool)' thread 7f8cd328b700 time 2017-06-16 02:34:24.509213

/a/sage-2017-06-16_00:46:50-rados-wip-sage-testing-distro-basic-smithi/1292390

unfortunately the core file seems to be bad. :(

#11 Updated by Sage Weil 3 months ago

  • Status changed from Verified to Need More Info

#12 Updated by Sage Weil 3 months ago

  • Priority changed from Immediate to Urgent

#13 Updated by Greg Farnum 3 months ago

  • Project changed from Ceph to mgr
  • Category deleted (110)

#14 Updated by Sage Weil 3 months ago

/a/sage-2017-06-21_02:01:04-rados-wip-sage-testing2-distro-basic-smithi/1308406

core file again doesn't work. wth..


Missing separate debuginfo for the main executable file
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/3e/0788b32f603e75fb8594bb211d83af6d28e3e5
Core was generated by `ceph-mgr -f --cluster ceph -i y'.
Program terminated with signal 6, Aborted.
#0  0x00007f57da86a23b in ?? ()
"/home/ubuntu/1498021974.23889.core" is a core file.
Please specify an executable to debug.
(gdb) q
[ubuntu@smithi148 ~]$ ls /usr/lib/debug/.build-id/3e/0788b32f603e75fb8594bb211d83af6d28e3e5
ls: cannot access /usr/lib/debug/.build-id/3e/0788b32f603e75fb8594bb211d83af6d28e3e5: No such file or directory
[ubuntu@smithi148 ~]$ ls /usr/lib/debug/.build-id/3e/
a14aebdd22407fb52918a54f3b4a8b7a70ca28  a14aebdd22407fb52918a54f3b4a8b7a70ca28.debug
[ubuntu@smithi148 ~]$ rpm -qa | grep debuginfo
ceph-debuginfo-12.0.3-1957.gcbf8433.el7.x86_64

#15 Updated by Sage Weil 3 months ago

/a/sage-2017-06-21_17:39:30-rados-wip-bluestore-test-fsck-distro-basic-smithi/1312108

core finally works!

a and b have inc and pg_map dumps

#16 Updated by Greg Farnum 3 months ago

Waiting to reproduce with newly-set higher debug log levels.

#17 Updated by Sage Weil 3 months ago

revert teuthology.git 31156d94c621e868c1fbb546015fced0bb56a873 when this is found (it raises debug mgr = 30)

Another instance, but without logs:

/a/sage-2017-07-03_15:41:59-rados-wip-sage-testing-distro-basic-smithi/1356383

#18 Updated by Sage Weil 3 months ago

Another case, without logs. Another hint, though..

  -198> 2017-07-07 02:18:26.111735 7f313c656700 10 mgr notify_osdmap  v135
  -197> 2017-07-07 02:18:26.111739 7f313c656700 10 check_osd_map pool 90 gone, removing pgs
  -196> 2017-07-07 02:18:26.111754 7f313c656700 10 check_osd_map pool 83 gone, removing pgs
  -195> 2017-07-07 02:18:26.111762 7f313c656700 10 check_osd_map pool 15 pg_num 11 != my pg_num 8
  -194> 2017-07-07 02:18:26.111782 7f313c656700 20 check_osd_map adding 15.8
  -193> 2017-07-07 02:18:26.111785 7f313c656700 20 check_osd_map adding 15.9
  -192> 2017-07-07 02:18:26.111787 7f313c656700 20 check_osd_map adding 15.a
  -191> 2017-07-07 02:18:26.111902 7f313c656700 20  deleted pool 83
  -190> 2017-07-07 02:18:26.111924 7f313c656700 20  deleted pool 90
...
  -147> 2017-07-07 02:18:26.403722 7f3138b4c700 10 mgr update_delta_stats  v136
...
   -11> 2017-07-07 02:18:27.204029 7f313c656700 10 mgr notify_osdmap  v137
   -10> 2017-07-07 02:18:27.204035 7f313c656700 10 check_osd_map pool 15 pg_num 11 != my pg_num 8
    -9> 2017-07-07 02:18:27.204038 7f313c656700 10 check_osd_map pool 91 pg_num 8 != my pg_num 0
    -8> 2017-07-07 02:18:27.204040 7f313c656700 20 check_osd_map adding 91.0
    -7> 2017-07-07 02:18:27.204044 7f313c656700 20 check_osd_map adding 91.1
    -6> 2017-07-07 02:18:27.204045 7f313c656700 20 check_osd_map adding 91.2
    -5> 2017-07-07 02:18:27.204047 7f313c656700 20 check_osd_map adding 91.3
    -4> 2017-07-07 02:18:27.204048 7f313c656700 20 check_osd_map adding 91.4
    -3> 2017-07-07 02:18:27.204050 7f313c656700 20 check_osd_map adding 91.5
    -2> 2017-07-07 02:18:27.204051 7f313c656700 20 check_osd_map adding 91.6
    -1> 2017-07-07 02:18:27.204052 7f313c656700 20 check_osd_map adding 91.7

not that the pool 15 count didn't match twice! also, the second time they weren't printed, which means they were already in pending_inc. probably an update from the osd?

#19 Updated by Sage Weil 2 months ago

  • Status changed from Need More Info to Need Review

#20 Updated by Sage Weil 2 months ago

  • Status changed from Need Review to Resolved

Also available in: Atom PDF