Project

General

Profile

Bug #16982

OSD crash after upgrade to Jewel: give useful error when trying to commit 4000 maps to a 100MB journal

Added by Wido den Hollander almost 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature:

Description

While upgrading from Hammer to Jewel I'm seeing this happen on multiple OSDs.

These systems are running Ubuntu 14.04 with the 4.4 kernel backported from Ubuntu 16.04

Not sure yet what is happening here.

    -3> 2016-08-10 19:41:33.479095 7f20c2fd0700  5 osd.319 89070 heartbeat: osd_stat(738 GB used, 377 GB avail, 1116 GB total, peers []/[] op hist [])
    -2> 2016-08-10 19:41:36.379292 7f20c2fd0700  5 osd.319 89070 heartbeat: osd_stat(738 GB used, 377 GB avail, 1116 GB total, peers []/[] op hist [])
    -1> 2016-08-10 19:41:38.679495 7f20c2fd0700  5 osd.319 89070 heartbeat: osd_stat(738 GB used, 377 GB avail, 1116 GB total, peers []/[] op hist [])
     0> 2016-08-10 19:41:40.686944 7f20d37f1700 -1 common/buffer.cc: In function 'void ceph::buffer::ptr::copy_in(unsigned int, unsigned int, const char*, bool)' thread 7f20d37f1700 time 2016-08-10 19:41:40.682052
common/buffer.cc: 977: FAILED assert(o+l <= _len)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x563076f97dab]
 2: (ceph::buffer::ptr::copy_in(unsigned int, unsigned int, char const*, bool)+0x248) [0x563076fa09c8]
 3: (ceph::buffer::list::rebuild(ceph::buffer::ptr&)+0x3c) [0x563076fa0d8c]
 4: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1e1) [0x563076fa23e1]
 5: (FileJournal::prepare_entry(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, ceph::buffer::list*)+0x87b) [0x563076d47deb]
 6: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x491) [0x563076c688c1]
 7: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x18d) [0x56307698b08d]
 8: (OSD::handle_osd_map(MOSDMap*)+0x1485) [0x56307694c595]
 9: (OSD::_dispatch(Message*)+0x261) [0x563076962291]
 10: (OSD::ms_dispatch(Message*)+0x20f) [0x5630769628cf]
 11: (DispatchQueue::entry()+0x78b) [0x56307705337b]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x563076f7447d]
 13: (()+0x8184) [0x7f20e78f9184]
 14: (clone()+0x6d) [0x7f20e5a2537d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.319.log
--- end dump of recent events ---
2016-08-10 19:41:40.698677 7f20d37f1700 -1 *** Caught signal (Aborted) **
 in thread 7f20d37f1700 thread_name:ms_dispatch

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x8ebb02) [0x563076ea0b02]
 2: (()+0x10330) [0x7f20e7901330]
 3: (gsignal()+0x37) [0x7f20e5961c37]
 4: (abort()+0x148) [0x7f20e5965028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x563076f97f85]
 6: (ceph::buffer::ptr::copy_in(unsigned int, unsigned int, char const*, bool)+0x248) [0x563076fa09c8]
 7: (ceph::buffer::list::rebuild(ceph::buffer::ptr&)+0x3c) [0x563076fa0d8c]
 8: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1e1) [0x563076fa23e1]
 9: (FileJournal::prepare_entry(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, ceph::buffer::list*)+0x87b) [0x563076d47deb]
 10: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x491) [0x563076c688c1]
 11: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x18d) [0x56307698b08d]
 12: (OSD::handle_osd_map(MOSDMap*)+0x1485) [0x56307694c595]
 13: (OSD::_dispatch(Message*)+0x261) [0x563076962291]
 14: (OSD::ms_dispatch(Message*)+0x20f) [0x5630769628cf]
 15: (DispatchQueue::entry()+0x78b) [0x56307705337b]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x563076f7447d]
 17: (()+0x8184) [0x7f20e78f9184]
 18: (clone()+0x6d) [0x7f20e5a2537d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2016-08-10 19:41:40.698677 7f20d37f1700 -1 *** Caught signal (Aborted) **
 in thread 7f20d37f1700 thread_name:ms_dispatch

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x8ebb02) [0x563076ea0b02]
 2: (()+0x10330) [0x7f20e7901330]
 3: (gsignal()+0x37) [0x7f20e5961c37]
 4: (abort()+0x148) [0x7f20e5965028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x563076f97f85]
 6: (ceph::buffer::ptr::copy_in(unsigned int, unsigned int, char const*, bool)+0x248) [0x563076fa09c8]
 7: (ceph::buffer::list::rebuild(ceph::buffer::ptr&)+0x3c) [0x563076fa0d8c]
 8: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1e1) [0x563076fa23e1]
 9: (FileJournal::prepare_entry(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, ceph::buffer::list*)+0x87b) [0x563076d47deb]
 10: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x491) [0x563076c688c1]
 11: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x18d) [0x56307698b08d]
 12: (OSD::handle_osd_map(MOSDMap*)+0x1485) [0x56307694c595]
 13: (OSD::_dispatch(Message*)+0x261) [0x563076962291]
 14: (OSD::ms_dispatch(Message*)+0x20f) [0x5630769628cf]
 15: (DispatchQueue::entry()+0x78b) [0x56307705337b]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x563076f7447d]
 17: (()+0x8184) [0x7f20e78f9184]
 18: (clone()+0x6d) [0x7f20e5a2537d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.319.log
--- end dump of recent events ---
104.20.63.56

ceph-osd.136.log.gz (47.9 KB) Wido den Hollander, 08/11/2016 06:51 AM


Related issues

Related to Ceph - Bug #17023: OSD failed to subscribe skipped osdmaps after "ceph osd pause" Resolved 08/10/2016

History

#1 Updated by Wido den Hollander almost 4 years ago

Logs are available at: https://www.holmes.nl/pub/osd.319.log.gz

I have to add, these OSDs are very far behind on OSDMaps due to many failures. OOM, High CPU, etc.

#2 Updated by Wido den Hollander almost 4 years ago

After a chat with sjust on IRC we think the journal size (100M) might be a issue.

This seems to be a admin error during deployment.

The OSD is far behind on the Maps and has to catch up. That fails.

#3 Updated by Wido den Hollander almost 4 years ago

  • Subject changed from OSD crash after upgrade to Jewel: common/buffer.cc: 977: FAILED assert(o+l <= _len) to OSD crash after upgrade to Jewel: give useful error when trying to commit 4000 maps to a 100MB journal

#4 Updated by Wido den Hollander almost 4 years ago

I have some new logs as requested on IRC: https://www.holmes.nl/pub/osd.319-2.log.gz

The journal has now been moved to a 8GB file instead of a 1100M.

Still, the OSD Crashes.

#5 Updated by Samuel Just almost 4 years ago

Wido: were you able to get the logs we talked about?

#6 Updated by Wido den Hollander almost 4 years ago

Before I go to sleep (long day!) some logs as asked on IRC.

I tried with osd.319 to get it up and running, but it crashed, I have two logs:

- https://www.holmes.nl/pub/osd.319.log.gz
- https://www.holmes.nl/pub/osd.319-2.log.gz

The reason it was sharding 4000 logs (or more) is because I upped osd_map_message_max to 5000 at some point.

While upgrading there were OSDs getting behind on the OSDMaps and they never passed preboot.

This happened with Hammer (0.94.7) and Jewel (10.2.2)

They started to crash with the backtrace from above.

What I found out is that some systems (by accident!) had a journal of 100M.

After I flushed the journal and re-created them with 16G it worked, but most OSDs never advanced to the latest map.

They had to be restarted multiple times and that's what you see in the logs of osd.264: https://www.holmes.nl/pub/ceph-osd.264.log.gz

Eventually, after a lot of restarts they advanced to the latest map.

The monitors were already running 10.2.2 in this case.

From mon01 I also have logs from while it was talking with osd.319: https://www.holmes.nl/pub/mon01.log.gz

So a short recap:

- Upgrade from Hammer to Jewel, hardware failed, major panic
- OSDs were slow due to CPU and Memory, falling behind
- OSDs never catched up with maps
- Bumped share maps, seemed to work
- OSDs with small journal started to crash
- Increased journal, OSDs never catched up

#7 Updated by Wido den Hollander almost 4 years ago

I have another log to attach.

I tried adding new OSDs to the cluster and they also have to catch up, which never happens until I restart them over and over.

osd.136 in this case is a fresh OSD. You can see it jumps with 1.000 maps (osd_map_message_max), but then just waits.

I restart the osd, it goes 1k maps forward and waits. I restart, etc, etc.

#8 Updated by Kefu Chai almost 4 years ago

  • Assignee set to Kefu Chai

#9 Updated by Kefu Chai almost 4 years ago

tl;dr: integer overflow

  • it's not specific to the upgrade scenario. this only happens with osds far behind the cluster, plus a large osd_map_message_max.
  • not all hammer=>jewel upgrades will run into this issue.

workaround

  • smaller osd_map_message_max, so we don't send that many osdmaps in a single message, and hence the transaction for file journal will be smaller.

over 4000 full osdmaps in a large cluster was being written into the file journal in a single transaction, and before it went into the journal. the buffers are rebuild so they are aligned.

i guess that the size of encoded is greater than (unsigned)-1, i.e. 4294967295. so at the moment of crash, in buffer::list::rebuild_aligned_size_and_memory(), the size of nb is a small number which has already overflowed before unaligned.rebuild(nb), for example, it was 4294967295 + 201, i.e. 200. and in buffer::ptr::copy_in():

assert(o < _len);   // 0 < 200; good
assert(o+l <= _len); // 0 + 1234 > 200; bad!!

i am still looking at the osd.136.

#10 Updated by Wido den Hollander almost 4 years ago

Logical explanation!

So that solves this 'bug', but doesn't tell why the OSDs will not advance in maps and require a restart to do so.

osd.136's log hopefully has more information about that.

#11 Updated by Kefu Chai almost 4 years ago

but doesn't tell why the OSDs will not advance in maps and require a restart to do so.

assuming you are talking about the case of osd.136. then here it comes:

  1. => i want 0
    • objecter->start(), sub_sent["osdmap"] = {0,1}
  2. booting, i have [86492..91992]
  3. <= [96222, 96222], i have [86492..96222].
  4. => i want >= 96223 * objecter.osdmap.epoch == 0; * objectoer.osdmap = m.last_osdmap // first map. we want the full thing * i think at that moment, the "pause" flag was set in the osdmap ("ceph osd pause") (or one or more OSD is full) , that's why the objecter was asking for osdmaps >= 96223 after its osdmap is updated with the one carried by the osdmap from monitor.
    • sub_new["osdmap"] = {96223, 0}
    • update sub_sent with sub_new, please note "sub_sent.insert(sub_new.begin(), sub_new.end())" does not replace existing items with the same key! // bad! because we will revisit sub_sent afterwards.
    • so sub_sent["osdmap"] = {0, 1}
    • in sub_got(), sub_sent.erase("osdmap"), because 96222 >= sub_sent["osdmap"].start.
  5. => i want = 91993
    • in OSD::osdmap_subscribe(), an onetime subscription was sent, for osdmap == sb.newest + 1 = 91993.
    • since the sub_sent["osdmap"] was cleaned, this subscription was sent
    • sub_sent["osdmap"] = {91993, 1}
  6. <= [91993, 92992], i have [86492..96222]
    • 1K maps, monitor reset the sent epoch, and replied.
    • in sub_got(), sub_sent.erase("osdmap"), because 91993 >= 91993
  7. => i want >= 96223
    • objecter still wants a map newer than 96222
    • sub_new["osdmap"] = {96223, 1},
    • sub_sent["osdmap"] = {96223, 1}
  8. 1k maps consumed, i have [86492..92992]
  9. <= [96223, 96223], i have [86492..96223].
  10. => i want >= 96224+
    • objecter still wants a map newer than 96223
      • sub_new["osdmap"] = {96224, 0},
      • sub_sent["osdmap"] = {96224, 0}
    • OSD found that message skipped epochs, so
      • osdmap_subscribe(92992 + 1, false), but 92993 < sub_sent["osdmap"].start (96224), so the request was rejected by MonClient.
  11. <= [96224, 96224], i have [86492..96224]. // yes, apparently, what mon had is just 96225 atm,
  12. => i want >= 96225+
  13. <= [96225, 96225], i have [86492..96225].
  14. => i want >= 96226+
  15. <= [96226, 96226], i have [86492..96226].

the problem here, due to "ceph osd pause", the subscription sent by objecter always gets in the way of OSD, so the latter cannot subscribe for the older osdmap to catch up with the cluster.

#12 Updated by Wido den Hollander almost 4 years ago

Yes, we have set the pause flag to prevent client I/O, but that shouldn't prevent OSDs from fetching their maps, right?

I think this is a bug or do you think differently?

#13 Updated by Kefu Chai almost 4 years ago

  • Copied to Bug #17023: OSD failed to subscribe skipped osdmaps after "ceph osd pause" added

#14 Updated by Kefu Chai almost 4 years ago

Wido, no that should not. agreed, it's a bug. and #17023 is filed to track it separately.

#15 Updated by Kefu Chai almost 4 years ago

per discussion with sam, we could check the transaction size before sending it down to FileJournal, it it's over the size limit of buffer or some internal limitation of ceph, we should print an error message and crash.

and there are some invalid (combinations of) settings we cannot detect, we could

  • print error messages and bails out at seeing them, or
  • override them with the default setting when loading invalid user settings

#16 Updated by Kefu Chai almost 4 years ago

  • Status changed from New to Fix Under Review
  • Source changed from other to Community (user)
  • Release set to jewel

#17 Updated by Kefu Chai almost 4 years ago

  • Status changed from Fix Under Review to Resolved

#18 Updated by Loic Dachary almost 4 years ago

  • Copied to deleted (Bug #17023: OSD failed to subscribe skipped osdmaps after "ceph osd pause")

#19 Updated by Loic Dachary almost 4 years ago

  • Related to Bug #17023: OSD failed to subscribe skipped osdmaps after "ceph osd pause" added

Also available in: Atom PDF