Bug #16982: OSD crash after upgrade to Jewel: give useful error when trying to commit 4000 maps to a 100MB journal - Ceph - Ceph

Actions

Copy link

Bug #16982

closed

OSD crash after upgrade to Jewel: give useful error when trying to commit 4000 maps to a 100MB journal

Added by Wido den Hollander over 7 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Kefu Chai

Category:

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

While upgrading from Hammer to Jewel I'm seeing this happen on multiple OSDs.

These systems are running Ubuntu 14.04 with the 4.4 kernel backported from Ubuntu 16.04

Not sure yet what is happening here.

    -3> 2016-08-10 19:41:33.479095 7f20c2fd0700  5 osd.319 89070 heartbeat: osd_stat(738 GB used, 377 GB avail, 1116 GB total, peers []/[] op hist [])
    -2> 2016-08-10 19:41:36.379292 7f20c2fd0700  5 osd.319 89070 heartbeat: osd_stat(738 GB used, 377 GB avail, 1116 GB total, peers []/[] op hist [])
    -1> 2016-08-10 19:41:38.679495 7f20c2fd0700  5 osd.319 89070 heartbeat: osd_stat(738 GB used, 377 GB avail, 1116 GB total, peers []/[] op hist [])
     0> 2016-08-10 19:41:40.686944 7f20d37f1700 -1 common/buffer.cc: In function 'void ceph::buffer::ptr::copy_in(unsigned int, unsigned int, const char*, bool)' thread 7f20d37f1700 time 2016-08-10 19:41:40.682052
common/buffer.cc: 977: FAILED assert(o+l <= _len)

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x563076f97dab]
 2: (ceph::buffer::ptr::copy_in(unsigned int, unsigned int, char const*, bool)+0x248) [0x563076fa09c8]
 3: (ceph::buffer::list::rebuild(ceph::buffer::ptr&)+0x3c) [0x563076fa0d8c]
 4: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1e1) [0x563076fa23e1]
 5: (FileJournal::prepare_entry(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, ceph::buffer::list*)+0x87b) [0x563076d47deb]
 6: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x491) [0x563076c688c1]
 7: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x18d) [0x56307698b08d]
 8: (OSD::handle_osd_map(MOSDMap*)+0x1485) [0x56307694c595]
 9: (OSD::_dispatch(Message*)+0x261) [0x563076962291]
 10: (OSD::ms_dispatch(Message*)+0x20f) [0x5630769628cf]
 11: (DispatchQueue::entry()+0x78b) [0x56307705337b]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x563076f7447d]
 13: (()+0x8184) [0x7f20e78f9184]
 14: (clone()+0x6d) [0x7f20e5a2537d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.319.log
--- end dump of recent events ---
2016-08-10 19:41:40.698677 7f20d37f1700 -1 *** Caught signal (Aborted) **
 in thread 7f20d37f1700 thread_name:ms_dispatch

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x8ebb02) [0x563076ea0b02]
 2: (()+0x10330) [0x7f20e7901330]
 3: (gsignal()+0x37) [0x7f20e5961c37]
 4: (abort()+0x148) [0x7f20e5965028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x563076f97f85]
 6: (ceph::buffer::ptr::copy_in(unsigned int, unsigned int, char const*, bool)+0x248) [0x563076fa09c8]
 7: (ceph::buffer::list::rebuild(ceph::buffer::ptr&)+0x3c) [0x563076fa0d8c]
 8: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1e1) [0x563076fa23e1]
 9: (FileJournal::prepare_entry(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, ceph::buffer::list*)+0x87b) [0x563076d47deb]
 10: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x491) [0x563076c688c1]
 11: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x18d) [0x56307698b08d]
 12: (OSD::handle_osd_map(MOSDMap*)+0x1485) [0x56307694c595]
 13: (OSD::_dispatch(Message*)+0x261) [0x563076962291]
 14: (OSD::ms_dispatch(Message*)+0x20f) [0x5630769628cf]
 15: (DispatchQueue::entry()+0x78b) [0x56307705337b]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x563076f7447d]
 17: (()+0x8184) [0x7f20e78f9184]
 18: (clone()+0x6d) [0x7f20e5a2537d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2016-08-10 19:41:40.698677 7f20d37f1700 -1 *** Caught signal (Aborted) **
 in thread 7f20d37f1700 thread_name:ms_dispatch

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (()+0x8ebb02) [0x563076ea0b02]
 2: (()+0x10330) [0x7f20e7901330]
 3: (gsignal()+0x37) [0x7f20e5961c37]
 4: (abort()+0x148) [0x7f20e5965028]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x265) [0x563076f97f85]
 6: (ceph::buffer::ptr::copy_in(unsigned int, unsigned int, char const*, bool)+0x248) [0x563076fa09c8]
 7: (ceph::buffer::list::rebuild(ceph::buffer::ptr&)+0x3c) [0x563076fa0d8c]
 8: (ceph::buffer::list::rebuild_aligned_size_and_memory(unsigned int, unsigned int)+0x1e1) [0x563076fa23e1]
 9: (FileJournal::prepare_entry(std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, ceph::buffer::list*)+0x87b) [0x563076d47deb]
 10: (FileStore::queue_transactions(ObjectStore::Sequencer*, std::vector<ObjectStore::Transaction, std::allocator<ObjectStore::Transaction> >&, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x491) [0x563076c688c1]
 11: (ObjectStore::queue_transaction(ObjectStore::Sequencer*, ObjectStore::Transaction&&, Context*, Context*, Context*, std::shared_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x18d) [0x56307698b08d]
 12: (OSD::handle_osd_map(MOSDMap*)+0x1485) [0x56307694c595]
 13: (OSD::_dispatch(Message*)+0x261) [0x563076962291]
 14: (OSD::ms_dispatch(Message*)+0x20f) [0x5630769628cf]
 15: (DispatchQueue::entry()+0x78b) [0x56307705337b]
 16: (DispatchQueue::DispatchThread::entry()+0xd) [0x563076f7447d]
 17: (()+0x8184) [0x7f20e78f9184]
 18: (clone()+0x6d) [0x7f20e5a2537d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.319.log
--- end dump of recent events ---
104.20.63.56

Files

ceph-osd.136.log.gz (47.9 KB) ceph-osd.136.log.gz

Wido den Hollander, 08/11/2016 06:51 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Wido den Hollander over 7 years ago

Logs are available at: https://www.holmes.nl/pub/osd.319.log.gz

I have to add, these OSDs are very far behind on OSDMaps due to many failures. OOM, High CPU, etc.

Actions

Copy link

Updated by Wido den Hollander over 7 years ago

After a chat with sjust on IRC we think the journal size (100M) might be a issue.

This seems to be a admin error during deployment.

The OSD is far behind on the Maps and has to catch up. That fails.

Actions

Copy link

Updated by Wido den Hollander over 7 years ago

Subject changed from OSD crash after upgrade to Jewel: common/buffer.cc: 977: FAILED assert(o+l <= _len) to OSD crash after upgrade to Jewel: give useful error when trying to commit 4000 maps to a 100MB journal

Actions

Copy link

Updated by Wido den Hollander over 7 years ago

I have some new logs as requested on IRC: https://www.holmes.nl/pub/osd.319-2.log.gz

The journal has now been moved to a 8GB file instead of a 1100M.

Still, the OSD Crashes.

Actions

Copy link

Updated by Samuel Just over 7 years ago

Wido: were you able to get the logs we talked about?

Actions

Copy link

Updated by Wido den Hollander over 7 years ago

Before I go to sleep (long day!) some logs as asked on IRC.

I tried with osd.319 to get it up and running, but it crashed, I have two logs:

- https://www.holmes.nl/pub/osd.319.log.gz
- https://www.holmes.nl/pub/osd.319-2.log.gz

The reason it was sharding 4000 logs (or more) is because I upped osd_map_message_max to 5000 at some point.

While upgrading there were OSDs getting behind on the OSDMaps and they never passed preboot.

This happened with Hammer (0.94.7) and Jewel (10.2.2)

They started to crash with the backtrace from above.

What I found out is that some systems (by accident!) had a journal of 100M.

After I flushed the journal and re-created them with 16G it worked, but most OSDs never advanced to the latest map.

They had to be restarted multiple times and that's what you see in the logs of osd.264: https://www.holmes.nl/pub/ceph-osd.264.log.gz

Eventually, after a lot of restarts they advanced to the latest map.

The monitors were already running 10.2.2 in this case.

From mon01 I also have logs from while it was talking with osd.319: https://www.holmes.nl/pub/mon01.log.gz

So a short recap:

- Upgrade from Hammer to Jewel, hardware failed, major panic
- OSDs were slow due to CPU and Memory, falling behind
- OSDs never catched up with maps
- Bumped share maps, seemed to work
- OSDs with small journal started to crash
- Increased journal, OSDs never catched up

Actions

Copy link

Updated by Wido den Hollander over 7 years ago

File ceph-osd.136.log.gz ceph-osd.136.log.gz added

I have another log to attach.

I tried adding new OSDs to the cluster and they also have to catch up, which never happens until I restart them over and over.

osd.136 in this case is a fresh OSD. You can see it jumps with 1.000 maps (osd_map_message_max), but then just waits.

I restart the osd, it goes 1k maps forward and waits. I restart, etc, etc.

Actions

Copy link

Updated by Kefu Chai over 7 years ago

Assignee set to Kefu Chai

Actions

Copy link

Updated by Kefu Chai over 7 years ago

tl;dr: integer overflow

it's not specific to the upgrade scenario. this only happens with osds far behind the cluster, plus a large osd_map_message_max.
not all hammer=>jewel upgrades will run into this issue.

workaround

smaller osd_map_message_max, so we don't send that many osdmaps in a single message, and hence the transaction for file journal will be smaller.

over 4000 full osdmaps in a large cluster was being written into the file journal in a single transaction, and before it went into the journal. the buffers are rebuild so they are aligned.

i guess that the size of encoded is greater than (unsigned)-1, i.e. 4294967295. so at the moment of crash, in buffer::list::rebuild_aligned_size_and_memory(), the size of nb is a small number which has already overflowed before unaligned.rebuild(nb), for example, it was 4294967295 + 201, i.e. 200. and in buffer::ptr::copy_in():

assert(o < _len);   // 0 < 200; good
assert(o+l <= _len); // 0 + 1234 > 200; bad!!

i am still looking at the osd.136.

Actions

Copy link

#10

Updated by Wido den Hollander over 7 years ago

Logical explanation!

So that solves this 'bug', but doesn't tell why the OSDs will not advance in maps and require a restart to do so.

osd.136's log hopefully has more information about that.

Actions

Copy link

#11

Updated by Kefu Chai over 7 years ago

but doesn't tell why the OSDs will not advance in maps and require a restart to do so.

assuming you are talking about the case of osd.136. then here it comes:

=> i want 0
- objecter->start(), sub_sent["osdmap"] = {0,1}
booting, i have [86492..91992]
<= [96222, 96222], i have [86492..96222].
=> i want >= 96223 * objecter.osdmap.epoch == 0; * objectoer.osdmap = m.last_osdmap // first map. we want the full thing * i think at that moment, the "pause" flag was set in the osdmap ("ceph osd pause") (or one or more OSD is full) , that's why the objecter was asking for osdmaps >= 96223 after its osdmap is updated with the one carried by the osdmap from monitor.
- sub_new["osdmap"] = {96223, 0}
- update sub_sent with sub_new, please note "sub_sent.insert(sub_new.begin(), sub_new.end())" does not replace existing items with the same key! // bad! because we will revisit sub_sent afterwards.
- so sub_sent["osdmap"] = {0, 1}
- in sub_got(), sub_sent.erase("osdmap"), because 96222 >= sub_sent["osdmap"].start.
=> i want = 91993
- in OSD::osdmap_subscribe(), an onetime subscription was sent, for osdmap == sb.newest + 1 = 91993.
- since the sub_sent["osdmap"] was cleaned, this subscription was sent
- sub_sent["osdmap"] = {91993, 1}
<= [91993, 92992], i have [86492..96222]
- 1K maps, monitor reset the sent epoch, and replied.
- in sub_got(), sub_sent.erase("osdmap"), because 91993 >= 91993
=> i want >= 96223
- objecter still wants a map newer than 96222
- sub_new["osdmap"] = {96223, 1},
- sub_sent["osdmap"] = {96223, 1}
1k maps consumed, i have [86492..92992]
<= [96223, 96223], i have [86492..96223].
=> i want >= 96224+
- objecter still wants a map newer than 96223
  - sub_new["osdmap"] = {96224, 0},
  - sub_sent["osdmap"] = {96224, 0}
- OSD found that message skipped epochs, so
  - osdmap_subscribe(92992 + 1, false), but 92993 < sub_sent["osdmap"].start (96224), so the request was rejected by MonClient.
<= [96224, 96224], i have [86492..96224]. // yes, apparently, what mon had is just 96225 atm,
=> i want >= 96225+
<= [96225, 96225], i have [86492..96225].
=> i want >= 96226+
<= [96226, 96226], i have [86492..96226].

the problem here, due to "ceph osd pause", the subscription sent by objecter always gets in the way of OSD, so the latter cannot subscribe for the older osdmap to catch up with the cluster.

Actions

Copy link

#12

Updated by Wido den Hollander over 7 years ago

Yes, we have set the pause flag to prevent client I/O, but that shouldn't prevent OSDs from fetching their maps, right?

I think this is a bug or do you think differently?

Actions

Copy link

#13

Updated by Kefu Chai over 7 years ago

Copied to Bug #17023: OSD failed to subscribe skipped osdmaps after "ceph osd pause" added

Actions

Copy link

#14

Updated by Kefu Chai over 7 years ago

Wido, no that should not. agreed, it's a bug. and #17023 is filed to track it separately.

Actions

Copy link

#15

Updated by Kefu Chai over 7 years ago

per discussion with sam, we could check the transaction size before sending it down to FileJournal, it it's over the size limit of buffer or some internal limitation of ceph, we should print an error message and crash.

and there are some invalid (combinations of) settings we cannot detect, we could