Project

General

Profile

Actions

Bug #8643

closed

0.80.1: OSD crash: osd/ECBackend.cc: 529: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

Added by Dmitry Smirnov almost 10 years ago. Updated almost 10 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

OSD (0.80.1) crashed and became unusable ever since due to similar crash soon after restart:

    -1> 2014-06-23 08:34:07.440615 7fecc3d85700  0 osd.3 38905 crush map has features 2578087936000, adjusting msgr requires for mons
     0> 2014-06-23 08:34:07.540053 7fecbe57a700 -1 osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7fecbe57a700 time 2014-06-23 08:34:07.366630
osd/ECBackend.cc: 529: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x55e1cb) [0x7fecd9f181cb]
 2: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x8c7) [0x7fecd9f190d7]
 3: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x93) [0x7fecd9f27203]
 4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x4a) [0x7fecd9f0c07a]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x972) [0x7fecd9f10772]
 6: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x186) [0x7fecd9f183f6]
 7: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fecd9d9202b]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x374) [0x7fecd9bd9104]
 9: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1cf) [0x7fecd9bf4fdf]
 10: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fecd9c38c6c]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x1390) [0x7fecda063ab0]
 12: (ThreadPool::WorkThread::entry()+0x10) [0x7fecda064760]
 13: (()+0x80ca) [0x7fecd910e0ca]
 14: (clone()+0x6d) [0x7fecd762fffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.3.log
--- end dump of recent events ---
2014-06-23 08:34:07.614542 7fecbe57a700 -1 *** Caught signal (Aborted) **
 in thread 7fecbe57a700

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x5d44c2) [0x7fecd9f8e4c2]
 2: (()+0xf8f0) [0x7fecd91158f0]
 3: (gsignal()+0x37) [0x7fecd757f407]
 4: (abort()+0x148) [0x7fecd7582508]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x175) [0x7fecd7e6ad65]
 6: (()+0x5edd6) [0x7fecd7e68dd6]
 7: (()+0x5ee21) [0x7fecd7e68e21]
 8: (()+0x5f039) [0x7fecd7e69039]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1e3) [0x7fecda0731e3]
 10: (()+0x55e1cb) [0x7fecd9f181cb]
 11: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x8c7) [0x7fecd9f190d7]
 12: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x93) [0x7fecd9f27203]
 13: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x4a) [0x7fecd9f0c07a]
 14: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x972) [0x7fecd9f10772]
 15: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x186) [0x7fecd9f183f6]
 16: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fecd9d9202b]
 17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x374) [0x7fecd9bd9104]
 18: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1cf) [0x7fecd9bf4fdf]
 19: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fecd9c38c6c]
 20: (ThreadPool::worker(ThreadPool::WorkThread*)+0x1390) [0x7fecda063ab0]
 21: (ThreadPool::WorkThread::entry()+0x10) [0x7fecda064760]
 22: (()+0x80ca) [0x7fecd910e0ca]
 23: (clone()+0x6d) [0x7fecd762fffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
     0> 2014-06-23 08:34:07.614542 7fecbe57a700 -1 *** Caught signal (Aborted) **
 in thread 7fecbe57a700

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x5d44c2) [0x7fecd9f8e4c2]
 2: (()+0xf8f0) [0x7fecd91158f0]
 3: (gsignal()+0x37) [0x7fecd757f407]
 4: (abort()+0x148) [0x7fecd7582508]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x175) [0x7fecd7e6ad65]
 6: (()+0x5edd6) [0x7fecd7e68dd6]
 7: (()+0x5ee21) [0x7fecd7e68e21]
 8: (()+0x5f039) [0x7fecd7e69039]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1e3) [0x7fecda0731e3]
 10: (()+0x55e1cb) [0x7fecd9f181cb]
 11: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x8c7) [0x7fecd9f190d7]
 12: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x93) [0x7fecd9f27203]
 13: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x4a) [0x7fecd9f0c07a]
 14: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x972) [0x7fecd9f10772]
 15: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x186) [0x7fecd9f183f6]
 16: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7fecd9d9202b]
 17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x374) [0x7fecd9bd9104]
 18: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1cf) [0x7fecd9bf4fdf]
 19: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7fecd9c38c6c]
 20: (ThreadPool::worker(ThreadPool::WorkThread*)+0x1390) [0x7fecda063ab0]
 21: (ThreadPool::WorkThread::entry()+0x10) [0x7fecda064760]
 22: (()+0x80ca) [0x7fecd910e0ca]
 23: (clone()+0x6d) [0x7fecd762fffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.3.log
--- end dump of recent events ---
2014-06-23 08:35:52.549587 7f300e0f97c0  0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 8214
2014-06-23 08:35:52.601996 7f300e0f97c0  0 filestore(/var/lib/ceph/osd/ceph-3) mount detected btrfs
2014-06-23 08:35:52.759509 7f300e0f97c0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_features: FIEMAP ioctl is supported and appears to work
2014-06-23 08:35:52.759524 7f300e0f97c0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2014-06-23 08:35:53.310845 7f300e0f97c0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2014-06-23 08:35:53.310950 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: CLONE_RANGE ioctl is supported
2014-06-23 08:35:54.084294 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: SNAP_CREATE is supported
2014-06-23 08:35:54.099785 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: SNAP_DESTROY is supported
2014-06-23 08:35:54.100296 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: START_SYNC is supported (transid 215110)
2014-06-23 08:35:54.437508 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: WAIT_SYNC is supported
2014-06-23 08:35:54.457076 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: SNAP_CREATE_V2 is supported
2014-06-23 08:35:55.101438 7f300e0f97c0  0 filestore(/var/lib/ceph/osd/ceph-3) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2014-06-23 08:35:55.106213 7f300e0f97c0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-06-23 08:35:55.106248 7f300e0f97c0  1 journal _open /var/lib/ceph/osd/ceph-3/journal fd 21: 4294967296 bytes, block size 4096 bytes, directio = 1, aio = 0
2014-06-23 08:35:55.355847 7f300e0f97c0  1 journal _open /var/lib/ceph/osd/ceph-3/journal fd 21: 4294967296 bytes, block size 4096 bytes, directio = 1, aio = 0
2014-06-23 08:35:59.636873 7f300e0f97c0  1 journal close /var/lib/ceph/osd/ceph-3/journal
2014-06-23 08:35:59.637686 7f300e0f97c0  0 filestore(/var/lib/ceph/osd/ceph-3) mount detected btrfs
2014-06-23 08:35:59.813314 7f300e0f97c0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_features: FIEMAP ioctl is supported and appears to work
2014-06-23 08:35:59.813333 7f300e0f97c0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2014-06-23 08:36:01.187740 7f300e0f97c0  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2014-06-23 08:36:01.187843 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: CLONE_RANGE ioctl is supported
2014-06-23 08:36:01.898978 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: SNAP_CREATE is supported
2014-06-23 08:36:01.899616 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: SNAP_DESTROY is supported
2014-06-23 08:36:01.900004 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: START_SYNC is supported (transid 215116)
2014-06-23 08:36:02.293731 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: WAIT_SYNC is supported
2014-06-23 08:36:02.311463 7f300e0f97c0  0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-3) detect_feature: SNAP_CREATE_V2 is supported
2014-06-23 08:36:02.872252 7f300e0f97c0  0 filestore(/var/lib/ceph/osd/ceph-3) mount: WRITEAHEAD journal mode explicitly enabled in conf
2014-06-23 08:36:02.876856 7f300e0f97c0 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2014-06-23 08:36:02.876923 7f300e0f97c0  1 journal _open /var/lib/ceph/osd/ceph-3/journal fd 23: 4294967296 bytes, block size 4096 bytes, directio = 1, aio = 0
2014-06-23 08:36:02.901217 7f300e0f97c0  1 journal _open /var/lib/ceph/osd/ceph-3/journal fd 23: 4294967296 bytes, block size 4096 bytes, directio = 1, aio = 0
2014-06-23 08:36:02.959723 7f300e0f97c0  0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello
2014-06-23 08:36:02.964149 7f300e0f97c0  0 osd.3 38905 crush map has features 2303210029056, adjusting msgr requires for clients
2014-06-23 08:36:02.964162 7f300e0f97c0  0 osd.3 38905 crush map has features 2578087936000, adjusting msgr requires for mons
2014-06-23 08:36:02.964166 7f300e0f97c0  0 osd.3 38905 crush map has features 2578087936000, adjusting msgr requires for osds
2014-06-23 08:36:02.964179 7f300e0f97c0  0 osd.3 38905 load_pgs
2014-06-23 08:36:28.804657 7f300e0f97c0  0 osd.3 38905 load_pgs opened 172 pgs
2014-06-23 08:36:28.859886 7f2ff971a700  0 osd.3 38905 ignoring osdmap until we have initialized
2014-06-23 08:36:28.859995 7f2ff971a700  0 osd.3 38905 ignoring osdmap until we have initialized
2014-06-23 08:36:28.951890 7f300e0f97c0  0 osd.3 38905 done with init, starting boot process
2014-06-23 08:36:30.167677 7f2ff8718700  0 osd.3 38910 crush map has features 2578087936000, adjusting msgr requires for mons
2014-06-23 08:36:31.245243 7f2ff8718700  0 osd.3 38911 crush map has features 2578087936000, adjusting msgr requires for mons
2014-06-23 08:40:49.971715 7f2ff971a700  0 osd.3 38912 crush map has features 2578087936000, adjusting msgr requires for mons
2014-06-23 08:40:50.152739 7f2ff270c700 -1 osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7f2ff270c700 time 2014-06-23 08:40:50.150396
osd/ECBackend.cc: 529: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x55e1cb) [0x7f300e6a21cb]
 2: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x8c7) [0x7f300e6a30d7]
 3: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x93) [0x7f300e6b1203]
 4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x4a) [0x7f300e69607a]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x972) [0x7f300e69a772]
 6: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x186) [0x7f300e6a23f6]
 7: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7f300e51c02b]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x374) [0x7f300e363104]
 9: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1cf) [0x7f300e37efdf]
 10: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x1390) [0x7f300e7edab0]
 12: (ThreadPool::WorkThread::entry()+0x10) [0x7f300e7ee760]
 13: (()+0x80ca) [0x7f300d8980ca]
 14: (clone()+0x6d) [0x7f300bdb9ffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- begin dump of recent events ---
-10000> 2014-06-23 08:40:25.049314 7f2ff270c700  5 -- op tracker -- , seq: 10201, time: 2014-06-23 08:40:25.049314, event: started, request: osd_sub_op(client.4205572.1:25905681 2.13 23243613/rb.0.31cd20.238e1f29.00000000a21b/head//2 [] v 38911'18861948 snapset=0=[]:[] snapc=0=[]) v10
 -9999> 2014-06-23 08:40:25.049383 7f2ff270c700  5 -- op tracker -- , seq: 10201, time: 2014-06-23 08:40:25.049383, event: started, request: osd_sub_op(client.4205572.1:25905681 2.13 23243613/rb.0.31cd20.238e1f29.00000000a21b/head//2 [] v 38911'18861948 snapset=0=[]:[] snapc=0=[]) v10
...
...
...
    -3> 2014-06-23 08:40:50.150251 7f2ff8718700  5 -- op tracker -- , seq: 10864, time: 2014-06-23 08:40:50.150251, event: started, request: pg_query(14.27s0 epoch 38912) v3
    -2> 2014-06-23 08:40:50.150272 7f2ff8718700  5 -- op tracker -- , seq: 10864, time: 2014-06-23 08:40:50.150272, event: done, request: pg_query(14.27s0 epoch 38912) v3
    -1> 2014-06-23 08:40:50.150310 7f2ff2f0d700  1 -- 192.168.0.2:6806/8214 --> 192.168.0.250:6830/13227 -- pg_notify(14.27s0(1) epoch 38912) v5 -- ?+0 0x7f3038877880 con 0x7f3034e9d1e0
     0> 2014-06-23 08:40:50.152739 7f2ff270c700 -1 osd/ECBackend.cc: In function 'void ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, RecoveryMessages*)' thread 7f2ff270c700 time 2014-06-23 08:40:50.150396
osd/ECBackend.cc: 529: FAILED assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x55e1cb) [0x7f300e6a21cb]
 2: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::list, std::less<std::string>, std::allocator<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x8c7) [0x7f300e6a30d7]
 3: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x93) [0x7f300e6b1203]
 4: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x4a) [0x7f300e69607a]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x972) [0x7f300e69a772]
 6: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x186) [0x7f300e6a23f6]
 7: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7f300e51c02b]
 8: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x374) [0x7f300e363104]
 9: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1cf) [0x7f300e37efdf]
 10: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x9c) [0x7f300e3c2c6c]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0x1390) [0x7f300e7edab0]
 12: (ThreadPool::WorkThread::entry()+0x10) [0x7f300e7ee760]
 13: (()+0x80ca) [0x7f300d8980ca]
 14: (clone()+0x6d) [0x7f300bdb9ffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.3.log
--- end dump of recent events ---
2014-06-23 08:40:50.234635 7f2ff270c700 -1 *** Caught signal (Aborted) **
 in thread 7f2ff270c700

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (()+0x5d44c2) [0x7f300e7184c2]
 2: (()+0xf8f0) [0x7f300d89f8f0]
 3: (gsignal()+0x37) [0x7f300bd09407]
 4: (abort()+0x148) [0x7f300bd0c508]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x175) [0x7f300c5f4d65]
 6: (()+0x5edd6) [0x7f300c5f2dd6]
 7: (()+0x5ee21) [0x7f300c5f2e21]
 8: (()+0x5f039) [0x7f300c5f3039]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1e3) [0x7f300e7fd1e3]
 10: (()+0x55e1cb) [0x7f300e6a21cb]
 11: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::list, std::less<pg_shard
_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type
, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, boost::optional<std::map<std::string, ceph::buffer::list, std::less<std::string>, std::alloca
tor<std::pair<std::string const, ceph::buffer::list> > > >, RecoveryMessages*)+0x8c7) [0x7f300e6a30d7]
 12: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x93) [0x7f300e6b1203]
 13: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x4a) [0x7f300e69607a]
 14: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*)+0x972) [0x7f300e69a772]
 15: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x186) [0x7f300e6a23f6]
 16: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x2db) [0x7f300e51c02b]
 17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x374) [0x7f300e363104]
 18: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x1cf) [0x7f300e37efdf]
 19: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0
x9c) [0x7f300e3c2c6c]
 20: (ThreadPool::worker(ThreadPool::WorkThread*)+0x1390) [0x7f300e7edab0]
 21: (ThreadPool::WorkThread::entry()+0x10) [0x7f300e7ee760]
 22: (()+0x80ca) [0x7f300d8980ca]
 23: (clone()+0x6d) [0x7f300bdb9ffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.3.log
--- end dump of recent events ---


Files

ceph-osd.3.log.xz (5.73 MB) ceph-osd.3.log.xz Dmitry Smirnov, 06/23/2014 12:24 PM
ceph-osd.0.log.txt.xz (1.28 MB) ceph-osd.0.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.1.log.txt.xz (1.14 MB) ceph-osd.1.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.2.log.txt.xz (530 KB) ceph-osd.2.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.3.log.txt.xz (1.54 MB) ceph-osd.3.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.4.log.txt.xz (2.49 MB) ceph-osd.4.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.5.log.txt.xz (1.98 MB) ceph-osd.5.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.6.log.txt.xz (534 KB) ceph-osd.6.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.7.log.txt.xz (108 KB) ceph-osd.7.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.9.log.txt.xz (256 KB) ceph-osd.9.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.10.log.txt.xz (61.5 KB) ceph-osd.10.log.txt.xz Dmitry Smirnov, 06/24/2014 09:57 PM
ceph-osd.11.log.txt.xz (5.3 MB) ceph-osd.11.log.txt.xz Dmitry Smirnov, 06/24/2014 10:06 PM
ceph-osd.12.log.txt.xz (667 KB) ceph-osd.12.log.txt.xz Dmitry Smirnov, 06/24/2014 10:06 PM
ceph-osd.10.log.keyvaluestore-dev.xz (13.2 KB) ceph-osd.10.log.keyvaluestore-dev.xz Dmitry Smirnov, 07/01/2014 12:25 PM

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #8660: pg in forever "down+peering" stateClosed06/25/2014

Actions
Actions #1

Updated by Sage Weil almost 10 years ago

  • Priority changed from Normal to Urgent
Actions #2

Updated by Samuel Just almost 10 years ago

  • Status changed from New to Need More Info

Please reproduce with
debug osd = 20
debug filestore = 20
debug ms = 1
on all osds and attach those logs.

Actions #3

Updated by Dmitry Smirnov almost 10 years ago

Here is the first log from affected OSD with increased verbosity, captured since start to crash.
Attaching logs from all OSDs will take time, please advise if it is still necessary. Thanks.

Actions #4

Updated by Samuel Just almost 10 years ago

It is still necessary.

Updated by Dmitry Smirnov almost 10 years ago

Here are detailed logs. This time osd.11 crashed as described. I tried to take affected OSD out and the error just moved to another OSD so recovery/remapping/backfilling never finished.

Updated by Dmitry Smirnov almost 10 years ago

Adding remaining logs (can attach only 10 files at once)...

Actions #7

Updated by Samuel Just almost 10 years ago

I think 248520c7/100001015d5.00000000/head//14 was inconsistently updated (pg 14.7). You can probably clear this issue by truncating it to size 0 on all replicas where you find it (to match the metadata). Probably osd 4 and osd 11? I suspect something wonky is going on with your journal. What kernel version are you running? Are you using block journals or file journals? What fs are you using? Is it the same on all osds? What settings are you using for the ec pool?

Actions #8

Updated by Dmitry Smirnov almost 10 years ago

The problem gotten worse, see #8660 which is probably another manifestation of this issue...

Samuel Just wrote:

I think 248520c7/100001015d5.00000000/head//14 was inconsistently updated (pg 14.7). You can probably clear this issue by truncating it to size 0 on all replicas where you find it (to match the metadata). Probably osd 4 and osd 11?

I'm not sure how to do it, sorry. On OSD.11 I found file

/var/lib/ceph/osd/ceph-11/current/14.7s2_head/DIR_7/DIR_C/DIR_0/100001015d5.00000000__head_248520C7__e_ffffffffffffffff_2

so I stopped OSD.11; and issued the following command:

truncate 100001015d5.00000000__head_248520C7__e_ffffffffffffffff_2 --size=0

Could you please confirm if I did what you meant?

This action did not fix this (or #8660) problem. OSD.11 still crashes within few minutes since start.

I suspect something wonky is going on with your journal.

Hmm, why?

What kernel version are you running?

3.14.4 and 3.14.7.

Are you using block journals or file journals?

File journals on ext4 on SSDs (only journals on SSDs).

What fs are you using? Is it the same on all osds?

Most OSDs are on Btrfs (with snapshots disabled). Some are on ext4 (I was diversifying to avoid kernel crashes related to Btrfs snapshotting but not I'm happy with Btrfs stability). Only OSD.{1,11} are on ext4 but it looks like problem originally appeared on Btrfs-based OSD.

What settings are you using for the ec pool?

directory=/usr/lib/x86_64-linux-gnu/ceph/erasure-code
k=2
m=2
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van
Actions #9

Updated by Samuel Just almost 10 years ago

Try disabling journal aio. I've seen similar behavior on another ticket with those kernel versions. I think something is preventing the osd journal from working right. You'll probably have to truncate the similar file on osd 4 as well.

Actions #10

Updated by Samuel Just almost 10 years ago

All of your journals are on ext4? Or the journals on btrfs osds are btrfs?

Actions #11

Updated by Samuel Just almost 10 years ago

  • Assignee set to Samuel Just
Actions #12

Updated by Dmitry Smirnov almost 10 years ago

All journals are on ext4@SSD. I'll disable "journal aio" but I don't understand how journal can be an issue. There were no power down or unclean reboot. Actually the were no reboots for a while and this issue was not picked up by deep+scrub. Could it be that journal is sensitive to OSD crashes? As usual, there were few OSD crashers during backfill, #8495 in particular...

I'm still unable to resolve this issue... I wish I could pick up some of your log interpreting skills.
How did you determine that "100001015d5.00000000__head_248520C7__e_ffffffffffffffff_2" should be truncated?
I did truncate it on OSD.4 as well but it's no use as there may be more files to truncate...

Actions #13

Updated by Dmitry Smirnov almost 10 years ago

I need more help here please. OSD.11 is still crashing as soon as it started. Earlier I've set it "out" so it have little data left but apparently it is critical to recover PG_14.7.
Frankly Ceph feels very fragile as it not suppose to be possible to lost a PG on healthy cluster without any HDD/OSD failures. In my case recovery should be possible without two OSDs yet it doesn't work as expected and I have little hope for data recovery... :( Those OSD crashes should not be tolerated...

I suspect PG peering problem #8660 could be related to my experiments with "ceph osd primary-affinity" but I fail to (re-)set it to bring PG:14.7 back to life. Is there any OSD repair utility or instructions? There is no file system errors on OSD.11 so this problem appears to be a filestore issue or lack of error handling in OSD code.

Actions #14

Updated by Samuel Just almost 10 years ago

No, you are using erasure coding with M=K=2. You will need at least two osds to recover the pg. That part is correct. I determined that that object was the problem by following the thread which asserted back to determine which object it was attempting to recover when it crashed. I then followed that back further and determined that for some reason the on disk size does not match the metadata.

The osd journal is used pretty much any time the osd restarts. If the aio subsystem is somehow corrupting or incorrectly reporting success for journal writes, upon restart the osd might loose committed updates resulting, perhaps, in what we are seeing here.

Are you using this combined with a cache tier? 100001015d5.00000000 looks like an rbd block I think?

Actions #15

Updated by Samuel Just almost 10 years ago

No, I guess it looks like a cephfs block. Is this cephfs with caching? Or somehow cephfs without caching?

Actions #16

Updated by Dmitry Smirnov almost 10 years ago

Thanks for your help and explanations, Samuel.

Samuel Just wrote:

No, you are using erasure coding with M=K=2. You will need at least two osds to recover the pg. That part is correct.

OK, so that PG (14.7) is lost, right? I still don't understand how did it happen that 3 parts of 4 were lost (m=2, k=2 should allow to recover from two failures, right?). It is alarming that such thing could happen without any HDD failures. IMHO OSD should never crash so there is a bug and a serious one. Mark PG 'incomplete' and/or log errors but not crash. I think in this incident because OSD was down for too long other PG pieces may become lost before they had a chance to be recovered.

So what's my recovery options, assuming that there is no way to fix broken PG? There is no hope to fix it, right?
Should I drop EC pool and re-create it? Is there any other way to drop just one PG? Or any other ideas?

Are you using this combined with a cache tier? 100001015d5.00000000 looks like an rbd block I think?

Yes, of course. That appears to be the only way to use EC pool for RBD images and CephFS (I use both on the caching pool backed by EC pool).

Actions #17

Updated by Samuel Just almost 10 years ago

Are you certain that you have never had cephfs writing directly to this pool? I have a theory for the size mismatch, but it requires that cephfs created this object directly on the ec pool (which of course wouldn't work for long, but it might have appeared to for a brief amount of time).

Actions #18

Updated by Samuel Just almost 10 years ago

Dropping the pool containing pg 14.7 would be an option. At what version did you start using this pool? I think I recall some bugs which might have caused this.

Actions #19

Updated by Dmitry Smirnov almost 10 years ago

Samuel Just wrote:

Are you certain that you have never had cephfs writing directly to this pool?

Yes, absolutely. :)
I've learned it the hard way (see #8623).

After my early attempt to use CephFS directly on EC pool I removed all files, unregistered EC pool from MDS, dropped all objects from EC pool (or maybe even recreated EC pool from scratch). I started fresh by creating caching pool on top of empty EC pool. Only caching pool is registered on MDS and can be accessed through directory on CephFS.

Unless my earlier experiments with EC pool over the week ago came back and destroyed OSD I doubt it would be possible to somehow inherit those mismatches...

Can we do something about crash to keep OSD operational in such situation? Then we could close this bug I presume...

Actions #20

Updated by Dmitry Smirnov almost 10 years ago

Samuel Just wrote:

Dropping the pool containing pg 14.7 would be an option.

I was afraid you would say that... :( I reckon there is no hope for data...

At what version did you start using this pool? I think I recall some bugs which might have caused this.

0.80.1

Actions #21

Updated by Dmitry Smirnov almost 10 years ago

Hi Samuel,

As advised I introduced

journal aio = false
which I doubt would do anything because before it OSDs were logging the following:
 0 filestore(/var/lib/ceph/osd/ceph-3) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
-1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
 1 journal _open /var/lib/ceph/osd/ceph-3/journal fd 21: 4294967296 bytes, block size 4096 bytes, directio = 1, aio = 0
 1 journal _open /var/lib/ceph/osd/ceph-3/journal fd 21: 4294967296 bytes, block size 4096 bytes, directio = 1, aio = 0

Earlier I re-built OSD.11 from scratch. Now all of a sudden I have five OSDs down and not starting:

     0> 2014-06-27 16:14:04.881979 7fe84d0d1700 -1 osd/PGLog.cc: In function 'void PGLog::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread 7fe84d0d1700 tim
e 2014-06-27 16:14:04.879447
osd/PGLog.cc: 512: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

If one more OSD dies during recovery I may not be able to save even replicated pools. :((
Losing data sucks and I almost lost any confidence in Ceph...
I'm so dangerously close to downtime and loss of all my data it's not funny... :(

Any ideas? What is the above error's meaning? Thanks.

Actions #22

Updated by Samuel Just almost 10 years ago

If you can reproduce that crash, please attach a log with
debug osd = 20
debug filestore = 20
debug ms = 1

Actions #23

Updated by Dmitry Smirnov almost 10 years ago

First let's focus on the original issue please. Some hours later I was able to start OSD that were previously dying with

FAILED assert(log.head >= olog.tail && olog.head >= log.tail)

But OSDs keep crashing very often so to keep cluster operational and to rescue some data I had to run the following on all OSD hosts except one:

while sleep 16; /etc/init.d/ceph start osd; done

This is terribly slow as almost always some PGs are in the "peering" state but at least I can rescue some data.
That was the only way how I could prevent cluster from going down because now almost all OSDs are crashing as follows:

     0> 2014-06-28 08:37:13.323351 7fe1dd61b700 -1 osd/ECUtil.h: In function 'void ECUtil::HashInfo::append(uint64_t, std::map<int, ceph::buffer::list>&)' thread 7fe1dd61b700 
time 2014-06-28 08:37:13.321152
osd/ECUtil.h: 116: FAILED assert(to_append.size() == cumulative_shard_hashes.size())

 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (TransGenerator::operator()(ECTransaction::AppendOp const&)+0x1840) [0x7fe1fcf96890]
 2: (ECTransaction::generate_transactions(std::map<hobject_t, std::tr1::shared_ptr<ECUtil::HashInfo>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::tr1:
:shared_ptr<ECUtil::HashInfo> > > >&, std::tr1::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t const&, std::map<unsigned char, ObjectStore::Transaction, 
std::less<unsigned char>, std::allocator<std::pair<unsigned char const, ObjectStore::Transaction> > >*, std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
 std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >*) const+0x1c6) [0x7fe1fcf
8faa6]
 3: (ECBackend::start_write(ECBackend::Op*)+0x1e5) [0x7fe1fcf730e5]
 4: (ECBackend::submit_transaction(hobject_t const&, eversion_t const&, PGBackend::PGTransaction*, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t
> >&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, std::tr1::shared_ptr<OpRequest>)+0xed6) [0x7fe1fcf750e6]
 5: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x4be) [0x7fe1fce0784e]
 6: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1187) [0x7fe1fce53e37]
 7: (CopyFromCallback::finish(boost::tuples::tuple<int, ReplicatedPG::CopyResults*, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>)+0x38) [0x7fe1fce94b78]
 8: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x4df) [0x7fe1fce2a4cf]
 9: (C_Copyfrom::finish(int)+0x12a) [0x7fe1fce9425a]
 10: (Context::complete(int)+0x9) [0x7fe1fcc90559]
 11: (Finisher::finisher_thread_entry()+0x1b8) [0x7fe1fd00fcd8]
 12: (()+0x80ca) [0x7fe1fc16e0ca]
 13: (clone()+0x6d) [0x7fe1fa68fffd]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Can you produce patch to put PG to "incomplete" or "inconsistent" state (whichever is appropriate) instead of crashing?

For now my priority is to stabilise situation and drop EC pool as soon as I rescue enough data from it.
I'll flood you with logs for remaining issues later. ;)

Actions #24

Updated by Samuel Just almost 10 years ago

It's not really practical to do that, partly because it would result in the pool becoming unusable anyway. You appear to have severely corrupted on-disk state, perhaps due to something in your hardware or kernel preventing the osd from maintaining transactions properly. It would be good to get to the bottom of that, but the simplest way to do that will involve starting from a clean cluster and trying to narrow down the conditions which are causing inconsistency.

osd/ECUtil.h: 116: FAILED assert(to_append.size() == cumulative_shard_hashes.size())

is probably due to some writes happening on a corrupted object. You probably want to stop all writes while you recover data and then drop the ec pool (and associated cache pools). Clearly, we have some work to do on improving the robustness of EC pools and cache pools with corrupted on-disk state.

Actions #25

Updated by Dmitry Smirnov almost 10 years ago

I'm a bit sceptical of your analysis. I have some confidence in my systems. Due to concerns regarding data integrity I do routine checks of file systems for corruption, monitor logs and do hardware testing before adding new components to cluster. Of course it is possible that something funny is going on in kernel etc., but as far as I can tell problems manifest only when I manually initiate remapping/backfilling by replacing OSD. Only then half a dozen uncatched asserts plague OSDs at random and degraded cluster just getting worse with every crash.

Besides I doubt it is practical to submit detailed logs when several OSDs (two or more) crash every 30 seconds.
At least I could not allow it to continue any longer or I would risk giving up on Ceph...

Now I dropped EC and cache pools by sacrificing data on them (yet again). Cluster is now healthy. I'm going to do some scrubbing and maintenance. The I hope to isolate problem and submit more logs if I notice any problems in OSDs or their logs.

I do believe that development focus (and priority) should be with OSD stability. Crashes are extremely unfriendly and dangerous. It is difficult to recover cluster when notorious "domino effect" (several OSDs dying together) prevent successful recovery (backfilling, remapping, etc.).

Perhaps my experience demonstrate situation "corrupt one PG and risk losing entire cluster" because during recovery error propagated to other OSDs and brought entire cluster down by crashing more than half OSDs.
This is very fragile indeed...

Actions #26

Updated by Dmitry Smirnov almost 10 years ago

Although recovery was extremely frustrating experience for me due to all those spontaneous OSD crashes I may have found a culprit for this issue.
Shortly before problem manifested I replaced few OSDs by using the following procedure:

  • ceph osd out {N}
  • ... wait till remapping finish
  • /etc/init.d/ceph stop osd.{N}
  • umount /var/lib/ceph/osd/ceph-{N}
  • mkfs.btrfs -L osd.{N} /dev/mapper/osd.{N} -f
  • mount \-v /var/lib/ceph/osd/ceph-{N}
  • ... create "keyring" (file) and "journal" (symlink to "osd.{N}.journal" at SSD mount point)
    in "/var/lib/ceph/osd/ceph-{N}"

However I've forgotten to remove old journal file just before issuing the following command:

  • ceph-osd -i {N} --mkfs --mkjournal

which produced the following (sample) output:

 -1 journal check: ondisk fsid c36bb6d9-d028-4160-b0c7-bbde4ed3662a doesn't match expected aae08a37-99b8-41b9-b3a7-0a8ad45ce9d0, invalid (someone else's?) journal
...
 -1 created object store /var/lib/ceph/osd/ceph-{N} journal /var/lib/ceph/osd/ceph-{N}/journal for osd.{N} fsid {UUID}

Contrary to what one could expect, `ceph-osd` did not blank out journal file despite reporting successful journal creation.
Warning about existing journal did not trigger new journal creation: `ceph-osd` merely wrote new UUID signature to the existing journal file.

As I just confirmed "ceph-osd --mkjournal" do not blank (fill with zeroes) existing journal.

I suspect that garbage from old journal may have been flushed-in when I started new (presumably blank) OSD.
If this could be responsible for corruption then we have two issues:

  • a problem with `ceph-osd` journal creation (when old journal is not cleared)
  • problem with extremely unstable/fragile OSDs blocking recovery by crashing when filestore inconsistency encountered.

IMHO unreliable OSDs are dangerous for recovery. Stability is most appreciated during recovery but that's exactly when OSDs are most vulnerable.

Actions #27

Updated by Samuel Just almost 10 years ago

Hmm. I looked into the journal code. Each entry is checksumed, and the checksum actually includes the uuid. Rewriting the journal header (including the uuid) actually is sufficient to clear the journal. IIRC, the only reason we 0 a new one is to force the filesystem to actually allocate the space. I think there must be something else going on. The recreating osd angle is intriguing though, I'll look into adding that to our cluster thrasher. Perhaps that caused a some interaction we don't currently test.

btrfs might also be a good angle. Most serious users I've seen are using xfs, so perhaps we are missing something in the filestore code specialized for btrfs.

Actions #28

Updated by Dmitry Smirnov almost 10 years ago

Samuel Just wrote:

Hmm. I looked into the journal code. Each entry is checksumed, and the checksum actually includes the uuid. Rewriting the journal header (including the uuid) actually is sufficient to clear the journal. IIRC, the only reason we 0 a new one is to force the filesystem to actually allocate the space.

Interesting, thanks for explaining how it works.

I think there must be something else going on. The recreating osd angle is intriguing though, I'll look into adding that to our cluster thrasher. Perhaps that caused a some interaction we don't currently test.

I'm not so sure about this... Re-created OSD need to be completely dropped from cluster using "ceph osd rm {N}" and created: "ceph osd create" before it became operational. I don't see how this procedure may not work properly...

btrfs might also be a good angle. Most serious users I've seen are using xfs, so perhaps we are missing something in the filestore code specialized for btrfs.

We use Btrfs because it is the only file system which does CRC checking on data. Personally I gave up on XFS because it tend to lost too much data on kernel crash or sudden loss of power. Besides fsck need enormous amount of RAM to fix XFS so sometimes it is not possible to repair file system due to lack of RAM.

We stress-tested Btrfs long enough to build confidence in it. In the internal testing it was proven very stable (if no snapshots are used) and unlike xfs and ext4 we've never lost data on Btrfs.

Please advise if there are any Btrfs-related OSD/filestore options worth trying.

Actions #29

Updated by Samuel Just almost 10 years ago

Hmm, so you have snapshotting disabled on btrfs? Specifically what config options did you change to do that? I wonder whether the filestore is doing something stupid by respecting the config options in some cases and not in others (since it detects btrfs). There's a whole bunch of logic for preventing replay of non-idempotent transactions when stable snapshots are not used. Perhaps some of that is conditional on btrfs rather than stable_snapshots.

Actions #30

Updated by Dmitry Smirnov almost 10 years ago

Samuel Just wrote:

Hmm, so you have snapshotting disabled on btrfs?

Yes. Otherwise eventually kernel crashes and btrfs file system became corrupted beyond repair... Disabling snapshots is necessary on Btrfs to avoid reboots and loss of OSDs...

Specifically what config options did you change to do that?

filestore btrfs snap = false
Actions #31

Updated by Dmitry Smirnov almost 10 years ago

Something strange is going on. After this incident I've been running deep-scrub on all OSDs and some minor errors like this were found:

osd.0 [ERR] 20.1e deep-scrub stat mismatch, got 802/802 objects, 0/0 clones, 798/798 dirty, 0/0 omap, 4/4 hit_set_archive, 0/0 whiteouts, 157648635/157648628 bytes.

"ceph pg repair" fixes it and following deep-scrub found no errors.
However some hours later deep-scrub found a similar error on the same PG...

I introduced one OSD with

osd objectstore = keyvaluestore-dev

to see how it behaves since it have no journal at all.
Just in case I marked it as
ceph osd primary-affinity osd.{N} 0

There are three other OSDs on the host.

New leveldb-based OSD is also on Btrfs.
Within minutes after start it died with (full OSD log attached):

os/GenericObjectMap.cc: 1118: FAILED assert(start <= header.oid)

I tried to start it several times but it did not last long.
I found it quite suspicious so I introduced

journal dio = false

to "ceph.conf" and restarted all OSDs on the host.
Somehow it fixed the issue and leveldb-based OSD is now working...

I'm not sure why "dio" have such effect but it looks like it might be a good idea to disable it for now...

Actions #32

Updated by Dmitry Smirnov almost 10 years ago

Ah, sorry, leveldb-based OSD just died again with "FAILED assert(start <= header.oid)" so perhaps "journal dio" did not have any influence on this...

Actions #33

Updated by Samuel Just almost 10 years ago

osd objectstore = keyvaluestore-dev

is so experimental, it's not even in our automated testing. I'm not really very surprised that that caused an issue. If you could open a new normal priority bug with that log, that would be great. journal dio should have no effect on a keyvaluestore-dev anyway. Also, primary affinity only affects whether the osd will be chosen as a primary. It does not affect whether it is chosen to store data. The stat mismatches are probably a result of performing the repairs before.

Actions #34

Updated by Dmitry Smirnov almost 10 years ago

This issue is no longer happening since when I dropped broken pool.
To close this bug I'd suggest to drop assert, skip PG/operation when not possible to repair, log error and keep serving other requests.
Such behaviour would save me days(!) during incident. Crashing is wrong. If repair is not possible it should be safe to log, skip and move on. Thanks.

Actions #35

Updated by Samuel Just almost 10 years ago

  • Status changed from Need More Info to Closed
Actions

Also available in: Atom PDF