Project

General

Profile

Bug #4065

Crash of 0.56.2 OSD on Ubuntu 12.04 LTS

Added by Matthias Babisch about 11 years ago. Updated about 11 years ago.

Status:
Can't reproduce
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Development
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Hi.

I am new to this and new to ceph. So please bear with me...
I tried to setup a ceph cluster here at home to try ceph out. (I did set unsafe tunables and set my own crushmap.)
I set up one OSD at first and then added another one. After I did that the cluster started repairing. (I expected that.)

But during this operation my first osd crashed with an abort (assert?). I would expect that is a bug, even if I did something wrong...

The binary used was installed from the european mirror today. If you need more information, just ask.

The OSD Log contains this at the end:
3> 2013-02-09 08:21:41.477497 b08bfb40 1 - 192.168.98.24:6802/11185 --> osd.1 192.168.98.11:6801/7503 -- pg_info(1 pgs e35:2.49) v3 -- ?+0 0xa425480
2> 2013-02-09 08:21:41.477519 b08bfb40 1 - 192.168.98.24:6802/11185 --> osd.1 192.168.98.11:6801/7503 -- pg_info(1 pgs e35:1.4a) v3 -- ?+0 0xa425360
1> 2013-02-09 08:21:41.477542 b08bfb40 1 - 192.168.98.24:6802/11185 --> osd.1 192.168.98.11:6801/7503 -- pg_info(1 pgs e35:2.36) v3 -- ?+0 0xa425240
0> 2013-02-09 08:21:41.477484 a88afb40 -1 ** Caught signal (Aborted) *
in thread a88afb40

ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)
1: /usr/bin/ceph-osd() [0x83e55c3]
2: [0xb7749400]
3: [0xb7749424]
4: (gsignal()+0x4f) [0xb72831df]
5: (abort()+0x175) [0xb7286825]
6: (_gnu_cxx::_verbose_terminate_handler()+0x14d) [0xb74f713d]
7: (()+0xaaed3) [0xb74f4ed3]
8: (()+0xaaf0f) [0xb74f4f0f]
9: (()+0xab05e) [0xb74f505e]
10: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x143) [0x84a82b3]
11: (object_info_t::decode(ceph::buffer::list::iterator&)+0x3e) [0x8515e8e]
12: (ReplicatedPG::send_push(int, int, ObjectRecoveryInfo const&, ObjectRecoveryProgress, ObjectRecoveryProgress*)+0xf19) [0x81c8279]
13: (ReplicatedPG::sub_op_pull(std::tr1::shared_ptr<OpRequest>)+0x3e2) [0x81cc162]
14: (ReplicatedPG::do_sub_op(std::tr1::shared_ptr<OpRequest>)+0x18f) [0x81f24bf]
15: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x1b1) [0x82cd8c1]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>)+0x373) [0x822e303]
17: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>)+0x461) [0x8244dc1]
18: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x3b) [0x82823fb]
19: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x97) [0x8282617]
20: (ThreadPool::worker(ThreadPool::WorkThread*)+0x488) [0x849b228]
21: (ThreadPool::WorkThread::entry()+0x22) [0x849d342]
22: (Thread::_entry_func(void*)+0xf) [0x849246f]
23: (()+0x6d4c) [0xb75bed4c]
24: (clone()+0x5e) [0xb7343d3e]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
3> 2013-02-09 08:21:41.477497 b08bfb40  1 - 192.168.98.24:6802/11185 --> osd.1 192.168.98.11:6801/7503 -- pg_info(1 pgs e35:2.49) v3 -- ?+0 0xa425480
2> 2013-02-09 08:21:41.477519 b08bfb40 1 - 192.168.98.24:6802/11185 --> osd.1 192.168.98.11:6801/7503 -- pg_info(1 pgs e35:1.4a) v3 -- ?+0 0xa425360
1> 2013-02-09 08:21:41.477542 b08bfb40 1 - 192.168.98.24:6802/11185 --> osd.1 192.168.98.11:6801/7503 -- pg_info(1 pgs e35:2.36) v3 -- ?+0 0xa425240
0> 2013-02-09 08:21:41.477484 a88afb40 -1 ** Caught signal (Aborted) *
in thread a88afb40
ceph version 0.56.2 (586538e22afba85c59beda49789ec42024e7a061)
1: /usr/bin/ceph-osd() [0x83e55c3]
2: [0xb7749400]
3: [0xb7749424]
4: (gsignal()+0x4f) [0xb72831df]
5: (abort()+0x175) [0xb7286825]
6: (_gnu_cxx::_verbose_terminate_handler()+0x14d) [0xb74f713d]
7: (()+0xaaed3) [0xb74f4ed3]
8: (()+0xaaf0f) [0xb74f4f0f]
9: (()+0xab05e) [0xb74f505e]
10: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x143) [0x84a82b3]
11: (object_info_t::decode(ceph::buffer::list::iterator&)+0x3e) [0x8515e8e]
12: (ReplicatedPG::send_push(int, int, ObjectRecoveryInfo const&, ObjectRecoveryProgress, ObjectRecoveryProgress*)+0xf19) [0x81c8279]
13: (ReplicatedPG::sub_op_pull(std::tr1::shared_ptr<OpRequest>)+0x3e2) [0x81cc162]
14: (ReplicatedPG::do_sub_op(std::tr1::shared_ptr<OpRequest>)+0x18f) [0x81f24bf]
15: (PG::do_request(std::tr1::shared_ptr<OpRequest>)+0x1b1) [0x82cd8c1]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>)+0x373) [0x822e303]
17: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>)+0x461) [0x8244dc1]
18: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_process(boost::intrusive_ptr<PG>, ThreadPool::TPHandle&)+0x3b) [0x82823fb]
19: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG> >::_void_process(void*, ThreadPool::TPHandle&)+0x97) [0x8282617]
20: (ThreadPool::worker(ThreadPool::WorkThread*)+0x488) [0x849b228]
21: (ThreadPool::WorkThread::entry()+0x22) [0x849d342]
22: (Thread::_entry_func(void*)+0xf) [0x849246f]
23: (()+0x6d4c) [0xb75bed4c]
24: (clone()+0x5e) [0xb7343d3e]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 100000
max_new 1000
log_file /var/log/ceph/ceph-osd.0.log
--
end dump of recent events ---

last-1000-lines.txt View (155 KB) Matthias Babisch, 02/11/2013 09:04 PM

History

#1 Updated by Ian Colle about 11 years ago

  • Assignee set to Samuel Just
  • Priority changed from Normal to Urgent

#2 Updated by Samuel Just about 11 years ago

I suspect the log included more output. Can you attach the previous 1000 lines?
-Sam

#3 Updated by Samuel Just about 11 years ago

  • Status changed from New to Need More Info

#4 Updated by Samuel Just about 11 years ago

  • Project changed from CephFS to Ceph

#5 Updated by Matthias Babisch about 11 years ago

Last 1000 lines of the log, no problem.

#6 Updated by Samuel Just about 11 years ago

There appears to be a corrupt attribute on one of the objects on that OSD. If this is reproducible, can you restart the osd with
debug osd = 20
debug filestore = 20
debug ms = 1

in the ceph.conf under the [osd] section?

Also, what filesystem are you using, did you see anything in dmesg that might indicate a filesystem error?

#7 Updated by Matthias Babisch about 11 years ago

I am using btrfs and it seems the disk in question is failing. Sadly I didn't notice before. I suspect this is the cause of the trouble. Several bad blocks which are not repairable.

I checked the disk before using it but not after...

Do you want me to reproduce the problems anyway?

#8 Updated by Ian Colle about 11 years ago

  • Status changed from Need More Info to Can't reproduce

Appears to be disk issue.

Also available in: Atom PDF