Bug #14154: OSD keep crashing - Ceph - Ceph

Actions

Copy link

Bug #14154

closed

OSD keep crashing

Added by Ali chips over 8 years ago. Updated about 7 years ago.

Status:

Can't reproduce

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

v0.95

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

we have several osds that keeps crashing.
here are the last bits of logs from the one of the OSDs:

4> 2015-12-22 16:26:31.882232 7fccd63b6700  5 - op tracker -- seq: 2776, time: 2015-12-22 16:26:31.882159, event: header_read, op: MOSDECSubOpRead(108.3dds4 340537 ECSubRead(tid=15755, to_read={36b52bdd/default.36046497.3__shadow_614153467.2~N5hssNbnSMjn2UXMtsONQ92838IhZmd.74_1/head//108=0,349536,0}, attrs_to_read=))
    3> 2015-12-22 16:26:31.882242 7fccd63b6700  5 - op tracker -- seq: 2776, time: 2015-12-22 16:26:31.882160, event: throttled, op: MOSDECSubOpRead(108.3dds4 340537 ECSubRead(tid=15755, to_read={36b52bdd/default.36046497.3__shadow_614153467.2~N5hssNbnSMjn2UXMtsONQ92838IhZmd.74_1/head//108=0,349536,0}, attrs_to_read=))
    2> 2015-12-22 16:26:31.882249 7fccd63b6700  5 - op tracker -- seq: 2776, time: 2015-12-22 16:26:31.882193, event: all_read, op: MOSDECSubOpRead(108.3dds4 340537 ECSubRead(tid=15755, to_read={36b52bdd/default.36046497.3__shadow_614153467.2~N5hssNbnSMjn2UXMtsONQ92838IhZmd.74_1/head//108=0,349536,0}, attrs_to_read=))
    1> 2015-12-22 16:26:31.882254 7fccd63b6700  5 - op tracker -- seq: 2776, time: 0.000000, event: dispatched, op: MOSDECSubOpRead(108.3dds4 340537 ECSubRead(tid=15755, to_read={36b52bdd/default.36046497.3__shadow_614153467.2~N5hssNbnSMjn2UXMtsONQ92838IhZmd.74_1/head//108=0,349536,0}, attrs_to_read=))
     0> 2015-12-22 16:26:31.892858 7fccef2fe700 -1 ** Caught signal (Aborted) *
 in thread 7fccef2fe700

ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: /usr/bin/ceph-osd() [0xac8a32]
 2: (()+0xf100) [0x7fcd1df7d100]
 3: (gsignal()+0x37) [0x7fcd1c9965f7]
 4: (abort()+0x148) [0x7fcd1c997ce8]
 5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7fcd1d29a9d5]
 6: (()+0x5e946) [0x7fcd1d298946]
 7: (()+0x5e973) [0x7fcd1d298973]
 8: (()+0x5eb93) [0x7fcd1d298b93]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc9f7a]
 10: (ECBackend::get_hash_info(hobject_t const&)+0x9f1) [0xa2b491]
 11: (ECBackend::submit_transaction(hobject_t const&, eversion_t const&, PGBackend::PGTransaction*, eversion_t const&, eversion_t const&, std::vector&lt;pg_log_entry_t, std::allocator&lt;pg_log_entry_t&gt; > const&, boost::optional&lt;pg_hit_set_history_t&gt;&, Context*, Context*, Context*, unsigned long, osd_reqid_t, std::tr1::shared_ptr&lt;OpRequest&gt;)+0x608) [0xa309b8]
 12: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*)+0x7ba) [0x84583a]
 13: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1008) [0x892578]
 14: (ReplicatedPG::do_op(std::tr1::shared_ptr&lt;OpRequest&gt;&)+0x45d7) [0x8977d7]
 15: (ReplicatedPG::do_request(std::tr1::shared_ptr&lt;OpRequest&gt;&, ThreadPool::TPHandle&)+0x68a) [0x8332aa]
 16: (OSD::dequeue_op(boost::intrusive_ptr&lt;PG&gt;, std::tr1::shared_ptr&lt;OpRequest&gt;, ThreadPool::TPHandle&)+0x405) [0x695385]
 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6958a9]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbb929f]
 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbbb3d0]
 20: (()+0x7dc5) [0x7fcd1df75dc5]
 21: (clone()+0x6d) [0x7fcd1ca5721d]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.205.log
-- end dump of recent events ---

this is my first bug report here, so i am not sure what else could be useful, i ll be happy to provide any additional info.
this got our cluster in really bad shape, specially when this is happening the same time that we had a complete node down.

ceph -s
cluster fbc973b6-2ef9-4929-8b3c-1580b18b8875
health HEALTH_WARN
281 pgs backfill
2 pgs backfilling
4107 pgs degraded
2 pgs down
2 pgs peering
3343 pgs recovery_wait
2449 pgs stuck degraded
2 pgs stuck inactive
4092 pgs stuck unclean
762 pgs stuck undersized
2419 pgs undersized
626 requests are blocked > 32 sec
recovery 28570932/685781130 objects degraded (4.166%)
recovery 34172588/685781130 objects misplaced (4.983%)
recovery 16/102757014 unfound (0.000%)
23/322 in osds are down
noout,noscrub,nodeep-scrub flag(s) set
1 mons down, quorum 0,2 monitor01,monitor03
monmap e21: 3 mons at {monitor01=192.168.217.202:6789/0,monitor02=192.168.217.203:6789/0,monitor03=192.168.217.204:6789/0}
election epoch 2108, quorum 0,2 monitor01,monitor03
osdmap e340566: 327 osds: 302 up, 322 in; 1104 remapped pgs
flags noout,noscrub,nodeep-scrub
pgmap v43572317: 14380 pgs, 29 pools, 377 TB data, 100348 kobjects
669 TB used, 263 TB / 932 TB avail
28570932/685781130 objects degraded (4.166%)
34172588/685781130 objects misplaced (4.983%)
16/102757014 unfound (0.000%)
10216 active+clean
1532 active+recovery_wait+degraded
999 active+recovery_wait+undersized+degraded
663 active+recovery_wait+undersized+degraded+remapped
527 active+undersized+degraded
226 active+undersized+degraded+remapped+wait_backfill
149 active+recovery_wait+degraded+remapped
49 active+remapped+wait_backfill
6 active+degraded+remapped+wait_backfill
4 active+remapped
2 active+undersized+degraded+remapped+backfilling
2 active+undersized+degraded+remapped
1 active+clean+scrubbing+deep
1 active+clean+scrubbing
1 down+peering
1 active+degraded+remapped
1 down+remapped+peering
recovery io 1109 kB/s, 0 objects/s
client io 2813 MB/s rd, 49172 kB/s wr, 2446 op/s

ceph -v
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

Actions

Copy link

Updated by Josh Durgin about 7 years ago

Status changed from New to Can't reproduce

In general this kind of error is from corrupt on-disk state - if it's still happening re-open and we can investigate.

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph

Custom queries

Bug #14154

OSD keep crashing

Updated by Josh Durgin about 7 years ago