Project

General

Profile

Actions

Bug #14154

closed

OSD keep crashing

Added by Ali chips over 8 years ago. Updated about 7 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

we have several osds that keeps crashing.
here are the last bits of logs from the one of the OSDs:

4> 2015-12-22 16:26:31.882232 7fccd63b6700  5 - op tracker -- seq: 2776, time: 2015-12-22 16:26:31.882159, event: header_read, op: MOSDECSubOpRead(108.3dds4 340537 ECSubRead(tid=15755, to_read={36b52bdd/default.36046497.3__shadow_614153467.2~N5hssNbnSMjn2UXMtsONQ92838IhZmd.74_1/head//108=0,349536,0}, attrs_to_read=))
3> 2015-12-22 16:26:31.882242 7fccd63b6700 5 - op tracker -- seq: 2776, time: 2015-12-22 16:26:31.882160, event: throttled, op: MOSDECSubOpRead(108.3dds4 340537 ECSubRead(tid=15755, to_read={36b52bdd/default.36046497.3__shadow_614153467.2~N5hssNbnSMjn2UXMtsONQ92838IhZmd.74_1/head//108=0,349536,0}, attrs_to_read=))
2> 2015-12-22 16:26:31.882249 7fccd63b6700 5 - op tracker -- seq: 2776, time: 2015-12-22 16:26:31.882193, event: all_read, op: MOSDECSubOpRead(108.3dds4 340537 ECSubRead(tid=15755, to_read={36b52bdd/default.36046497.3__shadow_614153467.2~N5hssNbnSMjn2UXMtsONQ92838IhZmd.74_1/head//108=0,349536,0}, attrs_to_read=))
1> 2015-12-22 16:26:31.882254 7fccd63b6700 5 - op tracker -- seq: 2776, time: 0.000000, event: dispatched, op: MOSDECSubOpRead(108.3dds4 340537 ECSubRead(tid=15755, to_read={36b52bdd/default.36046497.3__shadow_614153467.2~N5hssNbnSMjn2UXMtsONQ92838IhZmd.74_1/head//108=0,349536,0}, attrs_to_read=))
0> 2015-12-22 16:26:31.892858 7fccef2fe700 -1 ** Caught signal (Aborted) *
in thread 7fccef2fe700
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: /usr/bin/ceph-osd() [0xac8a32]
2: (()+0xf100) [0x7fcd1df7d100]
3: (gsignal()+0x37) [0x7fcd1c9965f7]
4: (abort()+0x148) [0x7fcd1c997ce8]
5: (_gnu_cxx::_verbose_terminate_handler()+0x165) [0x7fcd1d29a9d5]
6: (()+0x5e946) [0x7fcd1d298946]
7: (()+0x5e973) [0x7fcd1d298973]
8: (()+0x5eb93) [0x7fcd1d298b93]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x27a) [0xbc9f7a]
10: (ECBackend::get_hash_info(hobject_t const&)+0x9f1) [0xa2b491]
11: (ECBackend::submit_transaction(hobject_t const&, eversion_t const&, PGBackend::PGTransaction*, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, boost::optional<pg_hit_set_history_t>&, Context*, Context*, Context*, unsigned long, osd_reqid_t, std::tr1::shared_ptr<OpRequest>)+0x608) [0xa309b8]
12: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*)+0x7ba) [0x84583a]
13: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0x1008) [0x892578]
14: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0x45d7) [0x8977d7]
15: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x8332aa]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x695385]
17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2e9) [0x6958a9]
18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbb929f]
19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbbb3d0]
20: (()+0x7dc5) [0x7fcd1df75dc5]
21: (clone()+0x6d) [0x7fcd1ca5721d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.205.log
--
end dump of recent events ---

this is my first bug report here, so i am not sure what else could be useful, i ll be happy to provide any additional info.
this got our cluster in really bad shape, specially when this is happening the same time that we had a complete node down.

  1. ceph -s
    cluster fbc973b6-2ef9-4929-8b3c-1580b18b8875
    health HEALTH_WARN
    281 pgs backfill
    2 pgs backfilling
    4107 pgs degraded
    2 pgs down
    2 pgs peering
    3343 pgs recovery_wait
    2449 pgs stuck degraded
    2 pgs stuck inactive
    4092 pgs stuck unclean
    762 pgs stuck undersized
    2419 pgs undersized
    626 requests are blocked > 32 sec
    recovery 28570932/685781130 objects degraded (4.166%)
    recovery 34172588/685781130 objects misplaced (4.983%)
    recovery 16/102757014 unfound (0.000%)
    23/322 in osds are down
    noout,noscrub,nodeep-scrub flag(s) set
    1 mons down, quorum 0,2 monitor01,monitor03
    monmap e21: 3 mons at {monitor01=192.168.217.202:6789/0,monitor02=192.168.217.203:6789/0,monitor03=192.168.217.204:6789/0}
    election epoch 2108, quorum 0,2 monitor01,monitor03
    osdmap e340566: 327 osds: 302 up, 322 in; 1104 remapped pgs
    flags noout,noscrub,nodeep-scrub
    pgmap v43572317: 14380 pgs, 29 pools, 377 TB data, 100348 kobjects
    669 TB used, 263 TB / 932 TB avail
    28570932/685781130 objects degraded (4.166%)
    34172588/685781130 objects misplaced (4.983%)
    16/102757014 unfound (0.000%)
    10216 active+clean
    1532 active+recovery_wait+degraded
    999 active+recovery_wait+undersized+degraded
    663 active+recovery_wait+undersized+degraded+remapped
    527 active+undersized+degraded
    226 active+undersized+degraded+remapped+wait_backfill
    149 active+recovery_wait+degraded+remapped
    49 active+remapped+wait_backfill
    6 active+degraded+remapped+wait_backfill
    4 active+remapped
    2 active+undersized+degraded+remapped+backfilling
    2 active+undersized+degraded+remapped
    1 active+clean+scrubbing+deep
    1 active+clean+scrubbing
    1 down+peering
    1 active+degraded+remapped
    1 down+remapped+peering
    recovery io 1109 kB/s, 0 objects/s
    client io 2813 MB/s rd, 49172 kB/s wr, 2446 op/s
  1. ceph -v
    ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
Actions #1

Updated by Josh Durgin about 7 years ago

  • Status changed from New to Can't reproduce

In general this kind of error is from corrupt on-disk state - if it's still happening re-open and we can investigate.

Actions

Also available in: Atom PDF