Bug #8733: OSD crashed at void ECBackend::handle_sub_read - Ceph - Ceph

Actions

Copy link

Bug #8733

closed

OSD crashed at void ECBackend::handle_sub_read

Added by Jingjing Zhao almost 10 years ago. Updated over 9 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

Category:

OSD

Target version:

% Done:

Source:

Community (user)

Tags:

Backport:

Firefly

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

When took one OSD out (total 219 OSDs) to trigger recovery, 30 OSDs crashed after about 20min. All OSD crashed for same error.

2014-07-02 07:09:48.846158 7fdb620f9700 10 osd.60 1430 dequeue_op 0x34a5ab40
prio 10 cost 0 latency 0.000216 MOSDECSubOpRead(3.e29s2 1430
ECSubRead(tid=2587872,
to_read={64280e29/default.5470.106_gw02c902.com_636c00157cb41cee8345a98979397e6c/head//3=0,
1056768,a5280e29/default.17216.315_gw01c902.com_95356b91b9f5fd455749a6e55fd1c965/head//3=0,
1056768,e6280e29/default.5722.321_osd187.ceph.com_4b7d05ee419528ad931a84c8dfa88b46/head//3=0,
1056768,77280e29/default.5470.719_gw01c902.com_ed268e9b5a26ffd759a48358d0b67018/head//3=0,
1056768,d8280e29/default.5470.508_gw04c902.com_9308381b2ca7c3c379167dcb7302e165/head//3=0,1056768},
attrs_to_read=)) v1 pg pg[3.e29s2( v 1430'24852 (1385'21848,1430'24852]
local-les=1399 n=24852 ec=315 les/c 1399/1394 1396/1398/315)
[26,44,60,203,19,151,211,90,81,128,198]/[26,44,60,2147483647,19,151,211,90,81,128,198]
r=2 lpr=1398 pi=352-1397/6 luod=0'0 crt=633'13712 active+remapped]
2014-07-02 07:09:48.846196 7fdb620f9700 10 osd.60 pg_epoch: 1430 pg[3.e29s2( v
1430'24852 (1385'21848,1430'24852] local-les=1399 n=24852 ec=315 les/c
1399/1394 1396/1398/315)
[26,44,60,203,19,151,211,90,81,128,198]/[26,44,60,2147483647,19,151,211,90,81,128,198]
r=2 lpr=1398 pi=352-1397/6 luod=0'0 crt=633'13712 active+remapped]
handle_message: MOSDECSubOpRead(3.e29s2 1430 ECSubRead(tid=2587872,
to_read={64280e29/default.5470.106_gw02c902.com_636c00157cb41cee8345a98979397e6c/head//3=0,
1056768,a5280e29/default.17216.315_gw01c902.com_95356b91b9f5fd455749a6e55fd1c965/head//3=0,
1056768,e6280e29/default.5722.321_osd187.ceph.com_4b7d05ee419528ad931a84c8dfa88b46/head//3=0,
1056768,77280e29/default.5470.719_gw01c902.com_ed268e9b5a26ffd759a48358d0b67018/head//3=0,
1056768,d8280e29/default.5470.508_gw04c902.com_9308381b2ca7c3c379167dcb7302e165/head//3=0,1056768},
attrs_to_read=)) v1
2014-07-02 07:09:48.906100 7fdb620f9700 -1 osd/ECBackend.cc: In function 'void
ECBackend::handle_sub_read(pg_shard_t, ECSubRead&, ECSubReadReply*)' thread
7fdb620f9700 time 2014-07-02 07:09:48.895522
osd/ECBackend.cc: 875: FAILED assert(0)
ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
 1: (ECBackend::handle_sub_read(pg_shard_t, ECSubRead&, ECSubReadReply*)+0xca6)
[0x94d7e6]
 2: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x452)
[0x95d062]
 3: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x250) [0x7eca30]
 4: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x37c) [0x60e63c]
 5: (OSD::OpWQ::_process(boost::intrusive_ptr<PG>,
ThreadPool::TPHandle&)+0x63d) [0x63e97d]
 6: (ThreadPool::WorkQueueVal<std::pair<boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest> >, boost::intrusive_ptr<PG>
>::_void_process(void*, ThreadPool::TPHandle&)+0xae) [0x67649e]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0x551) [0xa8c301]
 8: (ThreadPool::WorkThread::entry()+0x10) [0xa8f340]
 9: /lib64/libpthread.so.0() [0x3087407851]
 10: (clone()+0x6d) [0x30870e890d]

More information:
1. the pool is using EC
2. Ceph version: ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74)
3. Restarting the OSD, but crashed again later
4. There a lot of objects in the cluster, 320TB used, 477TB/797TB avail

Files

part_osd.log (90.7 KB) part_osd.log

Zhi Zhang, 07/08/2014 05:33 AM

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Zhi Zhang almost 10 years ago

As described above, this crash happens when failing to read part of object on
peer OSD because this part doesn't exist, but it is expected to be there.

The reason why this part doesn't exist is that this part is still in original
OSD which is already down and out, and it is not migrated to new OSD during
recovering.

The reason why this part is not migrated to new OSD has something related to EC object's generation

When PG is doing recovery, primary OSD of this PG will list its collection
partially according to "osd_backfill_scan_min": "64" and
"osd_backfill_scan_max": "512".

After it gets the object list, for example, having 512 objects, it will filter
those objects whose version (generation) is not "ffffffffffffffff". And the
last object will be a marker for the next round listing but won't be included
in the next list. The following recovery process is based on this list.

As we have known, EC object has the generation (0, 1, 2, 3, ...) and won't be
deleted so far (v0.80.1). If one EC object has the generation which is not
"ffffffffffffffff" and is just the last one of above object list, it will be
filtered out and become a marker for the next round listing. So it will be not
included in current list or next list even if this EC object also has the
"ffffffffffffffff" generation.

Actions

Copy link

Updated by Zhi Zhang almost 10 years ago

File part_osd.log part_osd.log added

to correct my last comment:

The marker for the next round listing is not the last one of current object list. So we think it may be not related to EC object's generation.

From the latest observation, Filestore returns the object list, for example, having 512 objects, to OSD, and
also returns the marker for the next round listing.

What we expect is that the marker should be included in the next list, but actually it doesn't. So looks like all the objects had became a marker are missed.

Please see the log attached.

hobject: 7d6499c5/default.5007.39_osd12.com_f999f8f4eff3615826e034b8f22e87b1/head//3 was a marker, but the new list starting from this hobject didn't include it. So this object was missed so that it didn't recover on peer OSD.

Actions

Copy link

Updated by Zhi Zhang almost 10 years ago

Here is the latest founding:

From erasure coding, there is a new struct ghobject_t to represent an object in CEPH. ghobject_t has generation and shard info.

Filestore returns the marker for the next round listing, but the marker is represented by struct hobject_t.

if (r == 0)
*next = _next.hobj;

When doing next round listing, this marker will be converted to ghobject_t, but still lose the generation or shard info. Then every object under the collection will compare with this marker. The same object as this marker but with generation info will be skipped. So looks like all the objects had became a marker will be missed to
migrate.

For example,

object as a marker ====> next_object:
6b5a056a/default.5007.394_osd15.com_b9191bd1f0ef4622559930c63fe9fdeb/head//3

the same object with generation ====> i->second:
6b5a056a/default.5007.394_osd15.com_b9191bd1f0ef4622559930c63fe9fdeb/head//3/18446744073709551615/0

By following comparison, this condition "i->second < *next_object" will be met, so this object will be missed.

if (next_object && i->second < *next_object)
continue;

Actions

Copy link