Project

General

Profile

Actions

Bug #9978

closed

keyvaluestore: void ECBackend::handle_sub_read

Added by Dmitry Smirnov over 9 years ago. Updated over 8 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Severity:
2 - major
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

On 0.87 "Giant" I'm repeatedly hit by the following assert, typically crashing 4 ODSs at once:

   -10> 2014-11-01 04:48:38.513707 7fbf06370700  1 -- 192.168.0.204:6802/29215 --> 192.168.0.201:6804/15191 -- MOSDPGPush(19.11 83067 [PushOp(b2a6ced1/rbd_data.8b861374b0dc
51.0000000000012cf4/head//19, version: 67156'1718071, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 2, recovery_info: ObjectReco
veryInfo(b2a6ced1/rbd_data.8b861374b0dc51.0000000000012cf4/head//19@67156'1718071, copy_subset: [], clone_subset: {}), after_progress: ObjectRecoveryProgress(!first, data_r
ecovered_to:0, data_complete:true, omap_recovered_to:, omap_complete:true), before_progress: ObjectRecoveryProgress(first, data_recovered_to:0, data_complete:false, omap_re
covered_to:, omap_complete:false))]) v2 -- ?+0 0x7fbf35a83600 con 0x7fbf4d647b20
    -9> 2014-11-01 04:48:38.513716 7fbf06370700  5 osd.2 pg_epoch: 83067 pg[19.11( v 78404'1880414 (78135'1877414,78404'1880414] local-les=83009 n=3248 ec=44362 les/c 83009
/82202 82987/82998/82998) [6,12,11]/[2,12] r=0 lpr=82998 pi=80090-82997/49 rops=1 bft=6,11 crt=0'0 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] backf
ill_pos is 292aced1/rbd_data.5014112ae8944a.00000000000086d2/head//19
    -8> 2014-11-01 04:48:38.513768 7fbf06370700  1 -- 192.168.0.204:6802/29215 --> 192.168.0.204:6807/29400 -- pg_backfill(progress 19.11 e 83067/83067 lb 8651ced1/rbd_data
.5014112ae8944a.000000000000907f/head//19) v3 -- ?+0 0x7fbf4ed68d00 con 0x7fbf3bf9d1e0
    -7> 2014-11-01 04:48:38.513801 7fbf06370700  1 -- 192.168.0.204:6802/29215 --> 192.168.0.201:6804/15191 -- pg_backfill(progress 19.11 e 83067/83067 lb 8651ced1/rbd_data
.5014112ae8944a.000000000000907f/head//19) v3 -- ?+0 0x7fbf38c78480 con 0x7fbf4d647b20
    -6> 2014-11-01 04:48:38.604249 7fbefcb1e700  1 -- 192.168.0.204:6802/29215 <== osd.7 192.168.0.2:6818/31087 1095 ==== MOSDECSubOpRead(18.5s4 83067 ECSubRead(tid=2, to_r
ead={20cf7305/10000252fc3.00000059/head//18=0,4194304}, attrs_to_read=20cf7305/10000252fc3.00000059/head//18)) v1 ==== 199+0+0 (328060131 0 0) 0x7fbf2fd9e6c0 con 0x7fbf4da2
8160
    -5> 2014-11-01 04:48:38.604279 7fbefcb1e700  5 -- op tracker -- seq: 26245, time: 2014-11-01 04:48:38.604149, event: header_read, op: MOSDECSubOpRead(18.5s4 83067 ECSub
Read(tid=2, to_read={20cf7305/10000252fc3.00000059/head//18=0,4194304}, attrs_to_read=20cf7305/10000252fc3.00000059/head//18))
    -4> 2014-11-01 04:48:38.604294 7fbefcb1e700  5 -- op tracker -- seq: 26245, time: 2014-11-01 04:48:38.604153, event: throttled, op: MOSDECSubOpRead(18.5s4 83067 ECSubRe
ad(tid=2, to_read={20cf7305/10000252fc3.00000059/head//18=0,4194304}, attrs_to_read=20cf7305/10000252fc3.00000059/head//18))
    -3> 2014-11-01 04:48:38.604302 7fbefcb1e700  5 -- op tracker -- seq: 26245, time: 2014-11-01 04:48:38.604244, event: all_read, op: MOSDECSubOpRead(18.5s4 83067 ECSubRea
d(tid=2, to_read={20cf7305/10000252fc3.00000059/head//18=0,4194304}, attrs_to_read=20cf7305/10000252fc3.00000059/head//18))
    -2> 2014-11-01 04:48:38.604308 7fbefcb1e700  5 -- op tracker -- seq: 26245, time: 0.000000, event: dispatched, op: MOSDECSubOpRead(18.5s4 83067 ECSubRead(tid=2, to_read
={20cf7305/10000252fc3.00000059/head//18=0,4194304}, attrs_to_read=20cf7305/10000252fc3.00000059/head//18))
    -1> 2014-11-01 04:48:38.604357 7fbf08b75700  5 -- op tracker -- seq: 26245, time: 2014-11-01 04:48:38.604356, event: reached_pg, op: MOSDECSubOpRead(18.5s4 83067 ECSubR
ead(tid=2, to_read={20cf7305/10000252fc3.00000059/head//18=0,4194304}, attrs_to_read=20cf7305/10000252fc3.00000059/head//18))
     0> 2014-11-01 04:48:38.605682 7fbf08b75700 -1 osd/ECBackend.cc: In function 'void ECBackend::handle_sub_read(pg_shard_t, ECSubRead&, ECSubReadReply*)' thread 7fbf08b75700 time 2014-11-01 04:48:38.604443
osd/ECBackend.cc: 876: FAILED assert(0)

 ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x7fbf2b7e1122]
 2: (ECBackend::handle_sub_read(pg_shard_t, ECSubRead&, ECSubReadReply*)+0x649) [0x7fbf2b62ece9]
 3: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x4b5) [0x7fbf2b637b55]
 4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x36e) [0x7fbf2b42116e]
 5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x408) [0x7fbf2b279ad8]
 6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x350) [0x7fbf2b27a070]
 7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x848) [0x7fbf2b7d06a8]
 8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fbf2b7d2b00]
 9: (()+0x80a4) [0x7fbf29e4b0a4]
 10: (clone()+0x6d) [0x7fbf283a5cbd]

Files

ceph-osd.1.log.xz (125 KB) ceph-osd.1.log.xz Dmitry Smirnov, 11/05/2014 06:32 PM
ceph-osd.5.log.xz (145 KB) ceph-osd.5.log.xz Dmitry Smirnov, 11/06/2014 12:32 AM
ceph-osd.2.log.xz (265 KB) ceph-osd.2.log.xz Dmitry Smirnov, 11/18/2014 11:31 PM
ceph-osd.14.log.xz (259 KB) ceph-osd.14.log.xz Dmitry Smirnov, 11/18/2014 11:31 PM

Related issues 1 (0 open1 closed)

Is duplicate of Ceph - Bug #9727: 0.86 EC+ KV OSDs crashingDuplicateHaomai Wang10/10/2014

Actions
Actions #1

Updated by Dmitry Smirnov over 9 years ago

  • Description updated (diff)
Actions #2

Updated by Dmitry Smirnov over 9 years ago

Any news on this please? I can barely use my cluster since upgrade to Giant -- OSDs are crashing during backfill all over the place -- everything is very slow and I'm already tired of restarting OSDs...
Please help.

Actions #3

Updated by Samuel Just over 9 years ago

Are you saying these osds are using the KV backend?

Actions #4

Updated by Samuel Just over 9 years ago

Because the bug you say this duplicates is about KV osds.

Actions #5

Updated by Dmitry Smirnov over 9 years ago

Sorry for confusion -- I have impression that it may be not related to store type.
Attaching more detailed log...

Actions #6

Updated by Dmitry Smirnov over 9 years ago

Here is another log from filestore-based OSD, just crashed.

Actions #7

Updated by Dmitry Smirnov over 9 years ago

  • Assignee set to Haomai Wang
Actions #8

Updated by Haomai Wang over 9 years ago

  • Status changed from New to Duplicate
Actions #9

Updated by Haomai Wang over 9 years ago

  • Status changed from Duplicate to New

Sorry. Mistake

Actions #10

Updated by Dmitry Smirnov over 9 years ago

Please let me know if you need more logs.
Due to this error all OSDs (except few) in my cluster are down or crashing within minute(s) from start even before peering is completed...
Now without access to data I have little left to do than to gather logs...
This is happening when only filestore-based OSD are up, as well as when filestore and keyvaluestore OSD are up.
Type of OSD is certainly doesn't matter...

Actions #11

Updated by Samuel Just over 9 years ago

Type of osd probably does matter if the KV osds are distributing faulty information.

Actions #12

Updated by Dmitry Smirnov over 9 years ago

Samuel Just wrote:

Type of osd probably does matter if the KV osds are distributing faulty information.

Earlier I stopped all KV OSDs and still had this issue...

Actions #13

Updated by Haomai Wang over 9 years ago

Hmm, thanks.

I'm not sure who is working or dive into it. I need more free time to diagnose it. EC+KeyValueStore is a good match.

And Is there any other complains or reports about this assertion without KeyValueStore involved in?

Actions #14

Updated by Dmitry Smirnov over 9 years ago

Haomai Wang wrote:

EC+KeyValueStore is a good match.

Ironically that was the worst thing I ever tried in Ceph... Last week was miserable -- I did nothing but tried to recover my cluster but no matter what I do I can't get enough OSDs up to bring up EC pools... What happens is eventually KV OSD poisons newly created filestore OSDs during backfill even if all KV OSD are out... The only way I can bring poisoned OSD up is to delete folder with PG causing crash... Anyway it seems that I've lost (m) OSDs and my EC pools are doomed... It feels very bad to lose data on Ceph for the third time even after using it for over 10 months. It feels so fragile... I'm beginning to consider giving up on it... :(

And Is there any other complains or reports about this assertion without KeyValueStore involved in?

See above except for the rant... It seems that problem originates from KV OSD when some bad EC replica migrates from it. Then the problem persists even when KV OSD is "out" and "down"...

Actions #15

Updated by Haomai Wang over 9 years ago

It's a pity to see it.

Thanks to Dmitry Smirnov, could you give a detail summary about what you done and suggestion for EC+KV. And what's the problem you guess.

I will dive into it in the near future.

Actions #16

Updated by Dmitry Smirnov over 9 years ago

Thanks and sorry for lack of details. I was curious about KV-based OSDs for a long time and after upgrading to 0.87 I've decided to try introducing new KV OSD. Initially it was added with primary-affinity=0 and it was working like this for a week with no problems whatsoever. Some in-out tests were run as well as (deep-)scrubs etc. Then I've taken testing a bit further and tried primary-affinity=1 for some time and again found that new OSD is stable and responding properly. Then few more KV OSDs were added and I've let 'em to fill up to 1 TiB each. Problem hit accidentally and with catastrophic scale probably after reboot of host with filestore-based OSDs: OSD crashed as described following by crash of another OSD etc. very fast so cascade of crashes seemed almost simultaneous. OSD crash happens early after boot when peering is over so no replicas is sent. Then like a bushfire it spreads to next OSD where another replica of EC pool is located and so on. I use the following erasure profile on two pools:

k=2
m=3
plugin=jerasure
ruleset-failure-domain=host
technique=reed_sol_van

I've noticed that typically four or five OSDs crash no matter if KV or filestore-based but always enough to take down the whole pool.

As for suggestions I'd recommend more testing but I also recognise what looks like a fundamental flaw in Ceph development. OSDs stability issues are not new and I experienced 'em first hand ever since 0.72. Bugs happen that's fine but plugging untested code paths with asserts is wrong and ugly. OSDs must be stable and when something wrong happens, log the problem loud enough, stop operating on problematic PG and possibly mark it "down" while keep OSD up and running rather than committing suicide. If in doubt make this behaviour configurable. What happened to me is not (just) because of the bug but more due to systematic error in OSD development -- this is not how mission-critical software should work.

Actions #17

Updated by Dmitry Smirnov over 9 years ago

Do you need any additional "debug_keyvaluestore=20/20" logs? It's been another week... Is there any progress? Any hope to revive my cluster?

Actions #18

Updated by Haomai Wang over 9 years ago

Yep, if free you can paste crash logs with debug_keyvaluestore=20/20

Updated by Dmitry Smirnov over 9 years ago

Here are two new logs -- only filestore OSDs are up, all KV OSD are down.

Actions #20

Updated by Dmitry Smirnov over 9 years ago

It's been another week -- is there any chance to get this fixed please?

Actions #21

Updated by Haomai Wang over 9 years ago

Hi Dmitry,

A related bug is fixed. But I'm not fully sure whether fix this problem. So could you give a crash keyvaluestore log with debug_keyvaluestore=20/20.

Actions #22

Updated by Dmitry Smirnov over 9 years ago

Haomai Wang wrote:

A related bug is fixed. But I'm not fully sure whether fix this problem.

I'm not sure which bug you have in mind. Fix for #10119 did not help my problem at all. :(

So could you give a crash keyvaluestore log with debug_keyvaluestore=20/20.

Unfortunately after weeks of downtime no KV OSDs are starting any more:

read meta/23c2fcde/osd_superblock/0//-1 0~0
lookup_strip_header failed to get strip_header  cid meta oid 23c2fcde/osd_superblock/0//-1
read meta/23c2fcde/osd_superblock/0//-1 0~0 header isn't exist: r = -2
OSD::init() : unable to read osd superblock
 ** ERROR: osd init failed: (22) Invalid argument
Actions #23

Updated by Sage Weil over 9 years ago

  • Subject changed from void ECBackend::handle_sub_read to keyvaluestore: void ECBackend::handle_sub_read
Actions #24

Updated by Dmitry Smirnov over 9 years ago

I reported KV OSD init problem as #10225.

Actions #25

Updated by Dmitry Smirnov over 9 years ago

Can this bug get a little attention please? It has profound effect on my crippled cluster and I'm talking about filestore OSDs, not KV ones...

Actions #26

Updated by Sage Weil over 9 years ago

Dmitry Smirnov wrote:

Can this bug get a little attention please? It has profound effect on my crippled cluster and I'm talking about filestore OSDs, not KV ones...

This looks like fallout from a KV OSD elsewhere in the cluster. This assert happens when a replica/shard doesn't have the object it's supposed to. You might be able to recover to shutting down the KV OSDs.

Actions #27

Updated by Dmitry Smirnov over 9 years ago

No, I was not able to recover by stopping all KV OSDs. It did not help. I don't know why you guys do not understand that the problem is keep happening even without KV OSDs. Here is what I did:

  • Disabled backfilling completely (`osd max backfills = 0` in `/etc/ceph/ceph.conf`) to avoid other asserts like `OnRecoveryReadComplete::finish` / `osd/ECBackend.cc: 198: FAILED assert(res.r == 0)`.
  • Run the following loop to keep OSDs up: `while sleep 66; do /etc/init.d/ceph start osd ; done`.
  • Mark especially unstable OSDs as "out" (`ceph osd out NN`).
  • Copied all data from degraded EC pool. (Another EC pool did not survive due to unfound object despite redundancy that supposed to be enough to tolerate failure of 3 hosts).
    Not deleting files from degraded EC pools helped to keep OSDs alive. It was possible to use replicated pool(s) as this problem affects only EC pools. Copying data took about a week.
  • Dropped all EC pools.
  • Decommissioned Ceph in favour of different storage solution (LizardFS, unless better alternative will be found in a week).

It was possible to read data from degraded EC pools but backfill was causing OSD crash in seconds. I've just finished recovering data so I can't tell how much corruption is there. I expect some. So far some files seems to be recovered successfully while at least one ~350 GiB file failed integrity check...

Actions #28

Updated by Sage Weil over 9 years ago

The OSDs share all kinds of information with each other and that information is generally trusted. That means that a misbehaving OSD that shares bad metadata with its peers can break the rest of the cluster. The fact that all KV OSDs are now stopped and/or removed from the cluster doesn't change that when they were present the may have 'poisoned' some PGs with incorrect metadata.

I'm really really sorry that it wasn't clear that keyvaluestore-dev was an experimental and untested backend that should not be trusted with important data. I had thought the '-dev' suffix would be sufficient to scare people off but I was mistaken. I'm afraid this has unduly soured your ceph experience. Unfortunately we have enough to do that we can't sink a lot of developer time into repairing corruptions caused by unsupported code. I'm really sorry. :(

If you have a few minutes, I would appreciate it if you reviewed this thread about how to more aggressively flag experimental features so that others do not repeat your experience:

http://marc.info/?t=141780100300006&r=1&w=2

In particular, it would be helpful to know if the third option (requiring a config option like 'enable unrecoverable data corrupting features = keyvaluestore' in order to use an untested feature) sounds like a good one to you.

Thanks, Dmitry!

Actions #29

Updated by Dmitry Smirnov over 9 years ago

Dear Sage,
Thanks for your kind words but please do not draw a wrong conclusion from this incident:
the problem was not that I did not realise that KV backend is experimental. I took full responsibility for trying it and I knowingly considered risk of evaluating KV OSDs.

The real problem is that such things as OSDs "poisoning" is possible. Earlier I've already experienced similar problems like cascade of "domino effect" OSD failures spreading like a bush fire. I tried KV backend only because I was super unhappy with fragile behaviour of filestore OSDs and it was naive to expect KV to be any better.

This was not a bad experience comparing to other incidents and I would give up on Ceph anyway -- this was just a very last chance I've gave it before walking away... Merely a problem in experimental KV backend wouldn't be a reason to give up. Unfortunately I see a bigger picture where Ceph is full of design flaws making it unfit for its purpose in my eyes... That is my conclusion. Sorry.

Actions #30

Updated by Samuel Just over 9 years ago

  • Status changed from New to Closed
Actions #31

Updated by xingyi wu over 8 years ago

Unfortunately, we encountered this bug recently. I found a workaround and hope anyone who came to this page desperately could try it.
We using rgw+EC to store big files, the cluster is about 3PB. We encountered this bug when adding new nodes to the cluster, one pg is down, three OSDs kept coredumped after backfilling from the old osd sets to new osd sets, because the handle_sub_read caused a assertion.
The method is quite simple, we stop backfilling by "ceph osd set nobackfill", after that, no backfill would happen, and so, no osd coredump, no data loss.
Then we found the reason for the coredump, we found a suspicious object that could not found in local file system. we decided to backup the object and then delete the object, then we started backfilling, the backfilling procedure runs happily.

Actions

Also available in: Atom PDF