Bug #10059
closedosd/ECBackend.cc: 876: FAILED assert(0)
0%
Description
-1> 2014-11-09 14:13:01.334410 7f8b93c8b700 10 filestore(/var/lib/ceph/osd/ceph-3) FileStore::read(1.1ds0_head/78348bbd/benchmark_data_plana76_6098_object6568/head//1/18446744073709551615/0) open error: (2) No such file or directory
0> 2014-11-09 14:13:01.337431 7f8b93c8b700 -1 osd/ECBackend.cc: In function 'void ECBackend::handle_sub_read(pg_shard_t, ECSubRead&, ECSubReadReply*)' thread 7f8b93c8b700 time 2014-11-09 14:13:01.334418
osd/ECBackend.cc: 876: FAILED assert(0)
ceph version 0.87-573-g6977d02 (6977d02f0d31c453cdf554a8f1796f290c1a3b89)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x7f) [0xaa870f]
2: (ECBackend::handle_sub_read(pg_shard_t, ECSubRead&, ECSubReadReply*)+0x932) [0x949782]
3: (ECBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x4b1) [0x957461]
4: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x15a) [0x7d571a]
5: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x17f) [0x656f9f]
6: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x65f) [0x6579ff]
7: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x652) [0xa99352]
8: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xa9aa80]
9: (()+0x7e9a) [0x7f8baf2a6e9a]
5> 2014-11-09 14:13:01.334302 7f8b93c8b700 5 - op tracker -- seq: 65517, time: 2014-11-09 14:13:01.334302, event: reached_pg, op: MOSDECSubOpRead(1.1ds0 103 ECSubRead(tid=12070, to_read={78348bbd/benchmark_data_plana76_6098_object6568/head//1=0,524288}, attrs_to_read=78348bbd/benchmark_data_plana76_6098_object6568/head//1))
-4> 2014-11-09 14:13:01.334313 7f8b93c8b700 10 osd.3 pg_epoch: 103 pg[1.1ds0( v 92'740 (0'0,92'740] local-les=92 n=183 ec=9 les/c 92/92 93/98/98) [1,3,0] r=-1 lpr=98 pi=91-97/3 luod=0'0 crt=92'737 lcod 92'737 active] handle_message: MOSDECSubOpRead(1.1ds0 103 ECSubRead(tid=12070, to_read={78348bbd/benchmark_data_plana76_6098_object6568/head//1=0,524288}, attrs_to_read=78348bbd/benchmark_data_plana76_6098_object6568/he
ad//1)) v1
-3> 2014-11-09 14:13:01.334333 7f8b93c8b700 15 filestore(/var/lib/ceph/osd/ceph-3) read 1.1ds0_head/78348bbd/benchmark_data_plana76_6098_object6568/head//1/18446744073709551615/0 0~524288
-2> 2014-11-09 14:13:01.334397 7f8b93c8b700 10 filestore(/var/lib/ceph/osd/ceph-3) error opening file /var/lib/ceph/osd/ceph-3/current/1.1ds0_head/DIR_D/DIR_B/benchmark\udata\uplana76\u6098\uobject6568__head_78348BBD__1_ffffffffffffffff_0 with flags=2: (2) No such file or directory
2014-11-09 14:12:33.028797 7f8b93c8b700 20 osd.3 pg_epoch: 92 pg[1.1ds0( v 92'740 (0'0,92'740] local-les=92 n=183 ec=9 les/c 92/92 90/91/91) [2147483647,4,5]/[3,4,5] r=0 lpr=91 luod=92'738 crt=92'737 lcod 92'737 mlcod 92'737 active+remapped] check_op tid 12344: Op(78348bbd/benchmark_data_plana76_6098_object6568/head//1 v=92'739 tt=0'0 tid=12344 reqid=client.4115.0:19498 client_op=osd_op(client.4115.0:19498 benchmark_data_
plana76_6098_object6568 [delete] 1.78348bbd ack+ondisk+write+known_if_redirected e92) pending_commit=5(2) pending_apply=5(2))
...
2014-11-09 14:12:57.902407 7f8b96c91700 20 merge_log 92'735 (83'362) delete 4451151d/benchmark_data_plana76_6098_object6501/head//1 by client.4115.0:19431 2014-11-09 14:12:23.055232
2014-11-09 14:12:57.902415 7f8b96c91700 20 merge_log 92'736 (83'363) delete d91728bd/benchmark_data_plana76_6098_object6505/head//1 by client.4115.0:19435 2014-11-09 14:12:23.447074
2014-11-09 14:12:57.902422 7f8b96c91700 20 merge_log 92'737 (83'364) delete e9a88cbd/benchmark_data_plana76_6098_object6515/head//1 by client.4115.0:19445 2014-11-09 14:12:24.123917
2014-11-09 14:12:57.902430 7f8b96c91700 20 merge_log 92'738 (83'365) delete a1d82bbd/benchmark_data_plana76_6098_object6521/head//1 by client.4115.0:19451 2014-11-09 14:12:24.421805
2014-11-09 14:12:57.902751 7f8b96c91700 10 merge_log result log((0'0,92'738], crt=92'738) missing(185) changed=1
2014-11-09 14:12:57.903244 7f8b96c91700 5 osd.3 pg_epoch: 100 pg[1.1ds1( v 92'738 lc 0'0 (0'0,92'738] local-les=0 n=185 ec=9 les/c 92/92 93/98/98) [1,3,0] r=1 lpr=100 pi=91-97/3 crt=92'738 inactive m=185] exit Started/Stray 0.011615 1 0.000080
So, the object was deleted, but that entry ended up divergent. A different shard was created on the osd (shard 1) which then the primary did not realize didn't have a copy.
ubuntu@teuthology:/a/sage-2014-11-09_07:49:57-rados-next-testing-basic-multi/592008/remote
Updated by Dmitry Smirnov over 9 years ago
This bug makes me cry as it is the reason for my cluster to be completely down for over 10 days now... Duplicate added...
Updated by Samuel Just over 9 years ago
This is almost certainly unrelated to those two bugs. This is a specific edge case in divergent write recovery.
Updated by Dmitry Smirnov over 9 years ago
Updated by Samuel Just over 9 years ago
- Status changed from 7 to Pending Backport
- Backport set to giant,firefly
Updated by Dmitry Smirnov over 9 years ago
I applied c87bde64 on top of 0.89 on degraded cluster but it caused many PGs to stuck in "inactive" state so I had to revert it...
Updated by Loïc Dachary about 9 years ago
- firefly backport https://github.com/ceph/ceph/pull/3955
Updated by Loïc Dachary about 9 years ago
- original pull request https://github.com/ceph/ceph/pull/3154
Updated by Loïc Dachary about 9 years ago
- giant backport https://github.com/ceph/ceph/pull/4090
Updated by Loïc Dachary about 9 years ago
Updated by Loïc Dachary about 9 years ago
- Status changed from Pending Backport to Resolved