Bug #10166
closedfiemap or FileStore::do_sparse_copy_range bug: osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0))
0%
Description
Assertion: osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0)) ceph version 0.88-229-g47d6780 (47d6780849fe08191e5c0cf71d09f348609894aa) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xc97142] 2: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x5f0) [0x96f6b0] 3: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x172) [0x970032] 4: (ReplicatedBackend::do_push(std::tr1::shared_ptr<OpRequest>)+0x50) [0xb09d80] 5: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1e6) [0xb07726] 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x126) [0x941066] 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3be) [0x7b59fe] 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2af) [0x7b5e1f] 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x82b) [0xc88ceb] 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc8aa40] 11: (()+0x6b50) [0x7fc7492ceb50] 12: (clone()+0x6d) [0x7fc747ceaa7d]
Updated by Sage Weil over 9 years ago
- Subject changed from "Assertion: osd/ReplicatedPG.cc: 8706" in upgrade:giant-x-next-distro-basic-vps run to osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0))
- Category set to OSD
- Status changed from New to 12
- Priority changed from Normal to Urgent
ubuntu@teuthology:/a/teuthology-2014-11-23_18:13:01-upgrade:firefly-x-giant-distro-basic-multi/615700
Updated by Samuel Just over 9 years ago
/a/teuthology-2014-11-23_18:13:01-upgrade:firefly-x-giant-distro-basic-multi/615700
2014-11-23 19:18:14.224698 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) write 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 2097152~858903
2014-11-23 19:18:14.225564 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) write 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 2097152~858903 = 858903
[size should be 2097152+858903=2956055, which appears to be right]
2014-11-23 19:18:14.225580 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) _omap_setkeys 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18
2014-11-23 19:18:14.225625 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/2a0//18 not skipping op, *spos 15102.0.1
2014-11-23 19:18:14.225691 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) setattrs 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18
2014-11-23 19:18:14.225695 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) setattrs 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 = 0
2014-11-23 19:18:14.225702 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) collection_add 18.19_head/2e57ac39/plana3419355-7/2a0//18 from 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18
2014-11-23 19:18:14.325294 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/2a0//18 setting spos to 15102.0.3
2014-11-23 19:18:14.325299 7f4d4817a700 20 filestore set_map_header: setting 661 oid 2e57ac39/plana3419355-7/2a0//18 parent seq 0
2014-11-23 19:18:14.392082 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) collection_add 18.19_head/2e57ac39/plana3419355-7/2a0//18 from 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 = 0
2014-11-23 19:18:14.392095 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) remove 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18
2014-11-23 19:18:14.392169 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/2a0//18 setting spos to 15102.0.4
2014-11-23 19:18:14.392173 7f4d4817a700 20 filestore set_map_header: setting 661 oid 2e57ac39/plana3419355-7/2a0//18 parent seq 0
2014-11-23 19:18:14.450728 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) remove 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 = 0
...
2014-11-23 19:18:14.642835 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 0~556055 to 0
2014-11-23 19:18:14.676100 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/265//18 setting spos to 15111.0.5
2014-11-23 19:18:14.676107 7f4d4817a700 20 filestore set_map_header: setting 664 oid 2e57ac39/plana3419355-7/265//18 parent seq 0
2014-11-23 19:18:14.709426 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 0~556055 to 0 = 31767
2014-11-23 19:18:14.709458 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 1040596~476694 to 1040596
2014-11-23 19:18:14.826178 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/265//18 setting spos to 15111.0.6
2014-11-23 19:18:14.826184 7f4d4817a700 20 filestore set_map_header: setting 664 oid 2e57ac39/plana3419355-7/265//18 parent seq 0
2014-11-23 19:18:14.884547 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 1040596~476694 to 1040596 = 83478
2014-11-23 19:18:14.884563 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 2272624~54548 to 2272624
2014-11-23 19:18:14.943120 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/265//18 setting spos to 15111.0.7
2014-11-23 19:18:14.943127 7f4d4817a700 20 filestore set_map_header: setting 664 oid 2e57ac39/plana3419355-7/265//18 parent seq 0
2014-11-23 19:18:15.018070 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 2272624~54548 to 2272624 = 53904
from that, the line
2014-11-23 19:18:14.225564 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) write 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 2097152~858903 = 858903
indicates that size should be 2097152+858903=2956055, which appears to be right.
Later,
2014-11-23 19:18:15.018070 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 2272624~54548 to 2272624 = 53904
indicates a short read stopping at offset 2272624+53904=2326528 which is significantly less than the actual size. That appears to be the bug, no idea how it could have happened yet.
Later, we get another clone range with a short read:
2014-11-23 19:18:17.237803 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/297//18 2128446~596417 to 2128446 = 71106
2128446+71106=2199552 which is even smaller.
This occurred on a test with plana34,plana47,plana50. plana34 had osd 2 where the problem occurred. plana 34 is ubuntu 12.04.4 with kernel 3.18.0-rc6-ceph-00024-geb0e5fd.
plana 47 is 12.04.5 with kernel 3.18.0-rc6-ceph-00024-geb0e5fd
plana 50 is 12.04.4 with kernel 3.17.0-rc6-ceph-00002-ge8acad6
Updated by Samuel Just over 9 years ago
Going back to the full log, it appears to be related to _do_sparse_copy_range and therefore fiemap:
2014-11-23 19:18:14.884547 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 > 18.19_head/2e57ac39/plana3419355-7/265//18 1040596~476694 to 1040596 = 83478 start
2014-11-23 19:18:14.884563 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 2272624~54548 to 2272624
2014-11-23 19:18:14.884578 7f4d4817a700 20 filestore(/var/lib/ceph/osd/ceph-2) _do_clone_range copy 2272624~54548 to 2272624
2014-11-23 19:18:14.884582 7f4d4817a700 20 filestore(/var/lib/ceph/osd/ceph-2) _do_sparse_copy_range 2272624~54548 to 2272624
2014-11-23 19:18:14.885233 7f4d4b180700 5 osd.2 862 tick
2014-11-23 19:18:14.885262 7f4d4b180700 20 osd.2 862 scrub_random_backoff lost coin flip, randomly backing off
2014-11-23 19:18:14.885267 7f4d4b180700 10 osd.2 862 do_waiters -
2014-11-23 19:18:14.885269 7f4d4b180700 10 osd.2 862 do_waiters -- finish
2014-11-23 19:18:14.919282 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) _do_sparse_copy_range fm_mapped_extents=1 fe_logical=2272624 fe_length=53904
2014-11-23 19:18:14.919378 7f4d4817a700 20 filestore(/var/lib/ceph/osd/ceph-2) _do_sparse_copy_range 2272624~54548 to 2272624 = 53904
2014-11-23 19:18:14.919384 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) _set_replay_guard 15111.0.7
Probably due to teuthology commit 0f97481ce44e0487ac6cffa051a05590f907a081 on Nov 20 enabling fiemap in the nightlies.
Updated by Samuel Just over 9 years ago
- Subject changed from osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0)) to fiemap or FileStore::do_sparse_copy_range bug: osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0))
Updated by Samuel Just over 9 years ago
Hmm, this might be as simple as truncating out to the full copy_range size.
Updated by Haomai Wang over 9 years ago
- Status changed from 12 to Fix Under Review
- Assignee set to Haomai Wang
https://github.com/ceph/ceph/pull/3000
I'm not fully ensure that this issue is caused by inconsistence size. Maybe we need to test it again after this PR.
Updated by Haomai Wang over 9 years ago
- Status changed from Fix Under Review to Resolved