Project

General

Profile

Actions

Bug #10166

closed

fiemap or FileStore::do_sparse_copy_range bug: osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0))

Added by Yuri Weinstein over 9 years ago. Updated over 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
OSD
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Logs are in http://qa-proxy.ceph.com/teuthology/teuthology-2014-11-20_17:05:01-upgrade:giant-x-next-distro-basic-vps/611381/

Assertion: osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0))
ceph version 0.88-229-g47d6780 (47d6780849fe08191e5c0cf71d09f348609894aa)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x72) [0xc97142]
 2: (ReplicatedBackend::handle_pull_response(pg_shard_t, PushOp&, PullOp*, std::list<hobject_t, std::allocator<hobject_t> >*, ObjectStore::Transaction*)+0x5f0) [0x96f6b0]
 3: (ReplicatedBackend::_do_pull_response(std::tr1::shared_ptr<OpRequest>)+0x172) [0x970032]
 4: (ReplicatedBackend::do_push(std::tr1::shared_ptr<OpRequest>)+0x50) [0xb09d80]
 5: (ReplicatedBackend::handle_message(std::tr1::shared_ptr<OpRequest>)+0x1e6) [0xb07726]
 6: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x126) [0x941066]
 7: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3be) [0x7b59fe]
 8: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x2af) [0x7b5e1f]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x82b) [0xc88ceb]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc8aa40]
 11: (()+0x6b50) [0x7fc7492ceb50]
 12: (clone()+0x6d) [0x7fc747ceaa7d]

Related issues 1 (0 open1 closed)

Has duplicate Ceph - Bug #10165: ceph_test_rados got short readDuplicate11/21/2014

Actions
Actions #1

Updated by Sage Weil over 9 years ago

  • Subject changed from "Assertion: osd/ReplicatedPG.cc: 8706" in upgrade:giant-x-next-distro-basic-vps run to osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0))
  • Category set to OSD
  • Status changed from New to 12
  • Priority changed from Normal to Urgent

ubuntu@teuthology:/a/teuthology-2014-11-23_18:13:01-upgrade:firefly-x-giant-distro-basic-multi/615700

Actions #2

Updated by Samuel Just over 9 years ago

/a/teuthology-2014-11-23_18:13:01-upgrade:firefly-x-giant-distro-basic-multi/615700

2014-11-23 19:18:14.224698 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) write 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 2097152~858903
2014-11-23 19:18:14.225564 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) write 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 2097152~858903 = 858903
[size should be 2097152+858903=2956055, which appears to be right]
2014-11-23 19:18:14.225580 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) _omap_setkeys 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18
2014-11-23 19:18:14.225625 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/2a0//18 not skipping op, *spos 15102.0.1
2014-11-23 19:18:14.225691 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) setattrs 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18
2014-11-23 19:18:14.225695 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) setattrs 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 = 0
2014-11-23 19:18:14.225702 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) collection_add 18.19_head/2e57ac39/plana3419355-7/2a0//18 from 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18
2014-11-23 19:18:14.325294 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/2a0//18 setting spos to 15102.0.3
2014-11-23 19:18:14.325299 7f4d4817a700 20 filestore set_map_header: setting 661 oid 2e57ac39/plana3419355-7/2a0//18 parent seq 0
2014-11-23 19:18:14.392082 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) collection_add 18.19_head/2e57ac39/plana3419355-7/2a0//18 from 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 = 0
2014-11-23 19:18:14.392095 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) remove 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18
2014-11-23 19:18:14.392169 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/2a0//18 setting spos to 15102.0.4
2014-11-23 19:18:14.392173 7f4d4817a700 20 filestore set_map_header: setting 661 oid 2e57ac39/plana3419355-7/2a0//18 parent seq 0
2014-11-23 19:18:14.450728 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) remove 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 = 0
...
2014-11-23 19:18:14.642835 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 0~556055 to 0
2014-11-23 19:18:14.676100 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/265//18 setting spos to 15111.0.5
2014-11-23 19:18:14.676107 7f4d4817a700 20 filestore set_map_header: setting 664 oid 2e57ac39/plana3419355-7/265//18 parent seq 0
2014-11-23 19:18:14.709426 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 0~556055 to 0 = 31767
2014-11-23 19:18:14.709458 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 1040596~476694 to 1040596
2014-11-23 19:18:14.826178 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/265//18 setting spos to 15111.0.6
2014-11-23 19:18:14.826184 7f4d4817a700 20 filestore set_map_header: setting 664 oid 2e57ac39/plana3419355-7/265//18 parent seq 0
2014-11-23 19:18:14.884547 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 1040596~476694 to 1040596 = 83478
2014-11-23 19:18:14.884563 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 2272624~54548 to 2272624
2014-11-23 19:18:14.943120 7f4d4817a700 10 filestore oid: 2e57ac39/plana3419355-7/265//18 setting spos to 15111.0.7
2014-11-23 19:18:14.943127 7f4d4817a700 20 filestore set_map_header: setting 664 oid 2e57ac39/plana3419355-7/265//18 parent seq 0
2014-11-23 19:18:15.018070 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 2272624~54548 to 2272624 = 53904

from that, the line

2014-11-23 19:18:14.225564 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) write 18.19_TEMP/2e57ac39/plana3419355-7/2a0//18 2097152~858903 = 858903

indicates that size should be 2097152+858903=2956055, which appears to be right.

Later,

2014-11-23 19:18:15.018070 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 2272624~54548 to 2272624 = 53904

indicates a short read stopping at offset 2272624+53904=2326528 which is significantly less than the actual size. That appears to be the bug, no idea how it could have happened yet.

Later, we get another clone range with a short read:

2014-11-23 19:18:17.237803 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/297//18 2128446~596417 to 2128446 = 71106

2128446+71106=2199552 which is even smaller.

This occurred on a test with plana34,plana47,plana50. plana34 had osd 2 where the problem occurred. plana 34 is ubuntu 12.04.4 with kernel 3.18.0-rc6-ceph-00024-geb0e5fd.

plana 47 is 12.04.5 with kernel 3.18.0-rc6-ceph-00024-geb0e5fd

plana 50 is 12.04.4 with kernel 3.17.0-rc6-ceph-00002-ge8acad6

Actions #3

Updated by Samuel Just over 9 years ago

Going back to the full log, it appears to be related to _do_sparse_copy_range and therefore fiemap:

2014-11-23 19:18:14.884547 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 > 18.19_head/2e57ac39/plana3419355-7/265//18 1040596~476694 to 1040596 = 83478
2014-11-23 19:18:14.884563 7f4d4817a700 15 filestore(/var/lib/ceph/osd/ceph-2) clone_range 18.19_head/2e57ac39/plana3419355-7/2a0//18 -> 18.19_head/2e57ac39/plana3419355-7/265//18 2272624~54548 to 2272624
2014-11-23 19:18:14.884578 7f4d4817a700 20 filestore(/var/lib/ceph/osd/ceph-2) _do_clone_range copy 2272624~54548 to 2272624
2014-11-23 19:18:14.884582 7f4d4817a700 20 filestore(/var/lib/ceph/osd/ceph-2) _do_sparse_copy_range 2272624~54548 to 2272624
2014-11-23 19:18:14.885233 7f4d4b180700 5 osd.2 862 tick
2014-11-23 19:18:14.885262 7f4d4b180700 20 osd.2 862 scrub_random_backoff lost coin flip, randomly backing off
2014-11-23 19:18:14.885267 7f4d4b180700 10 osd.2 862 do_waiters -
start
2014-11-23 19:18:14.885269 7f4d4b180700 10 osd.2 862 do_waiters -- finish
2014-11-23 19:18:14.919282 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) _do_sparse_copy_range fm_mapped_extents=1 fe_logical=2272624 fe_length=53904
2014-11-23 19:18:14.919378 7f4d4817a700 20 filestore(/var/lib/ceph/osd/ceph-2) _do_sparse_copy_range 2272624~54548 to 2272624 = 53904
2014-11-23 19:18:14.919384 7f4d4817a700 10 filestore(/var/lib/ceph/osd/ceph-2) _set_replay_guard 15111.0.7

Probably due to teuthology commit 0f97481ce44e0487ac6cffa051a05590f907a081 on Nov 20 enabling fiemap in the nightlies.

Actions #4

Updated by Samuel Just over 9 years ago

  • Subject changed from osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0)) to fiemap or FileStore::do_sparse_copy_range bug: osd/ReplicatedPG.cc: 8706: FAILED assert((data_included.empty() && data.length() == 0) || (!data_included.empty() && data.length() > 0))
Actions #5

Updated by Samuel Just over 9 years ago

Hmm, this might be as simple as truncating out to the full copy_range size.

Actions #6

Updated by Haomai Wang over 9 years ago

  • Status changed from 12 to Fix Under Review
  • Assignee set to Haomai Wang

https://github.com/ceph/ceph/pull/3000

I'm not fully ensure that this issue is caused by inconsistence size. Maybe we need to test it again after this PR.

Actions #7

Updated by Haomai Wang over 9 years ago

  • Status changed from Fix Under Review to Resolved
Actions

Also available in: Atom PDF