Project

General

Profile

Actions

Bug #41923

closed

3 different ceph-osd asserts caused by enabling auto-scaler

Added by David Zafman over 4 years ago. Updated about 4 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Change config osd_pool_default_pg_autoscale_mode to "on"

Saw these 4 core dumps on 3 different sub-tests.

../qa/runstandalone.sh "osd-scrub-repair.sh TEST_XXXXXXXX"

TEST_corrupt_scrub_erasure_overwrites
/home/dzafman/ceph/src/osd/ECBackend.cc: 641: FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))
/home/dzafman/ceph/src/osd/ECBackend.cc: 583: FAILED ceph_assert(op.hinfo)

TEST_repair_stats 
   -12> 2019-09-18T14:13:30.560-0700 7fd515db1700  1 -- 127.0.0.1:0/19306 <== osd.0 v2:127.0.0.1:6816/19643 43 ==== osd_ping(ping_reply e26 up_from 23 ping_stamp 2019-09-18T14:13:30.563990-0700/166.699510694s send_stamp 161.692563315s delta_ub -5.006947379s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55c3725fc400 con 0x55c36fbc3180
   -11> 2019-09-18T14:13:30.560-0700 7fd515db1700 20 osd.1 26 handle_osd_ping new stamps hbstamp(osd.0 up_from 23 peer_clock_delta [-5.007942020s,-5.006947379s])
   -10> 2019-09-18T14:13:30.796-0700 7fd50b357700  5 prioritycache tune_memory target: 4294967296 mapped: 57991168 unmapped: 81920 heap: 58073088 old mem: 2845415832 new mem: 2845415832
    -9> 2019-09-18T14:13:31.084-0700 7fd512571700 10 osd.1 26 tick
    -8> 2019-09-18T14:13:31.084-0700 7fd512571700 10 osd.1 26 do_waiters -- start
    -7> 2019-09-18T14:13:31.084-0700 7fd512571700 10 osd.1 26 do_waiters -- finish
    -6> 2019-09-18T14:13:31.084-0700 7fd512571700 20 osd.1 26 tick last_purged_snaps_scrub 2019-09-18T14:10:08.873595-0700 next 2019-09-19T14:10:08.873595-0700
    -5> 2019-09-18T14:13:31.811-0700 7fd50b357700  5 prioritycache tune_memory target: 4294967296 mapped: 57991168 unmapped: 81920 heap: 58073088 old mem: 2845415832 new mem: 2845415832
    -4> 2019-09-18T14:13:31.875-0700 7fd515db1700  1 -- [v2:127.0.0.1:6808/19306,v1:127.0.0.1:6809/19306] <== osd.0 127.0.0.1:0/19643 57 ==== osd_ping(ping e26 up_from 23 ping_stamp 2019-09-18T14:13:31.880600-0700/163.008322854s send_stamp 163.008322854s delta_ub -5.006947379s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55c3725fc400 con 0x55c3724d6d80
    -3> 2019-09-18T14:13:31.875-0700 7fd515db1700 20 osd.1 26 handle_osd_ping new stamps hbstamp(osd.0 up_from 23 peer_clock_delta [-5.008228232s,-5.006947379s])
    -2> 2019-09-18T14:13:31.875-0700 7fd515db1700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd4fa541700' had timed out after 15
    -1> 2019-09-18T14:13:31.875-0700 7fd515db1700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd4fa541700' had suicide timed out after 150
             0> 2019-09-18T14:13:31.899-0700 7fd4fa541700 -1 *** Caught signal (Aborted) **
 in thread 7fd4fa541700 thread_name:tp_osd_tp

 ceph version v15.0.0-5169-g0d5f330188 (0d5f33018877851db13181511d4868396079a5b9) octopus (dev)
 1: (()+0x2e29f48) [0x55c363fb6f48]
 2: (()+0x12890) [0x7fd519e93890]
 3: (pthread_cond_wait()+0x243) [0x7fd519e8e9f3]
 4: (ceph::condition_variable_debug::wait(std::unique_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >&)+0xab) [0x55c3640ab3a7]
 5: (BlueStore::OpSequencer::drain_preceding(BlueStore::TransContext*)+0x6b) [0x55c363e47be3]
 6: (BlueStore::_osr_drain_preceding(BlueStore::TransContext*)+0x268) [0x55c363e0b0a8]
 7: (BlueStore::_split_collection(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Collection>&, unsigned int, int)+0x24b) [0x55c363e2f7a5]
 8: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x54d) [0x55c363e15277]
 9: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e8) [0x55c363e14204]
 10: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x96) [0x55c363628000]
 11: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x806) [0x55c3635ffc1e]
 12: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x380) [0x55c3636057ee]
 13: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6d) [0x55c363aaa28b]
 14: (OpQueueItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x4b) [0x55c36363211d]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x3631) [0x55c363612cf1]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x59c) [0x55c36405b882]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x25) [0x55c36405d25f]
 18: (Thread::entry_wrapper()+0x78) [0x55c364047ec6]
 19: (Thread::_entry_func(void*)+0x18) [0x55c364047e44]
 20: (()+0x76db) [0x7fd519e886db]
 21: (clone()+0x3f) [0x7fd518bd188f]

TEST_repair_stats_ec
/home/dzafman/ceph/src/osd/ECBackend.cc: 478: FAILED ceph_assert(op.xattrs.size())


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #41900: auto-scaler breaks many standalone testsResolvedDavid Zafman09/17/2019

Actions
Actions

Also available in: Atom PDF