Project

General

Profile

Actions

Bug #41923

closed

3 different ceph-osd asserts caused by enabling auto-scaler

Added by David Zafman over 4 years ago. Updated about 4 years ago.

Status:
Can't reproduce
Priority:
Normal
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(RADOS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Change config osd_pool_default_pg_autoscale_mode to "on"

Saw these 4 core dumps on 3 different sub-tests.

../qa/runstandalone.sh "osd-scrub-repair.sh TEST_XXXXXXXX"

TEST_corrupt_scrub_erasure_overwrites
/home/dzafman/ceph/src/osd/ECBackend.cc: 641: FAILED ceph_assert(pop.data.length() == sinfo.aligned_logical_offset_to_chunk_offset( after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))
/home/dzafman/ceph/src/osd/ECBackend.cc: 583: FAILED ceph_assert(op.hinfo)

TEST_repair_stats 
   -12> 2019-09-18T14:13:30.560-0700 7fd515db1700  1 -- 127.0.0.1:0/19306 <== osd.0 v2:127.0.0.1:6816/19643 43 ==== osd_ping(ping_reply e26 up_from 23 ping_stamp 2019-09-18T14:13:30.563990-0700/166.699510694s send_stamp 161.692563315s delta_ub -5.006947379s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55c3725fc400 con 0x55c36fbc3180
   -11> 2019-09-18T14:13:30.560-0700 7fd515db1700 20 osd.1 26 handle_osd_ping new stamps hbstamp(osd.0 up_from 23 peer_clock_delta [-5.007942020s,-5.006947379s])
   -10> 2019-09-18T14:13:30.796-0700 7fd50b357700  5 prioritycache tune_memory target: 4294967296 mapped: 57991168 unmapped: 81920 heap: 58073088 old mem: 2845415832 new mem: 2845415832
    -9> 2019-09-18T14:13:31.084-0700 7fd512571700 10 osd.1 26 tick
    -8> 2019-09-18T14:13:31.084-0700 7fd512571700 10 osd.1 26 do_waiters -- start
    -7> 2019-09-18T14:13:31.084-0700 7fd512571700 10 osd.1 26 do_waiters -- finish
    -6> 2019-09-18T14:13:31.084-0700 7fd512571700 20 osd.1 26 tick last_purged_snaps_scrub 2019-09-18T14:10:08.873595-0700 next 2019-09-19T14:10:08.873595-0700
    -5> 2019-09-18T14:13:31.811-0700 7fd50b357700  5 prioritycache tune_memory target: 4294967296 mapped: 57991168 unmapped: 81920 heap: 58073088 old mem: 2845415832 new mem: 2845415832
    -4> 2019-09-18T14:13:31.875-0700 7fd515db1700  1 -- [v2:127.0.0.1:6808/19306,v1:127.0.0.1:6809/19306] <== osd.0 127.0.0.1:0/19643 57 ==== osd_ping(ping e26 up_from 23 ping_stamp 2019-09-18T14:13:31.880600-0700/163.008322854s send_stamp 163.008322854s delta_ub -5.006947379s) v5 ==== 2033+0+0 (crc 0 0 0) 0x55c3725fc400 con 0x55c3724d6d80
    -3> 2019-09-18T14:13:31.875-0700 7fd515db1700 20 osd.1 26 handle_osd_ping new stamps hbstamp(osd.0 up_from 23 peer_clock_delta [-5.008228232s,-5.006947379s])
    -2> 2019-09-18T14:13:31.875-0700 7fd515db1700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd4fa541700' had timed out after 15
    -1> 2019-09-18T14:13:31.875-0700 7fd515db1700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fd4fa541700' had suicide timed out after 150
             0> 2019-09-18T14:13:31.899-0700 7fd4fa541700 -1 *** Caught signal (Aborted) **
 in thread 7fd4fa541700 thread_name:tp_osd_tp

 ceph version v15.0.0-5169-g0d5f330188 (0d5f33018877851db13181511d4868396079a5b9) octopus (dev)
 1: (()+0x2e29f48) [0x55c363fb6f48]
 2: (()+0x12890) [0x7fd519e93890]
 3: (pthread_cond_wait()+0x243) [0x7fd519e8e9f3]
 4: (ceph::condition_variable_debug::wait(std::unique_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >&)+0xab) [0x55c3640ab3a7]
 5: (BlueStore::OpSequencer::drain_preceding(BlueStore::TransContext*)+0x6b) [0x55c363e47be3]
 6: (BlueStore::_osr_drain_preceding(BlueStore::TransContext*)+0x268) [0x55c363e0b0a8]
 7: (BlueStore::_split_collection(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Collection>&, unsigned int, int)+0x24b) [0x55c363e2f7a5]
 8: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*)+0x54d) [0x55c363e15277]
 9: (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5e8) [0x55c363e14204]
 10: (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x96) [0x55c363628000]
 11: (OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0x806) [0x55c3635ffc1e]
 12: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x380) [0x55c3636057ee]
 13: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6d) [0x55c363aaa28b]
 14: (OpQueueItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x4b) [0x55c36363211d]
 15: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x3631) [0x55c363612cf1]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x59c) [0x55c36405b882]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x25) [0x55c36405d25f]
 18: (Thread::entry_wrapper()+0x78) [0x55c364047ec6]
 19: (Thread::_entry_func(void*)+0x18) [0x55c364047e44]
 20: (()+0x76db) [0x7fd519e886db]
 21: (clone()+0x3f) [0x7fd518bd188f]

TEST_repair_stats_ec
/home/dzafman/ceph/src/osd/ECBackend.cc: 478: FAILED ceph_assert(op.xattrs.size())


Related issues 1 (0 open1 closed)

Related to RADOS - Bug #41900: auto-scaler breaks many standalone testsResolvedDavid Zafman09/17/2019

Actions
Actions #1

Updated by David Zafman over 4 years ago

  • Description updated (diff)
Actions #2

Updated by David Zafman over 4 years ago

Look at the assert(op.hinfo) it is caused by the corruption injected by the test. I'll verify that the asserts are expected with a corrupt PG when trying to scale just like it would crash recovery/backfill.

So my only concern with this tracker is the heartbeat suicide timeout caused by TEST_repair_stats sub-test.

Actions #3

Updated by David Zafman over 4 years ago

I think this stack better reflects the thread that hit the suicide timeout. However, everytime I've seen this thread 1 has been in a wait in BlueStore::OpSequencer::drain_preceding().

Thread 6 (Thread 0x7fdaaf8b0700 (LWP 44827)):
#0  0x00007fdab26939d0 in __GI___nanosleep (requested_time=requested_time@entry=0x7fdaaf8ac1c0, remaining=remaining@entry=0x7fdaaf8ac1c0)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:28
#1  0x00007fdab26938aa in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#2  0x000055854ebf5d62 in ceph::HeartbeatMap::_check (this=0x558559bd8e70, h=0x55855ac09630, who=0x55854f802887 "is_healthy", now=6329731)
    at /home/dzafman/ceph/src/common/HeartbeatMap.cc:79
#3  0x000055854ebf6805 in ceph::HeartbeatMap::is_healthy (this=0x558559bd8e70) at /home/dzafman/ceph/src/common/HeartbeatMap.cc:140
#4  0x000055854e19a7b2 in OSD::handle_osd_ping (this=0x55855a98e000, m=0x55855abfc200) at /home/dzafman/ceph/src/osd/OSD.cc:4790
#5  0x000055854e1b2f11 in OSD::heartbeat_dispatch (this=0x55855a98e000, m=0x55855abfc200) at /home/dzafman/ceph/src/osd/OSD.cc:7103
#6  0x000055854e200647 in OSD::HeartbeatDispatcher::ms_fast_dispatch (this=0x55855a98f0c8, m=0x55855abfc200) at /home/dzafman/ceph/src/osd/OSD.h:1547
#7  0x000055854e1faaea in Dispatcher::ms_fast_dispatch2 (this=0x55855a98f0c8, m=...) at /home/dzafman/ceph/src/msg/Dispatcher.h:84
#8  0x000055854f1289ee in Messenger::ms_fast_dispatch (this=0x55855a81a800, m=...) at /home/dzafman/ceph/src/msg/Messenger.h:669
#9  0x000055854f126eee in DispatchQueue::fast_dispatch (this=0x55855a81ab98, m=...) at /home/dzafman/ceph/src/msg/DispatchQueue.cc:72
#10 0x000055854f134521 in DispatchQueue::fast_dispatch (this=0x55855a81ab98, m=0x55855abfc200) at /home/dzafman/ceph/src/msg/DispatchQueue.h:203
#11 0x000055854f16dbe4 in ProtocolV2::handle_message (this=0x55855ae16580) at /home/dzafman/ceph/src/msg/async/ProtocolV2.cc:1517
#12 0x000055854f169fac in ProtocolV2::handle_read_frame_dispatch (this=0x55855ae16580) at /home/dzafman/ceph/src/msg/async/ProtocolV2.cc:1136
#13 0x000055854f16c336 in ProtocolV2::handle_read_frame_epilogue_main (this=0x55855ae16580, buffer=..., r=0) at /home/dzafman/ceph/src/msg/async/ProtocolV2.cc:1360
#14 0x000055854f197128 in CtRxNode<ProtocolV2>::call (this=0x55855ae16990, foo=0x55855ae16580) at /home/dzafman/ceph/src/msg/async/Protocol.h:67
#15 0x000055854f15dd3e in ProtocolV2::run_continuation (this=0x55855ae16580, continuation=...) at /home/dzafman/ceph/src/msg/async/ProtocolV2.cc:45
#16 0x000055854f164be8 in ProtocolV2::<lambda(char*, int)>::operator()(char *, int) const (__closure=0x55855ae8d618, buffer=0x55855ac22b00 "", r=0)
    at /home/dzafman/ceph/src/msg/async/ProtocolV2.cc:727
#17 0x000055854f17fbce in std::_Function_handler<void(char*, long int), ProtocolV2::read(CONTINUATION_RXBPTR_TYPE<ProtocolV2>&, rx_buffer_t&&)::<lambda(char*, int)> >::_M_invoke(const std::_Any_data &, char *&&, long &&) (__functor=..., __args#0=@0x7fdaaf8acfc0: 0x55855ac22b00 "", __args#1=@0x7fdaaf8acfb8: 0) at /usr/include/c++/9/bits/std_function.h:300
#18 0x000055854f1357e1 in std::function<void (char*, long)>::operator()(char*, long) const (this=0x55855ae8d618, __args#0=0x55855ac22b00 "", __args#1=0)
    at /usr/include/c++/9/bits/std_function.h:690
#19 0x000055854f130455 in AsyncConnection::process (this=0x55855ae8d200) at /home/dzafman/ceph/src/msg/async/AsyncConnection.cc:457
#20 0x000055854f134d34 in C_handle_read::do_request (this=0x558559b8b1a0, fd_or_id=65) at /home/dzafman/ceph/src/msg/async/AsyncConnection.cc:71
#21 0x000055854ee73d1d in EventCenter::process_events (this=0x558559b8d280, timeout_microseconds=30000000, working_dur=0x7fdaaf8ad378)
    at /home/dzafman/ceph/src/msg/async/Event.cc:406
#22 0x000055854ee7bbb2 in NetworkStack::<lambda()>::operator()(void) const (__closure=0x55855a8beab8) at /home/dzafman/ceph/src/msg/async/Stack.cc:53
#23 0x000055854ee7d088 in std::_Function_handler<void(), NetworkStack::add_thread(unsigned int)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/9/bits/std_function.h:300
#24 0x000055854e476f5a in std::function<void ()>::operator()() const (this=0x55855a8beab8) at /usr/include/c++/9/bits/std_function.h:690
#25 0x000055854ee95581 in std::__invoke_impl<void, std::function<void ()>>(std::__invoke_other, std::function<void ()>&&) (__f=...) at /usr/include/c++/9/bits/invoke.h:60
---Type <return> to continue, or q <return> to quit---
#26 0x000055854ee954a5 in std::__invoke<std::function<void ()>>(std::function<void ()>&&) (__fn=...) at /usr/include/c++/9/bits/invoke.h:95
#27 0x000055854ee95406 in std::thread::_Invoker<std::tuple<std::function<void ()> > >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x55855a8beab8) at /usr/include/c++/9/thread:244
#28 0x000055854ee95360 in std::thread::_Invoker<std::tuple<std::function<void ()> > >::operator()() (this=0x55855a8beab8) at /usr/include/c++/9/thread:251
#29 0x000055854ee952a4 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<std::function<void ()> > > >::_M_run() (this=0x55855a8beab0) at /usr/include/c++/9/thread:195
#30 0x00007fdab30261b0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#31 0x00007fdab39876db in start_thread (arg=0x7fdaaf8b0700) at pthread_create.c:463
#32 0x00007fdab26d088f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Actions #4

Updated by David Zafman over 4 years ago

  • Related to Bug #41900: auto-scaler breaks many standalone tests added
Actions #5

Updated by Sage Weil about 4 years ago

  • Status changed from New to Can't reproduce
Actions

Also available in: Atom PDF