Bug #45994: OSD crash - in thread tp_osd_tp - bluestore - Ceph

Actions

Copy link

Bug #45994

closed

OSD crash - in thread tp_osd_tp

Added by Nokia ceph-users almost 4 years ago. Updated almost 3 years ago.

Status:

Duplicate

Priority:

Normal

Assignee:

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

Ceph - v14.2.2

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

We recently see some random OSD crashes in thread tp_osd_tp with the below backtrace on one of our Nautilus clusters. There were no big activities being done at the time; only some minimal constant write and read traffic that was ongoing in the cluster.

Environment - 5 node 14.2.2 nautilus with 60 OSDs each, centos 7.6

-6> 2020-05-09 12:25:41.738 7f51ad4d7700  4 mgrc ms_handle_reset ms_handle_reset con 0x55e88ac17000
    -5> 2020-05-09 12:25:41.738 7f51ad4d7700  4 mgrc reconnect Terminating session with v2:172.25.20.39:7040/44588
    -4> 2020-05-09 12:25:41.738 7f51ad4d7700  4 mgrc reconnect Starting new session with [v2:172.25.20.39:7040/44588,v1:172.25.20.39:7041/44588]
    -3> 2020-05-09 12:25:42.227 7f51ad4d7700  4 mgrc handle_mgr_configure stats_period=5
    -2> 2020-05-09 12:35:43.861 7f51a9f3b700  0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.61 down, but it is still running
    -1> 2020-05-09 12:35:43.861 7f51a9f3b700  0 log_channel(cluster) log [DBG] : map e2959 wrongly marked me down at e2958
     0> 2020-05-09 12:37:29.499 7f519b71e700 -1 ** Caught signal (Aborted) *
 in thread 7f519b71e700 thread_name:tp_osd_tp

ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
 1: (()+0xf5d0) [0x7f51be5965d0]
 2: (pthread_cond_wait()+0xc5) [0x7f51be592965]
 3: (std::condition_variable::wait(std::unique_lock&lt;std::mutex&gt;&)+0xc) [0x7f51bdce682c]
 4: (BlueStore::Collection::flush()+0x86) [0x55e84b47f606]
 5: (BlueStore::collection_list(boost::intrusive_ptr&lt;ObjectStore::CollectionImpl&gt;&, ghobject_t const&, ghobject_t const&, int, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >, ghobjec
t_t)+0x51) [0x55e84b4c7961]
 6: (PGBackend::objects_list_range(hobject_t const&, hobject_t const&, std::vector&lt;hobject_t, std::allocator&lt;hobject_t&gt; >, std::vector&lt;ghobject_t, std::allocator&lt;ghobject_t&gt; >)+0x147) [0x
55e84b268187]
 7: (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x28a) [0x55e84b1144da]
 8: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x169c) [0x55e84b142d0c]
 9: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0xaf) [0x55e84b143d4f]
 10: (PGScrub::run(OSD*, OSDShard*, boost::intrusive_ptr&lt;PG&gt;&, ThreadPool::TPHandle&)+0x12) [0x55e84b2f32e2]
 11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55e84b072ef4]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55e84b671ce3]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e84b674d80]
 14: (()+0x7dd5) [0x7f51be58edd5]
 15: (clone()+0x6d) [0x7f51bd44dead]
 NOTE: a copy of the executable, or `objdump -rdS &lt;executable&gt;` is needed to interpret this.

Ceph status when osd down
ceph -s
cluster:
id: d948df33-50a5-4230-8ede-796a0727e09f
health: HEALTH_OK
services:
mon: 5 daemons, quorum os1,os2,os3,os4,os5 (age 3w)
mgr: os4(active, since 3w), standbys: os3, os1, os2, os5
osd: 300 osds: 299 up (since 4d), 299 in (since 3d)
data:
pools: 1 pools, 8192 pgs
objects: 32.82M objects, 57 TiB
usage: 76 TiB used, 3.1 PiB / 3.2 PiB avail
pgs: 8150 active+clean
42 active+clean+scrubbing
io:
client: 13 KiB/s rd, 126 MiB/s wr, 9 op/s rd, 214 op/s wr

Attaching osd log from one of the occurrences. Please let me know other info to be attached.

Thanks,

Files

ceph-osd-61.log (216 KB) ceph-osd-61.log

Nokia ceph-users, 06/15/2020 06:39 AM

Related issues 2 (1 open — 1 closed)

Actions

Copy link

Updated by Nokia ceph-users almost 4 years ago

hi ,
We have seen the issue to be caused by heartbeat timeout resolved by increasing the timer. Hence can this ticket be closed , please? (i am not finding the option to do so)
Thanks

Actions

Copy link

Updated by Igor Fedotov almost 4 years ago

Increasing suicide timeout doesn't look like the proper way of dealing with this issue.

I presume you're suffering from highly fragmented KV database which makes DB access ops (e.g. collection listing) run slow.

Related tickets are:
https://tracker.ceph.com/issues/45765
https://tracker.ceph.com/issues/40741

The trigger for such a behavior seem to be massive prior data removal. And the only workaround for now is manual DB compaction using ceph-kvstore-tool.

And I suggest to upgrade you cluster, v14.2.2 sees to be pretty outdated now.

Actions

Copy link