Project

General

Profile

Bug #45994

OSD crash - in thread tp_osd_tp

Added by Nokia ceph-users about 1 year ago. Updated about 2 months ago.

Status:
Duplicate
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

We recently see some random OSD crashes in thread tp_osd_tp with the below backtrace on one of our Nautilus clusters. There were no big activities being done at the time; only some minimal constant write and read traffic that was ongoing in the cluster.

Environment - 5 node 14.2.2 nautilus with 60 OSDs each, centos 7.6

-6> 2020-05-09 12:25:41.738 7f51ad4d7700  4 mgrc ms_handle_reset ms_handle_reset con 0x55e88ac17000
-5> 2020-05-09 12:25:41.738 7f51ad4d7700 4 mgrc reconnect Terminating session with v2:172.25.20.39:7040/44588
-4> 2020-05-09 12:25:41.738 7f51ad4d7700 4 mgrc reconnect Starting new session with [v2:172.25.20.39:7040/44588,v1:172.25.20.39:7041/44588]
-3> 2020-05-09 12:25:42.227 7f51ad4d7700 4 mgrc handle_mgr_configure stats_period=5
-2> 2020-05-09 12:35:43.861 7f51a9f3b700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.61 down, but it is still running
-1> 2020-05-09 12:35:43.861 7f51a9f3b700 0 log_channel(cluster) log [DBG] : map e2959 wrongly marked me down at e2958
0> 2020-05-09 12:37:29.499 7f519b71e700 -1 ** Caught signal (Aborted) *
in thread 7f519b71e700 thread_name:tp_osd_tp
ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus (stable)
1: (()+0xf5d0) [0x7f51be5965d0]
2: (pthread_cond_wait()+0xc5) [0x7f51be592965]
3: (std::condition_variable::wait(std::unique_lock<std::mutex>&)+0xc) [0x7f51bdce682c]
4: (BlueStore::Collection::flush()+0x86) [0x55e84b47f606]
5: (BlueStore::collection_list(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, ghobject_t const&, int, std::vector<ghobject_t, std::allocator<ghobject_t> >, ghobjec
t_t
)+0x51) [0x55e84b4c7961]
6: (PGBackend::objects_list_range(hobject_t const&, hobject_t const&, std::vector<hobject_t, std::allocator<hobject_t> >, std::vector<ghobject_t, std::allocator<ghobject_t> >)+0x147) [0x
55e84b268187]
7: (PG::build_scrub_map_chunk(ScrubMap&, ScrubMapBuilder&, hobject_t, hobject_t, bool, ThreadPool::TPHandle&)+0x28a) [0x55e84b1144da]
8: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x169c) [0x55e84b142d0c]
9: (PG::scrub(unsigned int, ThreadPool::TPHandle&)+0xaf) [0x55e84b143d4f]
10: (PGScrub::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x12) [0x55e84b2f32e2]
11: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x9f4) [0x55e84b072ef4]
12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x433) [0x55e84b671ce3]
13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55e84b674d80]
14: (()+0x7dd5) [0x7f51be58edd5]
15: (clone()+0x6d) [0x7f51bd44dead]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Ceph status when osd down
ceph -s
cluster:
id: d948df33-50a5-4230-8ede-796a0727e09f
health: HEALTH_OK
services:
mon: 5 daemons, quorum os1,os2,os3,os4,os5 (age 3w)
mgr: os4(active, since 3w), standbys: os3, os1, os2, os5
osd: 300 osds: 299 up (since 4d), 299 in (since 3d)
data:
pools: 1 pools, 8192 pgs
objects: 32.82M objects, 57 TiB
usage: 76 TiB used, 3.1 PiB / 3.2 PiB avail
pgs: 8150 active+clean
42 active+clean+scrubbing
io:
client: 13 KiB/s rd, 126 MiB/s wr, 9 op/s rd, 214 op/s wr

Attaching osd log from one of the occurrences. Please let me know other info to be attached.

Thanks,

ceph-osd-61.log View (216 KB) Nokia ceph-users, 06/15/2020 06:39 AM


Related issues

Related to bluestore - Bug #45765: BlueStore::_collection_list causes huge latency growth pg deletion Resolved
Related to bluestore - Bug #40741: Mass OSD failure, unable to restart Triaged

History

#1 Updated by Nokia ceph-users about 1 year ago

hi ,
We have seen the issue to be caused by heartbeat timeout resolved by increasing the timer. Hence can this ticket be closed , please? (i am not finding the option to do so)
Thanks

#2 Updated by Igor Fedotov about 1 year ago

Increasing suicide timeout doesn't look like the proper way of dealing with this issue.

I presume you're suffering from highly fragmented KV database which makes DB access ops (e.g. collection listing) run slow.

Related tickets are:
https://tracker.ceph.com/issues/45765
https://tracker.ceph.com/issues/40741

The trigger for such a behavior seem to be massive prior data removal. And the only workaround for now is manual DB compaction using ceph-kvstore-tool.

And I suggest to upgrade you cluster, v14.2.2 sees to be pretty outdated now.

#3 Updated by Igor Fedotov about 1 year ago

  • Related to Bug #45765: BlueStore::_collection_list causes huge latency growth pg deletion added

#4 Updated by Igor Fedotov about 1 year ago

  • Related to Backport #40471: nautilus: cephfs-shell: Fix flake8 warnings and errors added

#5 Updated by Igor Fedotov about 1 year ago

  • Project changed from Ceph to bluestore
  • Status changed from New to Triaged

#6 Updated by Igor Fedotov about 1 year ago

  • Related to Bug #40741: Mass OSD failure, unable to restart added

#7 Updated by Igor Fedotov about 1 year ago

  • Related to deleted (Backport #40471: nautilus: cephfs-shell: Fix flake8 warnings and errors)

#8 Updated by Igor Fedotov about 2 months ago

  • Status changed from Triaged to Duplicate

Also available in: Atom PDF