Project

General

Profile

Actions

Bug #57531

open

Mutipule zombie processes, and more and more

Added by zhiwei wang over 1 year ago. Updated over 1 year ago.

Status:
Need More Info
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Trying to repeat BUG#57411 on octopus version(15.2.17, cephadm, docker20.10)

*****************************************************
[root@ceph120 ~]# ps aux | grep " Z "
ceph 3274 0.0 0.0 0 0 ? Z Sep12 0:00 [ssh] <defunct>
ceph 3275 0.0 0.0 0 0 ? Z Sep12 0:00 [ssh] <defunct>
ceph 3284 0.0 0.0 0 0 ? Z Sep12 0:00 [ssh] <defunct>
ceph 20946 0.0 0.0 0 0 ? Z 02:29 0:00 [ssh] <defunct>
ceph 20947 0.0 0.0 0 0 ? Z 02:29 0:00 [ssh] <defunct>
ceph 20948 0.0 0.0 0 0 ? Z 02:29 0:00 [ssh] <defunct>
Actions #1

Updated by Venky Shankar over 1 year ago

  • Status changed from New to Need More Info

I assume you are talking about https://tracker.ceph.com/issues/57411 here? If yes, could you please provide more debug information (starting with client/mds logs)?

Actions #2

Updated by zhiwei wang over 1 year ago

Venky Shankar wrote:

I assume you are talking about https://tracker.ceph.com/issues/57411 here? If yes, could you please provide more debug information (starting with client/mds logs)?

I tried to reproduce the 57411 bug(study file storage), 1 clients(mount via ceph-fuse on ceph120), 12 zombie processes after 24 hours, and 24 zombie processes after 36 hours, lot of ERR and WRN in /var/log/messages.
my test environment
CPU: Intel(R) Core(TM) i5-10400F CPU @ 2.90GHz
Memeory: 32G
Disk: SATA 2T

OS:        vmware ESSi 7.0.0
ceph cluster: 4c8g * 3(ceph120 ceph121 ceph122)
osds: 30g * 9

[root@ceph120 ~]# ceph -s
cluster:
id: 665572f0-3116-11ed-8085-000c29d2dd1d
health: HEALTH_WARN
insufficient standby MDS daemons available

services:
mon: 3 daemons, quorum ceph120,ceph121,ceph122 (age 28s)
mgr: ceph121.mfcbnn(active, since 54m), standbys: ceph120.zvsgqm
mds: amdsfs:2 {0=amdsfs.ceph120.lmbxfw=up:active,1=amdsfs.ceph122.xdeeaa=up:active}
osd: 9 osds: 9 up (since 2m), 9 in (since 8d)
data:
pools: 3 pools, 65 pgs
objects: 3.26k objects, 1.4 GiB
usage: 14 GiB used, 436 GiB / 450 GiB avail
pgs: 65 active+clean
io:
client: 7.7 MiB/s wr, 0 op/s rd, 4 op/s wr

[root@ceph120 ~]# ceph orch ps --daemon-type osd
NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID
osd.0 ceph122 running (54m) 9m ago 8d 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 c627d7412f17
osd.1 ceph120 running (54m) 9m ago 8d 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 134c8245da5c
osd.2 ceph121 running (55m) 9m ago 8d 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 1deeee0aafdc
osd.3 ceph122 running (54m) 9m ago 8d 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 672ba7d1145e
osd.4 ceph120 running (54m) 9m ago 8d 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 937679de396e
osd.5 ceph121 running (55m) 9m ago 8d 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 52ed830cdc88
osd.6 ceph122 running (54m) 9m ago 8d 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 99827a8723d0
osd.7 ceph120 running (54m) 9m ago 8d 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 374496962f68
osd.8 ceph121 running (55m) 9m ago 8d 15.2.13 docker.io/ceph/ceph:v15 2cf504fded39 cecc8dae70c7

ceph crash ls-new
--------------------------
2022-09-13T17:39:02.924158Z_f70ba90f-7708-4311-b448-0e9d02f21baa
2022-09-13T17:39:03.053510Z_f2dcc016-0497-4c10-97fd-c46a6b743ead
--------------------------
ceph crash info 2022-09-13T17:39:02.924158Z_f70ba90f-7708-4311-b448-0e9d02f21baa
-------------------------- {
"assert_condition": "abort",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/common/HeartbeatMap.cc",
"assert_func": "bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, ceph::time_detail::coarse_mono_clock::rep)",
"assert_line": 80,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, ceph::time_detail::coarse_mono_clock::rep)' thread 7fa5b0055700 time 2022-09-13T17:39:02.453684+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/common/HeartbeatMap.cc: 80: ceph_abort_msg(\"hit suicide timeout\")\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"(()+0x12b20) [0x7fa5d08eeb20]",
"(pthread_kill()+0x35) [0x7fa5d08eb8d5]",
"(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x258) [0x55a1e8128098]",
"(ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x21d) [0x55a1e81289ed]",
"(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x8de) [0x55a1e7fdcbee]",
"(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x55a1e7b4c825]",
"(OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0xf3) [0x55a1e7ae8103]",
"(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2d8) [0x55a1e7b16cb8]",
"(ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55a1e7d48906]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55a1e7b0992f]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55a1e8149f84]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a1e814cbe4]",
"(()+0x814a) [0x7fa5d08e414a]",
"(clone()+0x43) [0x7fa5cf61bf23]"
],
"ceph_version": "15.2.13",
"crash_id": "2022-09-13T17:39:02.924158Z_f70ba90f-7708-4311-b448-0e9d02f21baa",
"entity_name": "osd.4",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": "2aeef976ce0aa37e5c29ef54ede141193838149c213ebceb51f88d5ac5379ce5",
"timestamp": "2022-09-13T17:39:02.924158Z",
"utsname_hostname": "ceph120",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-348.7.1.el8_5.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Dec 22 13:25:12 UTC 2021"
}
--------------------------
ceph crash info 2022-09-13T17:39:03.053510Z_f2dcc016-0497-4c10-97fd-c46a6b743ead
-------------------------- {
"assert_condition": "abort",
"assert_file": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/common/HeartbeatMap.cc",
"assert_func": "bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, ceph::time_detail::coarse_mono_clock::rep)",
"assert_line": 80,
"assert_msg": "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, ceph::time_detail::coarse_mono_clock::rep)' thread 7fa5b0055700 time 2022-09-13T17:39:02.453684+0000\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.13/rpm/el8/BUILD/ceph-15.2.13/src/common/HeartbeatMap.cc: 80: ceph_abort_msg(\"hit suicide timeout\")\n",
"assert_thread_name": "tp_osd_tp",
"backtrace": [
"(()+0x12b20) [0x7fa5d08eeb20]",
"(abort()+0x203) [0x7fa5cf540d11]",
"(ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x1b6) [0x55a1e7a093e9]",
"(ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, unsigned long)+0x295) [0x55a1e81280d5]",
"(ceph::HeartbeatMap::clear_timeout(ceph::heartbeat_handle_d*)+0x21d) [0x55a1e81289ed]",
"(BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x8de) [0x55a1e7fdcbee]",
"(ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ceph::os::Transaction&&, boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x85) [0x55a1e7b4c825]",
"(OSD::dispatch_context(PeeringCtx&, PG*, std::shared_ptr<OSDMap const>, ThreadPool::TPHandle*)+0xf3) [0x55a1e7ae8103]",
"(OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0x2d8) [0x55a1e7b16cb8]",
"(ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55a1e7d48906]",
"(OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x12ef) [0x55a1e7b0992f]",
"(ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x55a1e8149f84]",
"(ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55a1e814cbe4]",
"(()+0x814a) [0x7fa5d08e414a]",
"(clone()+0x43) [0x7fa5cf61bf23]"
],
"ceph_version": "15.2.13",
"crash_id": "2022-09-13T17:39:03.053510Z_f2dcc016-0497-4c10-97fd-c46a6b743ead",
"entity_name": "osd.4",
"os_id": "centos",
"os_name": "CentOS Linux",
"os_version": "8",
"os_version_id": "8",
"process_name": "ceph-osd",
"stack_sig": "e54e4527e10436663dac5cdf953e82c939cc1da1f18576cd4391d9538aad55e0",
"timestamp": "2022-09-13T17:39:03.053510Z",
"utsname_hostname": "ceph120",
"utsname_machine": "x86_64",
"utsname_release": "4.18.0-348.7.1.el8_5.x86_64",
"utsname_sysname": "Linux",
"utsname_version": "#1 SMP Wed Dec 22 13:25:12 UTC 2021"
}
----------------------

Actions #3

Updated by Venky Shankar over 1 year ago

Are you saying the zombie processes are ceph-osd daemons?

Actions #4

Updated by Venky Shankar over 1 year ago

... or the daemon crashes are a different issue than the zombie processes (ceph-mds??).

Actions

Also available in: Atom PDF