Project

General

Profile

Actions

Bug #17001

closed

async messenger osd crash when ms_async_op_thread=1

Added by Dong Wu over 7 years ago. Updated over 7 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

when use async default config, everything is ok, but when i set ms_async_op_thread=1 and restart osd, then osd crash, here is the coredump:

2016-08-11 14:40:30.933833 7f4caedf1800 0 set uid:gid to 64045:64045 (ceph:ceph)
2016-08-11 14:40:30.933857 7f4caedf1800 0 ceph version 11.0.0-1362-g2c7ec07 (2c7ec0730ecf193d2436eb48d2b2897952874b4c), process ceph-osd, pid 28259
2016-08-11 14:40:30.933890 7f4caedf1800 5 object store type is filestore
2016-08-11 14:40:30.934414 7f4caedf1800 10 WorkerPool -- get_worker
2016-08-11 14:40:30.934426 7f4caedf1800 20 WorkerPool -- get_worker Worker 0x7f4ca98dc380 load: 0
2016-08-11 14:40:30.934431 7f4caedf1800 20 WorkerPool -- get_worker picked 0x7f4ca98dc380 as best worker with load 0
2016-08-11 14:40:30.934459 7f4caedf1800 10 WorkerPool -- get_worker
2016-08-11 14:40:30.934460 7f4caedf1800 20 WorkerPool -- get_worker Worker 0x7f4ca98dc380 load: 1
2016-08-11 14:40:30.934461 7f4caedf1800 20 WorkerPool -- get_worker picked 0x7f4ca98dc380 as best worker with load 1
2016-08-11 14:40:30.934474 7f4caedf1800 10 WorkerPool -- get_worker
2016-08-11 14:40:30.934476 7f4caedf1800 20 WorkerPool -- get_worker Worker 0x7f4ca98dc380 load: 2
2016-08-11 14:40:30.934477 7f4caedf1800 20 WorkerPool -- get_worker creating worker
2016-08-11 14:40:30.934674 7f4ca7bff700 10 Worker -- entry starting
2016-08-11 14:40:30.934735 7f4ca7bff700 1 Event(0x7f4ca98dc748 nevent=5000 time_id=1).set_owner idx=1 owner=139967208617728
2016-08-11 14:40:30.934753 7f4ca7bff700 20 Event(0x7f4ca98dc748 nevent=5000 time_id=1).create_file_event create event started fd=9 mask=1 original mask is 0
2016-08-11 14:40:30.934765 7f4ca7bff700 20 EpollDriver.add_event add event fd=9 cur_mask=0 add_mask=1 to 8
2016-08-11 14:40:30.934781 7f4ca7bff700 10 Event(0x7f4ca98dc748 nevent=5000 time_id=1).create_file_event create event end fd=9 mask=1 original mask is 1
2016-08-11 14:40:30.934792 7f4ca7bff700 20 Worker -- entry calling event process
2016-08-11 14:40:30.934802 7f4ca7bff700 10 Event(0x7f4ca98dc748 nevent=5000 time_id=1).process_events wait second 30 usec 0
2016-08-11 14:40:30.934880 7f4caedf1800 10 WorkerPool -- get_worker
2016-08-11 14:40:30.934891 7f4caedf1800 20 WorkerPool -- get_worker Worker 0x7f4ca98dc380 load: 2
2016-08-11 14:40:30.934893 7f4caedf1800 20 WorkerPool -- get_worker Worker 0x7f4ca98dc700 load: 1
2016-08-11 14:40:30.934895 7f4caedf1800 20 WorkerPool -- get_worker picked 0x7f4ca98dc700 as best worker with load 1
2016-08-11 14:40:30.934914 7f4caedf1800 10 WorkerPool -- get_worker
2016-08-11 14:40:30.934915 7f4caedf1800 20 WorkerPool -- get_worker Worker 0x7f4ca98dc380 load: 2
2016-08-11 14:40:30.934917 7f4caedf1800 20 WorkerPool -- get_worker Worker 0x7f4ca98dc700 load: 2
2016-08-11 14:40:30.934917 7f4caedf1800 20 WorkerPool -- get_worker picked 0x7f4ca98dc380 as best worker with load 2
2016-08-11 14:40:30.934926 7f4caedf1800 10 WorkerPool -- get_worker
2016-08-11 14:40:30.934927 7f4caedf1800 20 WorkerPool -- get_worker Worker 0x7f4ca98dc380 load: 3
2016-08-11 14:40:30.934928 7f4caedf1800 20 WorkerPool -- get_worker Worker 0x7f4ca98dc700 load: 2
2016-08-11 14:40:30.934931 7f4caedf1800 20 WorkerPool -- get_worker picked 0x7f4ca98dc700 as best worker with load 2
.......................

2016-08-11 14:40:31.058651 7f4caedf1800 10 osd.0 10 create_logger
2016-08-11 14:40:31.058665 7f4caedf1800 10 -- 0.0.0.0:6800/28259 ready 0.0.0.0:6800/28259
2016-08-11 14:40:31.058940 7f4c999ff700 10 Worker -- entry starting
2016-08-11 14:40:31.058966 7f4c991fe700 10 Worker -- entry starting
2016-08-11 14:40:31.058970 7f4c999ff700 1 Event(0x7f4ca98dc3c8 nevent=5000 time_id=1).set_owner idx=0 owner=139966971639552
2016-08-11 14:40:31.059033 7f4c999ff700 20 Event(0x7f4ca98dc3c8 nevent=5000 time_id=1).create_file_event create event started fd=6 mask=1 original mask is 0
2016-08-11 14:40:31.059039 7f4c999ff700 20 EpollDriver.add_event add event fd=6 cur_mask=0 add_mask=1 to 5
2016-08-11 14:40:31.059047 7f4c999ff700 10 Event(0x7f4ca98dc3c8 nevent=5000 time_id=1).create_file_event create event end fd=6 mask=1 original mask is 1
2016-08-11 14:40:31.059048 7f4c999ff700 20 Worker -- entry calling event process
2016-08-11 14:40:31.059051 7f4c999ff700 10 Event(0x7f4ca98dc3c8 nevent=5000 time_id=1).process_events wait second 30 usec 0
2016-08-11 14:40:31.060587 7f4c991fe700 1 msg/async/Event.cc: In function 'void EventCenter::set_owner()' thread 7f4c991fe700 time 2016-08-11 14:40:31.058997
msg/async/Event.cc: 140: FAILED assert(global_centers && !global_centers
>centers[idx])

ceph version 11.0.0-1362-g2c7ec07 (2c7ec0730ecf193d2436eb48d2b2897952874b4c)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x82) [0x7f4cae84c692]
2: (EventCenter::set_owner()+0x6b4) [0x7f4cae924f74]
3: (Worker::entry()+0xb1) [0x7f4cae8fff51]
4: (()+0x80a4) [0x7f4cacc3b0a4]
5: (clone()+0x6d) [0x7f4caab2887d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #1

Updated by Dong Wu over 7 years ago

through the source code, i found when osd start, it will create 6 messenger:ms_public/ms_cluster/ms_hbclient/ms_hb_back_server/ms_hb_front_server/ms_objecter, each will call Messenger::create, then construct AsyncMessgner which will call pool->get_worker().
but when ms_async_op_threads=1, in WorkerPool::get_worker(): when min_load=2, and worker.size()=1, it will make a new worker and create the worker thread, eg:
AsyncMessenger construct --> WorkerPool::get_worker() --> new worker and create --> center.set_owner() --> init global_centers and set global_centers->centers1

but until now the WorkerPool not start, here workers.size() is 2, when WorkerPool::start, it will traverse the workers and create worker, but global_centers->centers1 already started, so will fail the assert "assert(global_centers && !global_centers->centers[idx])"

Actions #3

Updated by Haomai Wang over 7 years ago

this code is removed now

Actions #4

Updated by Haomai Wang over 7 years ago

  • Status changed from New to Rejected
Actions

Also available in: Atom PDF