Project

General

Profile

Actions

Bug #16396

closed

Fix shutting down mds timed-out due to deadlock

Added by Zhi Zhang almost 8 years ago. Updated over 7 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:

0%

Source:
Community (dev)
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

This issue was found in jewel when restarting/stopping mds. It took long time for mds to completely stop until mds thread timed out. This is caused by when shuting down async msgr, it will try to dispatch external barrier event to a worker and wake it up, but this worker has been processed clean_handler event before, so its event status is already waked up. Hence this barrier event can't be processed further and msgr thread will be stuck at waiting for barrier count.

2016-06-20 16:14:45.889943 7f02d91af700  1 mds.0.70 shutdown: messenger
2016-06-20 16:14:45.889948 7f02d91af700 10 -- xxx:6801/26647 shutdown xxx:6801/26647
2016-06-20 16:14:45.889963 7f02d91af700 10  Processor -- stop
2016-06-20 16:14:45.889965 7f02d91af700 20 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).delete_file_event delete event started fd=13 mask=1 original mask is 1
2016-06-20 16:14:45.889977 7f02d91af700 20 EpollDriver.del_event del event fd=13 cur_mask=1 delmask=1 to 7
2016-06-20 16:14:45.889996 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).delete_file_event delete event end fd=13 mask=1 original mask is 0
2016-06-20 16:14:45.890022 7f02d91af700  1 -- xxx:6801/26647 mark_down_all
2016-06-20 16:14:45.890028 7f02d91af700  5 -- xxx:6801/26647 mark_down_all mark down xxx:0/4223556835 0x7f02ed6f9a00
2016-06-20 16:14:45.890048 7f02d91af700  1 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).wakeup
2016-06-20 16:14:45.890072 7f02dc9b6700 20 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).process_events event_wq process is 8 mask is 1
2016-06-20 16:14:45.890066 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4bf920 pending 1
2016-06-20 16:14:45.890081 7f02d91af700  1 -- xxx:6801/26647 >> xxx:0/4223556835 conn(0x7f02ed6f9a00 sd=30 :6801 s=STATE_OPEN pgs=268 cs=1 l=0).mark_down started.
2016-06-20 16:14:45.890087 7f02d91af700  1 -- xxx:6801/26647 >> xxx:0/4223556835 conn(0x7f02ed6f9a00 sd=30 :6801 s=STATE_OPEN pgs=268 cs=1 l=0)._stop
2016-06-20 16:14:45.890093 7f02d91af700 20 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).delete_file_event delete event started fd=30 mask=3 original mask is 1
2016-06-20 16:14:45.890096 7f02d91af700 20 EpollDriver.del_event del event fd=30 cur_mask=1 delmask=3 to 7
2016-06-20 16:14:45.890099 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).delete_file_event delete event end fd=30 mask=3 original mask is 0
2016-06-20 16:14:45.890112 7f02d91af700 10 -- xxx:6801/26647 >> xxx:0/4223556835 conn(0x7f02ed6f9a00 sd=30 :6801 s=STATE_OPEN pgs=268 cs=1 l=0).discard_out_queue started
2016-06-20 16:14:45.890128 7f02d91af700 10  WorkerPool -- release_worker
2016-06-20 16:14:45.890146 7f02d91af700 10  WorkerPool -- release_worker found worker, releasing

2016-06-20 16:14:45.890174 7f02d91af700  1 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).wakeup
2016-06-20 16:14:45.890181 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4c8560 pending 1

2016-06-20 16:14:45.890196 7f02d91af700  5 -- xxx:6801/26647 mark_down_all delete 0x7f02ed6f9a00
2016-06-20 16:14:45.890204 7f02d91af700 10  WorkerPool -- barrier started.
2016-06-20 16:14:45.890207 7f02d91af700  1 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).wakeup
2016-06-20 16:14:45.890211 7f02d91af700 10 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4c82b0 pending 1
2016-06-20 16:14:45.890221 7f02dd1b7700 20 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).process_events event_wq process is 5 mask is 1
2016-06-20 16:14:45.890230 7f02dd1b7700 20  Worker -- entry calling event process
2016-06-20 16:14:45.890232 7f02dd1b7700 10 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).process_events wait second 30 usec 0

2016-06-20 16:14:45.890214 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4c8500 pending 2

2016-06-20 16:14:45.890259 7f02d91af700  1 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).wakeup
2016-06-20 16:14:45.890264 7f02d91af700 10 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4c8160 pending 1
2016-06-20 16:14:45.890267 7f02d91af700 10  WorkerPool -- barrier wait for 2 barrier
2016-06-20 16:14:45.890274 7f02dc1b5700 20 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).process_events event_wq process is 11 mask is 1
2016-06-20 16:14:45.890299 7f02dc1b5700 20  Worker -- entry calling event process
2016-06-20 16:14:45.890301 7f02dc1b5700 10 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).process_events wait second 30 usec 0
2016-06-20 16:15:00.916638 7f02de1b9700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-06-20 16:15:05.916812 7f02de1b9700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-06-20 16:15:10.916986 7f02de1b9700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-06-20 16:15:15.917192 7f02dd1b7700 20  Worker -- entry calling event process
2016-06-20 16:15:15.917192 7f02de1b9700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-06-20 16:15:15.917208 7f02dd1b7700 10 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).process_events wait second 30 usec 0
2016-06-20 16:15:15.920320 7f02dc1b5700 20  Worker -- entry calling event process
2016-06-20 16:15:15.920328 7f02dc1b5700 10 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).process_events wait second 30 usec 0
2016-06-20 16:15:20.917282 7f02de1b9700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-06-20 16:15:25.917358 7f02de1b9700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
2016-06-20 16:15:30.917440 7f02de1b9700  1 heartbeat_map is_healthy 'MDSRank' had timed out after 15

Related issues 1 (0 open1 closed)

Copied to CephFS - Backport #16620: jewel: Fix shutting down mds timed-out due to deadlockResolvedAbhishek VarshneyActions
Actions #2

Updated by Zhi Zhang almost 8 years ago

  • Project changed from Ceph to CephFS
  • Subject changed from msg/async: fix trying to wake up worker event which is already waked up when shuting down msgr to Fix shutting down mds timed-out due to deadlock
  • Category changed from msgr to 47
Actions #3

Updated by John Spray almost 8 years ago

  • Status changed from New to Pending Backport
  • Backport set to jewel
Actions #4

Updated by Loïc Dachary almost 8 years ago

  • Copied to Backport #16620: jewel: Fix shutting down mds timed-out due to deadlock added
Actions #5

Updated by Greg Farnum almost 8 years ago

  • Category changed from 47 to Correctness/Safety
  • Component(FS) MDS added
Actions #6

Updated by Loïc Dachary over 7 years ago

  • Status changed from Pending Backport to Resolved
Actions

Also available in: Atom PDF