Actions
Bug #16396
closedFix shutting down mds timed-out due to deadlock
Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
Correctness/Safety
Target version:
-
% Done:
0%
Source:
Community (dev)
Tags:
Backport:
jewel
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
MDS
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):
Description
This issue was found in jewel when restarting/stopping mds. It took long time for mds to completely stop until mds thread timed out. This is caused by when shuting down async msgr, it will try to dispatch external barrier event to a worker and wake it up, but this worker has been processed clean_handler event before, so its event status is already waked up. Hence this barrier event can't be processed further and msgr thread will be stuck at waiting for barrier count.
2016-06-20 16:14:45.889943 7f02d91af700 1 mds.0.70 shutdown: messenger 2016-06-20 16:14:45.889948 7f02d91af700 10 -- xxx:6801/26647 shutdown xxx:6801/26647 2016-06-20 16:14:45.889963 7f02d91af700 10 Processor -- stop 2016-06-20 16:14:45.889965 7f02d91af700 20 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).delete_file_event delete event started fd=13 mask=1 original mask is 1 2016-06-20 16:14:45.889977 7f02d91af700 20 EpollDriver.del_event del event fd=13 cur_mask=1 delmask=1 to 7 2016-06-20 16:14:45.889996 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).delete_file_event delete event end fd=13 mask=1 original mask is 0 2016-06-20 16:14:45.890022 7f02d91af700 1 -- xxx:6801/26647 mark_down_all 2016-06-20 16:14:45.890028 7f02d91af700 5 -- xxx:6801/26647 mark_down_all mark down xxx:0/4223556835 0x7f02ed6f9a00 2016-06-20 16:14:45.890048 7f02d91af700 1 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).wakeup 2016-06-20 16:14:45.890072 7f02dc9b6700 20 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).process_events event_wq process is 8 mask is 1 2016-06-20 16:14:45.890066 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4bf920 pending 1 2016-06-20 16:14:45.890081 7f02d91af700 1 -- xxx:6801/26647 >> xxx:0/4223556835 conn(0x7f02ed6f9a00 sd=30 :6801 s=STATE_OPEN pgs=268 cs=1 l=0).mark_down started. 2016-06-20 16:14:45.890087 7f02d91af700 1 -- xxx:6801/26647 >> xxx:0/4223556835 conn(0x7f02ed6f9a00 sd=30 :6801 s=STATE_OPEN pgs=268 cs=1 l=0)._stop 2016-06-20 16:14:45.890093 7f02d91af700 20 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).delete_file_event delete event started fd=30 mask=3 original mask is 1 2016-06-20 16:14:45.890096 7f02d91af700 20 EpollDriver.del_event del event fd=30 cur_mask=1 delmask=3 to 7 2016-06-20 16:14:45.890099 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).delete_file_event delete event end fd=30 mask=3 original mask is 0 2016-06-20 16:14:45.890112 7f02d91af700 10 -- xxx:6801/26647 >> xxx:0/4223556835 conn(0x7f02ed6f9a00 sd=30 :6801 s=STATE_OPEN pgs=268 cs=1 l=0).discard_out_queue started 2016-06-20 16:14:45.890128 7f02d91af700 10 WorkerPool -- release_worker 2016-06-20 16:14:45.890146 7f02d91af700 10 WorkerPool -- release_worker found worker, releasing 2016-06-20 16:14:45.890174 7f02d91af700 1 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).wakeup 2016-06-20 16:14:45.890181 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4c8560 pending 1 2016-06-20 16:14:45.890196 7f02d91af700 5 -- xxx:6801/26647 mark_down_all delete 0x7f02ed6f9a00 2016-06-20 16:14:45.890204 7f02d91af700 10 WorkerPool -- barrier started. 2016-06-20 16:14:45.890207 7f02d91af700 1 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).wakeup 2016-06-20 16:14:45.890211 7f02d91af700 10 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4c82b0 pending 1 2016-06-20 16:14:45.890221 7f02dd1b7700 20 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).process_events event_wq process is 5 mask is 1 2016-06-20 16:14:45.890230 7f02dd1b7700 20 Worker -- entry calling event process 2016-06-20 16:14:45.890232 7f02dd1b7700 10 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).process_events wait second 30 usec 0 2016-06-20 16:14:45.890214 7f02d91af700 10 Event(0x7f02ed588548 owner=139650267834112 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4c8500 pending 2 2016-06-20 16:14:45.890259 7f02d91af700 1 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).wakeup 2016-06-20 16:14:45.890264 7f02d91af700 10 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).dispatch_event_external 0x7f02ed4c8160 pending 1 2016-06-20 16:14:45.890267 7f02d91af700 10 WorkerPool -- barrier wait for 2 barrier 2016-06-20 16:14:45.890274 7f02dc1b5700 20 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).process_events event_wq process is 11 mask is 1 2016-06-20 16:14:45.890299 7f02dc1b5700 20 Worker -- entry calling event process 2016-06-20 16:14:45.890301 7f02dc1b5700 10 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).process_events wait second 30 usec 0 2016-06-20 16:15:00.916638 7f02de1b9700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-06-20 16:15:05.916812 7f02de1b9700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-06-20 16:15:10.916986 7f02de1b9700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-06-20 16:15:15.917192 7f02dd1b7700 20 Worker -- entry calling event process 2016-06-20 16:15:15.917192 7f02de1b9700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-06-20 16:15:15.917208 7f02dd1b7700 10 Event(0x7f02ed588048 owner=139650276226816 nevent=5000 time_id=1).process_events wait second 30 usec 0 2016-06-20 16:15:15.920320 7f02dc1b5700 20 Worker -- entry calling event process 2016-06-20 16:15:15.920328 7f02dc1b5700 10 Event(0x7f02ed5882c8 owner=139650259441408 nevent=5000 time_id=1).process_events wait second 30 usec 0 2016-06-20 16:15:20.917282 7f02de1b9700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-06-20 16:15:25.917358 7f02de1b9700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15 2016-06-20 16:15:30.917440 7f02de1b9700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
Updated by Zhi Zhang almost 8 years ago
- Project changed from Ceph to CephFS
- Subject changed from msg/async: fix trying to wake up worker event which is already waked up when shuting down msgr to Fix shutting down mds timed-out due to deadlock
- Category changed from msgr to 47
Updated by John Spray almost 8 years ago
- Status changed from New to Pending Backport
- Backport set to jewel
Updated by Loïc Dachary almost 8 years ago
- Copied to Backport #16620: jewel: Fix shutting down mds timed-out due to deadlock added
Updated by Greg Farnum almost 8 years ago
- Category changed from 47 to Correctness/Safety
- Component(FS) MDS added
Updated by Loïc Dachary over 7 years ago
- Status changed from Pending Backport to Resolved
Actions