Actions
Bug #17913
closedlibrbd io deadlock after host lost network connectivity
Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:
0%
Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Description
During a recent network outage we found one VM with a totally deadlocked attached rbd volume. The full ceph-client.log is attached, but the interesting part is below.
The qemu-kvm lost network at 2016-11-08 09:06:21.834225. Then a few minutes later:
2016-11-08 09:14:06.136622 7f6a8f2dc700 0 -- 10.16.105.66:0/1983621456 >> 188.184.36.166:6789/0 pipe(0x7f79846ef000 sd=1034 :0 s=1 pgs=0 cs=0 l=1 c=0x7f797751f440).fault 2016-11-08 09:14:31.375413 7f6a54523700 -1 librbd::ImageWatcher: 0x7f7983aa4ac0 image watch failed: 140159521141248, (107) Transport endpoint is not connected 2016-11-08 09:14:39.567941 7f6a93dfd700 0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481051 rbd_data.6ae581d1314c088.00000000000156ee [set-alloc-hint object_size 4194304 write_size 4194304,write 4153344~40960] 4.47b99b35 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.201:6897/3581993, failed lossy con, dropping message 0x7f79847f0300 2016-11-08 09:14:42.024813 7f6a93dfd700 0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481168 rbd_data.6ae581d1314c088.0000000000038ee5 [set-alloc-hint object_size 4194304 write_size 4194304,write 507904~4096] 4.91a10f63 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.203:6817/3677267, failed lossy con, dropping message 0x7f79766e5700 2016-11-08 09:14:51.332193 7f6a93dfd700 0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481353 rbd_data.6ae581d1314c088.0000000000027804 [set-alloc-hint object_size 4194304 write_size 4194304,write 3637248~4096] 4.85873218 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.222:6866/3426863, failed lossy con, dropping message 0x7f7977250000 2016-11-08 09:14:52.344017 7f6a93dfd700 0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481432 rbd_data.6ae581d1314c088.0000000000020030 [set-alloc-hint object_size 4194304 write_size 4194304,write 1597440~45056] 4.3418d482 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.229:6928/3546026, failed lossy con, dropping message 0x7f7977dbb400 2016-11-08 09:15:08.629234 7f6a94dff700 -1 librbd::ImageWatcher: 0x7f796f2b4fc0 image watch failed: 140159514277632, (107) Transport endpoint is not connected 2016-11-08 09:15:08.641940 7f6a93dfd700 0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481923 rbd_data.6ae581d1314c088.0000000000038d3e [set-alloc-hint object_size 4194304 write_size 4194304,write 983040~4096] 4.56cc8de3 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.231:6940/4190058, failed lossy con, dropping message 0x7f7974508f00 2016-11-08 09:15:08.643951 7f6a93dfd700 0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481925 rbd_data.6ae581d1314c088.0000000000038e01 [set-alloc-hint object_size 4194304 write_size 4194304,write 827392~8192] 4.fa7a45b3 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.203:6910/3679986, failed lossy con, dropping message 0x7f7974508f00 2016-11-08 09:16:23.371172 7f795232f700 1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f6a93dfd700' had timed out after 60
The client was running librbd 0.94.9, as is the Ceph cluster.
Files
Actions