Project

General

Profile

Actions

Bug #17913

closed

librbd io deadlock after host lost network connectivity

Added by Dan van der Ster over 7 years ago. Updated over 6 years ago.

Status:
Rejected
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
Regression:
Yes
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
rbd
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

During a recent network outage we found one VM with a totally deadlocked attached rbd volume. The full ceph-client.log is attached, but the interesting part is below.

The qemu-kvm lost network at 2016-11-08 09:06:21.834225. Then a few minutes later:

2016-11-08 09:14:06.136622 7f6a8f2dc700  0 -- 10.16.105.66:0/1983621456 >> 188.184.36.166:6789/0 pipe(0x7f79846ef000 sd=1034 :0 s=1 pgs=0 cs=0 l=1 c=0x7f797751f440).fault
2016-11-08 09:14:31.375413 7f6a54523700 -1 librbd::ImageWatcher: 0x7f7983aa4ac0 image watch failed: 140159521141248, (107) Transport endpoint is not connected
2016-11-08 09:14:39.567941 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481051 rbd_data.6ae581d1314c088.00000000000156ee [set-alloc-hint object_size 4194304 write_size 4194304,write 4153344~40960] 4.47b99b35 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.201:6897/3581993, failed lossy con, dropping message 0x7f79847f0300
2016-11-08 09:14:42.024813 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481168 rbd_data.6ae581d1314c088.0000000000038ee5 [set-alloc-hint object_size 4194304 write_size 4194304,write 507904~4096] 4.91a10f63 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.203:6817/3677267, failed lossy con, dropping message 0x7f79766e5700
2016-11-08 09:14:51.332193 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481353 rbd_data.6ae581d1314c088.0000000000027804 [set-alloc-hint object_size 4194304 write_size 4194304,write 3637248~4096] 4.85873218 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.222:6866/3426863, failed lossy con, dropping message 0x7f7977250000
2016-11-08 09:14:52.344017 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481432 rbd_data.6ae581d1314c088.0000000000020030 [set-alloc-hint object_size 4194304 write_size 4194304,write 1597440~45056] 4.3418d482 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.229:6928/3546026, failed lossy con, dropping message 0x7f7977dbb400
2016-11-08 09:15:08.629234 7f6a94dff700 -1 librbd::ImageWatcher: 0x7f796f2b4fc0 image watch failed: 140159514277632, (107) Transport endpoint is not connected
2016-11-08 09:15:08.641940 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481923 rbd_data.6ae581d1314c088.0000000000038d3e [set-alloc-hint object_size 4194304 write_size 4194304,write 983040~4096] 4.56cc8de3 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.231:6940/4190058, failed lossy con, dropping message 0x7f7974508f00
2016-11-08 09:15:08.643951 7f6a93dfd700  0 -- 10.16.105.66:0/1983621456 submit_message osd_op(client.118493800.0:51481925 rbd_data.6ae581d1314c088.0000000000038e01 [set-alloc-hint object_size 4194304 write_size 4194304,write 827392~8192] 4.fa7a45b3 ack+ondisk+write+known_if_redirected e549025) v5 remote, 128.142.161.203:6910/3679986, failed lossy con, dropping message 0x7f7974508f00
2016-11-08 09:16:23.371172 7f795232f700  1 heartbeat_map is_healthy 'librbd::thread_pool thread 0x7f6a93dfd700' had timed out after 60

The client was running librbd 0.94.9, as is the Ceph cluster.


Files

ceph-client.1669.log.gz (52.5 KB) ceph-client.1669.log.gz Dan van der Ster, 11/15/2016 01:44 PM
ceph.log.gz (138 KB) ceph.log.gz Christian Theune, 03/29/2017 02:52 PM
thread-apply-bt-all.txt (58.4 KB) thread-apply-bt-all.txt Christian Theune, 03/29/2017 03:21 PM
Actions

Also available in: Atom PDF