How did it fail to write data?
We disconnect communication between the iscsi gateway node and the ceph cluster,So it's failed to write and switch to another path.
Did you failback the path to the original target?
Yes,we re-establish connection between the iscsi gateway node and the ceph cluster. Map a new rbd to this node and write data,it's successful.But in iscsi initiator, the recovered path's status is enabled and failed ready running.
dmesg:
rbd: rbd0:write 400000 at f9ff978000 result -108
blk_update_request: I/O error,dev rbd0,sector 2097138624
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae1f00,err:-108
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae1a00,err:-108
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae0b00,err:-108
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae1400,err:-108
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae0600,err:-108
Node A and B are iscsi gateways , the watch will be deleted and added to the blacklist after the node A error.
After the A node is restored and removed from the blacklist, the status in iscsi initiator is also failed, and fail to write data.It is found that when a node was added to blacklisted, it is no longer possible to resume its task queue.
rbd.c
ret = __rbd_register_watch(rbd_dev);
if (ret) {
rbd_warn(rbd_dev, "failed to reregister watch: %d", ret);
if (ret == -EBLACKLISTED || ret == -ENOENT) {
set_bit(RBD_DEV_FLAG_BLACKLISTED, &rbd_dev->flags);
wake_requests(rbd_dev, true);
} else {
queue_delayed_work(rbd_dev->task_wq,
&rbd_dev->watch_dwork,
RBD_RETRY_DELAY);
}
mutex_unlock(&rbd_dev->watch_mutex);
return;
}
why not resume its task queue when ret's value is -EBLACKLISTED -ENOENT