Project

General

Profile

Actions

Bug #25150

closed

Failed to write data to rbd device even removed from blacklist

Added by aiai li over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
rbd
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

Map rbd to multiple nodes as the iscsi target and use multipath on iscsi initiator.path_grouping_policy is failover.
The iscsi initiator sends data to the target. When the active path of is broken because the Internet problem, the data is switched to the alternate path. However, the original path is added to the blacklist. After the path is restored, the path is removed from the blacklist. , but the rbd device on the path failed to write data.

Actions #1

Updated by Jason Dillaman over 5 years ago

  • Status changed from New to Need More Info

@aiai li: this needs more information to assess. How did it fail to write data? Did you failback the path to the original target? What was the error? What did the tcmu-runner logs show?

Actions #2

Updated by aiai li over 5 years ago

How did it fail to write data?
We disconnect communication between the iscsi gateway node and the ceph cluster,So it's failed to write and switch to another path.

Did you failback the path to the original target?
Yes,we re-establish connection between the iscsi gateway node and the ceph cluster. Map a new rbd to this node and write data,it's successful.But in iscsi initiator, the recovered path's status is enabled and failed ready running.
dmesg:
rbd: rbd0:write 400000 at f9ff978000 result -108
blk_update_request: I/O error,dev rbd0,sector 2097138624
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae1f00,err:-108
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae1a00,err:-108
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae0b00,err:-108
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae1400,err:-108
test_bit(BIO_UPTODATE)failed for bio:ffff8804e8ae0600,err:-108

Node A and B are iscsi gateways , the watch will be deleted and added to the blacklist after the node A error.
After the A node is restored and removed from the blacklist, the status in iscsi initiator is also failed, and fail to write data.It is found that when a node was added to blacklisted, it is no longer possible to resume its task queue.

rbd.c

    ret = __rbd_register_watch(rbd_dev);
    if (ret) {
        rbd_warn(rbd_dev, "failed to reregister watch: %d", ret);
        if (ret == -EBLACKLISTED || ret == -ENOENT) {
            set_bit(RBD_DEV_FLAG_BLACKLISTED, &rbd_dev->flags);
            wake_requests(rbd_dev, true);
        } else {
            queue_delayed_work(rbd_dev->task_wq,
                       &rbd_dev->watch_dwork,
                       RBD_RETRY_DELAY);
        }
        mutex_unlock(&rbd_dev->watch_mutex);
        return;
    }

why not resume its task queue when ret's value is -EBLACKLISTED -ENOENT

Actions #3

Updated by Jason Dillaman over 5 years ago

  • Project changed from rbd to Linux kernel client

Clearing out the ticket backlog and I noticed this should be against krbd.

Actions #4

Updated by Ilya Dryomov over 5 years ago

  • Status changed from Need More Info to Closed

Currently, this is the intended mode of operation. Once the rbd client is blacklisted, there is no going back -- the affected device(s) should be remapped. This is what Mike's multipath helper was intended to do.

Doing iSCSI multipathing from scratch by hand is very error prone and can result in data corruption. If you want to access Ceph through iSCSI, check out http://docs.ceph.com/docs/master/rbd/iscsi-overview/.

Actions #5

Updated by Ilya Dryomov over 5 years ago

  • Category set to rbd
Actions

Also available in: Atom PDF