Project

General

Profile

Bug #63317

[v0.0.5] OMAP does not have entry of subsystem created during nvmeof service restart

Added by Rahul Lepakshi 4 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
1 - critical
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Main issue is below
python3 m control.cli: error: create_subsystem failed: code=StatusCode.UNKNOWN message=Exception calling application: [errno 125] Failed to operate write op for oid nvmeof.state -- during nvmeof service restart

Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: INFO:control.grpc:Received request to create subsystem nqn.2016-06.io.spdk:cnode2, ana reporting: False  
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: INFO:control.grpc:create_subsystem nqn.2016-06.io.spdk:cnode2: True
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: ERROR:control.state:Unable to add key to omap: [errno 125] Failed to operate write op for oid nvmeof.state. Exiting!
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: ERROR:control.grpc:Error persisting create_subsystem nqn.2016-06.io.spdk:cnode2: [errno 125] Failed to operate write op for oid nvmeof.state
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: ERROR:grpc._server:Exception calling application: [errno 125] Failed to operate write op for oid nvmeof.state
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: Traceback (most recent call last):
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:   File "/src/__pypackages__/3.9/lib/grpc/_server.py", line 494, in _call_behavior
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:     response_or_iterator = behavior(argument, context)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:   File "/src/control/grpc.py", line 301, in create_subsystem
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:     return self.create_subsystem_safe(request, context)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:   File "/src/control/grpc.py", line 290, in create_subsystem_safe
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:     self.gateway_state.add_subsystem(request.subsystem_nqn,
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:   File "/src/control/state.py", line 367, in add_subsystem
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:     self.omap.add_subsystem(subsystem_nqn, val)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:   File "/src/control/state.py", line 70, in add_subsystem
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:     self._add_key(key, val)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:   File "/src/control/state.py", line 250, in _add_key
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:     self.ioctx.operate_write_op(write_op, self.omap_name)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]:   File "rados.pyx", line 3654, in rados.Ioctx.operate_write_op
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: rados.OSError: [errno 125] Failed to operate write op for oid nvmeof.state

https://github.com/ceph/ceph-nvmeof/pull/258 prevents corruption on any grpc calls and I am not seeing any corruption but during this phase of restart if any addition of GW entity, in my case it is create_subsystem was issued multiple times upon command failure and finally it got acknowledged and a subsystem_cnode2 was created but the same was not updated in omap.

User tries below

[root@ceph-nvmf3-ryxj6t-node5 cephuser]# podman run quay.io/ceph/nvmeof-cli:v0.0.5 --server-address 10.0.209.152 --server-port 5500 create_subsystem  --subnqn nqn.2016-06.io.spdk:cnode2 --serial 2 --max-namespaces 500
usage: python3 -m control.cli [-h] [--server-address SERVER_ADDRESS]
                              [--server-port SERVER_PORT]
                              [--client-key CLIENT_KEY]
                              [--client-cert CLIENT_CERT]
                              [--server-cert SERVER_CERT]
                              {create_bdev,delete_bdev,create_subsystem,delete_subsystem,add_namespace,remove_namespace,add_host,remove_host,create_listener,delete_listener,get_subsystems}
                              ...
python3 -m control.cli: error: create_subsystem failed: code=StatusCode.UNKNOWN message=Exception calling application: [errno 125] Failed to operate write op for oid nvmeof.state
[root@ceph-nvmf3-ryxj6t-node5 cephuser]# podman run quay.io/ceph/nvmeof-cli:v0.0.5 --server-address 10.0.209.152 --server-port 5500 create_subsystem  --subnqn nqn.2016-06.io.spdk:cnode2 --serial 2 --max-namespaces 500
usage: python3 -m control.cli [-h] [--server-address SERVER_ADDRESS]
                              [--server-port SERVER_PORT]
                              [--client-key CLIENT_KEY]
                              [--client-cert CLIENT_CERT]
                              [--server-cert SERVER_CERT]
                              {create_bdev,delete_bdev,create_subsystem,delete_subsystem,add_namespace,remove_namespace,add_host,remove_host,create_listener,delete_listener,get_subsystems}
                              ...
python3 -m control.cli: error: create_subsystem failed: code=StatusCode.INTERNAL message=request:
{
  "nqn": "nqn.2016-06.io.spdk:cnode2",
  "serial_number": "2",
  "model_number": "SPDK bdev Controller",
  "max_namespaces": 500,
  "min_cntlid": 1,
  "max_cntlid": 65519,
  "method": "nvmf_create_subsystem",
  "req_id": 189
}
Got JSON-RPC error response
response:
{
  "code": -32603,
  "message": "Unable to create subsystem nqn.2016-06.io.spdk:cnode2" 
}

Somewhere in between 1st and 2nd request to create subsystem, subsystem nqn.2016-06.io.spdk:cnode2 is created as journalctl says -it already exists - check for "already exists" at http://magna002.ceph.redhat.com/cephci-jenkins/nvmeof_5_upstream.log

get_subsystems output -

[root@ceph-nvmf3-ryxj6t-node5 cephuser]# podman run quay.io/ceph/nvmeof-cli:v0.0.5 --server-address 10.0.209.152 --server-port 5500 get_subsystems
INFO:__main__:Get subsystems:
[
    {
        "nqn": "nqn.2016-06.io.spdk:cnode2",
        "subtype": "NVMe",
        "listen_addresses": [],
        "allow_any_host": false,
        "hosts": [],
        "serial_number": "2",
        "model_number": "SPDK bdev Controller",
        "max_namespaces": 500,
        "min_cntlid": 1,
        "max_cntlid": 65519,
        "namespaces": []
    },
    {
        "nqn": "nqn.2016-06.io.spdk:cnode1",
        "subtype": "NVMe",
        "listen_addresses": [
            {
                "transport": "TCP",
                "trtype": "TCP",
                "adrfam": "IPv4",
                "traddr": "10.0.209.152",
                "trsvcid": "5001" 
            }
        ],
        "allow_any_host": true,
        "hosts": [],
        "serial_number": "1",
        "model_number": "SPDK bdev Controller",
        "max_namespaces": 500,
        "min_cntlid": 1,
        "max_cntlid": 65519,
        "namespaces": [
            {
                "nsid": 1,
                "bdev_name": "UWI1-bdev1",
                "name": "UWI1-bdev1",
                "nguid": "65228668C41842058561AFBBC15C9C96",
                "uuid": "65228668-c418-4205-8561-afbbc15c9c96" 
            },
            {
                "nsid": 2,
                "bdev_name": "UWI1-bdev2",
                "name": "UWI1-bdev2",
                "nguid": "BBF19AD5F17240F9AA7A42B33376D35A",
                "uuid": "bbf19ad5-f172-40f9-aa7a-42b33376d35a" 

But omap does not has its entry - http://magna002.ceph.redhat.com/cephci-jenkins/nvmeof_5_restart_omap.log

namespace_nqn.2016-06.io.spdk:cnode1_98
value (95 bytes) :
00000000  7b 0a 20 20 22 73 75 62  73 79 73 74 65 6d 5f 6e  |{.  "subsystem_n|
00000010  71 6e 22 3a 20 22 6e 71  6e 2e 32 30 31 36 2d 30  |qn": "nqn.2016-0|
00000020  36 2e 69 6f 2e 73 70 64  6b 3a 63 6e 6f 64 65 31  |6.io.spdk:cnode1|
00000030  22 2c 0a 20 20 22 62 64  65 76 5f 6e 61 6d 65 22  |",.  "bdev_name"|
00000040  3a 20 22 55 57 49 31 2d  62 64 65 76 39 38 22 2c  |: "UWI1-bdev98",|
00000050  0a 20 20 22 6e 73 69 64  22 3a 20 39 38 0a 7d     |.  "nsid": 98.}|
0000005f

namespace_nqn.2016-06.io.spdk:cnode1_99
value (95 bytes) :
00000000  7b 0a 20 20 22 73 75 62  73 79 73 74 65 6d 5f 6e  |{.  "subsystem_n|
00000010  71 6e 22 3a 20 22 6e 71  6e 2e 32 30 31 36 2d 30  |qn": "nqn.2016-0|
00000020  36 2e 69 6f 2e 73 70 64  6b 3a 63 6e 6f 64 65 31  |6.io.spdk:cnode1|
00000030  22 2c 0a 20 20 22 62 64  65 76 5f 6e 61 6d 65 22  |",.  "bdev_name"|
00000040  3a 20 22 55 57 49 31 2d  62 64 65 76 39 39 22 2c  |: "UWI1-bdev99",|
00000050  0a 20 20 22 6e 73 69 64  22 3a 20 39 39 0a 7d     |.  "nsid": 99.}|
0000005f

omap_version
value (3 bytes) :
00000000  35 31 34                                          |514|
00000003

subsystem_nqn.2016-06.io.spdk:cnode1
value (100 bytes) :
00000000  7b 0a 20 20 22 73 75 62  73 79 73 74 65 6d 5f 6e  |{.  "subsystem_n|
00000010  71 6e 22 3a 20 22 6e 71  6e 2e 32 30 31 36 2d 30  |qn": "nqn.2016-0|
00000020  36 2e 69 6f 2e 73 70 64  6b 3a 63 6e 6f 64 65 31  |6.io.spdk:cnode1|
00000030  22 2c 0a 20 20 22 73 65  72 69 61 6c 5f 6e 75 6d  |",.  "serial_num|
00000040  62 65 72 22 3a 20 22 31  22 2c 0a 20 20 22 6d 61  |ber": "1",.  "ma|
00000050  78 5f 6e 61 6d 65 73 70  61 63 65 73 22 3a 20 35  |x_namespaces": 5|
00000060  30 30 0a 7d                                       |00.}|
00000064

Further on restarting the GW, subsystem nqn.2016-06.io.spdk:cnode2 is no more displayed which was earlier displayed
get_subsystems output,

[root@ceph-nvmf3-ryxj6t-node5 cephuser]# podman run quay.io/ceph/nvmeof-cli:v0.0.5 --server-address 10.0.209.152 --server-port 5500 get_subsystems
INFO:__main__:Get subsystems:
[
    {
        "nqn": "nqn.2016-06.io.spdk:cnode1",
        "subtype": "NVMe",
        "listen_addresses": [
            {
                "transport": "TCP",
                "trtype": "TCP",
                "adrfam": "IPv4",
                "traddr": "10.0.209.152",
                "trsvcid": "5001" 
            }
        ],
        "allow_any_host": true,
        "hosts": [],
        "serial_number": "1",
        "model_number": "SPDK bdev Controller",
        "max_namespaces": 500,
        "min_cntlid": 1,
        "max_cntlid": 65519,
        "namespaces": [
            {
                "nsid": 1,
                "bdev_name": "UWI1-bdev1",
                "name": "UWI1-bdev1",
                "nguid": "65228668C41842058561AFBBC15C9C96",
                "uuid": "65228668-c418-4205-8561-afbbc15c9c96" 
            },
            {
                "nsid": 2,
                "bdev_name": "UWI1-bdev2",
                "name": "UWI1-bdev2",
                "nguid": "BBF19AD5F17240F9AA7A42B33376D35A",
                "uuid": "bbf19ad5-f172-40f9-aa7a-42b33376d35a" 
            },

Also available in: Atom PDF