Bug #63317
open[v0.0.5] OMAP does not have entry of subsystem created during nvmeof service restart
0%
Description
Main issue is below
python3 m control.cli: error: create_subsystem failed: code=StatusCode.UNKNOWN message=Exception calling application: [errno 125] Failed to operate write op for oid nvmeof.state -- during nvmeof service restart
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: INFO:control.grpc:Received request to create subsystem nqn.2016-06.io.spdk:cnode2, ana reporting: False
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: INFO:control.grpc:create_subsystem nqn.2016-06.io.spdk:cnode2: True
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: ERROR:control.state:Unable to add key to omap: [errno 125] Failed to operate write op for oid nvmeof.state. Exiting!
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: ERROR:control.grpc:Error persisting create_subsystem nqn.2016-06.io.spdk:cnode2: [errno 125] Failed to operate write op for oid nvmeof.state
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: ERROR:grpc._server:Exception calling application: [errno 125] Failed to operate write op for oid nvmeof.state
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: Traceback (most recent call last):
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: File "/src/__pypackages__/3.9/lib/grpc/_server.py", line 494, in _call_behavior
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: response_or_iterator = behavior(argument, context)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: File "/src/control/grpc.py", line 301, in create_subsystem
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: return self.create_subsystem_safe(request, context)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: File "/src/control/grpc.py", line 290, in create_subsystem_safe
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: self.gateway_state.add_subsystem(request.subsystem_nqn,
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: File "/src/control/state.py", line 367, in add_subsystem
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: self.omap.add_subsystem(subsystem_nqn, val)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: File "/src/control/state.py", line 70, in add_subsystem
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: self._add_key(key, val)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: File "/src/control/state.py", line 250, in _add_key
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: self.ioctx.operate_write_op(write_op, self.omap_name)
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: File "rados.pyx", line 3654, in rados.Ioctx.operate_write_op
Oct 25 06:41:40 ceph-nvmf3-ryxj6t-node5 ceph-7ffac0a2-730f-11ee-9dde-fa163e47fc24-nvmeof-rbd-ceph-nvmf3-ryxj6t-node5-ywspzj[118520]: rados.OSError: [errno 125] Failed to operate write op for oid nvmeof.state
https://github.com/ceph/ceph-nvmeof/pull/258 prevents corruption on any grpc calls and I am not seeing any corruption but during this phase of restart if any addition of GW entity, in my case it is create_subsystem was issued multiple times upon command failure and finally it got acknowledged and a subsystem_cnode2 was created but the same was not updated in omap.
User tries below
[root@ceph-nvmf3-ryxj6t-node5 cephuser]# podman run quay.io/ceph/nvmeof-cli:v0.0.5 --server-address 10.0.209.152 --server-port 5500 create_subsystem --subnqn nqn.2016-06.io.spdk:cnode2 --serial 2 --max-namespaces 500
usage: python3 -m control.cli [-h] [--server-address SERVER_ADDRESS]
[--server-port SERVER_PORT]
[--client-key CLIENT_KEY]
[--client-cert CLIENT_CERT]
[--server-cert SERVER_CERT]
{create_bdev,delete_bdev,create_subsystem,delete_subsystem,add_namespace,remove_namespace,add_host,remove_host,create_listener,delete_listener,get_subsystems}
...
python3 -m control.cli: error: create_subsystem failed: code=StatusCode.UNKNOWN message=Exception calling application: [errno 125] Failed to operate write op for oid nvmeof.state
[root@ceph-nvmf3-ryxj6t-node5 cephuser]# podman run quay.io/ceph/nvmeof-cli:v0.0.5 --server-address 10.0.209.152 --server-port 5500 create_subsystem --subnqn nqn.2016-06.io.spdk:cnode2 --serial 2 --max-namespaces 500
usage: python3 -m control.cli [-h] [--server-address SERVER_ADDRESS]
[--server-port SERVER_PORT]
[--client-key CLIENT_KEY]
[--client-cert CLIENT_CERT]
[--server-cert SERVER_CERT]
{create_bdev,delete_bdev,create_subsystem,delete_subsystem,add_namespace,remove_namespace,add_host,remove_host,create_listener,delete_listener,get_subsystems}
...
python3 -m control.cli: error: create_subsystem failed: code=StatusCode.INTERNAL message=request:
{
"nqn": "nqn.2016-06.io.spdk:cnode2",
"serial_number": "2",
"model_number": "SPDK bdev Controller",
"max_namespaces": 500,
"min_cntlid": 1,
"max_cntlid": 65519,
"method": "nvmf_create_subsystem",
"req_id": 189
}
Got JSON-RPC error response
response:
{
"code": -32603,
"message": "Unable to create subsystem nqn.2016-06.io.spdk:cnode2"
}
Somewhere in between 1st and 2nd request to create subsystem, subsystem nqn.2016-06.io.spdk:cnode2 is created as journalctl says -it already exists - check for "already exists" at http://magna002.ceph.redhat.com/cephci-jenkins/nvmeof_5_upstream.log
get_subsystems output -
[root@ceph-nvmf3-ryxj6t-node5 cephuser]# podman run quay.io/ceph/nvmeof-cli:v0.0.5 --server-address 10.0.209.152 --server-port 5500 get_subsystems
INFO:__main__:Get subsystems:
[
{
"nqn": "nqn.2016-06.io.spdk:cnode2",
"subtype": "NVMe",
"listen_addresses": [],
"allow_any_host": false,
"hosts": [],
"serial_number": "2",
"model_number": "SPDK bdev Controller",
"max_namespaces": 500,
"min_cntlid": 1,
"max_cntlid": 65519,
"namespaces": []
},
{
"nqn": "nqn.2016-06.io.spdk:cnode1",
"subtype": "NVMe",
"listen_addresses": [
{
"transport": "TCP",
"trtype": "TCP",
"adrfam": "IPv4",
"traddr": "10.0.209.152",
"trsvcid": "5001"
}
],
"allow_any_host": true,
"hosts": [],
"serial_number": "1",
"model_number": "SPDK bdev Controller",
"max_namespaces": 500,
"min_cntlid": 1,
"max_cntlid": 65519,
"namespaces": [
{
"nsid": 1,
"bdev_name": "UWI1-bdev1",
"name": "UWI1-bdev1",
"nguid": "65228668C41842058561AFBBC15C9C96",
"uuid": "65228668-c418-4205-8561-afbbc15c9c96"
},
{
"nsid": 2,
"bdev_name": "UWI1-bdev2",
"name": "UWI1-bdev2",
"nguid": "BBF19AD5F17240F9AA7A42B33376D35A",
"uuid": "bbf19ad5-f172-40f9-aa7a-42b33376d35a"
But omap does not has its entry - http://magna002.ceph.redhat.com/cephci-jenkins/nvmeof_5_restart_omap.log
namespace_nqn.2016-06.io.spdk:cnode1_98
value (95 bytes) :
00000000 7b 0a 20 20 22 73 75 62 73 79 73 74 65 6d 5f 6e |{. "subsystem_n|
00000010 71 6e 22 3a 20 22 6e 71 6e 2e 32 30 31 36 2d 30 |qn": "nqn.2016-0|
00000020 36 2e 69 6f 2e 73 70 64 6b 3a 63 6e 6f 64 65 31 |6.io.spdk:cnode1|
00000030 22 2c 0a 20 20 22 62 64 65 76 5f 6e 61 6d 65 22 |",. "bdev_name"|
00000040 3a 20 22 55 57 49 31 2d 62 64 65 76 39 38 22 2c |: "UWI1-bdev98",|
00000050 0a 20 20 22 6e 73 69 64 22 3a 20 39 38 0a 7d |. "nsid": 98.}|
0000005f
namespace_nqn.2016-06.io.spdk:cnode1_99
value (95 bytes) :
00000000 7b 0a 20 20 22 73 75 62 73 79 73 74 65 6d 5f 6e |{. "subsystem_n|
00000010 71 6e 22 3a 20 22 6e 71 6e 2e 32 30 31 36 2d 30 |qn": "nqn.2016-0|
00000020 36 2e 69 6f 2e 73 70 64 6b 3a 63 6e 6f 64 65 31 |6.io.spdk:cnode1|
00000030 22 2c 0a 20 20 22 62 64 65 76 5f 6e 61 6d 65 22 |",. "bdev_name"|
00000040 3a 20 22 55 57 49 31 2d 62 64 65 76 39 39 22 2c |: "UWI1-bdev99",|
00000050 0a 20 20 22 6e 73 69 64 22 3a 20 39 39 0a 7d |. "nsid": 99.}|
0000005f
omap_version
value (3 bytes) :
00000000 35 31 34 |514|
00000003
subsystem_nqn.2016-06.io.spdk:cnode1
value (100 bytes) :
00000000 7b 0a 20 20 22 73 75 62 73 79 73 74 65 6d 5f 6e |{. "subsystem_n|
00000010 71 6e 22 3a 20 22 6e 71 6e 2e 32 30 31 36 2d 30 |qn": "nqn.2016-0|
00000020 36 2e 69 6f 2e 73 70 64 6b 3a 63 6e 6f 64 65 31 |6.io.spdk:cnode1|
00000030 22 2c 0a 20 20 22 73 65 72 69 61 6c 5f 6e 75 6d |",. "serial_num|
00000040 62 65 72 22 3a 20 22 31 22 2c 0a 20 20 22 6d 61 |ber": "1",. "ma|
00000050 78 5f 6e 61 6d 65 73 70 61 63 65 73 22 3a 20 35 |x_namespaces": 5|
00000060 30 30 0a 7d |00.}|
00000064
Further on restarting the GW, subsystem nqn.2016-06.io.spdk:cnode2 is no more displayed which was earlier displayed
get_subsystems output,
[root@ceph-nvmf3-ryxj6t-node5 cephuser]# podman run quay.io/ceph/nvmeof-cli:v0.0.5 --server-address 10.0.209.152 --server-port 5500 get_subsystems
INFO:__main__:Get subsystems:
[
{
"nqn": "nqn.2016-06.io.spdk:cnode1",
"subtype": "NVMe",
"listen_addresses": [
{
"transport": "TCP",
"trtype": "TCP",
"adrfam": "IPv4",
"traddr": "10.0.209.152",
"trsvcid": "5001"
}
],
"allow_any_host": true,
"hosts": [],
"serial_number": "1",
"model_number": "SPDK bdev Controller",
"max_namespaces": 500,
"min_cntlid": 1,
"max_cntlid": 65519,
"namespaces": [
{
"nsid": 1,
"bdev_name": "UWI1-bdev1",
"name": "UWI1-bdev1",
"nguid": "65228668C41842058561AFBBC15C9C96",
"uuid": "65228668-c418-4205-8561-afbbc15c9c96"
},
{
"nsid": 2,
"bdev_name": "UWI1-bdev2",
"name": "UWI1-bdev2",
"nguid": "BBF19AD5F17240F9AA7A42B33376D35A",
"uuid": "bbf19ad5-f172-40f9-aa7a-42b33376d35a"
},
No data to display