TLDR skip to *** MON.A ****** below.
So this looks like a race. The calls seem to be serialized in the client code yet somehow we see them being processed out of order by mon.a with the 'rm' finishing after the 'set' and therefore the 'pool create' failing because the profile is empty (the error message should print the profile but prints nothing in this case indicating it's empty.
$ zgrep testprofile-test-rados-api-smithi124-44005-9 ceph.audit.log.gz|sed -n 4,8p
2021-02-23T00:18:59.624810+0000 mon.a (mon.0) 499 : audit [INF] from='client.8453 172.21.15.124:0/3414301217' entity='client.admin' cmd=[{"prefix": "osd erasure-code-profile set", "name": "testprofile-test-rados-api-smithi124-44005-9", "profile": [ "k=2", "m=1", "crush-failure-domain=osd"]}]: dispatch
2021-02-23T00:18:59.632775+0000 mon.a (mon.0) 500 : audit [INF] from='client.? ' entity='client.admin' cmd=[{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]: dispatch
2021-02-23T00:19:00.092935+0000 mon.a (mon.0) 502 : audit [INF] from='client.8453 172.21.15.124:0/3414301217' entity='client.admin' cmd='[{"prefix": "osd erasure-code-profile set", "name": "testprofile-test-rados-api-smithi124-44005-9", "profile": [ "k=2", "m=1", "crush-failure-domain=osd"]}]': finished
2021-02-23T00:19:00.093691+0000 mon.a (mon.0) 503 : audit [INF] from='client.? ' entity='client.admin' cmd='[{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]': finished
2021-02-23T00:19:00.095129+0000 mon.a (mon.0) 505 : audit [INF] from='client.8453 172.21.15.124:0/3414301217' entity='client.admin' cmd=[{"prefix": "osd pool create", "pool": "test-rados-api-smithi124-44005-9", "pool_type":"erasure", "pg_num":8, "pgp_num":8, "erasure_code_profile":"testprofile-test-rados-api-smithi124-44005-9"}]: dispatch
I think failure injection is playing a major part here. Initially, the 'rm' command is sent to mon.c and immediately after we see a failure injection. I'll try to document the sequence as much as possible below.
Connection to mon.c and send 'rm' with socket failure injected.
2021-02-23T00:18:59.411+0000 7fc65f2e4540 10 start_mon_command cmd=[{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]
2021-02-23T00:18:59.411+0000 7fc65f2e4540 10 monclient: _send_command 1 [{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]
2021-02-23T00:18:59.411+0000 7fc65f2e4540 10 monclient: _send_mon_message to mon.c at v2:172.21.15.124:3301/0
2021-02-23T00:18:59.411+0000 7fc65f2e4540 1 -- 172.21.15.124:0/3414301217 --> [v2:172.21.15.124:3301/0,v1:172.21.15.124:6790/0] -- mon_command({"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"} v 0) v1 -- 0x558bf8e3af60 con 0x558bf8e814b0
2021-02-23T00:18:59.412+0000 7fc64eddb700 4 set_mon_vals no callback set
2021-02-23T00:18:59.413+0000 7fc64d5d8700 0 -- 172.21.15.124:0/3414301217 >> [v2:172.21.15.124:3301/0,v1:172.21.15.124:6790/0] conn(0x558bf8e814b0 msgr2=0x558bf8d80a00 secure :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until injecting socket failure
Reconnect to mon.a and send 'rm' again.
2021-02-23T00:18:59.414+0000 7fc65083e700 10 monclient: _finish_hunting 0
2021-02-23T00:18:59.414+0000 7fc65083e700 1 monclient: found mon.a
2021-02-23T00:18:59.414+0000 7fc65083e700 20 monclient: _un_backoff reopen_interval_multipler now 1
2021-02-23T00:18:59.414+0000 7fc65083e700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.124:3300/0
2021-02-23T00:18:59.414+0000 7fc65083e700 1 -- 172.21.15.124:0/3414301217 --> [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] -- mon_subscribe({config=0+,mgrmap=0+,monmap=2+,osdmap=0}) v3 -- 0x7fc638015360 con 0x558bf8e31bb0
2021-02-23T00:18:59.414+0000 7fc65083e700 10 monclient: _send_command 1 [{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]
2021-02-23T00:18:59.414+0000 7fc65083e700 10 monclient: _send_mon_message to mon.a at v2:172.21.15.124:3300/0
mon.0 (a) acks the command.
2021-02-23T00:18:59.568+0000 7fc63f7fe700 1 -- 172.21.15.124:0/3414301217 <== mon.0 v2:172.21.15.124:3300/0 5 ==== mon_command_ack([{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]=0 erasure-code-profile testprofile-test-rados-api-smithi124-44005-9 does not exist v56) v1 ==== 211+0+0 (secure 0 0 0) 0x7fc640072c10 con 0x558bf8e31bb0
2021-02-23T00:18:59.568+0000 7fc63f7fe700 10 monclient: handle_mon_command_ack 1 [{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]
2021-02-23T00:18:59.568+0000 7fc63f7fe700 10 monclient: _finish_command 1 = system:0 erasure-code-profile testprofile-test-rados-api-smithi124-44005-9 does not exist
Send 'profile set' command to mon.a and get ack.
2021-02-23T00:18:59.611+0000 7fc65f2e4540 10 start_mon_command cmd=[{"prefix": "osd erasure-code-profile set", "name": "testprofile-test-rados-api-smithi124-44005-9", "profile": [ "k=2", "m=1", "crush-failure-domain=osd"]}]
2021-02-23T00:18:59.611+0000 7fc65f2e4540 10 monclient: _send_command 3 [{"prefix": "osd erasure-code-profile set", "name": "testprofile-test-rados-api-smithi124-44005-9", "profile": [ "k=2", "m=1", "crush-failure-domain=osd"]}]
2021-02-23T00:18:59.611+0000 7fc65f2e4540 10 monclient: _send_mon_message to mon.a at v2:172.21.15.124:3300/0
2021-02-23T00:18:59.611+0000 7fc65f2e4540 1 -- 172.21.15.124:0/3414301217 --> [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] -- mon_command({"prefix": "osd erasure-code-profile set", "name": "testprofile-test-rados-api-smithi124-44005-9", "profile": [ "k=2", "m=1", "crush-failure-domain=osd"]} v 0) v1 -- 0x558bf8e3a440 con 0x558bf8e31bb0
2021-02-23T00:19:00.091+0000 7fc63f7fe700 1 -- 172.21.15.124:0/3414301217 <== mon.0 v2:172.21.15.124:3300/0 7 ==== mon_command_ack([{"prefix": "osd erasure-code-profile set", "name": "testprofile-test-rados-api-smithi124-44005-9", "profile": [ "k=2", "m=1", "crush-failure-domain=osd"]}]=0 v57) v1 ==== 188+0+0 (secure 0 0 0) 0x7fc640022010 con 0x558bf8e31bb0
2021-02-23T00:19:00.091+0000 7fc63f7fe700 10 monclient: handle_mon_command_ack 3 [{"prefix": "osd erasure-code-profile set", "name": "testprofile-test-rados-api-smithi124-44005-9", "profile": [ "k=2", "m=1", "crush-failure-domain=osd"]}]
2021-02-23T00:19:00.091+0000 7fc63f7fe700 10 monclient: _finish_command 3 = system:0
Send out pool create and get error.
2021-02-23T00:19:00.091+0000 7fc65f2e4540 10 start_mon_command cmd=[{"prefix": "osd pool create", "pool": "test-rados-api-smithi124-44005-9", "pool_type":"erasure", "pg_num":8, "pgp_num":8, "erasure_code_profile":"testprofile-test-rados-api-smithi124-44005-9"}]
2021-02-23T00:19:00.091+0000 7fc65f2e4540 10 monclient: _send_command 4 [{"prefix": "osd pool create", "pool": "test-rados-api-smithi124-44005-9", "pool_type":"erasure", "pg_num":8, "pgp_num":8, "erasure_code_profile":"testprofile-test-rados-api-smithi124-44005-9"}]
2021-02-23T00:19:00.091+0000 7fc65f2e4540 10 monclient: _send_mon_message to mon.a at v2:172.21.15.124:3300/0
2021-02-23T00:19:00.091+0000 7fc65f2e4540 1 -- 172.21.15.124:0/3414301217 --> [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] -- mon_command({"prefix": "osd pool create", "pool": "test-rados-api-smithi124-44005-9", "pool_type":"erasure", "pg_num":8, "pgp_num":8, "erasure_code_profile":"testprofile-test-rados-api-smithi124-44005-9"} v 0) v1 -- 0x558bf8e34510 con 0x558bf8e31bb0
2021-02-23T00:19:00.094+0000 7fc63f7fe700 1 -- 172.21.15.124:0/3414301217 <== mon.0 v2:172.21.15.124:3300/0 8 ==== mon_command_ack([{"prefix": "osd pool create", "pool": "test-rados-api-smithi124-44005-9", "pool_type":"erasure", "pg_num":8, "pgp_num":8, "erasure_code_profile":"testprofile-test-rados-api-smithi124-44005-9"}]=-22 cannot determine the erasure code plugin because there is no 'plugin' entry in the erasure_code_profile {} v57) v1 ==== 332+0+0 (secure 0 0 0) 0x7fc640022240 con 0x558bf8e31bb0
Meanwhile in the audit log.
mon.c receives 'rm' command.
2021-02-23T00:18:59.414603+0000 mon.c (mon.2) 79 : audit [INF] from='client.? 172.21.15.124:0/3414301217' entity='client.admin' cmd=[{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]: dispatch
mon.a receives 'rm' command.
2021-02-23T00:18:59.565685+0000 mon.a (mon.0) 497 : audit [INF] from='client.8453 172.21.15.124:0/3414301217' entity='client.admin' cmd=[{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]: dispatch
mon.a reports 'rm' command finished after the pool create error.
2021-02-23T00:19:00.093691+0000 mon.a (mon.0) 503 : audit [INF] from='client.? ' entity='client.admin' cmd='[{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]': finished
Around the same time on mon.c we see the following.
2021-02-23T00:19:00.097+0000 7f582b5af700 1 -- [v2:172.21.15.124:3301/0,v1:172.21.15.124:6790/0] <== mon.0 v2:172.21.15.124:3300/0 2082 ==== route(mon_command_ack([{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]=0 v57) v1 tid 1226) v3 ==== 278+0+0 (secure 0 0 0) 0x555752e17e40 con 0x5557521e2800
2021-02-23T00:19:00.097+0000 7f582b5af700 20 mon.c@2(peon) e1 _ms_dispatch existing session 0x555752254d80 for mon.0
2021-02-23T00:19:00.097+0000 7f582b5af700 20 mon.c@2(peon) e1 entity caps allow *
2021-02-23T00:19:00.097+0000 7f582b5af700 20 is_capable service=mon command= read addr v2:172.21.15.124:3300/0 on cap allow *
2021-02-23T00:19:00.097+0000 7f582b5af700 20 allow so far , doing grant allow *
2021-02-23T00:19:00.097+0000 7f582b5af700 20 allow all
2021-02-23T00:19:00.097+0000 7f582b5af700 20 is_capable service=mon command= exec addr v2:172.21.15.124:3300/0 on cap allow *
2021-02-23T00:19:00.097+0000 7f582b5af700 20 allow so far , doing grant allow *
2021-02-23T00:19:00.097+0000 7f582b5af700 20 allow all
2021-02-23T00:19:00.097+0000 7f582b5af700 10 mon.c@2(peon) e1 handle_route tid 1226 mon_command_ack([{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]=0 v57) v1
2021-02-23T00:19:00.097+0000 7f582b5af700 10 mon.c@2(peon) e1 don't have routed request tid 1226
We see what appears to be further confirmation of this in mon.c's log.
2021-02-23T00:19:00.638+0000 7f582b5af700 7 mon.c@2(peon).log v89 update_from_paxos applying incremental log 89 2021-02-23T00:19:00.092935+0000 mon.a (mon.0) 502 : audit [INF] from='client.8453 172.21.15.124:0/3414301217' entity='client.admin' cmd='[{"prefix": "osd erasure-code-profile set", "name": "testprofile-test-rados-api-smithi124-44005-9", "profile": [ "k=2", "m=1", "crush-failure-domain=osd"]}]': finished
2021-02-23T00:19:00.638+0000 7f582b5af700 20 mon.c@2(peon).log v89 update_from_paxos logging for channel 'audit' to file '/var/log/ceph/ceph.audit.log'
2021-02-23T00:19:00.638+0000 7f582b5af700 7 mon.c@2(peon).log v89 update_from_paxos applying incremental log 89 2021-02-23T00:19:00.093691+0000 mon.a (mon.0) 503 : audit [INF] from='client.? ' entity='client.admin' cmd='[{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]': finished
- MON.A ******
While on mon.a (which provides the definitive data set and probably renders most of the above redundant)
Receive the 'rm' command from the client.
2021-02-23T00:18:59.551+0000 7f1b91d60700 10 mon.a@0(leader).log v87 logging 2021-02-23T00:18:59.414603+0000 mon.c (mon.2) 79 : audit [INF] from='client.? 172.21.15.124:0/3414301217' entity='client.admin' cmd=[{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]: dispatch
Return DNE (expected response).
2021-02-23T00:18:59.568+0000 7f1b91d60700 2 mon.a@0(leader) e1 send_reply 0x56185fc6b4a0 0x561860f9b380 mon_command_ack([{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]=0 erasure-code-profile testprofile-test-rados-api-smithi124-44005-9 does not exist v56) v1
2021-02-23T00:18:59.568+0000 7f1b91d60700 1 -- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] --> 172.21.15.124:0/3414301217 -- mon_command_ack([{"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"}]=0 erasure-code-profile testprofile-test-rados-api-smithi124-44005-9 does not exist v56) v1 -- 0x561860f9b380 con 0x561860c2fc00
Receive the 'ec profile set' command, set the plugin to 'jerasure'.
2021-02-23T00:18:59.623+0000 7f1b91d60700 0 mon.a@0(leader) e1 handle_command mon_command({"prefix": "osd erasure-code-profile set", "name": "testprofile-test-rados-api-smithi124-44005-9", "profile": [ "k=2", "m=1", "crush-failure-domain=osd"]} v 0) v1
...
2021-02-23T00:18:59.628+0000 7f1b91d60700 20 mon.a@0(leader).osd e56 erasure code profile set testprofile-test-rados-api-smithi124-44005-9={crush-device-class=,crush-failure-domain=osd,crush-root=default,jerasure-per-chunk-alignment=false,k=2,m=1,plugin=jerasure,ruleset-failure-domain=osd,technique=reed_sol_van,w=8}
Receive a forwarded 'rm' command from mon.c
2021-02-23T00:18:59.630+0000 7f1b91d60700 1 -- [v2:172.21.15.124:3300/0,v1:172.21.15.124:6789/0] <== mon.2 v2:172.21.15.124:3301/0 1856 ==== forward(mon_command({"prefix": "osd erasure-code-profile rm", "name": "testprofile-test-rados-api-smithi124-44005-9"} v 0) v1 caps allow * tid 1226 con_features 4540138303579357183) v4 ==== 339+0+0 (unknown 0 0 0) 0x561860312480 con 0x56185e599c00
Because of this 'ec profile set' creation is cancelled.
2021-02-23T00:18:59.631+0000 7f1b91d60700 20 mon.a@0(leader).osd e56 erasure code profile rm testprofile-test-rados-api-smithi124-44005-9: creation canceled
The code.
src/mon/OSDMonitor.cc:
10917 } else if (prefix == "osd erasure-code-profile rm") {
10918 string name;
10919 cmd_getval(cmdmap, "name", name);
10920
10921 if (erasure_code_profile_in_use(pending_inc.new_pools, name, &ss))
10922 goto wait;
10923
10924 if (erasure_code_profile_in_use(osdmap.pools, name, &ss)) {
10925 err = -EBUSY;
10926 goto reply;
10927 }
10928
10929 if (osdmap.has_erasure_code_profile(name) ||
10930 »·······pending_inc.new_erasure_code_profiles.count(name)) {
10931 if (osdmap.has_erasure_code_profile(name)) {
10932 »·······pending_inc.old_erasure_code_profiles.push_back(name);
10933 } else {
10934 »·······dout(20) << "erasure code profile rm " << name << ": creation canceled" << dendl;
10935 »·······pending_inc.new_erasure_code_profiles.erase(name);
10936 }
So the socket failure caused there to effectively be two 'rm' commands and the 'first' command ended up being forwarded to the relevant mon well after the 'second' command had been completed and late enough that it actually caused cancellation of the creation of the profile which the test code assumes will be created well after the rm has completed.
This kind of makes a mockery of this, or any similar test, making any assumptions about the order of execution of commands it issues via rados_mon_command().