Bug #10488
closedosd erasure-code-profile set is sometimes not idempotent
0%
Description
When http://workbench.dachary.org/ceph/ceph/blob/giant/src/mon/OSDMonitor.cc#L4531 compares the profile to be created (for instance k=2 m=1), it does not take into account the key=values that are added implicitly by the jerasure plugin. Therefore ceph osd erasure-code-profile set myprofile k=3 m=1 will not compare equal if run twice.
It shows in the mon logs with
2015-01-07 18:06:34.333705 7fe7dcffd700 10 mon.d@6(peon) e1 handle_route mon_command_ack([{"prefix": "osd erasure-code-profile set", "name": "testprofile", "profile": [ "k=2", "m=1", "ruleset-failure-domain=osd"]}]=-1 will not override erasure code profile testprofile v220) v1 to unknown.0 :/0 2015-01-07 18:06:34.333712 7fe7dcffd700 1 -- 10.214.131.5:6792/0 --> 10.214.134.136:0/64022480 -- mon_command_ack([{"prefix": "osd erasure-code-profile set", "name": "testprofile", "profile": [ "k=2", "m=1", "ruleset-failure-domain=osd"]}]=-1 will not override erasure code profile testprofile v220) v1 -- ?+0 0x20d8960 con 0x32609a0 2015-01-07 18:06:34.335084 7fe7dcffd700 10 mon.d@6(peon) e1 ms_handle_reset 0x32609a0 10.214.134.136:0/64022480
and manifested itself on http://pulpito.ceph.com/sage-2015-01-06_09:44:19-rados-wip-sage-testing-firefly---basic-multi/688168/ reported in http://tracker.ceph.com/issues/10483 where http://workbench.dachary.org/ceph/ceph/blob/master/src/test/librados/test.cc#L57 failed with
2015-01-07T18:05:54.843 INFO:tasks.workunit.client.0.mira057.stdout:[ RUN ] LibRadosAioEC.FlushAsync 2015-01-07T18:05:55.228 INFO:teuthology.orchestra.run.plana35:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.134.136:6792 mon_status' 2015-01-07T18:05:55.607 INFO:tasks.mon_thrash.mon_thrasher:monitors to thrash: ['e', 'i', 'a', 'g'] 2015-01-07T18:05:55.607 INFO:tasks.mon_thrash.mon_thrasher:monitors to freeze: [] 2015-01-07T18:05:55.608 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.e 2015-01-07T18:05:55.608 INFO:tasks.mon_thrash.mon_thrasher:killing mon.e 2015-01-07T18:06:01.606 INFO:tasks.ceph.mon.e:Stopped 2015-01-07T18:06:01.606 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.i 2015-01-07T18:06:01.606 INFO:tasks.mon_thrash.mon_thrasher:killing mon.i 2015-01-07T18:06:07.607 INFO:tasks.ceph.mon.i:Stopped 2015-01-07T18:06:07.847 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.a 2015-01-07T18:06:07.847 INFO:tasks.mon_thrash.mon_thrasher:killing mon.a 2015-01-07T18:06:13.608 INFO:tasks.ceph.mon.a:Stopped 2015-01-07T18:06:13.608 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.g 2015-01-07T18:06:13.608 INFO:tasks.mon_thrash.mon_thrasher:killing mon.g 2015-01-07T18:06:19.608 INFO:tasks.ceph.mon.g:Stopped 2015-01-07T18:06:19.608 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 5 2015-01-07T18:06:19.609 INFO:teuthology.orchestra.run.plana35:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph quorum_status' 2015-01-07T18:06:19.799 INFO:teuthology.orchestra.run.plana35.stderr:2015-01-07 18:06:19.798735 7fba7c2c1700 0 -- :/1026241 >> 10.214.134.136:6790/0 pipe(0x7fba78039010 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7fba780392a0).fault 2015-01-07T18:06:28.799 INFO:teuthology.orchestra.run.plana35.stderr:2015-01-07 18:06:28.799283 7fba7c1c0700 0 -- 10.214.131.5:0/1026241 >> 10.214.131.5:6789/0 pipe(0x7fba68000cd0 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7fba680029c0).fault 2015-01-07T18:06:31.170 INFO:tasks.workunit.client.0.mira057.stdout:test/librados/aio.cc:2279: Failure 2015-01-07T18:06:31.170 INFO:tasks.workunit.client.0.mira057.stdout:Value of: test_data.init() 2015-01-07T18:06:31.170 INFO:tasks.workunit.client.0.mira057.stdout: Actual: "create_one_ec_pool(test-rados-api-mira057-22480-52) failed: error rados_mon_command erasure-code-profile set name:testprofile failed with error -1" 2015-01-07T18:06:31.171 INFO:tasks.workunit.client.0.mira057.stdout:Expected: "" 2015-01-07T18:06:31.171 INFO:tasks.workunit.client.0.mira057.stdout:[ FAILED ] LibRadosAioEC.FlushAsync (36327 ms) 2015-01-07T18:06:31.171 INFO:tasks.workunit.client.0.mira057.stdout:[ RUN ] LibRadosAioEC.FlushAsyncPP 2015-01-07T18:06:31.194 INFO:tasks.workunit.client.0.mira057.stdout:test/librados/aio.cc:2319: Failure 2015-01-07T18:06:31.194 INFO:tasks.workunit.client.0.mira057.stdout:Value of: test_data.init() 2015-01-07T18:06:31.195 INFO:tasks.workunit.client.0.mira057.stdout: Actual: "create_one_ec_pool(test-rados-api-mira057-22480-53) failed: error mon_command erasure-code-profile set name:testprofile failed with error -1" 2015-01-07T18:06:31.195 INFO:tasks.workunit.client.0.mira057.stdout:Expected: "" 2015-01-07T18:06:31.195 INFO:tasks.workunit.client.0.mira057.stdout:[ FAILED ] LibRadosAioEC.FlushAsyncPP (24 ms) 2015-01-07T18:06:31.196 INFO:tasks.workunit.client.0.mira057.stdout:[ RUN ] LibRadosAioEC.RoundTripWriteFull 2015-01-07T18:06:31.206 INFO:tasks.workunit.client.0.mira057.stdout:test/librados/aio.cc:2364: Failure
Updated by Sage Weil over 9 years ago
ubuntu@teuthology:/a/sage-2015-01-11_12:05:20-rados-firefly-distro-basic-multi/697291
Updated by Sage Weil over 9 years ago
ubuntu@teuthology:/a/sage-2015-01-19_18:35:10-rados-wip-dho-distro-basic-multi/713735
Updated by Loïc Dachary about 9 years ago
- Status changed from In Progress to Won't Fix
Although there is a potential for confusion, erasure-code-profile set is indempotent. https://github.com/ceph/ceph/pull/4064 clarifies the error message and add tests that verify idempotency. My analysis of the original error was incorrect. What's more likely is that the profile already existed and was initialized with different values (k=3 instead of k=2 for instance). If that happens again, the error message including the profiles being compared will make it easier to diagnose.
Updated by Loïc Dachary about 9 years ago
- Status changed from Won't Fix to Fix Under Review
Updated by Loïc Dachary about 9 years ago
There however is a situation where erasure-code-profile set fails shortly after a mon fails. /a/sage-2015-01-19_18:35:10-rados-wip-dho-distro-basic-multi/713735/teuthology.log
2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:monitors to thrash: ['b'] 2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:monitors to freeze: [] 2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.b 2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:killing mon.b 2015-01-19T22:08:06.539 DEBUG:tasks.ceph.mon.b:waiting for process to exit 2015-01-19T22:08:12.539 INFO:tasks.ceph.mon.b:Stopped 2015-01-19T22:08:12.539 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 2 2015-01-19T22:08:12.539 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph quorum_status' 2015-01-19T22:08:12.610 INFO:teuthology.orchestra.run.burnupi34.stderr:2015-01-19 22:08:12.609764 7f613857b700 0 -- :/1048066 >> 10.214.131.8:6789/0 pipe(0x7f613c0310e0 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7f613c0313\ 70).fault 2015-01-19T22:08:21.604 INFO:tasks.mon_thrash.ceph_manager:quorum_status is {"election_epoch":14,"quorum":[1,2],"quorum_names":["a","c"],"quorum_leader_name":"a","monmap":{"epoch":1,"fsid":"0eec1192-ff0c-425b-9f\ cc-97493942a122","modified":"2015-01-19 22:03:07.508836","created":"2015-01-19 22:03:07.508836","mons":[{"rank":0,"name":"b","addr":"10.214.131.8:6789\/0"},{"rank":1,"name":"a","addr":"10.214.135.22:6789\/0"},{"\ rank":2,"name":"c","addr":"10.214.135.22:6790\/0"}]}} 2015-01-19T22:08:21.604 INFO:tasks.mon_thrash.ceph_manager:quorum is size 2 2015-01-19T22:08:21.604 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.135.22:6789 mon_status' 2015-01-19T22:08:22.667 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.135.22:6790 mon_status' 2015-01-19T22:08:23.728 INFO:tasks.mon_thrash.mon_thrasher:waiting for 90.0 secs before reviving monitors 2015-01-19T22:08:44.496 INFO:tasks.workunit.client.0.plana32.stdout:test/librados/aio.cc:1945: Failure 2015-01-19T22:08:44.496 INFO:tasks.workunit.client.0.plana32.stdout:Value of: test_data.init() 2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout: Actual: "create_one_ec_pool(test-rados-api-plana32-28352-43) failed: error mon_command erasure-code-profile set name:testprofile failed with \ error -1" 2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout:Expected: "" 2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout:[ FAILED ] LibRadosAioEC.RoundTripAppendPP (39013 ms)
Updated by Loïc Dachary about 9 years ago
Trying manually does not help: the problem does not always show.
loic@fold:~/software/ceph/ceph/src$ MON=3 OSD=0 MDS=0 ./vstart.sh -X -n === mon.c === Stopping Ceph mon.c on fold...done === mon.b === Stopping Ceph mon.b on fold...done === mon.a === Stopping Ceph mon.a on fold...done hostname fold ip 10.159.203.253 ip 10.159.203.253 port creating /home/loic/software/ceph/ceph/src/keyring ./monmaptool --create --clobber --add a 10.159.203.253:6789 --add b 10.159.203.253:6790 --add c 10.159.203.253:6791 --print /tmp/ceph_monmap.22569 ./monmaptool: monmap file /tmp/ceph_monmap.22569 ./monmaptool: generated fsid 5a5300e2-eb04-4975-ab20-2a0418d7fac1 epoch 0 fsid 5a5300e2-eb04-4975-ab20-2a0418d7fac1 last_changed 2015-03-18 12:19:28.625882 created 2015-03-18 12:19:28.625882 0: 10.159.203.253:6789/0 mon.a 1: 10.159.203.253:6790/0 mon.b 2: 10.159.203.253:6791/0 mon.c ./monmaptool: writing epoch 0 to /tmp/ceph_monmap.22569 (3 monitors) rm -rf /home/loic/software/ceph/ceph/src/dev/mon.a mkdir -p /home/loic/software/ceph/ceph/src/dev/mon.a ./ceph-mon --mkfs -c /home/loic/software/ceph/ceph/src/ceph.conf -i a --monmap=/tmp/ceph_monmap.22569 --keyring=/home/loic/software/ceph/ceph/src/keyring ./ceph-mon: set fsid to 59b2a95e-c4ad-46a5-a36d-ab7ec3f039d7 ./ceph-mon: created monfs at /home/loic/software/ceph/ceph/src/dev/mon.a for mon.a rm -rf /home/loic/software/ceph/ceph/src/dev/mon.b mkdir -p /home/loic/software/ceph/ceph/src/dev/mon.b ./ceph-mon --mkfs -c /home/loic/software/ceph/ceph/src/ceph.conf -i b --monmap=/tmp/ceph_monmap.22569 --keyring=/home/loic/software/ceph/ceph/src/keyring ./ceph-mon: set fsid to 59b2a95e-c4ad-46a5-a36d-ab7ec3f039d7 ./ceph-mon: created monfs at /home/loic/software/ceph/ceph/src/dev/mon.b for mon.b rm -rf /home/loic/software/ceph/ceph/src/dev/mon.c mkdir -p /home/loic/software/ceph/ceph/src/dev/mon.c ./ceph-mon --mkfs -c /home/loic/software/ceph/ceph/src/ceph.conf -i c --monmap=/tmp/ceph_monmap.22569 --keyring=/home/loic/software/ceph/ceph/src/keyring ./ceph-mon: set fsid to 59b2a95e-c4ad-46a5-a36d-ab7ec3f039d7 ./ceph-mon: created monfs at /home/loic/software/ceph/ceph/src/dev/mon.c for mon.c ./ceph-mon -i a -c /home/loic/software/ceph/ceph/src/ceph.conf ./ceph-mon -i b -c /home/loic/software/ceph/ceph/src/ceph.conf ./ceph-mon -i c -c /home/loic/software/ceph/ceph/src/ceph.conf started. stop.sh to stop. see out/* (e.g. 'tail -f out/????') for debug output. export PYTHONPATH=./pybind export LD_LIBRARY_PATH=.libs loic@fold:~/software/ceph/ceph/src$ ./ceph osd erasure-code-profile set pp k=2 m=1 ruleset-failure-domain=osd *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** loic@fold:~/software/ceph/ceph/src$ ./ceph osd erasure-code-profile set pp k=2 m=1 ruleset-failure-domain=osd *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** loic@fold:~/software/ceph/ceph/src$ pkill -f 'ceph-mon -i a' loic@fold:~/software/ceph/ceph/src$ ./ceph -s *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** 2015-03-18 12:19:55.205630 7f1548c17700 0 -- 10.159.203.253:0/1023070 >> 10.159.203.253:6789/0 pipe(0x7f1538000cd0 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1538004f70).fault cluster 59b2a95e-c4ad-46a5-a36d-ab7ec3f039d7 health HEALTH_ERR 8 pgs stuck inactive 8 pgs stuck unclean no osds 1 mons down, quorum 1,2 b,c monmap e1: 3 mons at {a=10.159.203.253:6789/0,b=10.159.203.253:6790/0,c=10.159.203.253:6791/0} election epoch 6, quorum 1,2 b,c osdmap e2: 0 osds: 0 up, 0 in pgmap v3: 8 pgs, 1 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 8 creating loic@fold:~/software/ceph/ceph/src$ ./ceph osd erasure-code-profile set pp k=2 m=1 ruleset-failure-domain=osd *** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH *** loic@fold:~/software/ceph/ceph/src$
Updated by Loïc Dachary about 9 years ago
In /a/sage-2015-01-19_18:35:10-rados-wip-dho-distro-basic-multi/
- ceph-mon -i b
2015-01-19 22:06:53.606894 7f7ff6407700 7 mon.b@0(leader).osd e104 prepare_update mon_command({"prefix": "osd erasure-code-profile set", "name": "testprofile", "profile": [ "k=2", "m=1", "ruleset-failure-domain\ =osd"]} v 0) v1 from client.4251 10.214.131.8:0/33028352 2015-01-19 22:06:53.607035 7f7ff6407700 20 mon.b@0(leader).osd e104 erasure code profile testprofile set
- the thrasher kills ceph-mon -i b and ceph-mon -i c in various ways during a few minutes until
2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:monitors to thrash: ['b'] 2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:monitors to freeze: [] 2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.b 2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:killing mon.b 2015-01-19T22:08:06.539 DEBUG:tasks.ceph.mon.b:waiting for process to exit 2015-01-19T22:08:12.539 INFO:tasks.ceph.mon.b:Stopped 2015-01-19T22:08:12.539 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 2 2015-01-19T22:08:12.539 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph quorum_status' 2015-01-19T22:08:12.610 INFO:teuthology.orchestra.run.burnupi34.stderr:2015-01-19 22:08:12.609764 7f613857b700 0 -- :/1048066 >> 10.214.131.8:6789/0 pipe(0x7f613c0310e0 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7f613c0313\ 70).fault 2015-01-19T22:08:21.604 INFO:tasks.mon_thrash.ceph_manager:quorum_status is {"election_epoch":14,"quorum":[1,2],"quorum_names":["a","c"],"quorum_leader_name":"a","monmap":{"epoch":1,"fsid":"0eec1192-ff0c-425b-9f\ cc-97493942a122","modified":"2015-01-19 22:03:07.508836","created":"2015-01-19 22:03:07.508836","mons":[{"rank":0,"name":"b","addr":"10.214.131.8:6789\/0"},{"rank":1,"name":"a","addr":"10.214.135.22:6789\/0"},{"\ rank":2,"name":"c","addr":"10.214.135.22:6790\/0"}]}} 2015-01-19T22:08:21.604 INFO:tasks.mon_thrash.ceph_manager:quorum is size 2 2015-01-19T22:08:21.604 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.135.22:6789 mon_status' 2015-01-19T22:08:22.667 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.135.22:6790 mon_status' 2015-01-19T22:08:23.728 INFO:tasks.mon_thrash.mon_thrasher:waiting for 90.0 secs before reviving monitors 2015-01-19T22:08:44.496 INFO:tasks.workunit.client.0.plana32.stdout:test/librados/aio.cc:1945: Failure 2015-01-19T22:08:44.496 INFO:tasks.workunit.client.0.plana32.stdout:Value of: test_data.init() 2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout: Actual: "create_one_ec_pool(test-rados-api-plana32-28352-43) failed: error mon_command erasure-code-profile set name:testprofile failed with \ error -1" 2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout:Expected: ""
- ceph-mon -i a will consistently refuse to create the profile afterwards. This is the first time ceph-mon -i a handled an erasure-code-profile set command.
2015-01-19 22:08:44.493205 7f78dfa98700 1 -- 10.214.135.22:6789/0 --> 10.214.131.8:0/43028352 -- mon_command_ack([{"prefix": "osd erasure-code-profile set", "name": "testprofile", "profile": [ "k=2", "m=1", "ruleset-failure-domain=osd"]}]=-1 will not override erasure code profile testprofile v165) v1 -- ?+0 0x2646760 con 0x2671c80
Updated by Loïc Dachary about 9 years ago
The mon.b logs has the http://tracker.ceph.com/issues/11144 race, shortly before being killed. Maybe it is the reason why the error shows but I can't quite figure out why.
Updated by Loïc Dachary about 9 years ago
- Backport changed from firefly,giant to firefly
it's a rare condition that does not deserve backporting to giant
Updated by Loïc Dachary about 9 years ago
It happened once in march 2015 at http://qa-proxy.ceph.com/teuthology/teuthology-2015-03-03_09:46:42-rados-firefly-distro-basic-multi/786177/
Updated by Loïc Dachary about 9 years ago
Unless someone has a brilliant idea I'm tempted to merge https://github.com/ceph/ceph/pull/4064 and wait for a few weeks until it shows up again. Looking at the diverging profiles will give an additional information.
Updated by Loïc Dachary about 9 years ago
- Subject changed from osd erasure-code-profile set is not idempotent for jerasure to osd erasure-code-profile set is sometimes not idempotent
Updated by Loïc Dachary about 9 years ago
Looking for errors
cd /a ; ls > /tmp/1 ; grep '2015-03-03.*rados.*' /tmp/1 | while read dir ; do echo '======= ' $dir ; for subdir in $dir/* ; do if grep --quiet rados/test.sh $subdir/orig.config.yaml ; then file=$subdir/teuthology.log ; echo $file ; grep 'failed: error mon_command erasure-code-profile' $file ; fi ; done ; done
Updated by Loïc Dachary about 9 years ago
- Status changed from Fix Under Review to 7
Updated by Samuel Just about 9 years ago
- Status changed from 7 to Pending Backport
Updated by Loïc Dachary about 9 years ago
- Status changed from Pending Backport to 7
We'll wait for it to show up again, which should happen in the nightlies within a month maximum.
Updated by Loïc Dachary about 9 years ago
no errors found in runs prior to 2015-04-02
Updated by Loïc Dachary about 9 years ago
- Status changed from 7 to Resolved
no errors found in runs prior to 2015-04-22