Project

General

Profile

Actions

Bug #10488

closed

osd erasure-code-profile set is sometimes not idempotent

Added by Loïc Dachary over 9 years ago. Updated about 9 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
firefly
Regression:
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When http://workbench.dachary.org/ceph/ceph/blob/giant/src/mon/OSDMonitor.cc#L4531 compares the profile to be created (for instance k=2 m=1), it does not take into account the key=values that are added implicitly by the jerasure plugin. Therefore ceph osd erasure-code-profile set myprofile k=3 m=1 will not compare equal if run twice.
It shows in the mon logs with

2015-01-07 18:06:34.333705 7fe7dcffd700 10 mon.d@6(peon) e1 handle_route mon_command_ack([{"prefix": "osd erasure-code-profile set", "name": "testprofile", "profile": [ "k=2", "m=1", "ruleset-failure-domain=osd"]}]=-1 will not override erasure code profile testprofile v220) v1 to unknown.0 :/0
2015-01-07 18:06:34.333712 7fe7dcffd700  1 -- 10.214.131.5:6792/0 --> 10.214.134.136:0/64022480 -- mon_command_ack([{"prefix": "osd erasure-code-profile set", "name": "testprofile", "profile": [ "k=2", "m=1", "ruleset-failure-domain=osd"]}]=-1 will not override erasure code profile testprofile v220) v1 -- ?+0 0x20d8960 con 0x32609a0
2015-01-07 18:06:34.335084 7fe7dcffd700 10 mon.d@6(peon) e1 ms_handle_reset 0x32609a0 10.214.134.136:0/64022480

and manifested itself on http://pulpito.ceph.com/sage-2015-01-06_09:44:19-rados-wip-sage-testing-firefly---basic-multi/688168/ reported in http://tracker.ceph.com/issues/10483 where http://workbench.dachary.org/ceph/ceph/blob/master/src/test/librados/test.cc#L57 failed with
2015-01-07T18:05:54.843 INFO:tasks.workunit.client.0.mira057.stdout:[ RUN      ] LibRadosAioEC.FlushAsync
2015-01-07T18:05:55.228 INFO:teuthology.orchestra.run.plana35:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.134.136:6792 mon_status'
2015-01-07T18:05:55.607 INFO:tasks.mon_thrash.mon_thrasher:monitors to thrash: ['e', 'i', 'a', 'g']
2015-01-07T18:05:55.607 INFO:tasks.mon_thrash.mon_thrasher:monitors to freeze: []
2015-01-07T18:05:55.608 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.e
2015-01-07T18:05:55.608 INFO:tasks.mon_thrash.mon_thrasher:killing mon.e
2015-01-07T18:06:01.606 INFO:tasks.ceph.mon.e:Stopped
2015-01-07T18:06:01.606 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.i
2015-01-07T18:06:01.606 INFO:tasks.mon_thrash.mon_thrasher:killing mon.i
2015-01-07T18:06:07.607 INFO:tasks.ceph.mon.i:Stopped
2015-01-07T18:06:07.847 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.a
2015-01-07T18:06:07.847 INFO:tasks.mon_thrash.mon_thrasher:killing mon.a
2015-01-07T18:06:13.608 INFO:tasks.ceph.mon.a:Stopped
2015-01-07T18:06:13.608 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.g
2015-01-07T18:06:13.608 INFO:tasks.mon_thrash.mon_thrasher:killing mon.g
2015-01-07T18:06:19.608 INFO:tasks.ceph.mon.g:Stopped
2015-01-07T18:06:19.608 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 5
2015-01-07T18:06:19.609 INFO:teuthology.orchestra.run.plana35:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph quorum_status'
2015-01-07T18:06:19.799 INFO:teuthology.orchestra.run.plana35.stderr:2015-01-07 18:06:19.798735 7fba7c2c1700  0 -- :/1026241 >> 10.214.134.136:6790/0 pipe(0x7fba78039010 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7fba780392a0).fault
2015-01-07T18:06:28.799 INFO:teuthology.orchestra.run.plana35.stderr:2015-01-07 18:06:28.799283 7fba7c1c0700  0 -- 10.214.131.5:0/1026241 >> 10.214.131.5:6789/0 pipe(0x7fba68000cd0 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7fba680029c0).fault
2015-01-07T18:06:31.170 INFO:tasks.workunit.client.0.mira057.stdout:test/librados/aio.cc:2279: Failure
2015-01-07T18:06:31.170 INFO:tasks.workunit.client.0.mira057.stdout:Value of: test_data.init()
2015-01-07T18:06:31.170 INFO:tasks.workunit.client.0.mira057.stdout:  Actual: "create_one_ec_pool(test-rados-api-mira057-22480-52) failed: error rados_mon_command erasure-code-profile set name:testprofile failed with error -1" 
2015-01-07T18:06:31.171 INFO:tasks.workunit.client.0.mira057.stdout:Expected: "" 
2015-01-07T18:06:31.171 INFO:tasks.workunit.client.0.mira057.stdout:[  FAILED  ] LibRadosAioEC.FlushAsync (36327 ms)
2015-01-07T18:06:31.171 INFO:tasks.workunit.client.0.mira057.stdout:[ RUN      ] LibRadosAioEC.FlushAsyncPP
2015-01-07T18:06:31.194 INFO:tasks.workunit.client.0.mira057.stdout:test/librados/aio.cc:2319: Failure
2015-01-07T18:06:31.194 INFO:tasks.workunit.client.0.mira057.stdout:Value of: test_data.init()
2015-01-07T18:06:31.195 INFO:tasks.workunit.client.0.mira057.stdout:  Actual: "create_one_ec_pool(test-rados-api-mira057-22480-53) failed: error mon_command erasure-code-profile set name:testprofile failed with error -1" 
2015-01-07T18:06:31.195 INFO:tasks.workunit.client.0.mira057.stdout:Expected: "" 
2015-01-07T18:06:31.195 INFO:tasks.workunit.client.0.mira057.stdout:[  FAILED  ] LibRadosAioEC.FlushAsyncPP (24 ms)
2015-01-07T18:06:31.196 INFO:tasks.workunit.client.0.mira057.stdout:[ RUN      ] LibRadosAioEC.RoundTripWriteFull
2015-01-07T18:06:31.206 INFO:tasks.workunit.client.0.mira057.stdout:test/librados/aio.cc:2364: Failure


Related issues 2 (0 open2 closed)

Related to Ceph - Bug #11144: erasure-code-profile set races with erasure-code-profile rmResolvedLoïc Dachary03/18/2015

Actions
Has duplicate Ceph - Bug #10483: erasure-code-profile set fails on mon thrashingDuplicate01/08/2015

Actions
Actions #1

Updated by Sage Weil over 9 years ago

ubuntu@teuthology:/a/sage-2015-01-11_12:05:20-rados-firefly-distro-basic-multi/697291

Actions #2

Updated by Sage Weil over 9 years ago

ubuntu@teuthology:/a/sage-2015-01-19_18:35:10-rados-wip-dho-distro-basic-multi/713735

Actions #3

Updated by Loïc Dachary about 9 years ago

  • Status changed from 12 to In Progress
Actions #4

Updated by Loïc Dachary about 9 years ago

  • Status changed from In Progress to Won't Fix

Although there is a potential for confusion, erasure-code-profile set is indempotent. https://github.com/ceph/ceph/pull/4064 clarifies the error message and add tests that verify idempotency. My analysis of the original error was incorrect. What's more likely is that the profile already existed and was initialized with different values (k=3 instead of k=2 for instance). If that happens again, the error message including the profiles being compared will make it easier to diagnose.

Actions #5

Updated by Loïc Dachary about 9 years ago

  • Status changed from Won't Fix to Fix Under Review
Actions #6

Updated by Loïc Dachary about 9 years ago

There however is a situation where erasure-code-profile set fails shortly after a mon fails. /a/sage-2015-01-19_18:35:10-rados-wip-dho-distro-basic-multi/713735/teuthology.log

2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:monitors to thrash: ['b']
2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:monitors to freeze: []
2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.b
2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:killing mon.b
2015-01-19T22:08:06.539 DEBUG:tasks.ceph.mon.b:waiting for process to exit
2015-01-19T22:08:12.539 INFO:tasks.ceph.mon.b:Stopped
2015-01-19T22:08:12.539 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 2
2015-01-19T22:08:12.539 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph quorum_status'
2015-01-19T22:08:12.610 INFO:teuthology.orchestra.run.burnupi34.stderr:2015-01-19 22:08:12.609764 7f613857b700  0 -- :/1048066 >> 10.214.131.8:6789/0 pipe(0x7f613c0310e0 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7f613c0313\
70).fault
2015-01-19T22:08:21.604 INFO:tasks.mon_thrash.ceph_manager:quorum_status is {"election_epoch":14,"quorum":[1,2],"quorum_names":["a","c"],"quorum_leader_name":"a","monmap":{"epoch":1,"fsid":"0eec1192-ff0c-425b-9f\
cc-97493942a122","modified":"2015-01-19 22:03:07.508836","created":"2015-01-19 22:03:07.508836","mons":[{"rank":0,"name":"b","addr":"10.214.131.8:6789\/0"},{"rank":1,"name":"a","addr":"10.214.135.22:6789\/0"},{"\
rank":2,"name":"c","addr":"10.214.135.22:6790\/0"}]}}

2015-01-19T22:08:21.604 INFO:tasks.mon_thrash.ceph_manager:quorum is size 2
2015-01-19T22:08:21.604 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.135.22:6789 mon_status'
2015-01-19T22:08:22.667 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.135.22:6790 mon_status'
2015-01-19T22:08:23.728 INFO:tasks.mon_thrash.mon_thrasher:waiting for 90.0 secs before reviving monitors
2015-01-19T22:08:44.496 INFO:tasks.workunit.client.0.plana32.stdout:test/librados/aio.cc:1945: Failure
2015-01-19T22:08:44.496 INFO:tasks.workunit.client.0.plana32.stdout:Value of: test_data.init()
2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout:  Actual: "create_one_ec_pool(test-rados-api-plana32-28352-43) failed: error mon_command erasure-code-profile set name:testprofile failed with \
error -1" 
2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout:Expected: "" 
2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout:[  FAILED  ] LibRadosAioEC.RoundTripAppendPP (39013 ms)

Actions #7

Updated by Loïc Dachary about 9 years ago

Trying manually does not help: the problem does not always show.

loic@fold:~/software/ceph/ceph/src$ MON=3 OSD=0 MDS=0 ./vstart.sh -X -n
=== mon.c === 
Stopping Ceph mon.c on fold...done
=== mon.b === 
Stopping Ceph mon.b on fold...done
=== mon.a === 
Stopping Ceph mon.a on fold...done
hostname fold
ip 10.159.203.253
ip 10.159.203.253
port 
creating /home/loic/software/ceph/ceph/src/keyring
./monmaptool --create --clobber --add a 10.159.203.253:6789 --add b 10.159.203.253:6790 --add c 10.159.203.253:6791 --print /tmp/ceph_monmap.22569
./monmaptool: monmap file /tmp/ceph_monmap.22569
./monmaptool: generated fsid 5a5300e2-eb04-4975-ab20-2a0418d7fac1
epoch 0
fsid 5a5300e2-eb04-4975-ab20-2a0418d7fac1
last_changed 2015-03-18 12:19:28.625882
created 2015-03-18 12:19:28.625882
0: 10.159.203.253:6789/0 mon.a
1: 10.159.203.253:6790/0 mon.b
2: 10.159.203.253:6791/0 mon.c
./monmaptool: writing epoch 0 to /tmp/ceph_monmap.22569 (3 monitors)
rm -rf /home/loic/software/ceph/ceph/src/dev/mon.a
mkdir -p /home/loic/software/ceph/ceph/src/dev/mon.a
./ceph-mon --mkfs -c /home/loic/software/ceph/ceph/src/ceph.conf -i a --monmap=/tmp/ceph_monmap.22569 --keyring=/home/loic/software/ceph/ceph/src/keyring
./ceph-mon: set fsid to 59b2a95e-c4ad-46a5-a36d-ab7ec3f039d7
./ceph-mon: created monfs at /home/loic/software/ceph/ceph/src/dev/mon.a for mon.a
rm -rf /home/loic/software/ceph/ceph/src/dev/mon.b
mkdir -p /home/loic/software/ceph/ceph/src/dev/mon.b
./ceph-mon --mkfs -c /home/loic/software/ceph/ceph/src/ceph.conf -i b --monmap=/tmp/ceph_monmap.22569 --keyring=/home/loic/software/ceph/ceph/src/keyring
./ceph-mon: set fsid to 59b2a95e-c4ad-46a5-a36d-ab7ec3f039d7
./ceph-mon: created monfs at /home/loic/software/ceph/ceph/src/dev/mon.b for mon.b
rm -rf /home/loic/software/ceph/ceph/src/dev/mon.c
mkdir -p /home/loic/software/ceph/ceph/src/dev/mon.c
./ceph-mon --mkfs -c /home/loic/software/ceph/ceph/src/ceph.conf -i c --monmap=/tmp/ceph_monmap.22569 --keyring=/home/loic/software/ceph/ceph/src/keyring
./ceph-mon: set fsid to 59b2a95e-c4ad-46a5-a36d-ab7ec3f039d7
./ceph-mon: created monfs at /home/loic/software/ceph/ceph/src/dev/mon.c for mon.c
./ceph-mon -i a -c /home/loic/software/ceph/ceph/src/ceph.conf
./ceph-mon -i b -c /home/loic/software/ceph/ceph/src/ceph.conf
./ceph-mon -i c -c /home/loic/software/ceph/ceph/src/ceph.conf
started.  stop.sh to stop.  see out/* (e.g. 'tail -f out/????') for debug output.

export PYTHONPATH=./pybind
export LD_LIBRARY_PATH=.libs
loic@fold:~/software/ceph/ceph/src$ ./ceph osd erasure-code-profile set pp k=2 m=1 ruleset-failure-domain=osd
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
loic@fold:~/software/ceph/ceph/src$ ./ceph osd erasure-code-profile set pp k=2 m=1 ruleset-failure-domain=osd
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
loic@fold:~/software/ceph/ceph/src$ pkill -f 'ceph-mon -i a'
loic@fold:~/software/ceph/ceph/src$ ./ceph -s
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2015-03-18 12:19:55.205630 7f1548c17700  0 -- 10.159.203.253:0/1023070 >> 10.159.203.253:6789/0 pipe(0x7f1538000cd0 sd=8 :0 s=1 pgs=0 cs=0 l=1 c=0x7f1538004f70).fault
    cluster 59b2a95e-c4ad-46a5-a36d-ab7ec3f039d7
     health HEALTH_ERR
            8 pgs stuck inactive
            8 pgs stuck unclean
            no osds
            1 mons down, quorum 1,2 b,c
     monmap e1: 3 mons at {a=10.159.203.253:6789/0,b=10.159.203.253:6790/0,c=10.159.203.253:6791/0}
            election epoch 6, quorum 1,2 b,c
     osdmap e2: 0 osds: 0 up, 0 in
      pgmap v3: 8 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                   8 creating
loic@fold:~/software/ceph/ceph/src$ ./ceph osd erasure-code-profile set pp k=2 m=1 ruleset-failure-domain=osd
*** DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
loic@fold:~/software/ceph/ceph/src$ 

Actions #8

Updated by Loïc Dachary about 9 years ago

In /a/sage-2015-01-19_18:35:10-rados-wip-dho-distro-basic-multi/

  • ceph-mon -i b
    2015-01-19 22:06:53.606894 7f7ff6407700  7 mon.b@0(leader).osd e104 prepare_update mon_command({"prefix": "osd erasure-code-profile set", "name": "testprofile", "profile": [ "k=2", "m=1", "ruleset-failure-domain\
    =osd"]} v 0) v1 from client.4251 10.214.131.8:0/33028352
    2015-01-19 22:06:53.607035 7f7ff6407700 20 mon.b@0(leader).osd e104 erasure code profile testprofile set
    
  • the thrasher kills ceph-mon -i b and ceph-mon -i c in various ways during a few minutes until
    2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:monitors to thrash: ['b']
    2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:monitors to freeze: []
    2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:thrashing mon.b
    2015-01-19T22:08:06.539 INFO:tasks.mon_thrash.mon_thrasher:killing mon.b
    2015-01-19T22:08:06.539 DEBUG:tasks.ceph.mon.b:waiting for process to exit
    2015-01-19T22:08:12.539 INFO:tasks.ceph.mon.b:Stopped
    2015-01-19T22:08:12.539 INFO:tasks.mon_thrash.ceph_manager:waiting for quorum size 2
    2015-01-19T22:08:12.539 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph quorum_status'
    2015-01-19T22:08:12.610 INFO:teuthology.orchestra.run.burnupi34.stderr:2015-01-19 22:08:12.609764 7f613857b700  0 -- :/1048066 >> 10.214.131.8:6789/0 pipe(0x7f613c0310e0 sd=7 :0 s=1 pgs=0 cs=0 l=1 c=0x7f613c0313\
    70).fault
    2015-01-19T22:08:21.604 INFO:tasks.mon_thrash.ceph_manager:quorum_status is {"election_epoch":14,"quorum":[1,2],"quorum_names":["a","c"],"quorum_leader_name":"a","monmap":{"epoch":1,"fsid":"0eec1192-ff0c-425b-9f\
    cc-97493942a122","modified":"2015-01-19 22:03:07.508836","created":"2015-01-19 22:03:07.508836","mons":[{"rank":0,"name":"b","addr":"10.214.131.8:6789\/0"},{"rank":1,"name":"a","addr":"10.214.135.22:6789\/0"},{"\
    rank":2,"name":"c","addr":"10.214.135.22:6790\/0"}]}}
    
    2015-01-19T22:08:21.604 INFO:tasks.mon_thrash.ceph_manager:quorum is size 2
    2015-01-19T22:08:21.604 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.135.22:6789 mon_status'
    2015-01-19T22:08:22.667 INFO:teuthology.orchestra.run.burnupi34:Running: 'adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph -m 10.214.135.22:6790 mon_status'
    2015-01-19T22:08:23.728 INFO:tasks.mon_thrash.mon_thrasher:waiting for 90.0 secs before reviving monitors
    2015-01-19T22:08:44.496 INFO:tasks.workunit.client.0.plana32.stdout:test/librados/aio.cc:1945: Failure
    2015-01-19T22:08:44.496 INFO:tasks.workunit.client.0.plana32.stdout:Value of: test_data.init()
    2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout:  Actual: "create_one_ec_pool(test-rados-api-plana32-28352-43) failed: error mon_command erasure-code-profile set name:testprofile failed with \
    error -1" 
    2015-01-19T22:08:44.497 INFO:tasks.workunit.client.0.plana32.stdout:Expected: "" 
    
  • ceph-mon -i a will consistently refuse to create the profile afterwards. This is the first time ceph-mon -i a handled an erasure-code-profile set command.
    2015-01-19 22:08:44.493205 7f78dfa98700  1 -- 10.214.135.22:6789/0 --> 10.214.131.8:0/43028352 -- mon_command_ack([{"prefix": "osd erasure-code-profile set", "name": "testprofile", "profile": [ "k=2", "m=1", "ruleset-failure-domain=osd"]}]=-1 will not override erasure code profile testprofile v165) v1 -- ?+0 0x2646760 con 0x2671c80
    
Actions #9

Updated by Loïc Dachary about 9 years ago

The mon.b logs has the http://tracker.ceph.com/issues/11144 race, shortly before being killed. Maybe it is the reason why the error shows but I can't quite figure out why.

Actions #10

Updated by Loïc Dachary about 9 years ago

  • Backport changed from firefly,giant to firefly

it's a rare condition that does not deserve backporting to giant

Actions #12

Updated by Loïc Dachary about 9 years ago

Unless someone has a brilliant idea I'm tempted to merge https://github.com/ceph/ceph/pull/4064 and wait for a few weeks until it shows up again. Looking at the diverging profiles will give an additional information.

Actions #13

Updated by Loïc Dachary about 9 years ago

  • Subject changed from osd erasure-code-profile set is not idempotent for jerasure to osd erasure-code-profile set is sometimes not idempotent
Actions #14

Updated by Loïc Dachary about 9 years ago

Looking for errors

cd /a ; ls > /tmp/1 ; grep '2015-03-03.*rados.*' /tmp/1 | while read dir ; do echo '======= ' $dir ; for subdir in $dir/* ; do if grep --quiet rados/test.sh $subdir/orig.config.yaml ; then file=$subdir/teuthology.log ; echo $file ; grep 'failed: error mon_command erasure-code-profile' $file ; fi ; done ; done

Actions #15

Updated by Loïc Dachary about 9 years ago

  • Status changed from Fix Under Review to 7
Actions #16

Updated by Samuel Just about 9 years ago

  • Status changed from 7 to Pending Backport
Actions #17

Updated by Loïc Dachary about 9 years ago

  • Status changed from Pending Backport to 7

We'll wait for it to show up again, which should happen in the nightlies within a month maximum.

Actions #18

Updated by Loïc Dachary about 9 years ago

no errors found in runs prior to 2015-04-02

Actions #19

Updated by Loïc Dachary about 9 years ago

  • Status changed from 7 to Resolved

no errors found in runs prior to 2015-04-22

Actions

Also available in: Atom PDF