Bug #15011
TEST_crush_reject_empty: ceph osd setcrushmap error on BTRFS
0%
Description
When running a make check on current master on my local btrfs system I have the following trace :
../qa/workunits/ceph-helpers.sh:1152: expect_failure: ./ceph osd setcrushmap -i testdir/osd-crush/empty_map.map../qa/workunits/ceph-helpers.sh:1155: expect_failure: success=false
../qa/workunits/ceph-helpers.sh:1158: expect_failure: false
../qa/workunits/ceph-helpers.sh:1158: expect_failure: grep --quiet 'Error EINVAL' testdir/osd-crush/out
../qa/workunits/ceph-helpers.sh:1159: expect_failure: cat testdir/osd-crush/out
- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
Traceback (most recent call last):
File "./ceph", line 953, in <module>
retval = main()
File "./ceph", line 887, in main
sigdict, inbuf, verbose)
File "./ceph", line 470, in new_style_command
inbuf=inbuf)
File "/ceph/src/pybind/ceph_argparse.py", line 1270, in json_command
raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd setcrushmap"}']": exception 'utf8' codec can't decode byte 0xef in position 23: invalid continuation byte
I'm adding to this bug the complete log file & the empty_map.map file.
Associated revisions
SubProcess: Avoid buffer corruption when calling err()
Some code like crush/CrushTester.cc, uses err() to get the output status of a
spawned command.
As per reported in bug #15011, some tests were failing because of some utf8
decoding issues. The implied buffers were generated by the output of err()
function of SubProcess.
err() is returning the str().c_str() value of 'errstr' object which is not
living much longer that the function itself.
This patch returns a std::string to avoid the temporary allocation and fixes
the associated err() calls accordingly.
Since this commit, the make check is now passing everytime.
Fixes: #15011
Signed-off-by: Erwan Velu <erwan@redhat.com>
History
#1 Updated by Anonymous about 8 years ago
I got the same trace in another test :
../qa/workunits/cephtool/test.sh:1702: test_mon_crushmap_validation: printf '%s\n' '#!/bin/shcat > /dev/null
exit 0'
../qa/workunits/cephtool/test.sh:1707: test_mon_crushmap_validation: ceph osd setcrushmap -i /tmp/cephtool27928/map
- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2016-03-08 17:39:47.522648 7f8e0a1a2700 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-03-08 17:39:47.522655 7f8e0a1a2700 0 lockdep start
2016-03-08 17:39:47.528534 7f8e0a1a2700 -1 WARNING: the following dangerous and experimental features are enabled: *
set crush map
2016-03-08 17:39:47.793462 7f8e0a1a2700 0 lockdep stop
../qa/workunits/cephtool/test.sh:1709: test_mon_crushmap_validation: printf '%s\n' '#!/bin/sh
cat > /dev/null
exit 1'
../qa/workunits/cephtool/test.sh:1714: test_mon_crushmap_validation: expect_false ceph osd setcrushmap -i /tmp/cephtool27928/map
../qa/workunits/cephtool/test.sh:30: expect_false: set -x
../qa/workunits/cephtool/test.sh:31: expect_false: ceph osd setcrushmap -i /tmp/cephtool27928/map - DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
2016-03-08 17:39:47.975821 7fd1eb8f0700 -1 WARNING: the following dangerous and experimental features are enabled: *
2016-03-08 17:39:47.975828 7fd1eb8f0700 0 lockdep start
2016-03-08 17:39:47.981257 7fd1eb8f0700 -1 WARNING: the following dangerous and experimental features are enabled: *
Traceback (most recent call last):
File "/ceph/src/ceph", line 953, in <module>
retval = main()
File "/ceph/src/ceph", line 887, in main
sigdict, inbuf, verbose)
File "/ceph/src/ceph", line 470, in new_style_command
inbuf=inbuf)
File "/ceph/src/pybind/ceph_argparse.py", line 1270, in json_command
raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{'prefix': u'osd setcrushmap'}": exception "['{"prefix": "osd setcrushmap"}']": exception 'utf8' codec can't decode byte 0xf0 in position 22: invalid continuation byte
#2 Updated by Anonymous about 8 years ago
And more suprisingly in osd pool create testing :
./test/mon/osd-crush.sh:129: TEST_crush_rule_create_erasure: expect_failure testdir/osd-crush 'Error EINVAL' ./ceph osd pool create mypool 1 1 erasure../qa/workunits/ceph-helpers.sh:1146: expect_failure: local dir=testdir/osd-crush
../qa/workunits/ceph-helpers.sh:1147: expect_failure: shift
../qa/workunits/ceph-helpers.sh:1148: expect_failure: local 'expected=Error EINVAL'
../qa/workunits/ceph-helpers.sh:1149: expect_failure: shift
../qa/workunits/ceph-helpers.sh:1150: expect_failure: local success
../qa/workunits/ceph-helpers.sh:1152: expect_failure: ./ceph osd pool create mypool 1 1 erasure
../qa/workunits/ceph-helpers.sh:1155: expect_failure: success=false
../qa/workunits/ceph-helpers.sh:1158: expect_failure: false
../qa/workunits/ceph-helpers.sh:1158: expect_failure: grep --quiet 'Error EINVAL' testdir/osd-crush/out
../qa/workunits/ceph-helpers.sh:1159: expect_failure: cat testdir/osd-crush/out
- DEVELOPER MODE: setting PATH, PYTHONPATH and LD_LIBRARY_PATH ***
Traceback (most recent call last):
File "./ceph", line 953, in <module>
retval = main()
File "./ceph", line 887, in main
sigdict, inbuf, verbose)
File "./ceph", line 470, in new_style_command
inbuf=inbuf)
File "/ceph/src/pybind/ceph_argparse.py", line 1270, in json_command
raise RuntimeError('"{0}": exception {1}'.format(argdict, e))
RuntimeError: "{u'pool_type': u'erasure', 'prefix': u'osd pool create', u'pg_num': 1L, u'pool': u'mypool', u'pgp_num': 1L}": exception "['{"pool_type": "erasure", "prefix": "osd pool create", "pg_num": 1, "pool": "mypool", "pgp_num": 1}']": exception 'utf8' codec can't decode byte 0xe0 in position 33: invalid continuation byte
../qa/workunits/ceph-helpers.sh:1160: expect_failure: return 1
./test/mon/osd-crush.sh:130: TEST_crush_rule_create_erasure: return 1
./test/mon/osd-crush.sh:32: run: return 1
#3 Updated by Anonymous about 8 years ago
I worked on it trying to understand where this decoding error comes from.
The decoding complains comes from https://github.com/ceph/ceph/blob/master/src/pybind/rados/rados.pyx#L1155
Getting deeper in the calls, it happen that _outs looks like corrupted like this :
2016-03-10 09:23:01.549686 7fb54ebd3700 -1 mon.a@0(leader).osd e7 error on crush map: ÀeƵ<8d>U
We are able to find this kind of corrupted string inside outs (some other exists to).
Looking at the code, I ended up here : https://github.com/ceph/ceph/blob/2b9775db2c062ddac6794a25a041beced5f6784f/src/mon/OSDMonitor.cc#L5400
This call is pretty strange as since commit https://github.com/ceph/ceph/commit/aa238e5ed50f44a94caf84567267e4f6be8732a2, CrushTester::test_with_crushtool() expects 4 arguments and not 3, which could explains such issue.
I'm not confident enough on this part to make a patch for adding the missing ruleset but something is really strange here.
#4 Updated by Anonymous about 8 years ago
Ok I got wrong on my assumptions. The prototypes makes the last argument optional. I have so to determine why and who is corrupting this buffer.
#5 Updated by Anonymous about 8 years ago
Continued my debugging and reach that state :
In https://github.com/ceph/ceph/blob/master/src/common/SubProcess.h#L364, errstr is perfectly set (I added instrement code to prove it)
but at this particular time, if we access errstr via its method (https://github.com/ceph/ceph/blob/master/src/common/SubProcess.h#L222) it is just corrupted
to prove that, I added the following code
+ FILE * f;
+ f=fopen("/tmp/join", "w");
+ fprintf(f, "|cmd=%s| exit status=%d; err=|%s| err2=|%s|\n", cmd.c_str(), WEXITSTATUS,err(),errstr.str().c_str());
+ fclose(f);
cmd=crushtool | exit status=1; err= | `����U | err2= | crushtool: exit status: 1 |
Note that err() function does only {return (errstr.str().c_str()}
#6 Updated by Greg Farnum almost 7 years ago
- Status changed from New to Resolved